The primary objective of this exploratory data analysis (EDA) project is to uncover insights from Hillary Clinton's emails using NLP and statistical methods. The project aims to:

1) Identify the common topics discussed using TF-IDF
2) Understand the network of senders and recipients to identify key figures and their relationships.
   Explore the frequency and patterns of email communications over time.


**Data Cleaning and Preprocessing:** This initial step involves cleaning the dataset for analysis, including handling missing values, removing duplicates, and standardizing date formats. NLP-specific preprocessing will also be necessary, such as tokenization, removing stopwords, and lemmatization.

**Exploratory Data Analysis:** Employing statistical and visualization techniques to summarize the dataset's main characteristics. This includes analyzing the distribution of emails over time, the most frequent senders and recipients, and the length of emails.

**TF-IDF:** Using NLP techniques such as TF-IDF or count vectorization generate a list of the five TF-IDF terms that best describe the emails. This can provide insights into the main topics and themes discussed in the emails.

**Temporal Analysis:** Investigate how email communications change over time, identify any patterns or anomalies in the volume of emails sent and received, and correlate these with external events or timelines.


**Import pandas**

In [33]:
import pandas as pd
import numpy as np
import sklearn
import gensim as gensim
from gensim.models.doc2vec import Doc2Vec, TaggedDocument


**Read csv file**

In [3]:
emails = pd.read_csv("Emails.csv")

**(1a) Data Cleaning: General** 

Handling Missing Values

In [4]:
emails.fillna('null', inplace=True)

In [5]:
emails

Unnamed: 0,Id,DocNumber,MetadataSubject,MetadataTo,MetadataFrom,SenderPersonId,MetadataDateSent,MetadataDateReleased,MetadataPdfLink,MetadataCaseNumber,...,ExtractedTo,ExtractedFrom,ExtractedCc,ExtractedDateSent,ExtractedCaseNumber,ExtractedDocNumber,ExtractedDateReleased,ExtractedReleaseInPartOrFull,ExtractedBodyText,RawText
0,1,C05739545,WOW,H,"Sullivan, Jacob J",87.0,2012-09-12T04:00:00+00:00,2015-05-22T04:00:00+00:00,DOCUMENTS/HRC_Email_1_296/HRCH2/DOC_0C05739545...,F-2015-04841,...,,"Sullivan, Jacob J <Sullivan11@state.gov>",,"Wednesday, September 12, 2012 10:16 AM",F-2015-04841,C05739545,05/13/2015,RELEASE IN FULL,,UNCLASSIFIED\r\nU.S. Department of State\r\nCa...
1,2,C05739546,H: LATEST: HOW SYRIA IS AIDING QADDAFI AND MOR...,H,,,2011-03-03T05:00:00+00:00,2015-05-22T04:00:00+00:00,DOCUMENTS/HRC_Email_1_296/HRCH1/DOC_0C05739546...,F-2015-04841,...,,,,,F-2015-04841,C05739546,05/13/2015,RELEASE IN PART,"B6\r\nThursday, March 3, 2011 9:45 PM\r\nH: La...",UNCLASSIFIED\r\nU.S. Department of State\r\nCa...
2,3,C05739547,CHRIS STEVENS,;H,"Mills, Cheryl D",32.0,2012-09-12T04:00:00+00:00,2015-05-22T04:00:00+00:00,DOCUMENTS/HRC_Email_1_296/HRCH2/DOC_0C05739547...,F-2015-04841,...,B6,"Mills, Cheryl D <MillsCD@state.gov>","Abedin, Huma","Wednesday, September 12, 2012 11:52 AM",F-2015-04841,C05739547,05/14/2015,RELEASE IN PART,Thx,UNCLASSIFIED\r\nU.S. Department of State\r\nCa...
3,4,C05739550,CAIRO CONDEMNATION - FINAL,H,"Mills, Cheryl D",32.0,2012-09-12T04:00:00+00:00,2015-05-22T04:00:00+00:00,DOCUMENTS/HRC_Email_1_296/HRCH2/DOC_0C05739550...,F-2015-04841,...,,"Mills, Cheryl D <MillsCD@state.gov>","Mitchell, Andrew B","Wednesday, September 12,2012 12:44 PM",F-2015-04841,C05739550,05/13/2015,RELEASE IN PART,,UNCLASSIFIED\r\nU.S. Department of State\r\nCa...
4,5,C05739554,H: LATEST: HOW SYRIA IS AIDING QADDAFI AND MOR...,"Abedin, Huma",H,80.0,2011-03-11T05:00:00+00:00,2015-05-22T04:00:00+00:00,DOCUMENTS/HRC_Email_1_296/HRCH1/DOC_0C05739554...,F-2015-04841,...,,,,,F-2015-04841,C05739554,05/13/2015,RELEASE IN PART,"H <hrod17@clintonemail.com>\r\nFriday, March 1...",B6\r\nUNCLASSIFIED\r\nU.S. Department of State...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7940,7941,C05778462,WYDEN,H,"Verma, Richard R",180.0,2010-12-16T05:00:00+00:00,2015-08-31T04:00:00+00:00,DOCUMENTS/HRCEmail_August_Web/IPS-0113/DOC_0C0...,F-2014-20439,...,,"Verma, Richard R <VermaRR@state.gov>",,"Thursday, December 16, 2010 7:41 PM",F-2014-20439,C05778462,08/31/2015,RELEASE IN PART,,UNCLASSIFIED U.S. Department of State Case No....
7941,7942,C05778463,SENATE,H,"Verma, Richard R",180.0,2010-12-16T05:00:00+00:00,2015-08-31T04:00:00+00:00,DOCUMENTS/HRCEmail_August_Web/IPS-0113/DOC_0C0...,F-2014-20439,...,,"Verma, Richard R <VermaRR@state.gov>","Sullivan, Jacob J; Mills, Cheryl D; Abedin, Huma","Thursday, December 16, 2010 8:09 PM",F-2014-20439,C05778463,08/31/2015,RELEASE IN FULL,Big change of plans in the Senate. Senator Rei...,UNCLASSIFIED U.S. Department of State Case No....
7942,7943,C05778465,RICHARD (TNR),H,"Jiloty, Lauren C",116.0,2010-12-16T05:00:00+00:00,2015-08-31T04:00:00+00:00,DOCUMENTS/HRCEmail_August_Web/IPS-0113/DOC_0C0...,F-2014-20439,...,,"Jiloty, Lauren C <JilotyLC@state.gov>",,"Thursday, December 16, 2010 10:52 PM",F-2014-20439,C05778465,08/31/2015,RELEASE IN PART,,UNCLASSIFIED U.S. Department of State Case No....
7943,7944,C05778466,FROM,H,PVerveer,143.0,2012-12-17T05:00:00+00:00,2015-08-31T04:00:00+00:00,DOCUMENTS/HRCEmail_August_Web/IPS-0113/DOC_0C0...,F-2014-20439,...,"PVervee,",,,12/14/201,F-2014-20439,C05778466,08/31/2015,RELEASE IN PART,"PVerveer B6\r\nFriday, December 17, 2010 12:12...","Hi dear Melanne and Alyse,\r\nHope this email ..."


Removing Duplicates

In [6]:
# Detect and display duplicate rows based on all columns
duplicate_rows = emails[emails.duplicated()]

In [7]:
duplicate_rows

Unnamed: 0,Id,DocNumber,MetadataSubject,MetadataTo,MetadataFrom,SenderPersonId,MetadataDateSent,MetadataDateReleased,MetadataPdfLink,MetadataCaseNumber,...,ExtractedTo,ExtractedFrom,ExtractedCc,ExtractedDateSent,ExtractedCaseNumber,ExtractedDocNumber,ExtractedDateReleased,ExtractedReleaseInPartOrFull,ExtractedBodyText,RawText


Standardizing Date Formats

In [8]:
# metadata_dates_sent = emails['MetadataDateSent'].unique()
# print(metadata_dates_sent)

In [9]:
# # Convert 'date_column' to datetime
# emails['MetadataDateSent'] = pd.to_datetime(emails['MetadataDateSent'], errors='coerce')

# # Display the unique values to check if the conversion was successful
# print(emails['MetadataDateSent'].unique())

In [10]:
# metadata_dates_released = emails['MetadataDateReleased'].unique()
# print(metadata_dates_released)

In [11]:
sent_dates = emails['ExtractedDateSent'].unique()
print(sent_dates)

['Wednesday, September 12, 2012 10:16 AM' 'null'
 'Wednesday, September 12, 2012 11:52 AM' ...
 'Thursday, December 16, 2010 10:52 PM' '12/14/201'
 'Friday, December 17, 2010 10:42 AM']


In [12]:
released_dates = emails['ExtractedDateReleased'].unique()
print(released_dates)

['05/13/2015' '05/14/2015' '05/22/2015' '06/30/2015' '07/31/2015' 'null'
 '08/31/2015']


In [13]:
# already standardized

**(1b) NLP Specific Processing:** Tokenization, Removing Stopwords and Lemmatization

In [14]:
# We can use NLTK to tokenize and lemmatize our text
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem.porter import *
from nltk.corpus import stopwords
import string
nltk.download('stopwords')

# Create instances of the stemmer
stemmer = PorterStemmer()

# For stopwords we will add punctuation
punct = list(string.punctuation) + list(string.digits) 
# null = list('null')  #neither of these methods
stop_words = stopwords.words('english') + punct + ['null']

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Rija\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [15]:
# Create a list to store processed data
corpus = []

# Iterate over the rows in the 'emails' DataFrame
for index, row in emails.iterrows():
    # Extract ID and content from the DataFrame
    email_id = row['Id']
    email_content = row['ExtractedBodyText']
    
    # Tokenize and lemmatize the text
    tokens = word_tokenize(email_content)
    tokens = [stemmer.stem(token.lower()) for token in tokens if token.isalpha()]
    
    # Remove stopwords
    tokens = [token for token in tokens if token not in stop_words]
    
    # Append ID and processed text to the corpus list
    corpus.append([email_id, ' '.join(tokens)])

In [16]:
corpus

[[1, ''],
 [2,
  'thursday march pm h latest syria aid qaddafi sid hrc memo syria aid libya hrc memo syria aid libya march hillari'],
 [3, 'thx'],
 [4, ''],
 [5,
  'h friday march pm huma abedin fw h latest syria aid qaddafi sid hrc memo syria aid libya pi print'],
 [6,
  'pi print h clintonernailcom wednesday septemb pm fw meet extremist behind film spark deadli riot meat sent wednesday septemb pm subject meet right wing extremist behind film spark deadli riot sent verizon wireless lte droid depart state case doc date state dept produc hous select benghazi comm subject agreement sensit inform redact foia waiver'],
 [7, ''],
 [8,
  'h friday march pm huma abedin fw h latest syria aid qaddafi sid hrc memo syria aid libya pi print'],
 [9, 'fyi'],
 [10,
  'wednesday septemb pm fwd libya libya sept send direct sent verizon wireless lte druid'],
 [11, 'fyi'],
 [12,
  'wednesday septemb pm fwd libya libya sept send direct sent verizon wireless lte druid'],
 [13, 'fyi'],
 [14,
  'slaughter su

In [17]:
# Iterate over the text to extract our lemmas
def tokenize_lemmatize_text(text):
    lemmas = []  # Create a local list to store tokens for each text
    tokens = word_tokenize(text)
    for token in tokens:
        if token in stop_words:
            continue
        else:
            lemmas.append(stemmer.stem(token))
    return lemmas


In [None]:
# Assuming 'corpus' is your list of emails and 'tokenize_lemmatize_text' is the modified function

# Retrieve the content of the third email
third_email_content = corpus[2][1]  # Index 2 corresponds to the third element (0-based index)

# Call the function to get lemmas for the third email
third_email_lemmas = tokenize_lemmatize_text(third_email_content)

# Print the lemmas
print(third_email_lemmas)


['thx']


In [None]:
# Initialize a defaultdict to store bigram counts
bigram_freqs = {}

# Iterate over each email in the corpus
for _, email_content in corpus:
    # Tokenize and lemmatize the email content
    email_token_lemmas = tokenize_lemmatize_text(email_content)

    # Create a list of bigrams
    email_bigrams = [(email_token_lemmas[i], email_token_lemmas[i + 1]) for i in range(len(email_token_lemmas) - 1)]

    # Update the bigram frequencies
    for bigram in email_bigrams:
        bigram_freqs[bigram] = bigram_freqs.get(bigram, 0) + 1

# Create a DataFrame from the bigram frequencies
df = pd.DataFrame(list(bigram_freqs.items()), columns=['bigram', 'freq'])

# Sort the DataFrame by frequency in descending order
df = df.sort_values(by='freq', ascending=False)

# Expand bigrams into separate columns
df[['first_term', 'second_term']] = pd.DataFrame(df['bigram'].tolist(), index=df.index)


In [None]:
df
#Unfortunately, many of these bigrams don't make much sense. Is there anything else I can use?

Unnamed: 0,bigram,freq,first_term,second_term
1771,"(secretari, offic)",461,secretari,offic
1764,"(state, depart)",457,state,depart
122,"(unit, state)",447,unit,state
1846,"(white, hou)",416,white,hou
50,"(depart, state)",400,depart,state
...,...,...,...,...
85431,"(game, thi)",1,game,thi
85432,"(question, china)",1,question,china
85433,"(game, whi)",1,game,whi
85434,"(whi, china)",1,whi,china


**(2) Exploratory Data Analysis**

In [None]:
emails.shape

(7945, 22)

Analyzing the distribution of emails over time

Most frequent senders and recipients

In [None]:
top_values = df[column_name].value_counts().head(10)

Length of emails

**(3) TF-IDF**

Using NLP techniques such as TF-IDF or count vectorization generate a list of the five TF-IDF terms that best describe the emails. 

In [None]:
# # Assuming 'corpus' is your list of emails and 'tokenize_lemmatize_text' is your lemmatization function

# # Create an empty list to store the lemmatized tokens
# lemmas_list = []

# # Iterate over each item in the corpus
# for docID, content in corpus:
#     # Apply tokenize_lemmatize_text to lemmatize the content
#     lemmas = tokenize_lemmatize_text(content)
    
#     # Append the lemmatized tokens to the list as a dictionary
#     lemmas_list.append({'docID': docID, 'lemmas': lemmas})

# # Create a DataFrame from the list of dictionaries
# lemmas_df = pd.DataFrame(lemmas_list)

# # Continue with the rest of the TF-IDF calculations...



In [None]:
# # Assuming 'corpus' is your list of emails and 'tokenize_lemmatize_text' is your lemmatization function

# # Create an empty list to store the lemmatized tokens
# lemmas_list = []

# # Iterate over each item in the corpus
# for docID, content in corpus:
#     # Apply tokenize_lemmatize_text to lemmatize the content
#     lemmas = tokenize_lemmatize_text(content)
    
#     # Append the lemmatized tokens to the list as a single dictionary
#     lemmas_list.append({'docID': docID, 'lemmas': lemmas})

# # Create a DataFrame from the list of dictionaries
# lemmas_df = pd.DataFrame(lemmas_list)

# # Continue with the rest of the TF-IDF calculations...


In [None]:
# Assuming 'corpus' is your list of emails and 'tokenize_lemmatize_text' is your lemmatization function

# Create an empty list to store the lemmatized tokens
lemmas_list = []

# Iterate over each item in the corpus
for docID, content in corpus:
    # Apply tokenize_lemmatize_text to lemmatize the content
    lemmas = tokenize_lemmatize_text(content)
    
    # Append the lemmatized tokens to the list as a single tuple (docID, lemmas)
    lemmas_list.append((docID, lemmas))

# Create a DataFrame from the list of tuples
lemmas_df = pd.DataFrame(lemmas_list, columns=['docID', 'lemmas'])

# Continue with the rest of the TF-IDF calculations...


In [None]:
lemmas_df

Unnamed: 0,docID,lemmas
0,1,[]
1,2,"[thursday, march, pm, h, latest, syria, aid, q..."
2,3,[thx]
3,4,[]
4,5,"[h, friday, march, pm, huma, abedin, fw, h, la..."
...,...,...
7940,7941,[]
7941,7942,"[big, chang, plan, senat, senat, reid, announc..."
7942,7943,[]
7943,7944,"[pverveer, friday, decemb, plea, let, know, an..."


In [None]:
# # Term Frequency
# term_frequency = (
#     lemmas_df.groupby(by=['docID', 'lemmas'])
#     .size()
#     .reset_index(name='term_frequency')
# )

# # Document Frequency
# document_frequency = (
#     term_frequency.groupby(['docID', 'lemmas'])
#     .size()
#     .unstack()
#     .sum()
#     .reset_index()
#     .rename(columns={0: 'document_frequency'})
# )

# # Merge the document freqs into the term dataframe
# term_frequency = term_frequency.merge(document_frequency)

# # Inverse Document Frequency (IDF)
# documents_in_corpus = term_frequency['docID'].nunique()
# term_frequency['idf'] = np.log((1 + documents_in_corpus) / (1 + term_frequency['document_frequency'])) + 1

# # TF-IDF
# term_frequency['tfidf'] = term_frequency['term_frequency'] * term_frequency['idf']

# # Normalize our data
# term_frequency['tfidf_norm'] = preprocessing.normalize(term_frequency[['tfidf']], axis=0, norm='l2')

# # Select the top 5 TF-IDF terms for each document
# top_n_terms = term_frequency.sort_values(by=['docID', 'tfidf'], ascending=[True, False]).groupby(['docID']).head(5)

In [19]:
# Assuming 'corpus' is your list of emails and 'tokenize_lemmatize_text' is your lemmatization function

# Create an empty list to store the lemmatized tokens
lemmas_list = []

# Iterate over each item in the corpus
for docID, content in corpus:
    # Apply tokenize_lemmatize_text to lemmatize the content
    lemmas = tokenize_lemmatize_text(content)
    
    # Append the lemmatized tokens to the list as a single tuple (docID, lemmas)
    lemmas_list.append((docID, lemmas))

# Create a DataFrame from the list of tuples
lemmas_df = pd.DataFrame(lemmas_list, columns=['docID', 'lemmas'])

# Continue with the TF-IDF calculations
from sklearn.feature_extraction.text import TfidfVectorizer

# Combine lemmatized tokens into a single string per document
lemmas_df['text'] = lemmas_df['lemmas'].apply(lambda x: ' '.join(x))

# Create TF-IDF vectorizer
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(lemmas_df['text'])

# Create DataFrame with TF-IDF values
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())

# Add 'docID' column to the TF-IDF DataFrame
tfidf_df['docID'] = lemmas_df['docID']

# Continue with any further processing you need...


In [20]:
lemmas_df

Unnamed: 0,docID,lemmas,text
0,1,[],
1,2,"[thursday, march, pm, h, latest, syria, aid, q...",thursday march pm h latest syria aid qaddafi s...
2,3,[thx],thx
3,4,[],
4,5,"[h, friday, march, pm, huma, abedin, fw, h, la...",h friday march pm huma abedin fw h latest syri...
...,...,...,...
7940,7941,[],
7941,7942,"[big, chang, plan, senat, senat, reid, announc...",big chang plan senat senat reid announc wa lon...
7942,7943,[],
7943,7944,"[pverveer, friday, decemb, plea, let, know, an...",pverveer friday decemb plea let know ani help ...


In [None]:
tfidf_df

Unnamed: 0,aa,aab,aafia,aar,aari,aaron,aaronovitch,ab,aback,abandon,...,zohra,zone,zoo,zsu,zuckerman,zulciernain,zuma,zumbi,zurich,docID
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7940,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7941
7941,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7942
7942,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7943
7943,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7944


In [None]:

# # Assuming 'tfidf_df' is your DataFrame with TF-IDF values and 'lemmas_df' is your DataFrame with lemmatized tokens

# # Create a new DataFrame to store the top 5 TF-IDF terms for each document
# top_n_tfidf_terms = pd.DataFrame(columns=['docID', 'top_tfidf_terms'])

# # Iterate over unique document IDs
# for docID in lemmas_df['docID'].unique():
#     # Select the top 5 TF-IDF terms for each document
#     top_5_terms = tfidf_df.loc[docID].nlargest(5).reset_index()
    
#     # Append the result to the new DataFrame
#     top_n_tfidf_terms = pd.concat([top_n_tfidf_terms, top_5_terms[['index', docID]]])

# # Rename the columns for clarity
# top_n_tfidf_terms.columns = ['term', 'docID', 'tfidf']

# # Reset the index of the new DataFrame
# top_n_tfidf_terms = top_n_tfidf_terms.reset_index(drop=True)

# # Print or use 'top_n_tfidf_terms' as needed
# print(top_n_tfidf_terms)


KeyboardInterrupt: 

Latent Semantic Analysis (LSA)

In [21]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

# Extract lemmatized text from the DataFrame
lemmatized_text = lemmas_df['lemmas'].apply(lambda lemmas: ' '.join(lemmas))

# TF-IDF Vectorization
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(lemmatized_text)

# Latent Semantic Analysis (LSA)
num_topics = 5  # Specify the number of topics
lsa_model = TruncatedSVD(n_components=num_topics)
lsa_matrix = lsa_model.fit_transform(tfidf_matrix)

# Get the top terms for each topic
terms = tfidf_vectorizer.get_feature_names_out()
top_terms_indices = lsa_model.components_.argsort(axis=1)[:, ::-1]
top_terms = [[terms[idx] for idx in row[:5]] for row in top_terms_indices]

# Display the top terms for each topic
for i, topic_terms in enumerate(top_terms):
    print(f"Topic {i + 1}: {', '.join(topic_terms)}")


Topic 1: fyi, fw, cheryl, millscd, pm
Topic 2: ok, talk, pm, thi, thx
Topic 3: pm, print, thi, pl, secretari
Topic 4: print, pl, pi, copi, hrc
Topic 5: ye, thx, pl, print, work


Word Embeddings (Word2Vec or GloVe)

In [27]:
from gensim.models import Word2Vec

# Extract lemmatized tokens from the DataFrame
lemmatized_tokens = lemmas_df['lemmas'].tolist()

# Train Word2Vec model
word2vec_model = Word2Vec(sentences=lemmatized_tokens, vector_size=100, window=5, min_count=1, workers=4)

# Get the most similar terms for each term in the vocabulary
similar_terms = {term: [similar[0] for similar in word2vec_model.wv.most_similar(term, topn=5)] for term in word2vec_model.wv.index_to_key}

# Display the similar terms for each term
for term, similar_list in similar_terms.items():
    print(f"{term}: {', '.join(similar_list)}")


thi: great, forward, veri, done, anyth
wa: fifti, hi, poll, three, run
hi: wa, mr, dure, bush, run
state: rockefel, nypd, congreg, rothschild, nation
ha: term, success, becom, strong, critic
pm: pool, aboul, franklin, benjamin, camera
call: op, w, schedul, sheet, berri
would: like, could, way, say, happen
secretari: hillari, rodham, outer, brief, confer
time: today, scrub, happi, delight, holbrook
work: around, need, well, togeth, appreci
offic: confer, outer, staff, brief, daili
said: mr, morsi, dure, first, run
obama: barack, presid, mr, administr, bush
depart: air, whiteh, doc, cherish, en
presid: barack, bush, mr, obama, dure
one: made, correctli, even, say, never
new: york, upstat, frontier, citi, zealand
meet: offic, confer, phone, nalbandian, brief
also: speak, proxim, leav, mayb, someon
hou: senat, nocol, white, chatham, powder
like: happen, might, still, someth, could
get: sure, go, hope, asap, see
us: issu, help, pakistan, need, direct
say: one, happen, correctli, point, woul

Non-Negative Matrix Factorization (NMF)

In [28]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF

# TF-IDF Vectorization
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(lemmatized_text)

# Non-Negative Matrix Factorization (NMF)
num_topics = 5  # Specify the number of topics
nmf_model = NMF(n_components=num_topics)
nmf_matrix = nmf_model.fit_transform(tfidf_matrix)

# Get the top terms for each topic
terms = tfidf_vectorizer.get_feature_names_out()
top_terms_indices = nmf_model.components_.argsort(axis=1)[:, ::-1]
top_terms = [[terms[idx] for idx in row[:5]] for row in top_terms_indices]

# Display the top terms for each topic
for i, topic_terms in enumerate(top_terms):
    print(f"Topic {i + 1}: {', '.join(topic_terms)}")


Topic 1: fyi, fw, cheryl, millscd, high
Topic 2: ok, talk, sound, thx, relea
Topic 3: pm, thi, secretari, state, offic
Topic 4: print, pl, pi, thx, copi
Topic 5: ye, thx, work, lona, set


Doc2Vec

In [34]:
from gensim.models import Doc2Vec, TaggedDocument

# Tag documents with unique IDs
tagged_data = [TaggedDocument(words=lemmas, tags=[str(docID)]) for docID, lemmas in zip(lemmas_df['docID'], lemmas_df['lemmas'])]

# Train Doc2Vec model
doc2vec_model = Doc2Vec(vector_size=100, window=5, min_count=1, workers=4, epochs=20)
doc2vec_model.build_vocab(tagged_data)
doc2vec_model.train(tagged_data, total_examples=doc2vec_model.corpus_count, epochs=doc2vec_model.epochs)

# Get the most similar documents for each document
similar_docs = {docID: [similar[0] for similar in doc2vec_model.docvecs.most_similar(docID, topn=5)] for docID in lemmas_df['docID']}

# Display the most similar documents for each document
for docID, similar_list in similar_docs.items():
    print(f"Document {docID}: {', '.join(similar_list)}")


ImportError: cannot import name 'TaggedDocument' from 'gensim.models' (c:\Users\Rija\AppData\Local\Programs\Python\Python38\lib\site-packages\gensim\models\__init__.py)

Topic Modeling (Latent Dirichlet Allocation - LDA)

In [31]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

# Count Vectorization
count_vectorizer = CountVectorizer(stop_words='english')
count_matrix = count_vectorizer.fit_transform(lemmatized_text)

# Latent Dirichlet Allocation (LDA)
num_topics = 5  # Specify the number of topics
lda_model = LatentDirichletAllocation(n_components=num_topics, random_state=42)
lda_matrix = lda_model.fit_transform(count_matrix)

# Get the top terms for each topic
terms = count_vectorizer.get_feature_names_out()
top_terms_indices = lda_model.components_.argsort(axis=1)[:, ::-1]
top_terms = [[terms[idx] for idx in row[:5]] for row in top_terms_indices]

# Display the top terms for each topic
for i, topic_terms in enumerate(top_terms):
    print(f"Topic {i + 1}: {', '.join(topic_terms)}")


Topic 1: thi, work, wa, want, know
Topic 2: wa, hi, ha, thi, obama
Topic 3: thi, women, ok, need, work
Topic 4: state, pm, depart, secretari, offic
Topic 5: fyi, pm, cheryl, huma, fw


**(4) Temporal Analysis**

Investigate how email communications change over time.

Identify any patterns or anomalies in the volume of emails sent and received.

Correlate these with external events or timelines.