# Data Analysis of Speech Text Files
This notebook will explore the text files provided by the assignment. 

In [1]:
import os, glob, re
import pandas as pd
import numpy as np

## Read Text Files

In [2]:
data_path = '../data/presidentspeeches/'
file_names = os.listdir('../data/presidentspeeches/')
file_names = [os.path.join(data_path,file) for file in file_names]

There were some errors with encoding that didn't allow us to iterate over the files, hence the ignore error argument.

In [3]:
text_dict = {}
for i, file in enumerate(file_names):
    num = ''.join(re.findall('[0-9]',file))
    with open(file,'r' , encoding='utf-8',errors='ignore') as file:
        text_dict[num] = file.read()

In [4]:
speeches = pd.Series(data= [text_dict[key] for key in text_dict.keys()],
                     index= [int(key) for key in text_dict.keys()],
                     name= 'speeches')
speeches = speeches.sort_index()

## Data Cleaning
### Test for duplicates

In [5]:
speeches.duplicated().any()

True

Uh oh! Let's find out which files are duplicates of each other

In [6]:
speeches.duplicated().sum()

1

We only have one duplicate. As the `.duplicated()` method automatically discards all except for the first or last occurence, we don't technically know how many times this document is duplicated.

In [7]:
print(speeches[speeches.duplicated(keep='first')])
print(speeches[speeches.duplicated(keep='last')])

119        To the Senate and House of Representatives...
Name: speeches, dtype: object
116        To the Senate and House of Representatives...
Name: speeches, dtype: object


In [8]:
duplicate = speeches[119]
for key, val in zip(speeches.index,speeches.values):
    if val == duplicate:
        print(key)

116
119


We have confirmed it is only two documents which are exact duplicates. After looking at both documents in their text files we confirm this, but of course we had to perform this process as a true data scientist before resorting to our lesser non-python ways.

In dealing with this dilemma, we will merely remove the second instance. We arbitrarily keep the first copy.

In [9]:
print('Length of speech series before removal:',len(speeches))
speeches = speeches[~speeches.duplicated(keep='first')]
print('Length of speech series after removal:',len(speeches))

Length of speech series before removal: 217
Length of speech series after removal: 216


In [94]:
#Let's make sure we do the same with our dictionary of text values 
del text_dict['119']

In [10]:
df = pd.DataFrame(speeches)

# Preprocessing
With this mild exploration under our belts, we are read to begin the critical preprocessing step. A processed corpus of text will often take the form of a massive matrix which is some interpretation of our word frequency values. Of course, many of these cells will contain zeros as in entirety, very few of our corpus words appear in very few of our corpus entries. 

This is the problem of *matrix sparsity*. For its size, our processed corpus matrix actually has very little information. Luckily, through the act of preprocessing we can minimize its impact on our model.

For this introduction to text mining, we will split text preprocessing into two key steps. 
- **Noise Removal**: Removing data components which are redundant to our text analytics and thus are be considered noise.
- **Normalization**: The process of handling multiple occurences or representations of the same word. Normalization can be further broken down into two types: *stemming* and *lemmatization*. 
 - **Stemming**: Text is normalized texts by removing suffixes and prefixes. i.e. *learned*, *learner*, and *learning* all become *learn*.
 - **Lemmatisation**: A more advanced technique which identifies the *root* of the word. 

To perform these operations, we look to the most common library for NLP workflow, the **Natural Language Toolkit Library**.

In [66]:
import re
import nltk
#nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import RegexpTokenizer

#nltk.download('wordnet')
from nltk.stem.wordnet import WordNetLemmatizer

In [67]:
# create list of stop words and add custom stopwords
stop_words = set(stopwords.words("english"))

In [95]:
corpus = {}

for key in text_dict.keys():
    #Remove punctuations
    #Our index is somewhat irregular (starts at zero, removed 119)
    
    try:
        text = re.sub('[^a-zA-Z]', ' ', text_dict[key])
    except:
        pass
    
    #Convert to lowercase
    text = text.lower()
    
    #remove tags
    text=re.sub("&lt;/?.*?&gt;"," &lt;&gt; ",text)
    
    # remove special characters and digits
    text=re.sub("(\\d|\\W)+"," ",text)
    
    ##Convert to list from string
    text = text.split()
    
    ##Stemming
    ps=PorterStemmer()
    #Lemmatisation
    lem = WordNetLemmatizer()
    text = [lem.lemmatize(word) for word in text if not word in  
            stop_words] 
    text = " ".join(text)
    corpus[key] = text

In [96]:
text_dict['17']

'    The Senate and House of Representatives of the United States: At a moment when the nations of Europe are in commotion and arming against each other, and when those with whom we have principal intercourse are engaged in the general contest, and when the countenance of some of them toward our peaceable country threatens that even that may not be unaffected by what is passing on the general theater, a meeting of the representatives of the nation in both Houses of Congress has become more than usually desirable. Coming from every section of our country, they bring with them the sentiments and the information of the whole, and will be enabled to give a direction to the public affairs which the will and the wisdom of the whole will approve and support.In taking a view of the state of our country we in the first place notice the late affliction of two of our cities under the fatal fever which in latter times has occasionally visited our shores. Providence in His goodness gave it an early

In [97]:
corpus['17']

'senate house representative united state moment nation europe commotion arming principal intercourse engaged general contest countenance toward peaceable country threatens even may unaffected passing general theater meeting representative nation house congress become usually desirable coming every section country bring sentiment information whole enabled give direction public affair wisdom whole approve support taking view state country first place notice late affliction two city fatal fever latter time occasionally visited shore providence goodness gave early termination occasion lessened number victim usually fallen course several visitation disease appeared strictly local incident city tide water incommunicable country either person disease good carried diseased place access autumn disappears early frost restriction within narrow limit time space give security even maritime city three quarter year country always although fact appears unnecessary yet satisfy fear foreign nation caut

# Join with Metadata
I've noticed that the files on record for the metadata don't match up perfectly with all the text files. We'll first see which numbers we don't have metadata for.

In [11]:
meta_df = pd.read_csv('../data/speechdetails.csv')
meta_df.head()

Unnamed: 0,Name,Year,File number,IC
0,G. Washington,1790,1,2.1
1,G. Washington,1790,2,2.0
2,G. Washington,1791,3,1.75
3,G. Washington,1792,4,1.4
4,J. Adam,1797,9,2.5


In [12]:
meta_files = meta_df['File number'].values

unmatched = set(df.index.values)
for num in meta_files:
    if num in unmatched:
        unmatched.remove(num)
#unmatched now contains all the files with no meta_data
print('Number of unmatched rows:',len(unmatched))

Number of unmatched rows: 68


The best way to move forward with this discovery is to consider the samples without metadata to be our test set. We should be able to extract the speaker from the text, although the we wouldn't expect the year or speaker features to be too predictive.

In [13]:
test_df = df.loc[df.index.isin(unmatched)]
train_df = df.loc[~df.index.isin(unmatched)]

## NOTE: Create a 3. FEATURE notebook in which you can pass in a dataframe with various forms of text cleaning and then create features --> we will have features for each type of pre-processing

# Add Features

In [14]:
from feat_utils import add_feature_character_count, \
    add_feature_ngram_count, add_feature_overlap_ngrams, \
    add_feature_sentence_count, add_feature_unique_ngram_count, \
    add_feature_unique_ngram_ratio, add_feature_word_count

#Advanced features
from feat_utils import add_feature_tfidf_svd_similarity, \
    add_feature_w2v_similarity, add_feature_sentiment_sid

### Add basic count features

In [15]:
add_feature_character_count(df,'speeches')
add_feature_word_count(df,'speeches')
add_feature_sentence_count(df,'speeches')

### Ngram related features

In [16]:
#uni, bi, and tri-gram count
add_feature_ngram_count(df,'speeches',1)
add_feature_ngram_count(df,'speeches',2)
add_feature_ngram_count(df,'speeches',3)

#unique uni, bi, and tri-gram count
add_feature_unique_ngram_count(df,'speeches',1)
add_feature_unique_ngram_count(df,'speeches',2)
add_feature_unique_ngram_count(df,'speeches',3)

#unique uni, bi, and tri-gram ratios
add_feature_unique_ngram_ratio(df,'speeches',1)
add_feature_unique_ngram_ratio(df,'speeches',2)
add_feature_unique_ngram_ratio(df,'speeches',3)

In [17]:
df

Unnamed: 0,speeches,num_chars_speeches,num_word_speeches,num_sents_speeches,count_1gram_speeches,count_2gram_speeches,count_3gram_speeches,count_unique_1gram_speeches,count_unique_2gram_speeches,count_unique_3gram_speeches,ratio_unique_1gram_speeches,ratio_unique_2gram_speeches,ratio_unique_3gram_speeches
1,Fellow-Citizens of the Senate and House of ...,6677,1079,10,1076,1075,1074,479,938,1048,0.445167,0.872558,0.975791
2,Fellow-Citizens of the Senate and House of...,8363,1396,29,1392,1391,1390,621,1211,1370,0.446121,0.870597,0.985612
3,Fellow-Citizens of the Senate and House of...,14075,2287,34,2276,2275,2274,871,1839,2165,0.382689,0.808352,0.952067
4,Fellow-Citizens of the Senate and House of...,12644,2078,35,2074,2073,2072,832,1728,1996,0.401157,0.833575,0.963320
5,Fellow-Citizens of the Senate and House of...,11591,1953,37,1949,1948,1947,829,1627,1867,0.425346,0.835216,0.958911
6,Fellow-Citizens of the Senate and House of...,17533,2897,56,2893,2892,2891,1201,2456,2809,0.415140,0.849239,0.971636
7,Fellow-Citizens of the Senate and House of ...,12217,1973,35,1970,1969,1968,860,1681,1924,0.436548,0.853733,0.977642
8,Fellow-Citizens of the Senate and House of...,17250,2844,48,2840,2839,2838,1086,2335,2726,0.382394,0.822473,0.960536
9,Gentlemen of the Senate and Gentlemen of t...,12423,2057,42,2048,2047,2046,803,1687,1941,0.392090,0.824133,0.948680
10,Gentlemen of the Senate and Gentlemen of t...,13328,2203,34,2199,2198,2197,882,1828,2109,0.401091,0.831665,0.959945


### Sentiment Features

In [18]:
add_feature_sentiment_sid(df,'speeches',avg=False) 

### Comparison Features
**Idea**: There are many useful comparison-based features that we can use (TF-IDF & Word2Vec embedded differences). As we don't have a natural comparison between documents (as we might if we were comparing article headlines to body text) we can manufacture our own. Let's add as a feature the **TF-IDF & Word2Vec Similarity to the Min and Max integrative complexity article**.

Perhaps the TF-IDF matrix and Word2Vec embedding of sample speeches will capture some of the complexity. To do this, we will first need to identify the min and max scores of the ICs.

In [23]:
print('Maximum IC sample')
meta_df[meta_df['IC']== max(meta_df.IC.values)]

Maximum IC sample


Unnamed: 0,Name,Year,File number,IC
87,W.Taft,1909,121,3.75


In [24]:
print('Minimum IC sample')
meta_df[meta_df['IC']== min(meta_df.IC.values)]

Minimum IC sample


Unnamed: 0,Name,Year,File number,IC
35,J.Tyler,1844,56,1.0
38,J.Polk,1847,59,1.0
39,J.Polk,1848,60,1.0
70,C.Arthur,1884,96,1.0
94,W.Wilson,1916,128,1.0


As we have multiple speeches with low IC scores, we can try using the vectorized similarities of **each** as a feature, as well as a feature which averages their matrices (why not?).

To satisfy the structure of the feature generation functions we've created, we must add a column to the dataframe which has the text of the minimum and max scored speeches. For instance, adding a column as *min_IC_speech3*, and *max_IC_speech*. This is necessary because our function is built to extract the similarity of two entries in a row, and thus specifying each column is necessary. 

In [26]:
min_files = meta_df[meta_df['IC']== min(meta_df.IC.values)]['File number'].values
max_files = meta_df[meta_df['IC']== max(meta_df.IC.values)]['File number'].values

In [39]:
for i, num in enumerate(min_files):
    df['min_IC_speech'+str(i)] = text_dict[str(num)]
for i, num in enumerate(max_files):
    df['max_IC_speech'+str(i)] = text_dict[str(num)]

### Create Comparison Features
These features are a touch more complex, and require us to pass models in as arguments.

#### TF-IDF Similarity Feature

In [51]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(ngram_range=(1,3),max_df=0.8,min_df=2)

#Add in TFIDF difference for as many min_IC_speech columns we have
for col2 in [column for column in df.columns if 'min_IC_speech' in column]:
    add_feature_tfidf_svd_similarity(df,'speeches',col2, vectorizer)

#Do the same for max_IC_speech columns
for col2 in [column for column in df.columns if 'max_IC_speech' in column]:
    add_feature_tfidf_svd_similarity(df,'speeches',col2, vectorizer)

#### Word2Vec Similarity Features

In [60]:
#Import this 1.5GB model from Google!!!! (yikes)
import gensim 
model = gensim.models.KeyedVectors.load_word2vec_format('../../../../../Github/avengers-ensemble/eda_pipeline/GoogleNews-vectors-negative300.bin',binary=True)

In [62]:
#Add in W2V difference for as many min_IC_speech columns we have
for col2 in [column for column in df.columns if 'min_IC_speech' in column]:
    add_feature_w2v_similarity(df,'speeches',col2, model,1)

#Do the same for max_IC_speech columns
for col2 in [column for column in df.columns if 'max_IC_speech' in column]:
    add_feature_w2v_similarity(df,'speeches',col2, model,1)

AttributeError: 'float' object has no attribute 'split'

## Pickle Dataframe
We have a lot of features, let's save it as a pickle so we don't have to process the text each time!

In [65]:
df.to_pickle('../data/features_df.pkl')