# Feature Engineering

We create features from the raw text so we can train the machine learning models. The steps followed are:

2. **Label coding**: creation of a dictionary to map each category to a code.
3. **Train-test split**: to test the models on unseen data.
4. **Text representation**: use of TF-IDF scores to represent text.

In [1]:
import pickle
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import chi2
import numpy as np

First of all we'll load the dataset:

In [3]:
df = pd.read_csv('ml_model.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [4]:
# making sure we only have data from 1913 and forward
df = df.drop(df.index[21932:42286])

Finally, we can delete the intermediate columns:

In [5]:
print(df.columns)

Index(['Name', 'Type', 'Party', 'Republican', 'year', 'mean_word_length',
       'word_count', 'some_word_count', 'unique_word', 'some_unique_word',
       'unique_word_ratio', 'some_unique_word_ratio', 'words', 'sentences',
       'mean_sentence_length', 'positive_words', 'negative_words',
       'positive_words_ratio', 'negative_words_ratio', 'we_count', 'war_count',
       'i_count', 'Sentences', 'vader_neg', 'vader_pos', 'vader_neu',
       'vader_com'],
      dtype='object')


In [6]:
#df.head(1)
df.drop(['year','we_count','war_count','i_count','words'], axis=1, inplace=True)

In [7]:
print(df.columns)

Index(['Name', 'Type', 'Party', 'Republican', 'mean_word_length', 'word_count',
       'some_word_count', 'unique_word', 'some_unique_word',
       'unique_word_ratio', 'some_unique_word_ratio', 'sentences',
       'mean_sentence_length', 'positive_words', 'negative_words',
       'positive_words_ratio', 'negative_words_ratio', 'Sentences',
       'vader_neg', 'vader_pos', 'vader_neu', 'vader_com'],
      dtype='object')


In [9]:
list_columns = ['Name', 'Type', 'Party', 'Republican', 'mean_word_length', 'word_count',
       'some_word_count', 'unique_word', 'some_unique_word',
       'unique_word_ratio', 'some_unique_word_ratio', 'sentences',
       'mean_sentence_length', 'positive_words', 'negative_words',
       'positive_words_ratio', 'negative_words_ratio', 'Sentences', 'vader_neg', 
        'vader_pos', 'vader_neu', 'vader_com']
df = df[list_columns]

df = df.rename(columns={'Sentences': 'Sentences'})

In [11]:
print(df.shape)
print(df.columns)

(21932, 22)
Index(['Name', 'Type', 'Party', 'Republican', 'mean_word_length', 'word_count',
       'some_word_count', 'unique_word', 'some_unique_word',
       'unique_word_ratio', 'some_unique_word_ratio', 'sentences',
       'mean_sentence_length', 'positive_words', 'negative_words',
       'positive_words_ratio', 'negative_words_ratio', 'Sentences',
       'vader_neg', 'vader_pos', 'vader_neu', 'vader_com'],
      dtype='object')


## 2. Label coding

We'll create a dictionary with the label codification:

In [12]:
category_codes = {
    'Democrat': 0,
    'Republican': 1}

In [13]:
# Category mapping
df['Republican'] = df['Party']
df = df.replace({'Republican':category_codes})

In [14]:
df.head()

Unnamed: 0,Name,Type,Party,Republican,mean_word_length,word_count,some_word_count,unique_word,some_unique_word,unique_word_ratio,...,mean_sentence_length,positive_words,negative_words,positive_words_ratio,negative_words_ratio,Sentences,vader_neg,vader_pos,vader_neu,vader_com
0,Donald Trump,State of the Union,Republican,1,6.01407,29897.0,22255.0,1659.0,1387.0,0.055491,...,105.414634,244.0,177.0,0.010964,0.007953,"madam speaker, mr. vice president, membe...",0.0,0.097,0.903,0.4215
1,Donald Trump,State of the Union,Republican,1,6.01407,29897.0,22255.0,1659.0,1387.0,0.055491,...,105.414634,244.0,177.0,0.010964,0.007953,"as we begin a new congress, i stand here ready...",0.0,0.122,0.878,0.3612
2,Donald Trump,State of the Union,Republican,1,6.01407,29897.0,22255.0,1659.0,1387.0,0.055491,...,105.414634,244.0,177.0,0.010964,0.007953,millions of our fellow citizens are watching u...,0.053,0.159,0.788,0.4954
3,Donald Trump,State of the Union,Republican,1,6.01407,29897.0,22255.0,1659.0,1387.0,0.055491,...,105.414634,244.0,177.0,0.010964,0.007953,the agenda i will lay out this evening is not ...,0.0,0.0,1.0,0.0
4,Donald Trump,State of the Union,Republican,1,6.01407,29897.0,22255.0,1659.0,1387.0,0.055491,...,105.414634,244.0,177.0,0.010964,0.007953,it is the agenda of the american people.,0.0,0.0,1.0,0.0


## 3. Train - test split

We'll set apart a test set to prove the quality of our models. We'll do Cross Validation in the train set in order to tune the hyperparameters and then test performance on the unseen data of the test set.

In [15]:
X_train, X_test, y_train, y_test = train_test_split(df['Sentences'], 
                                                    df['Republican'], 
                                                    test_size=0.30, 
                                                    random_state=8)

Since we relatively many observations (21.932), we'll choose a train/test set size of 70/30 pct. of the full dataset.

## 4. Text representation

We have various options:

* Count Vectors as features
* TF-IDF Vectors as features
* Word Embeddings as features
* Text / NLP based features
* Topic Models as features

We'll use **TF-IDF Vectors** as features.

We have to define the different parameters:

* `ngram_range`: We want to consider both unigrams and bigrams.
* `max_df`: When building the vocabulary ignore terms that have a document
    frequency strictly higher than the given threshold
* `min_df`: When building the vocabulary ignore terms that have a document
    frequency strictly lower than the given threshold.
* `max_features`: If not None, build a vocabulary that only consider the top
    max_features ordered by term frequency across the corpus.

It needs to be mentioned that we are implicitly scaling our data when representing it as TF-IDF features with the argument `norm`.

In [16]:
# Parameter election
ngram_range = (1,2)
min_df = 10
max_df = 1.
max_features = 22

We have chosen these values as a first approximation. Due to time constraints, we'll stick to these values. But it has to be mentioned that different combinations could be tried in order to improve even more the accuracy of the models.

In [17]:
tfidf = TfidfVectorizer(encoding='utf-8',
                        ngram_range=ngram_range,
                        stop_words=None,
                        lowercase=False,
                        max_df=max_df,
                        min_df=min_df,
                        max_features=max_features,
                        norm='l2',
                        sublinear_tf=True)
                        
features_train = tfidf.fit_transform(X_train).toarray()
labels_train = y_train
print(features_train.shape)

features_test = tfidf.transform(X_test).toarray()
labels_test = y_test
print(features_test.shape)

(15352, 22)
(6580, 22)


Please note that we have fitted and then transformed the training set, but we have **only transformed** the **test set**.

Let's save the files we'll need in the next steps:

In [20]:
# X_train
with open('Pickles/X_train.pickle', 'wb') as output:
    pickle.dump(X_train, output)
    
# X_test    
with open('Pickles/X_test.pickle', 'wb') as output:
    pickle.dump(X_test, output)
    
# y_train
with open('Pickles/y_train.pickle', 'wb') as output:
    pickle.dump(y_train, output)
    
# y_test
with open('Pickles/y_test.pickle', 'wb') as output:
    pickle.dump(y_test, output)
    
# df
with open('Pickles/df.pickle', 'wb') as output:
    pickle.dump(df, output)
    
# features_train
with open('Pickles/features_train.pickle', 'wb') as output:
    pickle.dump(features_train, output)

# labels_train
with open('Pickles/labels_train.pickle', 'wb') as output:
    pickle.dump(labels_train, output)

# features_test
with open('Pickles/features_test.pickle', 'wb') as output:
    pickle.dump(features_test, output)

# labels_test
with open('Pickles/labels_test.pickle', 'wb') as output:
    pickle.dump(labels_test, output)
    
# TF-IDF object
with open('Pickles/tfidf.pickle', 'wb') as output:
    pickle.dump(tfidf, output)