### Importing data

In [1]:
import pandas as pd
import numpy as np

import data_exploration

import string
import nltk
from nltk import ngrams

# Caching stopwords
from nltk.corpus import stopwords
nltk.download('stopwords', quiet=True)
stop_words = set(stopwords.words('english'))

from nltk.stem.porter import PorterStemmer


In [2]:
df, df_crowdsourced, df_ground_truth = data_exploration.data_loading()

We could consider labelling the ground_truth and crowdsourced and putting more weight on the ground_truth in training?
Or penalizing more for getting the ground_truth wrong?

## Preparing for processing

### Preparing text

In [3]:
df.head()

Unnamed: 0,Sentence_id,Text,Speaker,Speaker_title,Speaker_party,File_id,Length,Line_number,Sentiment,Verdict
0,16,I think we've seen a deterioration of values.,George Bush,Vice President,REPUBLICAN,1988-09-25.txt,8,16,0.0,-1
1,17,I think for a while as a nation we condoned th...,George Bush,Vice President,REPUBLICAN,1988-09-25.txt,16,17,-0.456018,-1
2,18,"For a while, as I recall, it even seems to me ...",George Bush,Vice President,REPUBLICAN,1988-09-25.txt,29,18,-0.805547,-1
3,19,"So we've seen a deterioration in values, and o...",George Bush,Vice President,REPUBLICAN,1988-09-25.txt,35,19,0.698942,-1
4,20,"We got away, we got into this feeling that val...",George Bush,Vice President,REPUBLICAN,1988-09-25.txt,15,20,0.0,-1


#### Doing stemming

Stemming will be good as it removes some variability in how words are stated. But we should prooobably also try without.

In [4]:
df['Text_stemmed'] = data_exploration.stem(df)

In [5]:
df.Text_stemmed

0       ['I', 'think', "we'v", 'seen', 'a', 'deterior'...
1       ['I', 'think', 'for', 'a', 'while', 'as', 'a',...
2       ['for', 'a', 'while,', 'as', 'I', 'recall,', '...
3       ['So', "we'v", 'seen', 'a', 'deterior', 'in', ...
4       ['We', 'got', 'away,', 'we', 'got', 'into', 't...
                              ...                        
1027    ['He', 'ha', 'promis', 'a', 'trillion', 'dolla...
1028    ['(laughter)', 'I', '--', "there'", 'an', 'old...
1029             ['well,', 'can', 'I', 'answer', 'that?']
1030    ['I', 'look', 'forward', 'to', 'the', 'final',...
1031    ['for', 'those', 'of', 'you', 'for', 'me,', 't...
Name: Text_stemmed, Length: 23533, dtype: object

## Split into test and train

According to the description of the task we shuold split the dataset into test and train based on year of debate. All debates before and including 2008 goes into train and more recent debates into test. (I would also consider making a validation set when we get closer to the end to have a final validation)

In [6]:
df_train, df_test = data_exploration.test_train_split(df)

### Create the tfid matrix for the text column

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [9]:
train_tfid, test_tfid=  data_exploration.tfid(train = df_train.Text, test = df_test.Text)


In [10]:
train_tfid

<18170x10641 sparse matrix of type '<class 'numpy.float64'>'
	with 277846 stored elements in Compressed Sparse Row format>

In [11]:
test_tfid

<5363x10641 sparse matrix of type '<class 'numpy.float64'>'
	with 70684 stored elements in Compressed Sparse Row format>

Looks good, I first fitted the vectorizer to the train set (so only words in the train set will be counted) and then transformed the test set using the same vectorizer. They have the same amount of columns which indicate it has been done correctly, keeping them sparse to save storage.

## Predict using standard models

In [44]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score

In [82]:
def predict_it(X, Y, x, y, method = RandomForestClassifier(n_estimators=20, max_depth=20, random_state = 42, class_weight = 'balanced_subsample')):
    classifier = method
    classifier.fit(X,Y)
    return f1_score(classifier.predict(X), Y, average = 'weighted'), f1_score(classifier.predict(x),y, average = 'weighted')


In [81]:
clf = RandomForestClassifier(n_estimators=20, max_depth=20, random_state = 42, class_weight = 'balanced_subsample')
clf.fit(train_tfid,df_train.Verdict.values)
f1_score(clf.predict(test_tfid),df_test.Verdict, average = 'weighted')

0.6280225318138658

The accuracy if measured as suggested is good BUT that's because we have an over-representation of one class

In [78]:
pd.DataFrame(clf.predict(test_tfid))[0].value_counts(normalize = True) * 100

-1    60.190192
 0    20.846541
 1    18.963267
Name: 0, dtype: float64

In [53]:
df_train['Verdict'].value_counts(normalize = True) * 100

-1    66.604293
 1    23.252614
 0    10.143093
Name: Verdict, dtype: float64

In [52]:
df_test['Verdict'].value_counts(normalize = True) * 100

-1    61.793772
 1    26.589595
 0    11.616632
Name: Verdict, dtype: float64

In [83]:
# Doing machine learning with the binary
Y = df_train.Verdict# Train
X = train_tfid # Train
y = df_test.Verdict # Test
x = test_tfid # Test

predict_it(X, Y, x, y)

(0.749147541587202, 0.6280225318138658)

In [85]:
df_train[df_train.Sentiment.isna()] # need to replace nans with average of -1 sentiments?

Unnamed: 0,Sentence_id,Text,Speaker,Speaker_title,Speaker_party,File_id,Length,Line_number,Sentiment,Verdict,Text_stemmed,Year
2466,3490,I -- I hope so.,13,8,2,2008-09-26.txt,5,97,,-1,"['I', '--', 'I', 'hope', 'so.']",2008
7819,12350,"I apologize, Mr. Vice President.",6,3,2,2000-10-11.txt,5,974,,-1,"['I', 'apologize,', 'mr.', 'vice', 'president.']",2000
17355,25764,My word is my bond.,3,8,2,1996-10-16.txt,5,109,,-1,"['My', 'word', 'is', 'my', 'bond.']",1996


In [37]:
# Doing machine learning with the binary
Y = df_train.Verdict# Train
X = df_train[['Speaker_party', 'Length', 'Sentiment']] # Train
y = df_test.Verdict # Test
x = df_test[['Speaker_party', 'Length', 'Sentiment']] # Test

predict_it(X, Y, x, y)

ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

In [72]:
# Doing machine learning with the binary
Y = df.Verdict.iloc[:1144] # Train
X = binary[:1144] # Train
y = df.Verdict.iloc[1144:] # Test
x = binary[1144:] # Test

predict_it(X, Y, x, y)

(0.8036964531134257, 0.7909601468841126)

In [77]:
x.shape

(22389, 1)

In [78]:
X.shape

(1144, 1)

In [42]:
X.shape

(1144, 133240)

In [46]:
df.head(1)

Unnamed: 0,Sentence_id,Text,Speaker,Speaker_title,Speaker_party,File_id,Length,Line_number,Sentiment,Verdict,Text_stemmed,Year
0,16,I think we've seen a deterioration of values.,George Bush,9,2,1988-09-25.txt,8,16,0.0,-1,"['I', 'think', ""we'v"", 'seen', 'a', 'deterior'...",1988


In [31]:
clf = RandomForestClassifier(n_estimators = 100, max_depth = 10, random_state = 42)
clf.fit(df[['Speaker_party', 'Speaker_title', 'Length', 'Sentiment']].iloc[:1144],df.Verdict.iloc[:1144])

RandomForestClassifier(max_depth=10, random_state=42)

In [32]:
f1_score(df.Verdict.iloc[:1144], clf.predict(df[['Speaker_party', 'Speaker_title', 'Length', 'Sentiment']].iloc[:1144]), average='weighted')

0.8692397892258337

In [33]:
f1_score(df.Verdict.iloc[1144:], clf.predict(df[['Speaker_party', 'Speaker_title', 'Length', 'Sentiment']].iloc[1144:]), average='weighted')

ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

Trying SVM/SVC 

In [34]:
from sklearn.svm import SVC

In [35]:
svc = SVC(random_state = 42)

In [36]:
svc.fit(X,Y)

SVC(random_state=42)

In [37]:
f1_score(Y, svc.predict(X), average='weighted') # super overfitting

0.9696129025218587

In [38]:
f1_score(y, svc.predict(x), average='weighted')

0.5603502910080052

Trying logistic regression

In [63]:
from sklearn.linear_model import LogisticRegression

In [64]:
lr = LogisticRegression()
lr.fit(X,Y)

LogisticRegression()

In [65]:
f1_score(Y, lr.predict(X), average='weighted') # wow this overfitted like crazy

0.9499773585157968

In [66]:
f1_score(y, lr.predict(x), average='weighted')

0.5174502332132564