# Baseline model

In [23]:
import pandas as pd
import numpy as np

import os
import sys
sys.path.append(os.path.abspath('../src'))

import string
import nltk
from nltk import ngrams

# Caching stopwords
from nltk.corpus import stopwords
nltk.download('stopwords', quiet=True)
stop_words = set(stopwords.words('english'))

from nltk.stem.porter import PorterStemmer

from fact_classification import *

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score
from sklearn.metrics import precision_recall_fscore_support


## Importing data

In [24]:
df, df_crowdsourced, df_ground_truth = data_loading()

## Preparing for processing

### Preparing text

In [25]:
df.head()

Unnamed: 0,Sentence_id,Text,Speaker,Speaker_title,Speaker_party,File_id,Length,Line_number,Sentiment,Verdict
0,16,I think we've seen a deterioration of values.,George Bush,Vice President,REPUBLICAN,1988-09-25.txt,8,16,0.0,-1
1,17,I think for a while as a nation we condoned th...,George Bush,Vice President,REPUBLICAN,1988-09-25.txt,16,17,-0.456018,-1
2,18,"For a while, as I recall, it even seems to me ...",George Bush,Vice President,REPUBLICAN,1988-09-25.txt,29,18,-0.805547,-1
3,19,"So we've seen a deterioration in values, and o...",George Bush,Vice President,REPUBLICAN,1988-09-25.txt,35,19,0.698942,-1
4,20,"We got away, we got into this feeling that val...",George Bush,Vice President,REPUBLICAN,1988-09-25.txt,15,20,0.0,-1


#### Doing stemming

Stemming will be good as it removes some variability in how words are stated. But we should prooobably also try without.

In [26]:
df['Text_stemmed'] = stem(df)

In [27]:
df.Text_stemmed

0       ['i', 'think', "we'v", 'seen', 'a', 'deterior'...
1       ['i', 'think', 'for', 'a', 'while', 'as', 'a',...
2       ['for', 'a', 'while,', 'as', 'i', 'recall,', '...
3       ['so', "we'v", 'seen', 'a', 'deterior', 'in', ...
4       ['we', 'got', 'away,', 'we', 'got', 'into', 't...
                              ...                        
1027    ['he', 'ha', 'promis', 'a', 'trillion', 'dolla...
1028    ['(laughter)', 'i', '--', "there'", 'an', 'old...
1029             ['well,', 'can', 'i', 'answer', 'that?']
1030    ['i', 'look', 'forward', 'to', 'the', 'final',...
1031    ['for', 'those', 'of', 'you', 'for', 'me,', 't...
Name: Text_stemmed, Length: 23533, dtype: object

## Split into test and train

According to the description of the task we shuold split the dataset into test and train based on year of debate. All debates before and including 2008 goes into train and more recent debates into test. (I would also consider making a validation set when we get closer to the end to have a final validation)

In [28]:
df_train, df_test = test_train_split(df)

### Create the tfid matrix for the text column

In [29]:
train_tfid, test_tfid=  tfid(train = df_train.Text, test = df_test.Text)


In [30]:
train_tfid

<18170x10641 sparse matrix of type '<class 'numpy.float64'>'
	with 277846 stored elements in Compressed Sparse Row format>

In [31]:
test_tfid

<5363x10641 sparse matrix of type '<class 'numpy.float64'>'
	with 70684 stored elements in Compressed Sparse Row format>

Looks good, I first fitted the vectorizer to the train set (so only words in the train set will be counted) and then transformed the test set using the same vectorizer. They have the same amount of columns which indicate it has been done correctly, keeping them sparse to save storage.

## Predict using standard models

RandomForestClassifier(
        n_estimators=20,
        max_depth=20,
        random_state=42,
        class_weight="balanced_subsample",
    )

The base model

In [32]:
pred_train, pred_test = predict_it(train_tfid, df_train.Verdict, test_tfid) # predicting with basemodel

## Scoring

The score of our base-model is ok, it is important to also consider the scores for individual classes because our data is so unbalanced.

In [33]:
df_score = score_it(df_test.Verdict, pred_test, df_train.Verdict, pred_train)

A very simple randomforrestclassifier based only on the text stemmed and vectorized gives an accuracy of 64 percent. This will be our base line model.

## Proving the point of checking more than one accuracy measure and your data

Making a model which will have a high score but be completely useless. We can predict all -1 and ger an average weighted fscore of nearly 50%...

In [34]:
pred_test = stupid_model(df_test.shape[0])

In [35]:
pd.concat([df_score, score_it(df_test.Verdict, pred_test, df_train.Verdict, pred_train, algorithm = 'stupid_model', features = 'none')])

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Unnamed: 0,alogrithm,features,p_NFS,p_UFS,p_CFS,p_wavg,r_NFS,r_UFS,r_CFS,r_wavg,f_NFS,f_UFS,f_CFS,f_wavg
0,RandomForrestClassifier,tfid,0.787743,0.287819,0.644,0.691448,0.795112,0.470305,0.451613,0.666045,0.79141,0.357099,0.530915,0.671693
0,stupid_model,none,0.617938,0.0,0.0,0.381847,1.0,0.0,0.0,0.617938,0.763858,0.0,0.0,0.472017
