# Baseline model

In [1]:
import pandas as pd
import numpy as np

import os
import sys
sys.path.append(os.path.abspath('../src'))

import string
import nltk
from nltk import ngrams

# Caching stopwords
from nltk.corpus import stopwords
nltk.download('stopwords', quiet=True)
stop_words = set(stopwords.words('english'))

from nltk.stem.porter import PorterStemmer

from fact_classification import *

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score
from sklearn.metrics import precision_recall_fscore_support


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\signe\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


## Importing data

In [2]:
df, df_crowdsourced, df_ground_truth = data_loading()

In [3]:
df_score_test = pd.read_csv('score_test.csv')
df_score_train = pd.read_csv('score_train.csv')

## Preparing for processing

### Preparing text

In [4]:
df.head()

Unnamed: 0,Sentence_id,Text,Speaker,Speaker_title,Speaker_party,File_id,Length,Line_number,Sentiment,Verdict
0,16,I think we've seen a deterioration of values.,George Bush,Vice President,REPUBLICAN,1988-09-25.txt,8,16,0.0,-1
1,17,I think for a while as a nation we condoned th...,George Bush,Vice President,REPUBLICAN,1988-09-25.txt,16,17,-0.456018,-1
2,18,"For a while, as I recall, it even seems to me ...",George Bush,Vice President,REPUBLICAN,1988-09-25.txt,29,18,-0.805547,-1
3,19,"So we've seen a deterioration in values, and o...",George Bush,Vice President,REPUBLICAN,1988-09-25.txt,35,19,0.698942,-1
4,20,"We got away, we got into this feeling that val...",George Bush,Vice President,REPUBLICAN,1988-09-25.txt,15,20,0.0,-1


## Split into test and train

According to the description of the task we shuold split the dataset into test and train based on year of debate. All debates before and including 2008 goes into train and more recent debates into test. (I would also consider making a validation set when we get closer to the end to have a final validation)

In [5]:
df_train, df_test = test_train_split(df)

### Create the tfid matrix for the text column

In [6]:
train_tfid, test_tfid, vocab =  tfid(train = df_train.Text, test = df_test.Text, n_gram_range  = 1)

In [7]:
train_tfid

<18170x10641 sparse matrix of type '<class 'numpy.float64'>'
	with 277846 stored elements in Compressed Sparse Row format>

In [8]:
test_tfid

<5363x10641 sparse matrix of type '<class 'numpy.float64'>'
	with 70684 stored elements in Compressed Sparse Row format>

Looks good, I first fitted the vectorizer to the train set (so only words in the train set will be counted) and then transformed the test set using the same vectorizer. They have the same amount of columns which indicate it has been done correctly, keeping them sparse to save storage.

## Predict using standard models

RandomForestClassifier(
        max_depth=20,
        random_state=42,
        class_weight="balanced_subsample",
    )

The base model

In [9]:
pred_train, pred_test = predict_it(train_tfid, df_train.Verdict, test_tfid) # predicting with basemodel

## Scoring

The score of our base-model is ok, it is important to also consider the scores for individual classes because our data is so unbalanced.

In [10]:
df_score_test = score_it(df_test.Verdict, pred_test)
df_score_train = score_it(df_train.Verdict, pred_train)

In [11]:
df_score_train

Unnamed: 0,alogrithm,features,p_NFS,p_UFS,p_CFS,p_wavg,r_NFS,r_UFS,r_CFS,r_wavg,f_NFS,f_UFS,f_CFS,f_wavg
0,RandomForrest,tfid,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [12]:
df_score_test

Unnamed: 0,alogrithm,features,p_NFS,p_UFS,p_CFS,p_wavg,r_NFS,r_UFS,r_CFS,r_wavg,f_NFS,f_UFS,f_CFS,f_wavg
0,RandomForrest,tfid,0.67,0.6,0.81,0.7,0.99,0.06,0.23,0.68,0.8,0.11,0.35,0.6


A very simple randomforrestclassifier based only on the text vectorized gives an accuracy of around 60 percent. We can clearly see that it is strongly overtrained. It will function as a minimum effort baseline model example.

## Proving the point of checking more than one accuracy measure and your data

Making a model which will have a high score but be completely useless. We can predict all -1 and ger an average weighted fscore of nearly 50%...

In [13]:
pred_test = stupid_model(df_test.shape[0])
pred_train = stupid_model(df_train.shape[0])

In [14]:
df_score_test = pd.concat([df_score_test, score_it(df_test.Verdict, pred_test, algorithm = 'stupid_model', features = 'none')])
df_score_train = pd.concat([df_score_train, score_it(df_train.Verdict, pred_train, algorithm = 'stupid_model', features = 'none')])

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [15]:
df_score_test

Unnamed: 0,alogrithm,features,p_NFS,p_UFS,p_CFS,p_wavg,r_NFS,r_UFS,r_CFS,r_wavg,f_NFS,f_UFS,f_CFS,f_wavg
0,RandomForrest,tfid,0.67,0.6,0.81,0.7,0.99,0.06,0.23,0.68,0.8,0.11,0.35,0.6
0,stupid_model,none,0.62,0.0,0.0,0.38,1.0,0.0,0.0,0.62,0.76,0.0,0.0,0.47


In [16]:
df_score_train

Unnamed: 0,alogrithm,features,p_NFS,p_UFS,p_CFS,p_wavg,r_NFS,r_UFS,r_CFS,r_wavg,f_NFS,f_UFS,f_CFS,f_wavg
0,RandomForrest,tfid,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
0,stupid_model,none,0.67,0.0,0.0,0.44,1.0,0.0,0.0,0.67,0.8,0.0,0.0,0.53


In [17]:
df_score_test.to_csv(r'score_test.csv')
df_score_train.to_csv(r'score_train.csv')