# Text classification 

This notebook will implement a text classficiation model. The model will predict whether comment is pro or anti brexit. 9000 youtube/twitter comments have been gathered and annotated manually. Later we will see which words have heavier weight. i.e "#voteleave" is a word that is more often used amongst pro brexiters.




## Pre-Processing

Pre-processing will consist of
- **Tokenization**
 - Extract useful items (words) from each brexit comment. More specifically *unigram* or *1-gram* Tokenization will be done where each words will be considered an item
- **Stop word removal**
 - Remove all words (features) will most likley make the ML model perform worse. The words are called *stop words* (i.e "the", they, "shouldnt", ".")
- **Feature representation (bag of word)**
 - Represent each features (word) as the number of times the word occurs in each brexit comment


The following code section will first load the data set and split it into a training and test set. The training set will be used to find best suited hyper parameter values for different estimator (**Model selection**), while the test will will be used as a final evaluation on the model (**Model evaluation**)





In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
# The file is separated by tabs. Therefor the sep argument will be set to '\t'
with open("Data/a2_train.tsv", 'r',encoding='utf-8') as myfile:
    df = pd.read_csv(myfile,sep='\t',header=None)

df.rename(columns={0: 'label', 1: 'Comment'},inplace=True)
print(df.head(5))

X = df['Comment'].tolist()
y = df['label'].tolist()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)



   label                                            Comment
0      1                                   Brexit = Freedom
1      0  I voted brexit, because of the lies about fund...
2      0  Old people voted to leave. Young people voted ...
3      0  On the contrary, poll after poll shows the peo...
4      0  I'd rather see it not happening, but it will b...


### Tokenization


In [2]:
import nltk
from nltk.tokenize import TweetTokenizer
tknzr = TweetTokenizer(strip_handles=True, reduce_len=True)

### Stop word removal

In [3]:
from nltk.corpus import stopwords
nltk.download('stopwords')

stopWords = stopwords.words('english')
#stopWords.append("!")


[nltk_data] Downloading package stopwords to /home/chris/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Feature representation


In [4]:
#count instance will be used in the pipeline
from sklearn.feature_extraction.text import CountVectorizer
count = CountVectorizer(ngram_range=(1, 1), lowercase=True, tokenizer=tknzr.tokenize, stop_words=stopWords)

#tfidf instance will be used in the pipeline
from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer(use_idf=True, norm='l2',smooth_idf=True)

## Model selection

Hyper parameter optimization will be done using Gridsearch combined with cross validation. A 10-fold approach will be done during the cross validation. Standard accuracy will be the performance metric to optmize on. This is motivated since the classes are balanced. Otherwise *balanced accuracy* would be a more appropriate performance metric.

In [5]:
#Import the pipeline, and all of the learning algorithms
from sklearn.pipeline import Pipeline
from sklearn.linear_model import Perceptron
from sklearn.svm import SVC
from sklearn.dummy import DummyClassifier

#Use pickle to save the model
import pickle

pipeline_perceptron = Pipeline([
    ('vect', count),
    ('tfidf', tfidf),
    ('perceptron', Perceptron()),
])


pipeline_SVC = Pipeline([
    ('vect', count),
    ('tfidf', tfidf),
    ('SVC', SVC()),
])

pipeline_DummyClassifier = Pipeline([
    ('vect', count),
    ('tfidf', tfidf),
    ('DummyClassifier', DummyClassifier()),
])


#Perfrom grid search for each of the learning algorithms

from sklearn.model_selection import GridSearchCV


#Grid search on perceptron
param_grid_perceptron = {'perceptron__max_iter':[1,10,20,30,40,50],'perceptron__tol':[0.001]}
GridSearch_perceptron = GridSearchCV(estimator=pipeline_perceptron, param_grid = param_grid_perceptron, 
             cv=10, scoring='accuracy')
GridSearch_perceptron.fit(X_train, y_train)
fp = open('Model/perceptron.pickle', 'wb')
pickle.dump(GridSearch_perceptron.best_estimator_, fp)
fp.close()
print("GridSearchCV on perceptron done")


#Grid search on SVC
param_grid_SVC = {'SVC__C': [0.0001, 0.001, 0.01, 0.1, 1.0, 5, 10.0], 'SVC__kernel': ['linear']}
GridSearch_SVC = GridSearchCV(estimator=pipeline_SVC, param_grid = param_grid_SVC, 
             cv=10, scoring='accuracy')
GridSearch_SVC.fit(X_train, y_train)
fp = open('Model/SVC.pickle', 'wb')
pickle.dump(GridSearch_SVC.best_estimator_, fp)
fp.close()
print("GridSearchCV on SVC done")


#Grid search on a DummyClassifier
param_grid_DummyClassifier = {'DummyClassifier__strategy':['most_frequent']} #'uniform','stratified',
GridSearch_DummyClassifier = GridSearchCV(estimator=pipeline_DummyClassifier, param_grid = param_grid_DummyClassifier, 
             cv=10, scoring='accuracy')
GridSearch_DummyClassifier.fit(X_train, y_train)
fp = open('Model/DummyClassifier.pickle', 'wb')
pickle.dump(GridSearch_DummyClassifier.best_estimator_, fp)
fp.close()
print("GridSearchCV on DummyClassifier done")




print("Perceptron best params {0}  \n Perceptron best score {1} \n \n".format(GridSearch_perceptron.best_params_, 
                                                                              GridSearch_perceptron.best_score_))

print("SVC best params: {0}  \n SVC best score: {1} \n \n".format(GridSearch_SVC.best_params_, GridSearch_SVC.best_score_))

print("DummyClassifier best params: {0}  \n DummyClassifier best score: {1} \n \n".format(GridSearch_DummyClassifier.best_params_,
                                                                                          GridSearch_DummyClassifier.best_score_))







GridSearchCV on perceptron done
GridSearchCV on SVC done
GridSearchCV on DummyClassifier done
Perceptron best params {'perceptron__max_iter': 30, 'perceptron__tol': 0.001}  
 Perceptron best score 0.7087845674080565 
 

SVC best params: {'SVC__C': 1.0, 'SVC__kernel': 'linear'}  
 SVC best score: 0.7535209573636419 
 

DummyClassifier best params: {'DummyClassifier__strategy': 'most_frequent'}  
 DummyClassifier best score: 0.5115993574048778 
 



#### Results of Models selection


**Perceptron best params** {'perceptron__max_iter': 30, 'perceptron__tol': 0.001}  
**Perceptron best score**: 0.7131317315658657    
 

**SVC best params:** {'SVC__C': 1.0, 'SVC__kernel': 'linear'}    
**SVC best score:** 0.7535211267605634    
 

**DummyClassifier best params:** {'DummyClassifier__strategy': 'most_frequent'}    
**DummyClassifier best score:** 0.5115990057995029   

## Model evaluation

The initial test set held out earlier will be evaluated in this section. A confusion matrix will be also computed to see if there is a harder time predicting on one class (i.e anti brexit comments) than the other class.

In [6]:
from sklearn.metrics import confusion_matrix

clf_perceptron = pickle.load( open( "Model/perceptron.pickle", "rb" ) )
clf_SVC = pickle.load( open( "Model/SVC.pickle", "rb" ) )
clf_DummyClassifier = pickle.load( open( "Model/DummyClassifier.pickle", "rb" ) )

#Print test set accuracy results
print("Perceptron test set results: {0}".format(clf_perceptron.score(X_test,y_test)))
print("SVC test set results: {0}".format(clf_SVC.score(X_test,y_test)))
print("Dummy Classifier test set results: {0} \n \n".format(clf_DummyClassifier.score(X_test,y_test)))


#Comput an confusion matrix
ypred_perceptron = clf_perceptron.predict(X_test)
ypred_SVC = clf_SVC.predict(X_test)
ypred_DummyClassifier = clf_DummyClassifier.predict(X_test)

conf_matrix_perceptron = confusion_matrix(y_true=y_test, y_pred=ypred_perceptron, labels = [0,1])
conf_matrix_SVC = confusion_matrix(y_true=y_test, y_pred=ypred_SVC , labels = [0,1])
conf_matrix_DummyClassifier = confusion_matrix(y_true=y_test, y_pred=ypred_DummyClassifier, labels = [0,1])
#Print conf matrix for SVC
print('Confusion matrix for perceptron: {0} \n Confusion matrix for SVC: {1}\n Confusion matrix for DummyClassifier: {2}\n'.format(conf_matrix_perceptron,
                                                                                                                                    conf_matrix_SVC,
                                                                                                                                    conf_matrix_DummyClassifier))

tn, fp, fn, tp = conf_matrix_perceptron.ravel()
print("True negatives are anti brexit comments correctly identified")
print("False positive are anti-brexit Predicted as Pro brexit")
print("False negative are pro brexit comments predicted as anti brexit")
print("True positives are pro brexit comments correctly identified")
print("For example: Perceptron has  {0} true positive, {1} false positives, {2} false negatives, and {3} true positives".format(tn, fp, fn, tp))


#Trying to print important features for perceptron and SVC
#feature_names = clf_perceptron.named_steps['vect'].get_feature_names()
#print(yo[0:5])
#weights = clf_perceptron.named_steps['perceptron'].coef_
#print(weights[0:5])
#mapped = zip(clf_perceptron.named_steps['perceptron'].coef_ , clf_perceptron.named_steps['vect'].get_feature_names())
#Print which features the SVC thinks are important
#for weight, fname in mapped:
#    print(fname, weight)


Perceptron test set results: 0.7258200168208578
SVC test set results: 0.7485281749369218
Dummy Classifier test set results: 0.5126156433978133 
 

Confusion matrix for perceptron: [[802 357]
 [295 924]] 
 Confusion matrix for SVC: [[859 300]
 [298 921]]
 Confusion matrix for DummyClassifier: [[   0 1159]
 [   0 1219]]

True negatives are anti brexit comments correctly identified
False positive are anti-brexit Predicted as Pro brexit
False negative are pro brexit comments predicted as anti brexit
True positives are pro brexit comments correctly identified
For example: Perceptron has  802 true positive, 357 false positives, 295 false negatives, and 924 true positives


**Perceptron test set results:** 0.7232968881412952  
**SVC test set results:** 0.7485281749369218  
**Dummy Classifier test set results:** 0.5126156433978133   
  
It can be seen that both learning algorithms (SVC, perceptron) performs better than the best possible stupid dummy baseline (Most frequent class)

**Confusion matrix for perceptron:**   
[[805 354]  
 [304 915]]     


**Confusion matrix for SVC:**  
[[859 300]  
[298 921]]  


**Confusion matrix for DummyClassifier:**   
[[   0 1159]  
[   0 1219]]  

*True negatives* are anti brexit comments correctly identified  
*False positive* are anti-brexit Predicted as Pro brexit   
*False negative* are pro brexit comments predicted as anti brexit  
*True positives* are pro brexit comments correctly identified  


For example: Perceptron has  805 true positive, 354 false positives, 304 false negatives, and 915 true positives  




### Running the best estimators on a separate test set

The optmized estimators are in the the folder named "Models". The *Pickle* library can be is used to load classifier

In [7]:
import pickle
clf_pipeline = pickle.load( open( "Model/SVC.pickle", "rb" ) )
clf_pipeline.score(X_test,y_test)

0.7485281749369218

### Investigating features

The linear classifiers SVC and perceptron can rank which features (words) are most relevant for determining the classifications. 

In [8]:
clf_pipeline = pickle.load( open( "Model/SVC.pickle", "rb" ) )
featuresSparse = clf_pipeline['SVC'].coef_
features = featuresSparse.todense()


countVectorizer = clf_pipeline['vect']
words = countVectorizer.get_feature_names()



df = pd.DataFrame(data={'Words': words, 'Score': features.tolist()[0][:]})
df_proBrexit = df[df.Score > 0]
df_AntiBrexit = df[df.Score < 0]

df_proBrexit = df_proBrexit.sort_values(by=['Score'], ascending=False)
df_AntiBrexit = df_AntiBrexit.sort_values(by=['Score'], ascending=False)
print("Top pro brexit")
print(df_proBrexit.head(10))
print("Top Anti brexit")
print(df_AntiBrexit.tail(10))




Top pro brexit
           Words     Score
3105          eu  3.331471
5276       media  2.698867
5266       means  2.639581
6863   remainers  2.439756
91    #voteleave  2.360061
2092     corrupt  2.291570
4028        hard  2.264120
2417  democratic  2.192142
2415   democracy  2.124605
4885        laws  2.003526
Top Anti brexit
            Words     Score
7830     stopping -1.796267
5686     nonsense -1.821959
1333     brexshit -1.892504
1317       brexit -1.920070
93    #voteremain -1.926407
6061        peace -2.112792
7899       stupid -2.227987
4915      leavers -2.437080
1322   brexiteers -2.621479
1325    brexiters -2.701223


## Analysing the annotation

There are several annotators on one sample. However since the third annotator is missing on some of the samples, only an analysis of two annotators will be made


In [9]:
from sklearn.metrics import cohen_kappa_score
from sklearn.metrics import accuracy_score
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
# The file is separated by tabs. Therefor the sep argument will be set to '\t'
with open("Data/a2_final.csv", 'r',encoding='utf-8') as myfile:
    df = pd.read_csv(myfile,sep=',')

df.rename(columns={0: 'labels', 1: 'Comment', 2: 'Anotator 1', 3: 'Anotator 2', 4: 'Anotator 3'},inplace=True)
#print(df)

df_two_annotators  = df[['Anotator 1', 'Anotator 2']].copy()
#print(df_two_annotators)
df_two_annotators = df_two_annotators.dropna()
Anotator_1 = df_two_annotators['Anotator 1'].tolist()
Anotator_2 = df_two_annotators['Anotator 2'].tolist()

print(accuracy_score(Anotator_1, Anotator_2))
print(cohen_kappa_score(Anotator_1, Anotator_2))





0.8649101431617423
0.7408589591450898


**Accuracy:** 0.8649101431617423  
**Cohens kappa:** 0.7408589591450898