<H1>INSTRUCTIONS</H1>

In this ICA, you will build various text classification models and use them to classify sentences from 2016 presidential debates according to speaker.

The ICA zip file contains three files train, dev and test. The files are one document per line, tokenized and lowercased, so you don't have to do any preprocessing. Each line has the format: 

      
   <font color="red"> <center>trump i fully understand .</center></font>

where the first token is the speaker's name, and the remaining tokens are the words of the document.


<H2>Hands on programming with Text Classification</H2> 

Naive Bayes Classifier with 51.5% accuracy is provided.
Experiment with two new kinds of features.
<ol type="a">
<li>Extend the code provided to construct a bag of words to include an additional kind of feature besides words. You can try bigrams, prefixes, parts of speech, anything you like. Describe your new features. Report your new accuracy. Briefly write down your conclusions from this experiment.<b>(5 points)</b></li>
    <li>Do the same thing for another kind of feature. Report your new accuracy. Briefly write down your conclusions from this experiment.<b>(5 points)</b></li>
</ol>

In [0]:
import pandas as pd
import numpy as np

In [0]:
##Load Training Data
train_x=[]
train_y=[]

In [0]:
with open('data/train.txt', 'r') as f:
   for line in f:
        temp=line.split(' ',1)
        train_x.append(temp[1])
        train_y.append(temp[0])
       

In [4]:
np.unique(train_y)

array(['bush', 'carson', 'chafee', 'christie', 'clinton', 'cruz',
       'fiorina', 'huckabee', 'kasich', "o'malley", 'paul', 'perry',
       'rubio', 'sanders', 'trump', 'walker', 'webb'], dtype='<U8')

In [5]:
len(np.unique(train_y))

17

In [6]:
train_df = pd.DataFrame({'sentence':train_x, 'speaker':train_y})
train_df.head()

Unnamed: 0,sentence,speaker
0,"no . i am very proud to be jewish , and being ...",sanders
1,"well , that just wasn't that just wasn't the f...",clinton
2,"thank you , anderson . thank you , cnn . and t...",chafee
3,. . . let us talk about issues . \n,sanders
4,"thank you , bernie . thank you . \n",clinton


In [0]:
##Load test data
test_x=[]
test_y=[]

In [0]:
with open('data/test.txt', 'r') as f:
   for line in f:
        temp=line.split(' ',1)
        test_x.append(temp[1])
        test_y.append(temp[0])

In [9]:
test_df = pd.DataFrame({'sentence':test_x, 'speaker':test_y})
test_df.head()

Unnamed: 0,sentence,speaker
0,"madam secretary , when he asked me to speak . ...",sanders
1,too many lives have been destroyed because peo...,sanders
2,. . . well and i . . . \n,sanders
3,"look , the secretary is right . this is a terr...",sanders
4,"well , hillary clinton , and everybody else wh...",sanders


<H3>Train Naive Bayes Classifier with the data in "train.txt" file</H3>

In [10]:
#Count of documents per class
from collections import Counter
docs_per_class = Counter(train_df['speaker'])
docs_per_class

Counter({'bush': 207,
         'carson': 104,
         'chafee': 18,
         'christie': 81,
         'clinton': 455,
         'cruz': 273,
         'fiorina': 90,
         'huckabee': 41,
         'kasich': 162,
         "o'malley": 128,
         'paul': 94,
         'perry': 1,
         'rubio': 305,
         'sanders': 501,
         'trump': 637,
         'walker': 31,
         'webb': 28})

In [0]:
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
X_train_dtm = vect.fit_transform(train_df.sentence)


In [12]:
#count of words for all speakers
vect.vocabulary_

{'no': 5532,
 'am': 514,
 'very': 8682,
 'proud': 6413,
 'to': 8240,
 'be': 931,
 'jewish': 4529,
 'and': 560,
 'being': 976,
 'is': 4477,
 'so': 7528,
 'much': 5395,
 'of': 5625,
 'what': 8902,
 'look': 4936,
 'my': 5420,
 'father': 3241,
 'family': 3215,
 'was': 8824,
 'wiped': 8957,
 'out': 5744,
 'by': 1310,
 'hitler': 3979,
 'in': 4192,
 'the': 8144,
 'holocaust': 3992,
 'know': 4672,
 'about': 228,
 'crazy': 2052,
 'radical': 6518,
 'extremist': 3169,
 'politics': 6131,
 'mean': 5132,
 'learned': 4774,
 'that': 8141,
 'lesson': 4813,
 'as': 718,
 'tiny': 8232,
 'child': 1543,
 'when': 8906,
 'mother': 5374,
 'would': 9013,
 'take': 8010,
 'me': 5131,
 'shopping': 7385,
 'we': 8850,
 'see': 7246,
 'people': 5956,
 'working': 8998,
 'stores': 7774,
 'who': 8923,
 'had': 3808,
 'numbers': 5584,
 'on': 5659,
 'their': 8146,
 'arms': 696,
 'because': 944,
 'they': 8162,
 'were': 8895,
 'concentration': 1790,
 'camp': 1337,
 'an': 551,
 'essential': 3007,
 'part': 5866,
 'human': 4066,

In [0]:
#Train Naive Bayes Classifier
#MultinomialNB has default smoothing factor of 1
from sklearn.naive_bayes import MultinomialNB
nb_clf = MultinomialNB().fit(X_train_dtm, train_df.speaker)

<H3>Test the developed naive bayes classifier with "test.txt" file</H3>

In [0]:
X_test_dtm = vect.transform(test_df.sentence)
predicted_speaker=nb_clf.predict(X_test_dtm)

In [15]:
from sklearn import metrics
metrics.accuracy_score(test_df.speaker, predicted_speaker)

0.515

In [16]:
#Evaluate model
from sklearn import metrics
print(metrics.classification_report(test_df.speaker, predicted_speaker, target_names=test_df.speaker.unique()))

              precision    recall  f1-score   support

     sanders       0.75      0.10      0.18        30
     clinton       0.00      0.00      0.00        16
    o'malley       0.00      0.00      0.00         2
      chafee       0.00      0.00      0.00        12
        webb       0.43      0.74      0.54        53
    huckabee       0.79      0.65      0.71        34
       trump       0.00      0.00      0.00        10
        cruz       0.00      0.00      0.00         7
        paul       0.79      0.35      0.49        31
    christie       0.60      0.17      0.26        18
       rubio       0.00      0.00      0.00         5
     fiorina       0.52      0.54      0.53        41
      kasich       0.51      0.69      0.59        64
        bush       0.48      0.87      0.62        71
      carson       0.00      0.00      0.00         3
      walker       0.00      0.00      0.00         3

    accuracy                           0.52       400
   macro avg       0.30   

  'precision', 'predicted', average, warn_for)


## Using TFIDF as Features

We use TF-IDF to represent words in vectors that describe the weighted frequecy of a word in the corpus

In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
tfidf_vector = vectorizer.fit_transform(train_df.sentence)

In [0]:
#Train Naive Bayes Classifier
#MultinomialNB has default smoothing factor of 1
from sklearn.naive_bayes import MultinomialNB
nb_clf_tfidf = MultinomialNB().fit(tfidf_vector, train_df.speaker)

In [0]:
X_test_tfidf = vectorizer.transform(test_df.sentence)
predicted_speaker=nb_clf_tfidf.predict(X_test_tfidf)

In [39]:
from sklearn import metrics
metrics.accuracy_score(test_df.speaker, predicted_speaker)

0.3725

In [40]:
#Evaluate model
from sklearn import metrics
print(metrics.classification_report(test_df.speaker, predicted_speaker, target_names=test_df.speaker.unique()))

              precision    recall  f1-score   support

     sanders       0.00      0.00      0.00        30
     clinton       0.00      0.00      0.00        16
    o'malley       0.00      0.00      0.00         2
      chafee       0.00      0.00      0.00        12
        webb       0.36      0.62      0.46        53
    huckabee       0.00      0.00      0.00        34
       trump       0.00      0.00      0.00        10
        cruz       0.00      0.00      0.00         7
        paul       0.00      0.00      0.00        31
    christie       0.00      0.00      0.00        18
       rubio       0.00      0.00      0.00         5
     fiorina       0.00      0.00      0.00        41
      kasich       0.47      0.75      0.57        64
        bush       0.33      0.96      0.49        71
      carson       0.00      0.00      0.00         3
      walker       0.00      0.00      0.00         3

    accuracy                           0.37       400
   macro avg       0.07   

  'precision', 'predicted', average, warn_for)


As we can see, the accuracy has reduced consderably to 37%

## Using NGrams as Features

Our original experiment only included unigrams. We now consider 5 sets of Ngrams from N=2 till N=6.

In [0]:
bigram_vectorizer = CountVectorizer(ngram_range=(2, 6))

In [0]:
X_train_ngrams = bigram_vectorizer.fit_transform(train_df.sentence)

In [0]:
#Train Naive Bayes Classifier
#MultinomialNB has default smoothing factor of 1
from sklearn.naive_bayes import MultinomialNB
nb_clf_ngrams = MultinomialNB().fit(X_train_ngrams, train_df.speaker)

In [0]:
X_test_ngrams = bigram_vectorizer.transform(test_df.sentence)

In [0]:
predicted_speaker=nb_clf_ngrams.predict(X_test_ngrams)

In [89]:
from sklearn import metrics
metrics.accuracy_score(test_df.speaker, predicted_speaker)

0.5625

In [90]:
#Evaluate model
from sklearn import metrics
print(metrics.classification_report(test_df.speaker, predicted_speaker, target_names=test_df.speaker.unique()))

              precision    recall  f1-score   support

     sanders       0.80      0.27      0.40        30
     clinton       1.00      0.12      0.22        16
    o'malley       0.00      0.00      0.00         2
      chafee       1.00      0.08      0.15        12
        webb       0.44      0.72      0.55        53
    huckabee       0.88      0.62      0.72        34
       trump       1.00      0.10      0.18        10
        cruz       0.00      0.00      0.00         7
        paul       0.80      0.26      0.39        31
    christie       1.00      0.33      0.50        18
       rubio       1.00      0.40      0.57         5
     fiorina       0.67      0.63      0.65        41
      kasich       0.52      0.72      0.61        64
        bush       0.50      0.92      0.65        71
      carson       0.50      0.33      0.40         3
      walker       0.00      0.00      0.00         3

    accuracy                           0.56       400
   macro avg       0.63   

  'precision', 'predicted', average, warn_for)


Increasing the number of features by using more Ngrams, the accuracy goes up to 56%.