<H1>INSTRUCTIONS</H1>

In this ICA, you will build various text classification models and use them to classify sentences from 2016 presidential debates according to speaker.

The ICA zip file contains three files train, dev and test. The files are one document per line, tokenized and lowercased, so you don't have to do any preprocessing. Each line has the format: 

      
   <font color="red"> <center>trump i fully understand .</center></font>

where the first token is the speaker's name, and the remaining tokens are the words of the document.


<H2>Hands on programming with Text Classification</H2> 

Naive Bayes Classifier with 51.5% accuracy is provided.
Experiment with two new kinds of features.
<ol type="a">
<li>Extend the code provided to construct a bag of words to include an additional kind of feature besides words. You can try bigrams, prefixes, parts of speech, anything you like. Describe your new features. Report your new accuracy. Briefly write down your conclusions from this experiment.<b>(5 points)</b></li>
    <li>Do the same thing for another kind of feature. Report your new accuracy. Briefly write down your conclusions from this experiment.<b>(5 points)</b></li>
</ol>

In [11]:
import pandas as pd
import numpy as np
import nltk

In [2]:
##Load Training Data
train_x=[]
train_y=[]

In [3]:
with open('train.txt', 'r') as f:
   for line in f:
        temp=line.split(' ',1)
        train_x.append(temp[1])
        train_y.append(temp[0])
       

In [4]:
np.unique(train_y)

array(['bush', 'carson', 'chafee', 'christie', 'clinton', 'cruz',
       'fiorina', 'huckabee', 'kasich', "o'malley", 'paul', 'perry',
       'rubio', 'sanders', 'trump', 'walker', 'webb'], dtype='<U8')

In [5]:
len(np.unique(train_y))

17

In [6]:
train_df = pd.DataFrame({'sentence':train_x, 'speaker':train_y})
train_df.head()

Unnamed: 0,sentence,speaker
0,"no . i am very proud to be jewish , and being ...",sanders
1,"well , that just wasn't that just wasn't the f...",clinton
2,"thank you , anderson . thank you , cnn . and t...",chafee
3,. . . let us talk about issues . \n,sanders
4,"thank you , bernie . thank you . \n",clinton


In [7]:
##Load test data
test_x=[]
test_y=[]

In [9]:
with open('test.txt', 'r') as f:
   for line in f:
        temp=line.split(' ',1)
        test_x.append(temp[1])
        test_y.append(temp[0])

In [10]:
test_df = pd.DataFrame({'sentence':test_x, 'speaker':test_y})
test_df.head()

Unnamed: 0,sentence,speaker
0,"madam secretary , when he asked me to speak . ...",sanders
1,too many lives have been destroyed because peo...,sanders
2,. . . well and i . . . \n,sanders
3,"look , the secretary is right . this is a terr...",sanders
4,"well , hillary clinton , and everybody else wh...",sanders


<H3>Train Naive Bayes Classifier with the data in "train.txt" file</H3>

In [19]:
#Count of documents per class
from collections import Counter
docs_per_class = Counter(train_df['speaker'])
docs_per_class

Counter({'sanders': 501,
         'clinton': 455,
         'chafee': 18,
         "o'malley": 128,
         'webb': 28,
         'bush': 207,
         'cruz': 273,
         'trump': 637,
         'christie': 81,
         'rubio': 305,
         'kasich': 162,
         'fiorina': 90,
         'paul': 94,
         'carson': 104,
         'huckabee': 41,
         'walker': 31,
         'perry': 1})

In [37]:
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
X_train_dtm = vect.fit_transform(train_df.sentence)


In [38]:
#count of words for all speakers
vect.vocabulary_

{'ain': 442,
 'acquiring': 290,
 'nicaraguan': 5515,
 'assess': 741,
 'quarters': 6495,
 'possess': 6166,
 'recall': 6618,
 'partly': 5878,
 'panels': 5841,
 'planes': 6062,
 'neutrally': 5502,
 'cover': 2033,
 'confronted': 1834,
 'normal': 5551,
 'alcohol': 466,
 'amwill': 549,
 'parental': 5859,
 'marketing': 5083,
 'lost': 4954,
 'referenced': 6684,
 'alongside': 498,
 'columbus': 1690,
 'equate': 2981,
 'mandated': 5044,
 'triad': 8369,
 'behaved': 967,
 'ended': 2906,
 'paris': 5863,
 'mom': 5331,
 'mocked': 5314,
 'see': 7246,
 'hacked': 3805,
 'rumble': 7081,
 'made': 4999,
 'rattling': 6575,
 'responded': 6899,
 'coroner': 1983,
 'akron': 460,
 'heed': 3917,
 'priority': 6290,
 'requested': 6858,
 'cousin': 2032,
 'embracing': 2849,
 'contradicts': 1928,
 'yours': 9064,
 'gulf': 3786,
 'highly': 3953,
 'figures': 3312,
 'ethanol': 3016,
 'specter': 7625,
 'continuously': 1922,
 'maximize': 5117,
 'militarily': 5223,
 'enduring': 2916,
 'fresh': 3527,
 'neil': 5487,
 'mm': 5308

In [39]:
#Train Naive Bayes Classifier
#MultinomialNB has default smoothing factor of 1
from sklearn.naive_bayes import MultinomialNB
nb_clf = MultinomialNB().fit(X_train_dtm, train_df.speaker)

<H3>Test the developed naive bayes classifier with "test.txt" file</H3>

In [40]:
X_test_dtm = vect.transform(test_df.sentence)
predicted_speaker=nb_clf.predict(X_test_dtm)

In [41]:
from sklearn import metrics
metrics.accuracy_score(test_df.speaker, predicted_speaker)

0.515

In [42]:
#Evaluate model
from sklearn import metrics
print(metrics.classification_report(test_df.speaker, predicted_speaker, target_names=test_df.speaker.unique()))

              precision    recall  f1-score   support

     sanders       0.75      0.10      0.18        30
     clinton       0.00      0.00      0.00        16
    o'malley       0.00      0.00      0.00         2
      chafee       0.00      0.00      0.00        12
        webb       0.43      0.74      0.54        53
    huckabee       0.79      0.65      0.71        34
       trump       0.00      0.00      0.00        10
        cruz       0.00      0.00      0.00         7
        paul       0.79      0.35      0.49        31
    christie       0.60      0.17      0.26        18
       rubio       0.00      0.00      0.00         5
     fiorina       0.52      0.54      0.53        41
      kasich       0.51      0.69      0.59        64
        bush       0.48      0.87      0.62        71
      carson       0.00      0.00      0.00         3
      walker       0.00      0.00      0.00         3

    accuracy                           0.52       400
   macro avg       0.30   

  'precision', 'predicted', average, warn_for)


## Problem 1

In [43]:
vect = CountVectorizer(tokenizer=nltk.word_tokenize,ngram_range=(1,3))
X_train_dtm = vect.fit_transform(train_df.sentence)
vect.vocabulary_

{'cutting social security': 71550,
 'significant way in': 194429,
 'laughing stock in': 133411,
 'the japanese know': 214642,
 'caught in': 61160,
 'honest ,': 110652,
 '. compared': 13680,
 'in paris yesterday': 118163,
 'have nukes': 105393,
 'three words bush': 226639,
 'with health .': 255282,
 'top of every': 234249,
 "'s visa overstays": 4090,
 'a career': 20747,
 "'s really doing": 3569,
 ', to find': 11380,
 'syria , parts': 203336,
 'knows politicians': 132562,
 'whatever they have': 250545,
 'strategy for our': 200840,
 'to say they': 231932,
 'and start leading': 37935,
 'every telephone': 84970,
 'and marriage .': 36645,
 'that and the': 207528,
 'are talking about': 42688,
 'many places .': 140770,
 'and we stood': 39095,
 "'d like": 186,
 'legally in': 134880,
 'nixon . he': 152205,
 'deliver benefits': 73786,
 'for middle': 91573,
 'plots and these': 173728,
 'privileges .': 177169,
 'killing , i': 131589,
 'seven million': 192790,
 'people are tired': 170995,
 'answer e

In [47]:
nb_clf = MultinomialNB().fit(X_train_dtm, train_df.speaker)
X_test_dtm = vect.transform(test_df.sentence)
predicted_speaker=nb_clf.predict(X_test_dtm)

In [48]:
metrics.accuracy_score(test_df.speaker, predicted_speaker)

0.465

In [46]:
print(metrics.classification_report(test_df.speaker, predicted_speaker, target_names=test_df.speaker.unique()))

              precision    recall  f1-score   support

     sanders       0.00      0.00      0.00        30
     clinton       0.00      0.00      0.00        16
    o'malley       0.00      0.00      0.00         2
      chafee       0.00      0.00      0.00        12
        webb       0.27      0.75      0.39        53
    huckabee       0.88      0.44      0.59        34
       trump       0.00      0.00      0.00        10
        cruz       0.00      0.00      0.00         7
        paul       0.50      0.03      0.06        31
    christie       1.00      0.06      0.11        18
       rubio       1.00      0.20      0.33         5
     fiorina       0.65      0.37      0.47        41
      kasich       0.61      0.80      0.69        64
        bush       0.50      0.87      0.64        71
      carson       0.00      0.00      0.00         3
      walker       0.00      0.00      0.00         3

    accuracy                           0.47       400
   macro avg       0.34   

  'precision', 'predicted', average, warn_for)


I used unigram, bigram an trigram count as new features to the model. however the accuracy dicreases to 46%. However, if i use only bigram or trigram count the accuracy will be around 57%

## Problem 2

In [49]:

from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer(tokenizer=nltk.word_tokenize)
X_train_dtm = vect.fit_transform(train_df.sentence)

In [50]:
vect.vocabulary_

{'possess': 6204,
 'acquiring': 318,
 'nicaraguan': 5548,
 'assess': 768,
 'quarters': 6534,
 'recall': 6658,
 'partly': 5916,
 'panels': 5879,
 'planes': 6100,
 'neutrally': 5535,
 'cover': 2061,
 'confronted': 1863,
 'normal': 5584,
 'alcohol': 494,
 'amwill': 577,
 'parental': 5897,
 'marketing': 5113,
 'lost': 4984,
 'referenced': 6724,
 'alongside': 526,
 'columbus': 1719,
 'equate': 3007,
 'mandated': 5074,
 'triad': 8410,
 'behaved': 995,
 'ended': 2932,
 'paris': 5901,
 'mom': 5361,
 'mocked': 5344,
 'see': 7287,
 'hacked': 3833,
 'rumble': 7121,
 'made': 5030,
 'rattling': 6615,
 'responded': 6939,
 'coroner': 2012,
 'akron': 488,
 'heed': 3943,
 'priority': 6328,
 'requested': 6898,
 'cousin': 2060,
 'embracing': 2875,
 'contradicts': 1957,
 'yours': 9107,
 'gulf': 3813,
 'highly': 3979,
 'figures': 3339,
 'ethanol': 3042,
 'specter': 7664,
 'continuously': 1951,
 'maximize': 5147,
 'militarily': 5253,
 'enduring': 2942,
 'fresh': 3554,
 'neil': 5520,
 "let'slet'slet": 4847,


In [51]:
nb_clf = MultinomialNB().fit(X_train_dtm, train_df.speaker)

In [52]:
X_test_dtm = vect.transform(test_df.sentence)
predicted_speaker=nb_clf.predict(X_test_dtm)

In [53]:
metrics.accuracy_score(test_df.speaker, predicted_speaker)

0.3475

In [54]:
print(metrics.classification_report(test_df.speaker, predicted_speaker, target_names=test_df.speaker.unique()))

              precision    recall  f1-score   support

     sanders       0.00      0.00      0.00        30
     clinton       0.00      0.00      0.00        16
    o'malley       0.00      0.00      0.00         2
      chafee       0.00      0.00      0.00        12
        webb       0.38      0.53      0.44        53
    huckabee       0.00      0.00      0.00        34
       trump       0.00      0.00      0.00        10
        cruz       0.00      0.00      0.00         7
        paul       0.00      0.00      0.00        31
    christie       0.00      0.00      0.00        18
       rubio       0.00      0.00      0.00         5
     fiorina       0.00      0.00      0.00        41
      kasich       0.46      0.66      0.54        64
        bush       0.29      0.97      0.45        71
      carson       0.00      0.00      0.00         3
      walker       0.00      0.00      0.00         3

    accuracy                           0.35       400
   macro avg       0.07   

  'precision', 'predicted', average, warn_for)


TFIDF feature tries to find out unique word by considering how frequent a word occurs in document and how rare it is in other document. Since the data set does not contain any such things, the accuracy decreases to 36% when tfidf feature is used  