In [1]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
import seaborn as sn
import pandas as pd
import numpy as np

### Step 1: Load the data
For this you can use python built in file reader to load the data, you will have to split each row to create list of Tweets(X) and list of sentiment Labels(Y). An alternate option is to use [Pandas](https://pandas.pydata.org/pandas-docs/stable/index.html) to load the csv file, then return the two cols in pandas dataframe as arrays.

In [2]:
def load_data(filename):
    """
    Load data from file

    Args:
        filename : Name of the file from which the data is to be loaded
    
    Returns: tweet_X, sentiment_Y
    tweet_X: list of tweets
    sentiment_Y: list of sentiment lables correponding to each tweet
    """
    df = pd.read_csv(filename)
    df.head()
    y = df[['class']]
    x = df[['text']]
    return x,y#update these
tweet_X_task1, sentiment_label_Y_task1=load_data("twitter-sanders-apple2.csv")
tweet_X_task2, sentiment_label_Y_task2=load_data("twitter-sanders-apple3.csv")

### Step 2: Preprocessing
Before we move on to the actual classification section, there is some data cleaning up to be done. As a matter of fact, this step is critical, as it ultimately effects the learning process of your model. So basically [this](https://ibb.co/G9zcx2K) can happen to your model.

We will keep to simple preprocessing to make out task easier. Some of you might disagree with the preprocessing steps stated below, hence we wish to mention that choosing different preprocessing steps depend on the nature of task at hand and alos that there are always trade offs associated augmenting the orignal dataset.<br>

You are required to perform the following preprocessing steps:<br>
<i>(note that you must figure out the order of steps for ease of preprocessing yourself)</i><br>
<ul>
    <li><b>Remove all punctuations.</b><br> Although this makes things simple it is not always a good idea to remove punctuation for example in sentiment classification the use of exclaimation marks, can be used to intensify the sentiment of a given word or sentence hence might be useful, nevertheless it increases the vocabulary size hence we will not use it here. e.g. <blockquote>The opening ceremony was extreamly disappointing!!! #PSL5</blockquote>
        Removing pucntutations also removes emojis which might also be helpful.
    </li><br>
    <li><b>Remove all html tags.</b><br> Mostly done for all tasks </li><br>
    <li><b>Replace all @sometext with AT_TOKEN.</b><br>Since there will be lots of different usernames or words that can follow @ we will choose to ignore them as they dont have much sentiment information. Might be usefull for Named Entity Recognition </li>e.g. 
    <blockquote>@apple will be converted to AT_TOKEN</blockquote>
    <li><b>Remove all # symbols.</b> We want to remove the # in #applesucks so that it is not treated differently that any other word in the sentence, this migh also help us to reduce our vocabulary size.</li><br>
    <li><b>Convert the tweets to lower case text</b><br> Although all CAPS convey meaningful information for sentiment analysis we mostly convert text to lower case to reduce vocabulary size by reducing the number of variations. For example Car, cAr ,CAR, caR and car will now be treated as a single token car.</li><br>
    <li><b>Tokenize the text</b><br></li><br>
</ul>    
    

In [3]:
sentiment_label_Y_task1

Unnamed: 0,class
0,Pos
1,Pos
2,Pos
3,Pos
4,Pos
...,...
474,Neg
475,Neg
476,Neg
477,Neg


In [4]:
for val in tweet_X_task1['text']:
    print(val)

Now all @Apple has to do is get swype on the iphone and it will be crack. Iphone that is
@Apple will be adding more carrier support to the iPhone 4S (just announced)
Hilarious @youtube video - guy does a duet with @apple 's Siri. Pretty much sums up the love affair! http://t.co/8ExbnQjY
@RIM you made it too easy for me to switch to @Apple iPhone. See ya!
I just realized that the reason I got into twitter was ios5 thanks @apple
I'm a current @Blackberry user little bit disappointed with it! Should I move to @Android or @Apple @iphone
The 16 strangest things Siri has said so far. I am SOOO glad that @Apple gave Siri a sense of humor! http://t.co/TWAeUDBp via @HappyPlace
Great up close & personal event @Apple tonight in Regent St store!
From which companies do you experience the best customer service aside from @zappos and @apple?
Just apply for a job at @Apple hope they call me lol
RT @JamaicanIdler: Lmao I think @apple is onto something magical! I am DYING!!! haha. Siri suggested where 

In [5]:
import string

In [6]:
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [7]:
def remove_punctuation(data):
    for punctuation in string.punctuation:
        if punctuation != '@':
            data = data.replace(punctuation, ' ')
    return data

In [8]:
def remove_trailing(data):
    data = data.replace('@',' ')
    return data

In [9]:
import re

In [10]:
tweet_X_task1.head(50)

Unnamed: 0,text
0,Now all @Apple has to do is get swype on the i...
1,@Apple will be adding more carrier support to ...
2,Hilarious @youtube video - guy does a duet wit...
3,@RIM you made it too easy for me to switch to ...
4,I just realized that the reason I got into twi...
5,I'm a current @Blackberry user little bit disa...
6,The 16 strangest things Siri has said so far. ...
7,Great up close & personal event @Apple tonight...
8,From which companies do you experience the bes...
9,Just apply for a job at @Apple hope they call ...


In [11]:
import nltk

In [12]:
xCount = None

In [13]:
def preprocessing(data):
    """
    Perform preprocessing of the tweets

    Args:
        data : list of tweets
    
    Returns: data: preprocessed list of tweets
    """
    data['text'] = data['text'].replace(to_replace=r'http://t\.co/[A-Za-z0-9]{8}',value=" ",regex=True)
    data['text'] = data['text'].str.lower()
    data['text'] = data['text'].apply(remove_punctuation)
    data['text'] = data['text'].replace(to_replace = r'@[A-Za-z0-9]*',value = 'AT_TOKEN',regex=True)
    data['text'] = data['text'].apply(remove_trailing)
    xCount = data['text']
    data['text'] = data['text'].apply(nltk.word_tokenize)
    return data
    
tweet_X_task1=preprocessing(tweet_X_task1)
tweet_X_task2=preprocessing(tweet_X_task2)
    

In [14]:
tweet_X_task1

Unnamed: 0,text
0,"[now, all, AT_TOKEN, has, to, do, is, get, swy..."
1,"[AT_TOKEN, will, be, adding, more, carrier, su..."
2,"[hilarious, AT_TOKEN, video, guy, does, a, due..."
3,"[AT_TOKEN, you, made, it, too, easy, for, me, ..."
4,"[i, just, realized, that, the, reason, i, got,..."
...,...
474,"[houston, we, have, a, problem, my, ipad, has,..."
475,"[siri, went, down, for, a, little, while, last..."
476,"[AT_TOKEN, should, have, teamed, up, with, AT_..."
477,"[rt, AT_TOKEN, really, AT_TOKEN, what, have, y..."


## Step 3: Feature Extraction
For feature extraction you  will  need  a  way  of  mapping  from  sentences  (lists  of  strings)  to  featurevectors, a process called feature extraction or featurization.<br>
For feature vectors we will explore four techniques:
<ul>
    <li><b>Unigram feature vector</b>
        <br> This will be a sparse vector with length equal to the vocabulary size.   It is sparse because a sentence generally contains only 20-40 unique words, so only this many positions in the vector will be nonzero.
</li><br>
    <li><b>Bigram feature vector</b><br>
    The number of unique bigrams will be significantly lower that unigrams, but this will also be sparse.</li><br>
    <li><b>POS features</b>
    Instead of using the words we will use POS tags, this will be generally not as sparse as unigram and bigram, you can use nltk to get POS tags for a sentence, here the featurevectors length will be equal to the total number of unique POS tags.</li><br>
    <li><b>Complex features</b>
    These can be a combination of unigrams and bigrams or any other complex feature that you can think of, we will explore these in later part of the assignment.</li><br>
</ul>

Firstly we need the extract the vocabulary for each of these different feature vectors.<br> 
For this task use [nltk](https://www.nltk.org/)


In [15]:
import nltk
from nltk import word_tokenize
from nltk.util import ngrams
from collections import Counter
from nltk.tag import pos_tag

In [16]:
def converter(df,n=0,pos=False):
    tempSet = set()
    for val in df['text']:
        if n==1:
            for subval in val:
                tempSet.add(subval)
        elif n>1:
            ngramss = (list(ngrams(val,n)))
            for subval in ngramss:
                tempSet.add(subval)
        elif pos==True:
            tags = pos_tag(val)
            for tag in tags:
                tempSet.add(tag)
    return list(tempSet)
        

In [17]:
def vocabulary_unigram(data):
    """
    Returns list of unique unigrams

    Args:
        data : list of tweets
    
    Returns: vocab_unigram: list of unique unigrams
    """
    return (converter(data,1))
    
    
def vocabulary_bigram(data):
    """
    Returns list of unique bigrams

    Args:
        data : list of tweets
    
    Returns: vocab_unigram: list of unique bigrams
    """
    return converter(data,2)
def vocabulary_POS(data):
    """
    Returns list of unique POS tags

    Args:
        data : list of tweets
    
    Returns: vocab_unigram: list of unique POS tags
    """
    return converter(data,pos=True)
    
vocab_unigram_task1=vocabulary_unigram(tweet_X_task1)
vocab_bigram_task1=vocabulary_bigram(tweet_X_task1)
vocab_pos_tag_task1=vocabulary_POS(tweet_X_task1)

vocab_unigram_task2=vocabulary_unigram(tweet_X_task2)
vocab_bigram_task2=vocabulary_bigram(tweet_X_task2)
vocab_pos_tag_task2=vocabulary_POS(tweet_X_task2)

In [18]:
vocab_unigram_task1

['line',
 'launch',
 'sucks',
 'shit',
 'welcome',
 'because',
 'about',
 'pictures',
 'its',
 'send',
 'much',
 'forces',
 'map',
 'sooo',
 'cost',
 'piece',
 'sick',
 'reminders',
 'include',
 'features',
 'tweet',
 'item',
 '27',
 'except',
 'bloomberg',
 'aaron',
 'geolocation',
 'updater',
 'duplicating',
 'worst',
 'accounts',
 'efffing',
 'word',
 'subscriptions',
 'ipads',
 'timed',
 'featured',
 'keys',
 'get',
 'calls',
 'things',
 'technology',
 'iphone',
 'on',
 'keyboard',
 'authorise',
 'driving',
 'reappeared',
 'kudos',
 'bump',
 'imamac',
 'fucked',
 'thxx',
 'guy',
 'correction',
 'break',
 '9',
 '64gb',
 'option',
 'customer',
 'blaze',
 'through',
 'list',
 'greatly',
 'each',
 'does',
 'name',
 'slide',
 'handle',
 'flashcard',
 '5th',
 'take',
 'houston',
 'can',
 '16',
 'itunesmatch',
 'search',
 'rock',
 'maybe',
 'lolol',
 'class',
 'week',
 'such',
 'me',
 'fruits',
 'desktop',
 'point',
 'wasted',
 'happened',
 'garbage',
 'be',
 'schedule',
 '24',
 'resolve'

In [19]:
vocab_bigram_task1

[('what', 'was'),
 ('told', 'i'),
 ('time', 'AT_TOKEN'),
 ('per', 'day'),
 ('to', 'do'),
 ('that', 'the'),
 ('as', 'i'),
 ('syncing', 'stop'),
 ('to', 'will'),
 ('had', 'nothing'),
 ('jolly', 'ass'),
 ('considering', 'going'),
 ('s', 'right'),
 ('it', 'took'),
 ('you', 'and'),
 ('drain', 'issue'),
 ('bar', 'to'),
 ('an', 'hour'),
 ('snatched', 'siri'),
 ('refurbished', '1st'),
 ('this', 'can'),
 ('fm', 'kevin'),
 ('AT_TOKEN', 'signups'),
 ('it', 'just'),
 ('seemed', 'to'),
 ('of', 'insanely'),
 ('AT_TOKEN', 'unhappy'),
 ('laptop', 'help'),
 ('to', 'go'),
 ('the', 'top'),
 ('video', 'card'),
 ('my', 'ipad'),
 ('customer', 'base'),
 ('game', 'and'),
 ('he', 'gets'),
 ('same', 'co'),
 ('AT_TOKEN', 'hero'),
 ('care', 'protection'),
 ('looking', 'petty'),
 ('the', 'issues'),
 ('service', 'experience'),
 ('pins', 'in'),
 ('you', 'just'),
 ('staff', 'at'),
 ('said', 'to'),
 ('camera', 'in'),
 ('you', 'think'),
 ('any', 'plans'),
 ('know', 'which'),
 ('world', 'fuck'),
 ('apps', 'data'),
 ('be

In [20]:
vocab_pos_tag_task1

[('smallvictory', 'NN'),
 ('cheer', 'VB'),
 ('why', 'WRB'),
 ('corporate', 'JJ'),
 ('app', 'JJ'),
 ('w', 'IN'),
 ('av', 'NN'),
 ('damn', 'NN'),
 ('redeemed', 'VBD'),
 ('write', 'VB'),
 ('hindi', 'NN'),
 ('second', 'JJ'),
 ('french', 'JJ'),
 ('former', 'JJ'),
 ('gave', 'VBD'),
 ('integrate', 'VB'),
 ('intensifies', 'NNS'),
 ('replaceit', 'NN'),
 ('got', 'VBD'),
 ('under', 'IN'),
 ('making', 'VBG'),
 ('registered', 'VBN'),
 ('totally', 'RB'),
 ('fuh', 'NN'),
 ('music', 'NN'),
 ('retrieve', 'VB'),
 ('health', 'NN'),
 ('worth', 'IN'),
 ('lose', 'VB'),
 ('brightness', 'NN'),
 ('AT_TOKEN', 'NN'),
 ('genius', 'NN'),
 ('t', 'IN'),
 ('onto', 'IN'),
 ('full', 'JJ'),
 ('sign', 'VB'),
 ('lesson', 'NN'),
 ('absolutely', 'RB'),
 ('workstation', 'NN'),
 ('inbox', 'NN'),
 ('avail', 'VBP'),
 ('frozen', 'JJ'),
 ('thx', 'VBP'),
 ('that', 'WDT'),
 ('email', 'NN'),
 ('again', 'RB'),
 ('sound', 'VBD'),
 ('article', 'NN'),
 ('need', 'VBP'),
 ('mall', 'NN'),
 ('arrived', 'VBN'),
 ('pissed', 'VBN'),
 ('cracked

### Creating the feature vector
The feature vector that is to be created is in the following format:e.g.
<blockquote>{'simplistic': False, 'silly': True, 'and': True, 'tedious': False, 'its': False, 'so': True, 'laddish': False, 'juvenile': False, 'only': False, 'teenage': False, 'boys': False, 'could': False, 'possibly': False, 'find': False, 'it': False, 'funny': False, 'exploitative': False, 'largely': False, 'devoid': False}</blockquote><br>
Instructions on how to make the feature vectors:
<ul>
    <li><b>For each tweet</b> the feature vector to be constructed will be equal to the <b>size of the vocabulary</b>.</li>
    <li>The feature vector is a dictionary with <b>word and boolean pairs</b>, here the boolean will be <b>true</b> if the specific word exists in the tweet otherwise it will be <b>false</b>.</li>
    
</ul>


In [None]:
def features_unigram(data,vocab):
    """
    Returns list of feature vectors

    Args:
        data : list of tweets
    
    Returns: vocab_unigram: list of unique unigrams
    """
    newList = []
    for i in range(len(data)):
        newDict = {}
        for value in vocab:
            newDict[value]= False
        newList.append(newDict)
    j = 0
    for val in data['text']:
        #print(val)
        for subval in val:
            if subval in vocab:
                newList[j][subval] = True
        j=j+1
    return newList
def features_bigram(data,vocab):
    """
    Returns list of unique bigrams

    Args:
        data : list of tweets
    
    Returns: vocab_unigram: list of unique bigrams
    """
    newList = []
    for i in range(len(data)):
        newDict = {}
        for value in vocab:
            newDict[value] = False
        newList.append(newDict)
    j = 0
    for val in data['text']:
        for subval in val:
            if subval in vocab:
                newList[j][subval] = True
        j = j+1
    return newList
def features_POS(data,vocab):
    """
    Returns list of unique POS tags

    Args:
        data : list of tweets
    
    Returns: vocab_unigram: list of unique POS tags
    """
    newList = []
    for i in range(len(data)):
        newDict = {}
        for value in vocab:
            newDict[value] = False
        newList.append(newDict)
    j = 0
    for val in data['text']:
        for subval in val:
            if subval in vocab:
                newList[j][subval] = True
        j = j+1
    return newList
task1_unigram_features=features_unigram(tweet_X_task1,vocab_unigram_task1)
task1_bigram_features=features_bigram(tweet_X_task1,vocab_bigram_task1)
task1_POS_features=features_POS(tweet_X_task1,vocab_pos_tag_task1)

task2_unigram_features=features_unigram(tweet_X_task2,vocab_unigram_task2)
task2_bigram_features=features_bigram(tweet_X_task2,vocab_bigram_task2)
task2_POS_features=features_POS(tweet_X_task2,vocab_pos_tag_task2)

In [None]:
task1_unigram_features[1]

### Step 4: Test Train Split
Test train split, use [sklearns test train split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) to split the data into 80-20 ratio for training and testing. Here we are not using validation set as our dataset is too small.
<blockquote>Note: The lists for labels are overwritten beacause each split should return the same labels</blockquote>

In [None]:
# sentiment_label_Y_task2['class'].value_counts()


In [None]:
# from sklearn import preprocessing

In [None]:
# le = preprocessing.LabelEncoder()
# le.fit_transform(sentiment_label_Y_task2['class'].values)
# le.inverse_transform([0, 0, 1, 0])

In [None]:
# le

In [None]:
# def changeLabels(df):
#     nlist = []
#     for value in df['class']:
#         #print(value)
#         if value=='Pos':
#             nlist.append(0)
#         elif value=='Neg':
#             nlist.append(1)
#         else:
#             nlist.append(2)
#     return np.array(nlist)

In [None]:
def test_train_split(featurevectors,labels):
    """
    Returns test train split data

    Args:
        featurevectors : list of feature vectors
    
    Returns: feature_vector_train, feature_vector_test, label_train, label_test
    """
    X = []
    #a = changeLabels(labels)
    Y = labels
    from sklearn import preprocessing
    #le = preprocessing.LabelEncoder()
    #Y = le.fit_transform(labels['class'])
    for i in range(len(featurevectors)):
        X.append(np.array(list(featurevectors[i].values())))
    X_train, X_test, y_train, y_test = train_test_split(X,Y, test_size=0.2,shuffle=True)
    return X_train,X_test,y_train,y_test
task1_unigram_features_train, task1_unigram_features_test, task1_label_train_unigram, task1_label_test_unigram=test_train_split(task1_unigram_features,sentiment_label_Y_task1)

task1_bigram_features_train, task1_bigram_features_test, task1_label_train_bigram, task1_label_test_bigram=test_train_split(task1_bigram_features,sentiment_label_Y_task1)

task1_POS_features_train, task1_POS_features_test, task1_label_train_POS, task1_label_test_POS=test_train_split(task1_POS_features,sentiment_label_Y_task1)


task2_unigram_features_train, task2_unigram_features_test, task2_label_train_unigram, task2_label_test_unigram=test_train_split(task2_unigram_features,sentiment_label_Y_task2)

task2_bigram_features_train, task2_bigram_features_test, task2_label_train_bigram, task2_label_test_bigram=test_train_split(task2_bigram_features,sentiment_label_Y_task2)

task2_POS_features_train, task2_POS_features_test, task2_label_train_POS, task2_label_test_POS=test_train_split(task2_POS_features,sentiment_label_Y_task2)

In [None]:
task1_label_train_unigram

###### Step 5: Modelling and Predictions 

#### Model 1: Niave Bayes
For the purpose of classification we will use [Niave Bayes](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html) from sklearn. Train the Niave Bayes algorithm on all featurevector variants of task 1 and task 2.<br>
In order to compare and quantify the goodness of our model we will compute a [classification report](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html). We will also create a [confusion matrix](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html) to see what mistakes the model might be making.

First program the train_model and predictions function which are general functions that will be used later on.

In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report

In [None]:
def train_model(model,features,labels):
    """
    Trains a Naive Bayes model 

    Args:
        features : list of feature vectors or X
        labels: targets or Y
        model: Niave Bayes model that is to be trained
    
    Returns: trained model
    """
    model.fit(features,labels)
    return model

In [None]:
def predictions(model,features):
    """
    Returns predictions from a trained model

    Args:
        features : list of feature vectors or X
        model: Niave Bayes model that is to be trained
    
    Returns: prediction list
    """
    y_predict = model.predict(features)
    
    return y_predict

In [None]:
task1_label_train_unigram

##### Using the above functions train different models for different feature vectors and  obtain the models predictions for test set.

In [None]:
#train model on task1_unigram_features_train
###__________Code_Here________________###
gnb = MultinomialNB()
gnb = train_model(gnb,task1_unigram_features_train,task1_label_train_unigram.values.ravel())

###___________________________________###

In [None]:
#Classification report for task1_unigram_features_train
###__________Code_Here________________###
pred=gnb.predict(task1_unigram_features_test)
#print(pred)
#print(type(task1_label_test_unigram))
#print(type(pred))
#print(type(task1_label_test))
###___________________________________###

print(classification_report(task1_label_test_unigram, pred, labels=["Pos", "Neg"]))
cm=confusion_matrix(task1_label_test_unigram,pred)
cm = pd.DataFrame(cm, ['Pos','Neg'], ['Pos','Neg'])# matrix,names row,names col,
# plt.figure(figsize=(10,7))
sn.set(font_scale=1.5) # for label size
sn.heatmap(cm, annot=True, annot_kws={"size": 11}, fmt="d") # font size
plt.show()

In [None]:
##### train model on task1_bigram_features_train
###__________Code_Here________________###
gnb = MultinomialNB()
gnb = train_model(gnb,task1_bigram_features_train,task1_label_train_bigram.values.ravel())
###___________________________________###

In [None]:
#Classification report and confusion matrix for task1_bigram_features_train
###__________Code_Here________________###
pred=predictions(gnb,task1_bigram_features_test)
###___________________________________###

print(classification_report(task1_label_test_bigram, pred, labels=["Pos", "Neg"]))
cm=confusion_matrix(task1_label_test_bigram,pred)
cm = pd.DataFrame(cm, ['Pos','Neg'], ['Pos','Neg'])# matrix,names row,names col,
# plt.figure(figsize=(10,7))
sn.set(font_scale=1.5) # for label size
sn.heatmap(cm, annot=True, annot_kws={"size": 11}, fmt="d") # font size
plt.show()

In [None]:
#train model on task1_POS_features_train
###__________Code_Here________________###
gnb = MultinomialNB()
gnb = train_model(gnb,task1_POS_features_train,task1_label_train_POS.values.ravel())

###___________________________________###

In [None]:
#Classification report for task1_POS_features
###__________Code_Here________________###
pred=predictions(gnb,task1_POS_features_test)
###___________________________________###

print(classification_report(task1_label_test_POS, pred, labels=["Pos", "Neg"]))
cm=confusion_matrix(task1_label_test_POS,pred)
cm = pd.DataFrame(cm, ['Pos','Neg'], ['Pos','Neg'])# matrix,names row,names col,
# plt.figure(figsize=(10,7))
sn.set(font_scale=1.5) # for label size
sn.heatmap(cm, annot=True, annot_kws={"size": 11}, fmt="d") # font size
plt.show()

#### Compare how performance changes for different feature vectors for task1? 

%%capture
###______________Answer_Start_________###
For task 1, in terms of accuracy,precision,recall,f1-score,micro and macro scores, unigram features outperforms pos and bigram features and pos features outperform bigram features. Fscore and precision are illdefined for bigram and pos features because they result in a 0/0 situation with recall equal to 1 for negative class. The reason might be that model gets too much complex in these cases and we have not done any kind of stop words pre-processing
###______________Answer_End_________###

##### Repeat the above steps for task2

In [None]:
#train model on task2_unigram_features_train
###__________Code_Here________________###
gnb = MultinomialNB()
gnb = train_model(gnb,task2_unigram_features_train,task2_label_train_unigram.values.ravel())

###___________________________________###

In [None]:
#Classification report for task2_unigram_features
###__________Code_Here________________###
pred=predictions(gnb,task2_unigram_features_test)
###___________________________________###

print(classification_report(task2_label_test_unigram, pred, labels=["Pos", "Neg","Neutral"]))
cm=confusion_matrix(task2_label_test_unigram,pred)
cm = pd.DataFrame(cm, ["Pos","Neg","Neutral"], ["Pos","Neg","Neutral"])# matrix,names row,names col,
# plt.figure(figsize=(10,7))
sn.set(font_scale=1.5) # for label size
sn.heatmap(cm, annot=True, annot_kws={"size": 11}, fmt="d") # font size
plt.show()

In [None]:
#train model on task2_bigram_features_train
###__________Code_Here________________###
gnb = MultinomialNB()
gnb = train_model(gnb,task2_bigram_features_train,task2_label_train_bigram.values.ravel())

###___________________________________###

In [None]:

#Classification report for task2_bigram_features
###__________Code_Here________________###
pred=predictions(gnb,task2_bigram_features_test)
###___________________________________###

print(classification_report(task2_label_test_bigram, pred, labels=["Pos", "Neg","Neutral"]))
cm=confusion_matrix(task2_label_test_bigram,pred)
cm = pd.DataFrame(cm, ["Pos","Neg","Neutral"], ["Pos","Neg","Neutral"])# matrix,names row,names col,
# plt.figure(figsize=(10,7))
sn.set(font_scale=1.5) # for label size
sn.heatmap(cm, annot=True, annot_kws={"size": 11}, fmt="d") # font size
plt.show()

In [None]:
#train model on task2_POS_features_train
###__________Code_Here________________###
gnb = MultinomialNB()
gnb = train_model(gnb,task2_POS_features_train,task2_label_train_POS.values.ravel())
###___________________________________###

In [None]:

#Classification report for task2_POS_features
###__________Code_Here________________###
pred=predictions(gnb,task2_POS_features_test)
###___________________________________###

print(classification_report(task2_label_test_POS, pred, labels=["Pos", "Neg","Neutral"]))
cm=confusion_matrix(task2_label_test_POS,pred)
cm = pd.DataFrame(cm, ["Pos","Neg","Neutral"], ["Pos","Neg","Neutral"])# matrix,names row,names col,
# plt.figure(figsize=(10,7))
sn.set(font_scale=1.5) # for label size
sn.heatmap(cm, annot=True, annot_kws={"size": 11}, fmt="d") # font size
plt.show()

#####  How does the addition of an extra class affect the learning of our algorithm? Which pairs of classes are most often confused? Why might that be? 

%%capture
###______________Answer_Start_________###
The addition of extra class makes our model predict majority of the testing examples as negative. Neutral and positive classes are mostly confused with  negative classes. The reason might be that the number of training examples are skewed meaning that the classes used for training are mostly negative classes due to which model is classifying all of the training data as negative

###______________Answer_End_________###

### We can further create more complex features by combining different ngram features to  improve performance. We can also remove words that occur rarely in our dataset to reduce the size of our feature vectors. 

#### Model2: Logistic Regression
We will now use [logistic regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) to explore the improvements stated above. To create complex feature vectors we will use [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) from sklearn.<br>
We will play around with two parameters of CountVectorizer:
<ul>
    <li>min_df ( = 5): defines the minimum frequency of a word for it to be counted as a feature, in this case the word should have a frequency count of atleast 5.</li>
    <li>ngram_range (= (2,2)): The ngram_range parameter is a tuple. It defines the minimum and maximum length of sequence of tokens considered. In this case, this length is 2. So, this will find sequence of 2 tokens like — ‘but the’, ‘wise man’ etc.</li>
</ul>
Use the ngram_range=(1,2) and min_df=5.

Hint: See the fit and transform functions<br>
<i>If you get stuck, you can look at our solution</i> [here](https://ibb.co/p4BFdvs)
    

    

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
x,y = tweet_X_task1,sentiment_label_Y_task1

In [None]:
x['text']=x['text'].apply(', '.join)

In [None]:
#xTrain,xTest,yTrain,yTest = train_test_split(tweet_X_task1,sentiment_label_Y_task1)

In [None]:
#Need to jo rejoining

In [None]:
#create complex features
###__________Code_Here________________###

vectorizer = CountVectorizer(min_df=5,ngram_range=(2,2))

x = vectorizer.fit_transform(x['text'])
xTrain,xTest,yTrain,yTest = train_test_split(x,sentiment_label_Y_task1)
###___________________________________###

In [None]:
#train logistic regression model
###__________Code_Here________________###
from sklearn.linear_model import LogisticRegression
logistic = LogisticRegression()
logistic.fit(xTrain,yTrain.values.ravel())

###___________________________________###

In [None]:
#train model on complex_feature
###__________Code_Here________________###
pred= logistic.predict(xTest)
#print(pred)
###___________________________________###
#print(len(pred),len(yTest))
print(classification_report(yTest, pred, labels=["Pos", "Neg",]))
cm=confusion_matrix(yTest,pred)
cm = pd.DataFrame(cm, ["Pos","Neg"], ["Pos","Neg"])# matrix,names row,names col,
# plt.figure(figsize=(10,7))
sn.set(font_scale=1.5) # for label size
sn.heatmap(cm, annot=True, annot_kws={"size": 11}, fmt="d") # font size
plt.show()

#### You can play around with ngram_range and min_df to further improve performance. 


Great you have completed the first task now move onto [Part 2](https://ibb.co/MfpxHZh).