# Ronen Reouveni 

# NLP Homework 3: Sentiment Classification 

---

## Introduction 

In this analysis we train a sentiment classifier on airline tweets. We then use these models on a set of covid tweets and analyze the resulting sentiment distributions. Each tweet in the airline dataset is annotated to have labels of either negative, positive, or neutral sentiment. However, the covid tweets have no labels whatsoever. Therefore, all evaluation metrics are calculated using cross validation within the airline tweets. The hope is that we can generlize the model to the covid tweets. Although we will be able to analyze how many sentences in the covid tweets are classified with a given sentiment, we will not know the classification rate because the covid tweets have no annotated labels. 


## Explanation of Cleaning, Preprocessing, and Analysis 

### Naive Bayes 

To begin the cleaning and preprocessing, we remove the '@', '#', and digits from all the tweets in the airline corpus. The object is to classify sentiment on the sentence level of the covid tweets using the models trained on the airline tweets. Therefore, I then iterate over the covid tweets and append each sentence to a new list. An inner loop in this iteration tracks the author and country. This is necessary because if a tweet is 3 sentences it has the same author, but it should be split into 3 sentences. Therefore, the author needs to be associated with each sentence. A similar logic applies to country and title. I then ensure we have the correct data shapes. Only a quarter of the sentences in the covid dataset are used. I extract the labels of the airline tweets, create a list of airline tweets and a list of sentences from the covid set. 

At no point do I remove stopwords or lowercase any text. Although there are still very interesting results lowercasing would have improved them. This is discussed more in the final analysis and interpretation. 

---

Two separate feature sets are created for two separate Naive Bayes models. I use CountVectorizer from sklearn to both tokenize and obtain 'bag of words'. To be clear, the bag of words is created on the entire dataset, both the covid tweets and airline tweets. The amazing thing about CountVectorizer from sklearn are the parameters that can be passed into it as well. The two used in this analysis is ngram_range and min_df. ngram_range describes the minimum and maximum ngram size created. For example, passing in 1,1 means we will only get unigrams, 1,2 means we will get unigrams and bigrams, 1,3 means unigrams, bigrams, and trigrams. Furthermore, 2,3 means we only will get bigrams and trigrams. The second parameter is min_df. This defines the minimum number of documents that the n_gram can be in or it will be excluded. 


1. Complement Naive Bayes:
    * Unigrams 
    * minimum document number: 2000
    * resulting features: 891

2. Complement Naive Bayes:
    * Unigrams, Bigrams, Trigrams
    * minimum document number: 1000
    * resulting features: 2707
    
The CountVectorizer is run on both corpuses together. This means that the n_grams found are across all the data. Furthermore, the minimum document count includes all the data, both corpuses. However, when it comes time to train the Naive Bayes models, it is only trained on the data from the airline tweets. The reason the count vectorization is done with all the data is so we can truly generalize between the two. However, we only train on the airline set because it is the only set that has labels. I accomplish this by subsetting my dataframes’ by index where the index reflects the ending row number of the airline tweet data set. 



### Deep RNN: LSTM 

The preprocessing required to run the LSTM is more involved than what we need to do for the Naive Bayes modeling. We need to create a tokenizer with a limit of vocabulary. This is the vocabulary that the model will hold. This is set to 7,500 words. This means that there are 7,500 words that the model will remember and hold. We then need to create a padded sequence with a max length. This processing converts words to numbers, like a dictionary key pair values. However, it maintains the order of the words. The max length is set to 200. This means that we analyze a window of 200 sequential words. Because language is sequential in nature we need a model that does not degrade the order of the words. This is a major improvement from the Naive Bayes models. The way that we are able to do this is by using an RNN, a Recurrent Neural Network. Specifically I use an LSTM. 

The LSTM implemented here also uses dropout. This randomly turns off a portion of the neurons to prevent overfitting. The final layer of the model is a dense fully connected layer with 3 outputs. It has a sigmoid activation function. This final layer has three nodes because we are looking for three outputs, positive, neutral, or negative. The LSTM will output a probability for each class. The classification is the node with the highest probability. We use SparseCategoricalCrossentropy as the loss function. This is because we have unbalanced data with 3 classes. Furthermore, we use 'adam' optimizer. 

The actual architecture of the LSTM implemented in this analysis is as followed.

Model: "sequential_6"
_________________________________________________________________
-----------------------------------------------------------------
embedding_6 (Embedding)      (None, 200, 32)           504608    
_________________________________________________________________
spatial_dropout1d_6 (Spatial (None, 200, 32)           0         
_________________________________________________________________
lstm_6 (LSTM)                (None, 50)                16600     
_________________________________________________________________
dropout_6 (Dropout)          (None, 50)                0         
_________________________________________________________________
dense_6 (Dense)              (None, 3)                 153       


* Total params: 521,361
* Trainable params: 521,361
* Non-trainable params: 0


As we can see above, there are 521,361 parameters that are trained and optimized. 


## Explanation of Analysis

Cross validation will be used on both Naive Bayes models to calculate accuracy scores. The best Naive Bayes model will then be chosen and we will use that model to predict the sentiment of the sentences in the covid tweets set. From here, we can get counts of how many sentences are neutral, positive, and negative. Furthermore, I can extract top adverb, adjective, and verb phrases from both the positive and negative tweets.

Another interesting mode of analysis is looking at the conditional mean for each sentiment classification given the country it comes from. This will highlight which countries have the highest positive sentiment and negative sentiments. However, average sentiment is not simple to interpret because of how the data is structured. The way that I execute this is by taking conditional mean count of positive and negative sentences per country. These are the predicted labels by the Naive Bayes model. Then I take the difference between positive and negative. If the result is more negative, the sentiments on average, given that specific country, is more negative. If the number is positive, given that specific country, the sentiment is on average more positive. 

I will also test the LSTM on some fake sentences to try and understand its nature. After getting an idea of its behavior I will randomly select covid tweets that are labeled as positive and negative by the Naive Bayes and then reclassify them using the LSTM. The hope is to compare tweets classified with opposite sentiments by each classifier. 




## Code and Output


In [None]:
#import data
import pandas as pd 
import numpy as np
train_dataset = pd.read_csv('train_tweets_airlines.csv') 
test_set = pd.read_csv('test_covid.csv')

In [None]:
import re
import nltk
# Since this is social media data, we will have to add a few extra preprocessing steps
# First, let's remove  @ and # (Twitter platform affordances) from the training data
# We'll use regular expressions for that, creating a Function that we can use to pass the data through
pattern = r'[0-9]'

def remove_at(x):
    x = str(x).replace('@', '')
    x = str(x).replace('#', '')
    x = re.sub(pattern, '', x)
    return x

In [None]:
# Clearning the data with our function
textList = list(train_dataset['text'].apply(lambda x: remove_at(x))) #call above function on text 


textList_test = list(test_set['text'].apply(lambda x: remove_at(x))) #unused

sentList = list(train_dataset['airline_sentiment']) #extract sentiment into new list 

In [None]:
#split covid set into sentences
#save the author, country, title of each sentence into new lists

sentences = []
authors = []
country = []
title = []

for i in range(0, int(len(test_set['text'])/4)): #select quarter of data 
    container = (nltk.sent_tokenize(test_set['text'][i])) #tokenize each tweet by sentence 
    sentences.append(container) #append sentences 
    for k in range(0, len(container)): #append data for each sentence 
        authors.append(test_set['author'][i])  
        title.append(test_set['title'][i])
        country.append(test_set['country'][i])

    

In [None]:
import itertools
#fix data structure created in the previous code, itertools 
sentences = (list(itertools.chain.from_iterable(sentences)))


In [None]:
#ensure the lengths all match up 
print(len(sentences))
print(len(authors))
print(len(country))
print(len(title))

583450
583450
583450
583450


In [None]:
#ensure the first sentence list has correct dimensions 
print(len(textList))

14640


In [None]:
#create list of all data from both airline and covid 
totalList = textList + sentences
print(len(totalList))

598090


In [None]:
#get grams 
from sklearn.feature_extraction.text import CountVectorizer #import module 
vec = CountVectorizer(ngram_range=(1,1), min_df = 2000) #create object with unigrams and 2000 min doc size 
X = vec.fit_transform(totalList) #apply the object to ALL tweets/sentences 
df = pd.DataFrame(X.toarray(), columns=vec.get_feature_names()) #save in a dataframe 

In [None]:
#the shape of the df ensures we have the correct rows 
#14640+583450 = 598090

#note that the above code left us with 891 predictors 
df.shape

(598090, 891)

In [None]:
from sklearn.model_selection import cross_val_score #import cross validation 
from sklearn.naive_bayes import ComplementNB        #import model
clf = ComplementNB() #instantiate model 

#train the classifier on ONLY the airline tweets
#notice we have 14640 rows (airline set), and the sentList, the extracted sentiment labels 
#note: this automatically randomizes the rows 
scores = cross_val_score(clf, df[0:14640], sentList, cv=5) #call model with 5 folds 
scores

array([0.67588798, 0.64378415, 0.60382514, 0.72096995, 0.70696721])

In [None]:
#cv score of only .67
sum(scores)/5

0.6702868852459016

In [None]:
#repeat same process as above but with the addition of bigram, and trigram
#also note new minimum of 1000 documents 
vec = CountVectorizer(ngram_range=(1,3), min_df = 1000)
X = vec.fit_transform(totalList)
df = pd.DataFrame(X.toarray(), columns=vec.get_feature_names())

In [None]:
# now we have 2707 columns 
df.shape

(598090, 2707)

In [None]:
#train new NB on ONLY the airline tweets 
#note: this automatically randomizes the rows 
clf_bi = ComplementNB()
scores_bi = cross_val_score(clf_bi, df[0:14640], sentList, cv=5)
scores_bi

array([0.69740437, 0.6875    , 0.67657104, 0.76058743, 0.73155738])

In [None]:
#new score is up to .71 with unigrams, bigrams, and trigrams 
sum(scores_bi)/5

0.7107240437158471

In [None]:
#re-fit the better model with the data from the airline set 
clf_bi.fit(df[0:14640], sentList)

ComplementNB()

In [None]:
#make predictions on the covid set using the NB model 
testPredictions = clf_bi.predict(df[14640:598090])

In [None]:
#ensure the number of predictions made matches the expected
len(testPredictions)

583450

In [None]:
#get index locations of pos, neg, and nue sentences
indices_pos = [i for i, x in enumerate(testPredictions) if x == "positive"]

In [None]:
indices_neg = [i for i, x in enumerate(testPredictions) if x == "negative"]

In [None]:
indices_nue = [i for i, x in enumerate(testPredictions) if x == "neutral"]

In [None]:
#view counts 
len(indices_pos)

171743

In [None]:
#view counts 
len(indices_neg)

107066

In [None]:
#view counts 
len(indices_nue)

304641

In [None]:
#make sure it adds back up to expected
sum([217952,229193,136305])

583450

In [None]:
#loop through sentence list and append to a new list 
pos_sents = []

for i in indices_pos:
    pos_sents.append(sentences[i])

In [None]:
#ensure num of pos sentences matches len of indicies 
len(pos_sents)

171743

In [None]:
#look at first positive sentence 
pos_sents[0]

'Rajiv Gandhi Institute of Chest Diseases (RGICD) with 15 beds and Wenlock Hospital at Mangaluru with 10 beds have been selected for the treatment of the virus.'

In [None]:
#fix country data type 
country = list(country)

In [None]:
#zip and join all in df 
subDF = pd.DataFrame(list(zip(title, authors, country, testPredictions)), columns =['title', 'author','country','sentiment'])

In [None]:
#turn sentiment labels using one hot encoding 
subDF = pd.get_dummies(subDF, columns=['sentiment'])


In [None]:
#group by to get appropriate table 
finaldf = subDF.fillna('none').groupby(['title','author', 'country'], sort = False).sum()

In [None]:
#view table 
finaldf

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,sentiment_negative,sentiment_neutral,sentiment_positive
title,author,country,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
"Karnataka: Helplines, isolation wards set up for coronavirus - Udayavani",Udayavani,IN,1.0,5.0,2.0
Health dept. monitoring 24 people for possible infection,none,US,0.0,2.0,1.0
none,jmccorm,US,1.0,1.0,0.0
Asian Markets Mostly Higher,rttnews.com,US,1.0,16.0,5.0
Tesla soars as bearish analysts left with little to highlight - BNN Bloomberg,Joe Easton,CA,3.0,14.0,7.0
...,...,...,...,...,...
Coronavirus: expert warns infection could reach 60% of world's population | World news | The Guardian,Sarah Boseley,US,0.0,1.0,0.0
Viruses don’t care if you’re lying or not,Neil Steinberg,US,29.0,26.0,13.0
Australian Dollar Edges Higher as Aussie Home Loans Beat Forecasts,Colin Lawrence,GB,1.0,22.0,4.0
"Eight days in Wuhan, cut off from the world",none,ZA,25.0,22.0,24.0


In [None]:
#write frame to csv 
finaldf.to_csv('hm3_nlp_table.csv', index=True)

In [None]:
#note: average sentiment is not simple to interpret because of how the data is structured 

In [None]:
#average negative sentiment 
finaldf['sentiment_negative'].mean()

4.49271956695061

In [None]:
#average positive sentiment 
finaldf['sentiment_positive'].mean()

7.206705551592464

In [None]:
#get conditional mean sentiments by country 
avgSentiment = finaldf.groupby('country')['sentiment_negative'].mean()
avgSentiment

country
IN    2.771093
US    5.402157
CA    5.180972
DE    2.404110
GB    5.002394
        ...   
LT    0.000000
NP    3.000000
RS    6.000000
AZ    3.000000
KG    1.000000
Name: sentiment_negative, Length: 127, dtype: float64

In [None]:
#get conditional mean sentiments by country 
avgSentiment_neg = finaldf.groupby('country')['sentiment_positive'].mean()
index = avgSentiment_neg.index
avgSentiment_neg

country
IN     5.148686
US     8.382962
CA     7.816960
DE     5.047945
GB     7.144225
        ...    
LT     4.000000
NP    11.000000
RS     5.666667
AZ     1.000000
KG     0.500000
Name: sentiment_positive, Length: 127, dtype: float64

In [None]:
#create a new frame of the mean difference between pos and neg sentences by country 
avgSentFrame = pd.DataFrame(list(zip(index, avgSentiment, avgSentiment_neg)), columns = ['country','pos','neg'])
avgSentFrame['sent_difference'] = avgSentFrame['pos'] - avgSentFrame['neg']
avgSentFrame.sort_values(by = 'sent_difference', ascending = False)


Unnamed: 0,country,pos,neg,sent_difference
35,BR,10.307692,3.000000,7.307692
108,PR,8.000000,5.000000,3.000000
87,BT,6.666667,4.333333,2.333333
72,MU,8.875000,6.625000,2.250000
116,BZ,5.500000,3.500000,2.000000
...,...,...,...,...
123,NP,3.000000,11.000000,-8.000000
22,SO,4.555556,14.333333,-9.777778
102,PL,8.000000,18.666667,-10.666667
99,CM,6.000000,22.000000,-16.000000


In [None]:
#how does USA look 
print('Global average',avgSentFrame['sent_difference'].mean())
print(avgSentFrame.loc[avgSentFrame['country'] == 'US'])

Global average -2.2353136235379365
  country       pos       neg  sent_difference
1      US  5.402157  8.382962        -2.980805


In [None]:
#get modules needed 
import nltk
from nltk import sent_tokenize
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/ronenreouveni/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [None]:
#split text and tag
tokentext = [nltk.word_tokenize(sent) for sent in pos_sents] #tokenize words within sentences
taggedtext = [nltk.pos_tag(tokens) for tokens in tokentext] #apply POS tags to each token generated in the previous line

In [None]:
#create function to get phrases 
def findPhrases(inpuText, phrase):
  #define grammer for regex
  grammar = phrase
  chunk_parser = nltk.RegexpParser(grammar)

  #build parse trees
  myTags = []
  for sent in inpuText:
      if len(sent) > 0:
        tree = chunk_parser.parse(sent)
        for subtree in tree.subtrees():
            if subtree.label() == 'test':
                myTags.append(subtree)
                
# append data to list 
  myPhrase = []
  for sent in myTags:
    temp = ''
    for w, t in sent:
        temp += w+ ' '    
    myPhrase.append(temp)

  mySent = []
  for sents in myTags:
    temp=''
    for (word,tag) in sents:
        temp += word+' '
        mySent.append(temp)
        
#calculate average length of phrases
  totalSent = sum(len(sent) for sent in mySent) 
  avgLeng = (totalSent / len(mySent))

#calculate frequency distro
  freq_ = nltk.FreqDist(myPhrase)
  distro = freq_.most_common(50)

#turn data into pandas frame
  df = pd.DataFrame(distro, columns =['Phrase', 'Freq'])
  df.sort_values('Freq', inplace=True)          

#return an object with the info needed for analysis
  return(distro,df,avgLeng)

In [None]:
#function to loop through object output from findPhrases and print results 
def printFreq(obj, title):
  print(title)
  for word, freq in obj:
    print(word, freq)

In [None]:
adjPhrases = findPhrases(taggedtext,"test: {<RB.?>+<JJ.?>}") #"adjective phrases" 

In [None]:
#call functions on RB + RB
adverbPhrases = findPhrases(taggedtext,"test: {<RB>+<RB>}") #"adverb phrases"

In [None]:
#verbs 
verbverb = findPhrases(taggedtext,"test: {<VB.?>+<VB.?>}") #verbs+nouns

In [None]:
#combine all into dataframe of tuples with phrase and count
df_positive = pd.DataFrame(list(zip(adjPhrases[0], adverbPhrases[0], verbverb[0])), columns =['adjective phrase', 'adverb phrase','verb phrase']) 
df_positive

Unnamed: 0,adjective phrase,adverb phrase,verb phrase
0,"(late last , 329)","(so far , 2111)","(have been , 1824)"
1,"(most important , 203)","(as well , 2010)","(has killed , 1778)"
2,"(as much , 167)","(So far , 586)","(has been , 1093)"
3,"(so much , 147)","(not just , 217)","(have tested , 806)"
4,"(too much , 147)","(not only , 201)","(has infected , 768)"
5,"(too early , 143)","(right now , 190)","(have died , 764)"
6,"(so many , 143)","(once again , 180)","(has spread , 745)"
7,"(very early , 143)","(very much , 150)","(have been reported , 591)"
8,"(extremely well-prepared , 129)","(as soon , 124)","(have been confirmed , 587)"
9,"(very good , 121)","(not yet , 122)","(had been , 573)"


In [None]:
#loop through sentence list and append to a new list 
neg_sents = []

for i in indices_neg:
    neg_sents.append(sentences[i])

In [None]:
tokentext_neg = [nltk.word_tokenize(sent) for sent in neg_sents] #tokenize words within sentences
taggedtext_neg = [nltk.pos_tag(tokens) for tokens in tokentext_neg] #apply POS tags to each token generated in the previous line

In [None]:
adjPhrases_neg = findPhrases(taggedtext_neg,"test: {<RB.?>+<JJ.?>}") #"adjective phrases" 

In [None]:
#call functions on RB + RB
adverbPhrases_neg = findPhrases(taggedtext_neg,"test: {<RB>+<RB>}") #"adverb phrases"

In [None]:
verbverb_neg = findPhrases(taggedtext_neg,"test: {<VB.?>+<VB.?>}") #verbs+verbs

In [None]:
#combine all into dataframe of tuples with phrase and count
df_neg = pd.DataFrame(list(zip(adjPhrases_neg[0], adverbPhrases_neg[0], verbverb_neg[0])), columns =['adjective phrase', 'adverb phrase','verb phrase']) 
df_neg

Unnamed: 0,adjective phrase,adverb phrase,verb phrase
0,"(so many , 125)","(right now , 323)","(has been , 1455)"
1,"(not available , 86)","(so far , 274)","(have been , 1290)"
2,"(Most Popular , 82)","(as well , 263)","(had been , 581)"
3,"(safer self-quarantining , 72)","(not yet , 235)","(is happening , 489)"
4,"(not clear , 69)","(not only , 229)","(’ s , 463)"
5,"(very close , 65)","(not just , 187)","(is going , 344)"
6,"(not sure , 65)","(else right , 154)","(are going , 254)"
7,"(already fully disinfect , 63)","(not necessarily , 143)","(’ re , 228)"
8,"(too late , 61)","(not immediately , 116)","(are getting , 206)"
9,"(most important , 60)","(no longer , 90)","(have seen , 186)"


## LSTM 

In [None]:
#give credit to source of tutorial (although changes were implemented)
#https://medium.datadriveninvestor.com/deep-learning-lstm-for-sentiment-analysis-in-tensorflow-with-keras-api-92e62cde7626

In [None]:
tweet_df = train_dataset[['text','airline_sentiment']] #new df with relevant info 


In [None]:
sentiment_label = tweet_df.airline_sentiment.factorize() #turn named sentiment into factors(0,1,2)

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer #import modules 
from tensorflow.keras.preprocessing.sequence import pad_sequences #import modules 

#get values 
tweet = tweet_df.text.values
tokenizer = Tokenizer(num_words=7500) #tokenizer with 7,500 size vocab
tokenizer.fit_on_texts(tweet) #apply tokenizer
vocab_size = len(tokenizer.word_index) + 1 #get size 
encoded_docs = tokenizer.texts_to_sequences(tweet) #encode to numbered sequence
padded_sequence = pad_sequences(encoded_docs, maxlen=200) #filter sequence to maxlen=200

In [None]:
# Build the model
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM,Dense, Dropout
from tensorflow.keras.layers import SpatialDropout1D
from tensorflow.keras.layers import Embedding

#set vector length
embedding_vector_length = 32
model = Sequential() #instantiate sequential model

#add all layers 

#dropout turns off neurons to prevent overfitting 
model.add(Embedding(vocab_size, embedding_vector_length,input_length=200) )
model.add(SpatialDropout1D(0.25))
model.add(LSTM(50, dropout=0.5, recurrent_dropout=0.5))
model.add(Dropout(0.2))

#dense layer has 3 outputs for the 3 classes 
model.add(Dense(3, activation='sigmoid'))
model.compile(loss='SparseCategoricalCrossentropy',optimizer='adam', metrics=['accuracy'])
print(model.summary())

Model: "sequential_6"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_6 (Embedding)      (None, 200, 32)           504608    
_________________________________________________________________
spatial_dropout1d_6 (Spatial (None, 200, 32)           0         
_________________________________________________________________
lstm_6 (LSTM)                (None, 50)                16600     
_________________________________________________________________
dropout_6 (Dropout)          (None, 50)                0         
_________________________________________________________________
dense_6 (Dense)              (None, 3)                 153       
Total params: 521,361
Trainable params: 521,361
Non-trainable params: 0
_________________________________________________________________
None


In [None]:
#fit everything with validation split at 10% and run for 5 epochs
history = model.fit(padded_sequence,sentiment_label[0],validation_split=0.1, epochs=5, batch_size=32)

#validation accuracy of .8436 a massive improvment on NB

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [None]:
#create ordered list of sentiment for getting back label from an index 
sentiment = ['neutral', 'positive', 'negative']

In [None]:
#create function to test string
def testString(test_word):
    tw = tokenizer.texts_to_sequences([test_word])
    tw = pad_sequences(tw,maxlen=200)
    prediction = model.predict(tw)
    print(prediction)
    print(sentiment[np.argmax(prediction)])

In [None]:
testString("they were amazingly bad")

[[0.44376218 0.46981794 0.5761415 ]]
negative


In [None]:
testString("this is the best thing I have done")

[[0.38965803 0.9004164  0.11448115]]
positive


In [None]:
testString("people said they were amazingly bad, but I actually really like them")

[[0.27355775 0.3706975  0.8630587 ]]
negative


In [None]:
#import module
from random import sample 

In [None]:
#randomely sample positive and negative sentences by Naive Bayes 
testNegs = sample(indices_neg,300)

In [None]:
#hold LSTM predictions
deepSents = []

#loop through the negative tweets and classify them using the LSTM
for i in testNegs:
    tw = tokenizer.texts_to_sequences([sentences[i]])
    tw = pad_sequences(tw,maxlen=200)
    prediction = model.predict(tw)
    deepSents.append(sentiment[np.argmax(prediction)])

In [None]:
#find the tweets that have opposite classification as Naive Bayes 
indTest = [i for i, x in enumerate(deepSents) if x == "positive"]

In [None]:
#tagged as positive by LSTM and negative by NB
for i in indTest:
    print()
    print(sentences[i])


Toronto Public Health will monitor the patient while he continues to recover at home, where his wife is also in self-isolation.

Dr Tedros, speaking at the press conference in Geneva, described the virus as an "unprecedented outbreak" that has been met with an "unprecedented response".

The WHO declares a Public Health Emergency of International Concern when there is "an extraordinary event which is determined … to constitute a public health risk to other states through the international spread of disease".

Although questions have been raised about transparency, the WHO has praised China's handling of the outbreak.

The province of 60 million people is home to Wuhan, the heart of the outbreak.

Peter Morris, chief economist at Ascend by Cirium, said: “Cirium data clearly shows the dramatic impact that coronavirus is having, with nearly 10,000 scheduled flights to, from and within China being suspended between January 23 and 28.

Beijing, which has only just started to mend tattered t

In [None]:
#same as above but for positives 
testPos = sample(indices_pos,30)

deepSents = []

for i in testPos:
    tw = tokenizer.texts_to_sequences([sentences[i]])
    tw = pad_sequences(tw,maxlen=200)
    prediction = model.predict(tw)
    deepSents.append(sentiment[np.argmax(prediction)])

In [None]:
#tagged as negative by LSTM and positive by NB

indTest = [i for i, x in enumerate(deepSents) if x == "negative"]
for i in indTest:
    print()
    print(sentences[i])


All district hospitals will have five beds isolated for patients carrying the virus.

104 Arogya Sahayavani helpline run by the Health Department will take all calls related to the virus.

The government making sure that the new coronavirus does not make its way to the country.

Apart from more people falling sick (as bad as that is), is there a more fundamental concern that if it runs wild in a less developed country, it'll mutate into something more dangerous?

Among the major miners, Fortescue Metals is advancing more than 1 percent, BHP is adding 1 percent and Rio Tinto is up 0.2 percent.

Among the big four banks, ANZ Banking, Commonwealth Bank and National Australia Bank are rising in a range of 0.4 percent to 0.5 percent, while Westpac is edging up 0.1 percent.

Commonwealth Bank, which reports half-year results on February 12, said it will make a provision of A$83 million for insurance claims related to the recent bushfires in Australia.

Gold miners are weak despite gold pric

## Interpretation of the results

### NB 1 
Both Naive Bayes model are trained on only the airline data. The first Naive Bayes model used is a Complement Naive Bayes model.Documentation states that the Complement Naive Bayes model is built to handle unbalanced datasets. It seemed like the correct choice for this problem. The initial model is only unigrams. For a unigram to be admitted to be used as a predictor it must show up in at least 2,000 documents. This helps the model not overfit rare words. The hope in this analysis is to not overfit sentiment based on the name of the airline. Based on this limit, there are 891 unigrams chosen as predictors. 

Cross validation is essential for truly measuring the validity of the model. Five fold cross validation is used and the rates are as follows. 

[0.67588798, 0.64378415, 0.60382514, 0.72096995, 0.70696721]

As we can see the best fold is .72 or 72% correct, but the worst is .60 or 60%. This is very intriguing, however, not for good reason. It shows that there is a large variability based on which fold is used. The model is potentially somewhat unstable. 

The average of these is .67. Although this seems bad, it is interesting that a simple Naive Bayes classifier with only 891 unigrams can accurately classify a 3 class problem 67% of the time. It does need improvement. 


### NB 2 

The second Naive Bayes model is also a Complement Naive Bayes model. However, this time we use unigrams, bigrams, and trigrams. For any of these to be admitted as a predictor they must be found in at least 1000 of the documents. This also should help preventing overfitting on something like airline names. The model results in 

The cross validation scores for this are as followed. 

[0.69740437, 0.6875 , 0.67657104, 0.76058743, 0.73155738]

The highest fold is .76 or 76%. The lowest fold is .67 or 67%. As compared to the previous model, the folds as a whole have a lower standard deviation. This is better. The reason is that there is less variability meaning that our model is more stable and consistent across each fold. The average rate is .71 or 71%. Still much room for improvement but this is an interesting baseline to compare against the LSTM which is the main model of this report. 

It is important to reiterate that both Naive Bayes model are trained on only the airline data. 

### Phrases: Sentiment Predictions (Naive Bayes)

We will use the better of the two Naive Bayes models to make predictions on every single sentence in the covid set (1/4 of actual data). This allows us to view the counts of positive, negative, and neutral sentiment for each sentence in the covid tweet data. The result is 171,743 positive sentences, 107,066 negative sentences, and 304,641 neutral sentences. This is clearly imbalanced. Furthermore, summing all of these results in 583,450 sentences, showing that we have made the correct number of predictions. 

From here, we build a table that has title, author, country, negative, neutral, and positive as the columns. This allows us to calculate the conditional mean number of positive and negative sentences given country. This however, is more involved than it may seem. First, we must get the conditional averages for positive and negative sentiment. However, we cannot simply order these and analyze that result. The reason is that these means only regard the single sentiment type. For example, a country may have the largest average amount of positive sentiment sentences, but it also may have many negative. Therefore, I take the difference between the average positive and negative sentences given the country. The results are very interesting. BR or Brazil has by far on average the most positive sentiment. This is if we make the assumption that our model is correctly classifying the sentiment. The most negative country on average is BA or Bosnia and Herzegovina. Brazils difference score is 7.3 and Bosnia’s is -17. For reference, the global average sentiment difference is -2.2. The United States scores just under the average with a score of -3. 

---

##### Phrase extraction and issues 


Next, we extract Adverb, Verb, and Adjective phrases for the all the positive and negative sentences in the covid tweet data set. It is important to note here some issues with the results. There were clearly tokenization issues that need deeper investigation in order to fix. These are things like 's being counted as a word. Investigation into the tokenizer is needed to fix and investigate this issue and come to a resolution. Furthermore, this happens in all three phrase groups. Furthermore, the data is never lowercased. This means we have repeated phrases where its counting lower case version and upper case version as different. This also needs to be fixed moving forward. For the purpose of this report we still were able to obtain very interesting results that are worth exploring. 

#### Adjective Phrases (Positive and Negative) 

Through living with covid and being a part of this history, we understand that positive covid results often come from early action. Negative results also often come from late action or inaction all together. Amazingly we can see this in the adjective phrases. Many of the top ten adjective phrases for the positive sentences have to do with early action or preparedness. The 8th phrase is, (extremely well-prepared , 129) and the 7th is (very early , 143). This is a stark contrast to the negative adjective phrases, one of the top ten are (too late , 61). Another interesting example of this is that the positive list has the phrase (too early , 143) while the negative has the phrase (too late , 61). It is important to note that there are many similarities between both lists. For example, the positive list of adjective phrases has phrases like (too much , 147), (so far killed , 88), (more serious , 117), and (too much , 147). These could all be interpreted as negative. This is to be expected though as the subject of covid itself is very negative. A distinct theme found in the negative list of adjective phrases is that of being unknown. For example, some of the top adjective phrases in the negative list are (not clear , 69), (not available , 86), and (not sure , 65). Although both lists are negative, we can draw some interesting difference as described above. 


#### Adverb Phrases (Positive and Negative) 

The adverb phrases as a whole are less interesting than the adjective phrases. This is because the lists are extremely similar. More preprocessing may help this but it just may be that the similarities in speech between positive and negative in this specific case is small. For example, the first 5 of 6 phrases in both the negative and positive adverb lists are the same, although they are in different orders. The only difference is the negative list has (not yet , 235) while the positive does not. Other than that, 5 of the first 6 are the same. These are, (so far , 2111), (as well , 2010), (not just , 217), (not only , 201), and (right now , 190). The fact that the lists are similar imply that these adverbs are simply a part of the speech used that are uncorrelated to sentiment. This is an opportunity to also point out an issue discussed above. The positive list has (so far , 2111) and (So far , 586). This should be fixed by lowercasing the text and adding that to the data pipeline in the preprocessing stage. 



#### Verb Phrases (Positive and Negative) 

The verb phrases, especially in the negative list, have many issues. There are many examples like, ('s going , 159) that are not correct. This is another example of something that should be fixed in the tokenization process. Although there are difference between the lists, the positive list is very negative. This makes it difficult to asses how they were tagged as positive while focusing on the verb phrases. Some of these include, (has killed , 1778), (has infected , 768), (have died , 764), and (have been reported , 591). Needles to say these do not seem positive in nature. The negative list includes things like (has been , 1455), (have been , 1290), and (is happening , 489). These results imply that some preprocessing should have been done to remove some of these more useless words. 

An important insight here is that we can often learn from what we do not find. The fact that those verbs mentioned above, the ones in the positive list but appeared to be very negative, are so prevalent in the positive list actually tells us something interesting. The Naive Bayes model is trained on the airline data. The airline data most likely does not include words like killed and infected. Therefore, when the algorithm comes across them in the testing set it’s not entirely sure what to do with them. This fact calls into question the ability of being able to generalize a model trained on one dataset to another that is very different. 




### Deep RNN: LSTM 

This report really shows how much better and more advanced an LSTM is at this task than the above models. The best cross validation Naive Bayes model was about .72 or 72%. On the validation set the LSTM performs at .8436 or 84.36%. This is a massive improvement on what was previously achieved. After seeing this result on the validation set, I started testing random strings to see the label. I was initially impressed, it was able to classify 'they were amazingly bad' as 'negative'. This is impressive because of the amazing and the bad. With very high confidence it classified 'this is the best thing I have done' as 'positive'. However, I was able to confuse the model. I created the string, "people said they were amazingly bad, but I actually really like them". This was classified as 'negative' with 86% certainty. 

From here I tried predicting every single sentence with the LSTM that was predicted by the Naive Bayes models, however, it was computationally too much. Therefore, I simply sampled some positive and negative sentences to see if the LSTM would classify them with the opposite sentiment. This was actually often the case. For example, the string 'Although questions have been raised about transparency, the WHO has praised China's handling of the outbreak.', was classified as 'positive' by the LSTM and 'negative' by the Naive Bayes model. In my opinion, the LSTM gets this right. On the other hand, the string 'ResMed reported a 13 percent increase in revenue for the second quarter from last year, reflecting growing demand for masks and other medical accessories.' is classified as 'negative', by the LSTM and 'positive' by the Naive Bayes. In this case, I actually think the Naive Bayes model is correct. 

In conclusion, I think that training the models on more relevant data is desirable. Furthermore, I wanted to evaluate these models without sentiment lexicons. Using them would help eliminate the vocabulary issues. There are words in one corpus that are not in the other. If we used an external lexicon this would remedy the issue. However, I wanted to see the performance with only relying on the vocabulary itself. 