### Week 10 | Natural Language Processing(NLP) using Naive Bayes, to predict the sentiment of Tweets
<hr>
## Learning Objectives
At the end of this lesson, you will be able to:

- Understand that Naive Bayes classifier is a simple classifier that classifies based on the probabilities of events

- Used Naive Bayes to perform classifying of texts into certain categories

- Understand how to express the required probability of a category given a sentence into a fraction format using Naive Bayes

- Know that the denominator can be ignored since they are the same for all categories

- Apply pandas `groupby` function to calculate the right side of the numerator, which is the probability of each categories

- Identify that the left hand side of the numerator needs to be split into individual words, since it is unlikely that the sentence to be predicted appears in the training data, hence the probability of the sentence given the category will be zero

- Apply `CountVectorizer` from `sklearn` to find the list of word frequencies, in order to find the individual word probabilities in a given category

- Remember to use `Laplace Smoothing` to fix the problem of the entire numerator becoming zero due to a single word probability being zero. This is possible if the word does not appear in the training data

- Understand how to combine the left and right hand side of the numerator to give the required probability of each category given a sentence to be predicted

- Use these probabilities of predict the category of the given sentence

- Credits: 
 1. https://medium.com/analytics-vidhya/naive-bayes-classifier-for-text-classification-556fabaf252b
 2. https://www.kaggle.com/c/tweet-sentiment-extraction/

## Twitter sentiment extraction
- "My ridiculous dog is amazing." [sentiment: positive]

- With all of the tweets circulating every second it is hard to tell whether the sentiment behind a specific tweet will impact a company, or a person's, brand for being viral (positive), or devastate profit because it strikes a negative tone. 
- Disclaimer: The dataset for this competition contains text that may be considered profane, vulgar, or offensive.
- Each row contains the text of a tweet and a sentiment label. In the training set you are provided with a word or phrase drawn from the tweet (selected_text) that encapsulates the provided sentiment.
- The columns are:
 1. textID - unique ID for each piece of text
 2. text - the text of the tweet
 3. selected_text - [train only] the text that supports the tweet's sentiment
 4. sentiment - the general sentiment of the tweet
- Your task will be to identify the sentiment(neutral, positive, negative), given the tweet text.

In [1]:
import pandas as pd
df = pd.read_csv("train.csv")
df = df.dropna()
df.head(10)

Unnamed: 0,textID,text,selected_text,sentiment
0,cb774db0d1,"I`d have responded, if I were going","I`d have responded, if I were going",neutral
1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,Sooo SAD,negative
2,088c60f138,my boss is bullying me...,bullying me,negative
3,9642c003ef,what interview! leave me alone,leave me alone,negative
4,358bd9e861,"Sons of ****, why couldn`t they put them on t...","Sons of ****,",negative
5,28b57f3990,http://www.dothebouncy.com/smf - some shameles...,http://www.dothebouncy.com/smf - some shameles...,neutral
6,6e0c6d75b1,2am feedings for the baby are fun when he is a...,fun,positive
7,50e14c0bb8,Soooo high,Soooo high,neutral
8,e050245fbd,Both of you,Both of you,neutral
9,fc2cbefa9d,Journey!? Wow... u just became cooler. hehe....,Wow... u just became cooler.,positive


## Naive Bayes theorem
- Based on the Naive Bayes theorem, given a text "The company is going bankrupt", we need to find out what is the probability of the sentiment `neutral`, `positive`, `neutral`. i.e we need to compare the following probabilities: 
 1. P(neutral|The company is going bankrupt)
 2. P(positive|The company is going bankrupt)
 3. P(negative|The company is going bankrupt)
- Using the Naive Bayes theorem, the above can be written as,
 1. [P(The company is going bankrupt|neutral) * P(neutral)] / P(The company is going bankrupt)
 2. [P(The company is going bankrupt|positive) * P(positive)] / P(The company is going bankrupt)
 3. [P(The company is going bankrupt|negative) * P(negative)] / P(The company is going bankrupt)
- The denominator for the three equations are all the same, hence we just need to compare to numerator. i.e
 1. [P(The company is going bankrupt|neutral) * P(neutral)]
 2. [P(The company is going bankrupt|positive) * P(positive)]
 3. [P(The company is going bankrupt|negative) * P(negative)]
- Let us now look at the right side of the numerator above. To count P(neutral), P(positive), P(negative), we use the pandas dataframe `groupby` function.

In [2]:
total_count = df.count()[0]
group_count = df.groupby('sentiment').count()
negative_prob = group_count.iloc[0][0] / total_count
neutral_prob = group_count.iloc[1][0] / total_count
positive_prob = group_count.iloc[2][0] / total_count
print(negative_prob, neutral_prob, positive_prob)
group_count

0.2831513828238719 0.40454876273653567 0.3122998544395924


Unnamed: 0_level_0,textID,text,selected_text
sentiment,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
negative,7781,7781,7781
neutral,11117,11117,11117
positive,8582,8582,8582


## Probability of the tweet given the sentiment
- We have to tackle the issue that the tweet may not appear in the the sentiment group within the training set. In this case, the probability will be zero, if the tweet did not appear in any of the classes in the training set.
- Hence, we will split the tweet into words, assuming that every word in the tweet is independent of the other ones. Instead of looking at the entire tweet sentence, we will be analysing the individual words.
- Let us now look at the left side of the numerator above:
 1. P(The company is going bankrupt|neutral) = P(The|neutral) * P(company|neutral) * P(is|neutral) * P(going|neutral) * P(bankrupt|neutral)
 2. P(The company is going bankrupt|positive) = P(The|positive) * P(company|positive) * P(is|positive) * P(going|positive) * P(bankrupt|positive)
 3. P(The company is going bankrupt|negative) = P(The|negative) * P(company|negative) * P(is|negative) * P(going|negative) * P(bankrupt|negative)
- In short, what we are trying to calculate is P(word|sentiment) = Number of times the word appears in the sentiment class / total number of words in the sentiment class
- To find the total number of words in the sentiment class, we can use sklearn CountVectorizer. This gives us the Term-document Matrix (TDM), which is the list of word frequencies.

In [3]:
from sklearn.feature_extraction.text import CountVectorizer

# List of the selected_text for which the sentiment is negative
negative_docs = [row['selected_text'] for index,row in df.iterrows() if row['sentiment'] == 'negative']
negative_docs

['Sooo SAD',
 'bullying me',
 'leave me alone',
 'Sons of ****,',
 'DANGERously',
 'lost',
 'Uh oh, I am sunburned',
 '*sigh*',
 'sick',
 'onna',
 'I`m sorry.',
 '.no internet',
 'Power back up not working too',
 'well so much for being unhappy for about 10 minute',
 'miss',
 'soooooo sleeeeepy!!!',
 'SUCKKKKKK',
 'dont like go',
 'd I`m not thrilled at all with mine.',
 'it is ****...u have dissappointed me that past few days',
 'hurts',
 'Torn ace of hearts',
 'i lost all my friends, i`m alone and sleepy..',
 'I give in to easily',
 'jealous..',
 'BADDD.',
 'I am sooo tired',
 'Sick.',
 ', sorry guys',
 'i miss you bby',
 'tired',
 'freaked',
 'unfortunately',
 'horrible,',
 'busy',
 'I don`t feel confident',
 'sad.',
 'Not looking forward',
 'Poor you',
 'not well',
 'painful.',
 'sad?',
 'missed all the awesome weather,',
 'terrible!',
 'Unfortunatley,',
 'That sucks, tho.',
 'Hate fighting',
 'Car-warmed Sprite tastes like sore throat',
 'hate',
 'i don`t like the other ones.',
 '

In [4]:
# TDM for positive sentiments
vec_n = CountVectorizer()
X_n = vec_n.fit_transform(negative_docs)
tdm_n = pd.DataFrame(X_n.toarray(), columns=vec_n.get_feature_names())

# In this case for negative sentiment tweets, "uh" appeared 6 times, "sigh" appeared 7 times, "dangerously" appeared 4 times.
sliced = tdm_n.loc[:10,["uh","sigh","dangerously"]]
sliced

Unnamed: 0,uh,sigh,dangerously
0,0,0,0
1,0,0,0
2,0,0,0
3,0,0,0
4,0,0,1
5,0,0,0
6,1,0,0
7,0,1,0
8,0,0,0
9,0,0,0


In [5]:
# Let us do the above two cells for neutral and positive sentiments as well
neutral_docs = [row['selected_text'] for index,row in df.iterrows() if row['sentiment'] == 'neutral']
vec_t = CountVectorizer()
X_t = vec_t.fit_transform(neutral_docs)
tdm_t = pd.DataFrame(X_t.toarray(), columns=vec_t.get_feature_names())

positive_docs = [row['selected_text'] for index,row in df.iterrows() if row['sentiment'] == 'positive']
vec_p = CountVectorizer()
X_p = vec_p.fit_transform(positive_docs)
tdm_p = pd.DataFrame(X_p.toarray(), columns=vec_p.get_feature_names())

## Probability of words in each sentiment class (Part 1)
- We now need to find the total number of times each word appeared in each sentiment class. For example, as seen above, "uh" appeared 6 times in the negative sentiment class.
- After that, we will use the frequency values to find the proability that each word appears in each sentiment class.

In [6]:
# Frequency of words in negative sentiment class
word_list_n = vec_n.get_feature_names();    
count_list_n = X_n.toarray().sum(axis=0) 
freq_n = dict(zip(word_list_n,count_list_n))
freq_n

{'00': 2,
 '09': 1,
 '0ut': 1,
 '10': 5,
 '100': 2,
 '10000000000': 1,
 '107': 1,
 '11': 3,
 '12': 2,
 '12th': 1,
 '13': 1,
 '13pdrmj': 1,
 '14': 1,
 '140': 1,
 '14m': 1,
 '15': 1,
 '150': 1,
 '15am': 1,
 '16th': 1,
 '18hrs': 1,
 '1st': 2,
 '20': 2,
 '200': 1,
 '2008': 1,
 '2009': 1,
 '21': 2,
 '22hrs': 1,
 '24hrs': 1,
 '27': 1,
 '2day': 2,
 '2nd': 1,
 '2nit': 1,
 '2stop': 1,
 '30': 5,
 '30am': 1,
 '333333333': 1,
 '35': 1,
 '36': 1,
 '3647': 1,
 '38': 1,
 '3am': 1,
 '3wordsaftersex': 1,
 '42nd': 1,
 '45': 1,
 '4th': 1,
 '51': 1,
 '58': 1,
 '5ghz': 1,
 '5hours': 1,
 '5jib6': 1,
 '630': 1,
 '674p1': 1,
 '6th': 2,
 '7500': 1,
 '75c': 1,
 '7th': 1,
 '80': 1,
 '800': 1,
 '87': 1,
 '88': 1,
 '8th': 1,
 '90': 1,
 '970': 1,
 '9am': 1,
 '9pm': 1,
 '_c': 1,
 '_da_': 1,
 '_hall': 1,
 '_kill_boy': 1,
 '_l': 2,
 '_lord': 1,
 '_mcloven': 1,
 '_skies': 1,
 '_troy': 1,
 '_violence': 1,
 '_x': 1,
 'aaaaaaaaaahhhhhhhh': 1,
 'aaaaarrrrggghhh': 1,
 'aaaagggessss': 1,
 'aaaaw': 1,
 'aaarrrgggghhh': 1,
 'a

In [7]:
# Frequency of words in neutral sentiment class
word_list_t = vec_t.get_feature_names();    
count_list_t = X_t.toarray().sum(axis=0) 
freq_t = dict(zip(word_list_t,count_list_t))

# Frequency of words in positive sentiment class
word_list_p = vec_p.get_feature_names();    
count_list_p = X_p.toarray().sum(axis=0) 
freq_p = dict(zip(word_list_p,count_list_p))

In [8]:
# Probabilities of words in negative sentiment class
prob_n = []
total_n_values = sum(freq_n.values())
for word, count in zip(word_list_n, count_list_n):
    prob_n.append(count / total_n_values)
n_prob_dict = dict(zip(word_list_n, prob_n))
n_prob_dict

{'00': 7.218392463998267e-05,
 '09': 3.6091962319991336e-05,
 '0ut': 3.6091962319991336e-05,
 '10': 0.0001804598115999567,
 '100': 7.218392463998267e-05,
 '10000000000': 3.6091962319991336e-05,
 '107': 3.6091962319991336e-05,
 '11': 0.00010827588695997402,
 '12': 7.218392463998267e-05,
 '12th': 3.6091962319991336e-05,
 '13': 3.6091962319991336e-05,
 '13pdrmj': 3.6091962319991336e-05,
 '14': 3.6091962319991336e-05,
 '140': 3.6091962319991336e-05,
 '14m': 3.6091962319991336e-05,
 '15': 3.6091962319991336e-05,
 '150': 3.6091962319991336e-05,
 '15am': 3.6091962319991336e-05,
 '16th': 3.6091962319991336e-05,
 '18hrs': 3.6091962319991336e-05,
 '1st': 7.218392463998267e-05,
 '20': 7.218392463998267e-05,
 '200': 3.6091962319991336e-05,
 '2008': 3.6091962319991336e-05,
 '2009': 3.6091962319991336e-05,
 '21': 7.218392463998267e-05,
 '22hrs': 3.6091962319991336e-05,
 '24hrs': 3.6091962319991336e-05,
 '27': 3.6091962319991336e-05,
 '2day': 7.218392463998267e-05,
 '2nd': 3.6091962319991336e-05,
 '2

In [9]:
# Probabilities of words in neutral sentiment class
prob_t = []
total_t_values = sum(freq_t.values())
for word, count in zip(word_list_t, count_list_t):
    prob_t.append(count / total_t_values)
t_prob_dict = dict(zip(word_list_t, prob_t))

# Probabilities of words in positive sentiment class
prob_p = []
total_p_values = sum(freq_p.values())
for word, count in zip(word_list_p, count_list_p):
    prob_p.append(count / total_p_values)
p_prob_dict = dict(zip(word_list_p, prob_p))

## Probability of words in each sentiment class (Part 2)
- However, as the above equation involves the probabilities of a new word of a new sentence with respect to a sentiment class, if a word for the new sentence does not occur within the sentiment class in the training set, the entire equation will become zero.
- For example, P(The company is going bankrupt|neutral) = P(The|neutral) * P(company|neutral) * P(is|neutral) * P(going|neutral) * P(bankrupt|neutal) will be equal 0 if any of the probabilities on the right hand side is zero.
- To fix this problem, we use Laplace Smoothing.

In [10]:
# Total count of all features in the training set
docs = [row['selected_text'] for index,row in df.iterrows()]

vec = CountVectorizer()
X = vec.fit_transform(docs)

total_features = len(vec.get_feature_names())

# # Total count of individual features in the training set
total_cnts_features_n = count_list_n.sum(axis=0)
total_cnts_features_t = count_list_t.sum(axis=0)
total_cnts_features_p = count_list_p.sum(axis=0)

## Combining the probabilities
- Now, we will be calculating the left hand side of numerator above, before combining with the right hand side of
the numerator found above, to find the total value of the numerator.
- To find the left hand side of the numerator, we will multiply the individual word probabilities, applying Laplace Smoothing.
- The right hand side of the numerator is found above, which is simply the probabilities of each fo the sentiment classes.
- We combine the left and right hand sides of the numerator to find the total value of the numerator, which are:
 1. [P(The company is going bankrupt|neutral) * P(neutral)]
 2. [P(The company is going bankrupt|positive) * P(positive)]
 3. [P(The company is going bankrupt|negative) * P(negative)]
- As mentioned above, since the denominators are all the same and can be ignored, we can compare the numerators above to find which of these probabilities are the largest(hence most likely to be the correct sentiment):
 1. P(neutral|The company is going bankrupt)
 2. P(positive|The company is going bankrupt)
 3. P(negative|The company is going bankrupt)
- As seen below, since P(negative|The company is going bankrupt) is the largest, it is most likely that the sentiment is negative, given the sentence "The company is going bankrupt", hence we conclude that the sentiment is negative.

In [11]:
from nltk.tokenize import word_tokenize
# Tokenize the new_sentence, which is the sentence we would like the predict the sentiment of.
new_sentence = 'The company is going bankrupt'
new_word_list = word_tokenize(new_sentence)
new_word_list 

['The', 'company', 'is', 'going', 'bankrupt']

In [12]:
# P(The company is going bankrupt|negative) = P(The|negative) * P(company|negative) * P(is|negative) * P(going|negative) * P(bankrupt|negative)
# To find P(The company is going bankrupt|negative), multiply the individual word probabilities
prob_n_with_ls = []
for word in new_word_list:
    if word in freq_n.keys():
        count = freq_n[word]
    else:
        count = 0
    prob_n_with_ls.append((count + 1)/(total_cnts_features_n + total_features))
dict_n_dict = dict(zip(new_word_list,prob_n_with_ls))
print(dict_n_dict)
n_prob_left = 1
for key in dict_n_dict:    
    n_prob_left = n_prob_left * dict_n_dict[key]
n_prob_left

{'The': 2.2061905707415006e-05, 'company': 6.618571712224501e-05, 'is': 0.008273214640280628, 'going': 0.0017428905508857855, 'bankrupt': 2.2061905707415006e-05}


4.645096429746363e-19

In [13]:
# Now, we just need to find the value of the entire numerator of the negative sentiment class
# [P(The company is going bankrupt|negative) * P(negative)]
n_prob_total = n_prob_left * negative_prob
n_prob_total

1.315265477432913e-19

In [14]:
# Repeat this for the neutral and positive sentiment classes

# P(The company is going bankrupt|neutral) = P(The|neutral) * P(company|neutral) * P(is|neutral) * P(going|neutral) * P(bankrupt|neutral)
# To find P(The company is going bankrupt|neutral), multiply the individual word probabilities
prob_t_with_ls = []
for word in new_word_list:
    if word in freq_t.keys():
        count = freq_t[word]
    else:
        count = 0
    prob_t_with_ls.append((count + 1)/(total_cnts_features_t + total_features))
dict_t_dict = dict(zip(new_word_list,prob_t_with_ls))
print(dict_t_dict)
t_prob_left = 1
for key in dict_t_dict:    
    t_prob_left = t_prob_left * dict_t_dict[key]

    
# Now, we just need to find the value of the entire numerator of the neutral sentiment class
# [P(The company is going bankrupt|neutral) * P(neutral)]
t_prob_total = t_prob_left * neutral_prob
t_prob_total

{'The': 7.00883814490072e-06, 'company': 0.00013316792475311366, 'is': 0.010408124645177569, 'going': 0.0033432157951176432, 'bankrupt': 7.00883814490072e-06}


9.208724189362161e-20

In [15]:
# P(The company is going bankrupt|positive) = P(The|positive) * P(company|positive) * P(is|positive) * P(going|positive) * P(bankrupt|positive)
# To find P(The company is going bankrupt|positive), multiply the individual word probabilities
prob_p_with_ls = []
for word in new_word_list:
    if word in freq_p.keys():
        count = freq_p[word]
    else:
        count = 0
    prob_p_with_ls.append((count + 1)/(total_cnts_features_p + total_features))
dict_p_dict = dict(zip(new_word_list,prob_p_with_ls))
print(dict_p_dict)
p_prob_left = 1
for key in dict_p_dict:    
    p_prob_left = p_prob_left * dict_p_dict[key]

    
# Now, we just need to find the value of the entire numerator of the positive sentiment class
# [P(The company is going bankrupt|positive) * P(positive)]
p_prob_total = p_prob_left * positive_prob
p_prob_total

{'The': 2.2107264447097318e-05, 'company': 8.842905778838927e-05, 'is': 0.006057390458504665, 'going': 0.0009727196356722819, 'bankrupt': 2.2107264447097318e-05}


7.952616574025986e-20

In [16]:
p_prob_total = p_prob_left * positive_prob
cur_max_sentiment = "negative"
cur_max_num = n_prob_total

if t_prob_total > cur_max_num:
    cur_max_sentiment = "neutral"
    cur_max_num = t_prob_total
if p_prob_total > cur_max_num:
    cur_max_sentiment = "positive"
    cur_max_num = p_prob_total

print("Positive: " + str(p_prob_total))
print("Negative: " + str(n_prob_total))
print("Neutral: " + str(t_prob_total) + "\n")

print("Sentence: " + new_sentence)
print("Predicted Sentiment: " + cur_max_sentiment)

Positive: 7.952616574025986e-20
Negative: 1.315265477432913e-19
Neutral: 9.208724189362161e-20

Sentence: The company is going bankrupt
Predicted Sentiment: negative


## Custom sentiment prediction
- Now, let us place the entire section above into a function in order to make predictions for a custom sentence.
- The code in this function is the same as above.
- Have fun making your own predictions based on the dataset!! Change the sentence_to_be_predicted value.

In [17]:
def predict(my_sentence):
    # Tokenize the new_sentence, which is the sentence we would like the predict the sentiment of.
    new_word_list = word_tokenize(my_sentence)
    
    ## NEGATIVE sentiment
    prob_n_with_ls = []
    for word in new_word_list:
        if word in freq_n.keys():
            count = freq_n[word]
        else:
            count = 0
        prob_n_with_ls.append((count + 1)/(total_cnts_features_n + total_features))
    dict_n_dict = dict(zip(new_word_list,prob_n_with_ls))
    n_prob_left = 1
    for key in dict_n_dict:    
        n_prob_left = n_prob_left * dict_n_dict[key]
    n_prob_total = n_prob_left * negative_prob
    
    ## NEUTRAL sentiment
    prob_t_with_ls = []
    for word in new_word_list:
        if word in freq_t.keys():
            count = freq_t[word]
        else:
            count = 0
        prob_t_with_ls.append((count + 1)/(total_cnts_features_t + total_features))
    dict_t_dict = dict(zip(new_word_list,prob_t_with_ls))
    t_prob_left = 1
    for key in dict_t_dict:    
        t_prob_left = t_prob_left * dict_t_dict[key]

    t_prob_total = t_prob_left * neutral_prob
    
    ## POSITIVE sentiment
    prob_p_with_ls = []
    for word in new_word_list:
        if word in freq_p.keys():
            count = freq_p[word]
        else:
            count = 0
        prob_p_with_ls.append((count + 1)/(total_cnts_features_p + total_features))
    dict_p_dict = dict(zip(new_word_list,prob_p_with_ls))
    p_prob_left = 1
    for key in dict_p_dict:    
        p_prob_left = p_prob_left * dict_p_dict[key]
        
    p_prob_total = p_prob_left * positive_prob
    cur_max_sentiment = "negative"
    cur_max_num = n_prob_total
    
    if t_prob_total > cur_max_num:
        cur_max_sentiment = "neutral"
        cur_max_num = t_prob_total
    if p_prob_total > cur_max_num:
        cur_max_sentiment = "positive"
        cur_max_num = p_prob_total
    
    print("Positive: " + str(p_prob_total))
    print("Negative: " + str(n_prob_total))
    print("Neutral: " + str(t_prob_total) + "\n")
    
    print("Predicted Sentiment: " + cur_max_sentiment)

In [30]:
sentence_to_be_predicted = "this is difficult"
predict(sentence_to_be_predicted)

Positive: 1.7011600944603131e-10
Negative: 1.3864767838425653e-09
Neutral: 5.967346292592119e-10

Predicted Sentiment: negative
