# Naive Bayes - Class Exercise 3

## Introduction

## Metadata (Data Dictionary)

| No.| Variable | Data Type | Description |
|----|----------|-----------|-------------|
| 1  | text | string | The text of the tweet |
| 2  | sentiment | string | Sentiment category |


## Import necessary libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn import metrics

## Import Data

In [2]:
df = pd.read_csv('sentiment.csv')
df

Unnamed: 0,text,sentiment
0,bullying me,negative
1,leave me alone,negative
2,DANGERously,negative
3,"Uh oh, I am sunburned",negative
4,*sigh*,negative
...,...,...
5959,wanna leave work al,negative
5960,good,positive
5961,welcome,positive
5962,enjoy,positive


In [3]:
# Change the label to binary variable

df['sentiment'] = df['sentiment'].map({'negative': 0, 'positive': 1})
df.head()

Unnamed: 0,text,sentiment
0,bullying me,0
1,leave me alone,0
2,DANGERously,0
3,"Uh oh, I am sunburned",0
4,*sigh*,0


# Vectorization

Vectorization is like One-hot encoding of sentences.

Assuming we have 3 sentences: "I eat chicken rice", "I like cheese burger, "I drink coffee".<br>
They would be vectorized into the format below.

| No.| I | eat | chicken | rice | like | cheese | burger | drink | coffee |
|----|---|---|---|---|---|---|---|---|---|
| 1  | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
| 2  | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 |
| 3  | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |

In [4]:
# We will use the vectorizer from sklearn
from sklearn.feature_extraction.text import CountVectorizer

In [5]:
# Initialize the vectorizer
cv = CountVectorizer()

In [6]:
# Obtain vectorized words
vectorized_words = cv.fit_transform(df['text']).toarray()

In [7]:
# They are in the matrix format now
vectorized_words

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [8]:
# We can check the shape of it
# This matrix has 5964 rows, which is the size of our dataset, and 3018 columns, which is the total number of words

vectorized_words.shape

(5964, 3018)

In [9]:
# This attribute records the index of the word
# The word "ABC" has the index of n, it means it is stored at the n-th column in the matrix
cv.vocabulary_

{'bullying': 394,
 'me': 1688,
 'leave': 1548,
 'alone': 114,
 'dangerously': 650,
 'uh': 2748,
 'oh': 1898,
 'am': 122,
 'sunburned': 2513,
 'sigh': 2332,
 'sorry': 2410,
 'no': 1849,
 'internet': 1415,
 'funny': 1051,
 'soooooo': 2403,
 'sleeeeepy': 2357,
 'romance': 2206,
 'zero': 3015,
 'is': 1426,
 'torn': 2672,
 'ace': 51,
 'of': 1890,
 'hearts': 1256,
 'give': 1084,
 'in': 1392,
 'to': 2648,
 'easily': 820,
 'jealous': 1451,
 'chilliin': 488,
 'better': 281,
 'baddd': 234,
 'sooo': 2400,
 'tired': 2644,
 'thank': 2589,
 'yyyyyyyyyoooooooooouuuuu': 3014,
 'sick': 2328,
 'happy': 1221,
 'star': 2449,
 'wars': 2831,
 'day': 660,
 'everyone': 882,
 'and': 132,
 'enjoy': 854,
 'the': 2599,
 'holiday': 1295,
 'uk': 2749,
 'mothers': 1773,
 'all': 107,
 'you': 2994,
 'mums': 1790,
 'out': 1934,
 'there': 2603,
 'unfortunately': 2763,
 'pretty': 2065,
 'wish': 2907,
 'was': 2832,
 'allowed': 112,
 'go': 1094,
 'hahaa': 1188,
 'your': 2996,
 'awesomee': 211,
 'awesomeeeee': 212,
 'not': 

In [10]:
# If we sum the matrix by columns, it will be the count of each word
vectorized_words_sum = vectorized_words.sum(axis=0)

In [11]:
# We can create a DataFrame to record the index and the count for each word
word_count_df = []

# Loop over the vacabulary attribute
for word in cv.vocabulary_.keys():
    index = cv.vocabulary_[word]
    word_count_df.append([index, word, vectorized_words_sum[index]])

# Convert it to a DataFrame
word_count_df = pd.DataFrame(word_count_df, columns=['index', 'word', 'count']).sort_values(['index']).reset_index(drop=True)

In [12]:
# Take a look at it
word_count_df

Unnamed: 0,index,word,count
0,0,000,1
1,1,09,2
2,2,10,1
3,3,100,2
4,4,13pdrmj,1
...,...,...,...
3013,3013,yuuum,1
3014,3014,yyyyyyyyyoooooooooouuuuu,1
3015,3015,zero,2
3016,3016,zuccini,1


In [13]:
# We can convert our vectorized matrix to a DataFrame, too, for a better readability
vectorized_df = pd.DataFrame(vectorized_words, columns=word_count_df['word'].to_list())
vectorized_df

Unnamed: 0,000,09,10,100,13pdrmj,17,1st,2008,2009,200lbs,...,yum,yummm,yummmmy,yummy,yup,yuuum,yyyyyyyyyoooooooooouuuuu,zero,zuccini,â½m
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5959,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5960,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5961,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5962,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [14]:
# A lot of words only appeared once
# Hence, we can't rely on them to make predictions
# We need to words with a minimum set frequency
# Let's say, 10

frequent_word_df = word_count_df[word_count_df['count'] > 10]
frequent_word_df

Unnamed: 0,index,word,count
40,40,about,14
80,80,again,19
107,107,all,82
117,117,already,11
121,121,always,12
...,...,...,...
2979,2979,yes,12
2994,2994,you,306
2996,2996,your,47
3008,3008,yum,12


In [15]:
# Now, we extract those columns of frequent words
x = vectorized_df[frequent_word_df['word']]
x

Unnamed: 0,about,again,all,already,always,am,amazing,an,and,are,...,work,working,wow,wrong,yay,yes,you,your,yum,yummy
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5959,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
5960,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5961,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5962,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [16]:
# Extract the label
y = df['sentiment']

In [17]:
# Import MultinomialNB from sklearn
# This model is good to deal with word count for text classification

from sklearn.naive_bayes import MultinomialNB

In [18]:
model = MultinomialNB()
model.fit(x, y)
yhat = model.predict(x)

In [19]:
from sklearn import metrics

In [20]:
metrics.confusion_matrix(y, yhat)

array([[1998,  847],
       [ 264, 2855]], dtype=int64)

In [21]:
metrics.f1_score(y, yhat)

0.8371206567951914