# SMAI Assignment - 2

## Question - `3` : Multinomial NaÃ¯ve Bayes

| | |
|- | -|
| Course | Statistical Methods in AI |
| Release Date | `16.02.2023` |
| Due Date | `24.02.2023` |

This question will have you working and experimenting with the Multinomial NaÃ¯ve Bayes classifier. Initially, you will transform the given data in csv file to count matrix, then calculate the priors. Use those priors to compute likelyhoods according to Multinomial Naive Bayes and then classify the test data. Please note that use of `sklearn` implementations is only for the final question of the assignment, for other doubts regarding libraries you can reach out to the TAs.

The dataset is about `Spam SMS`. There is 1 attribute that is the `message`, and the class label which could be `spam` or `ham`. The data is present in `spam.csv`. It contains about 5-6000 samples.
For your convinience the data is already pre-processed and loaded, but I suggest you to just take a look at the code for your own knowledge, and parts vectorization is left up to you which could be easily done with the help of the given example code.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Reading text-based data using pandas

In [2]:
# read file into pandas using a relative path

df = pd.read_csv("./spam.csv", encoding='latin-1')
df.dropna(how="any", inplace=True, axis=1)
df.columns = ['label', 'message']

df.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


## Pre-processing

- Our main issue with our data is that it is all in text format (strings). The classification algorithms that we usally use need some sort of numerical feature vector in order to perform the classification task. There are actually many methods to convert a corpus to a vector format. The simplest is the bag-of-words approach, where each unique word in a text will be represented by one number.

- As a first step, let's write a function that will split a message into its individual words and return a list. We'll also remove very common words, ('the', 'a', etc..). To do this we will take advantage of the NLTK library. It's pretty much the standard library in Python for processing text and has a lot of useful features. We'll only use some of the basic ones here.

In [3]:
import string
import nltk
nltk.download('stopwords')

from nltk.corpus import stopwords

def text_process(mess):
    """
    Takes in a string of text, then performs the following:
    1. Remove all punctuation
    2. Remove all stopwords
    3. Returns a list of the cleaned text
    """
    STOPWORDS = stopwords.words('english') + ['u', 'Ã¼', 'ur', '4', '2', 'im', 'dont', 'doin', 'ure']
    # Check characters to see if they are in punctuation
    nopunc = [char for char in mess if char not in string.punctuation]

    # Join the characters again to form the string.
    nopunc = ''.join(nopunc)
    
    # Now just remove any stopwords
    return ' '.join([word for word in nopunc.split() if word.lower() not in STOPWORDS])

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Lokes\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [4]:
df['message'] = df.message.apply(text_process)
df.head()

Unnamed: 0,label,message
0,ham,Go jurong point crazy Available bugis n great ...
1,ham,Ok lar Joking wif oni
2,spam,Free entry wkly comp win FA Cup final tkts 21s...
3,ham,dun say early hor c already say
4,ham,Nah think goes usf lives around though


In [5]:
df['label'] = df.label.map({'ham':0, 'spam':1})
df.head()

Unnamed: 0,label,message
0,0,Go jurong point crazy Available bugis n great ...
1,0,Ok lar Joking wif oni
2,1,Free entry wkly comp win FA Cup final tkts 21s...
3,0,dun say early hor c already say
4,0,Nah think goes usf lives around though


## Splitting the data

In [6]:
# split X and y into training and testing sets 
from sklearn.model_selection import train_test_split

X = df.message
y = df.label

print(f'X: {X.shape}')
print(f'y: {y.shape}')
print()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

print(f'X_train: {X_train.shape}')
print(f'y_train: {y_train.shape}')
print()

print(f'X_test: {X_test.shape}')
print(f'y_test: {y_test.shape}')
print()

X: (5572,)
y: (5572,)

X_train: (4457,)
y_train: (4457,)

X_test: (1115,)
y_test: (1115,)



## Helper code / Example code for Representing text as Numerical data using Sci-kit learn

ðŸ“Œ From the scikit-learn documentation:
- Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect numerical feature vectors with a fixed size rather than the raw text documents with variable length.
- We will use CountVectorizer to "convert text into a matrix of token counts":

In [8]:
# example text for model training (SMS messages)
simple_train = ['call you tonight', 'Call me a cab', 'Please call me... PLEASE!']

In [9]:
# import and instantiate CountVectorizer (with the default parameters)
from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer()
simple_train = vect.fit_transform(simple_train)

vect.get_feature_names_out()

array(['cab', 'call', 'me', 'please', 'tonight', 'you'], dtype=object)

In [10]:
vect.get_feature_names_out()

array(['cab', 'call', 'me', 'please', 'tonight', 'you'], dtype=object)

In [11]:
# convert sparse matrix to a dense matrix
simple_train.toarray()

array([[0, 1, 0, 0, 1, 1],
       [1, 1, 1, 0, 0, 0],
       [0, 1, 1, 2, 0, 0]], dtype=int64)

In this scheme, features and samples are defined as follows:

- Each individual token occurrence frequency (normalized or not) is treated as a feature.
- The vector of all the token frequencies for a given document is considered a multivariate sample.

A corpus of documents can thus be represented by a matrix with one row per document and one column per token (e.g. word) occurring in the corpus.

In [12]:
# examine the vocabulary and document-term matrix together
pd.DataFrame(simple_train.toarray(), columns=vect.get_feature_names_out())

Unnamed: 0,cab,call,me,please,tonight,you
0,0,1,0,0,1,1
1,1,1,1,0,0,0
2,0,1,1,2,0,0


### Transform Testing data into a document-term matrix (using existing / training vocabulary)

- You are supposed to use the training vocabolary to make the count matrix for test data

In [13]:
simple_test = ["please don't call me"]

In [14]:
simple_test_dtm = vect.transform(simple_test)
simple_test_dtm.toarray()

array([[0, 1, 1, 1, 0, 0]], dtype=int64)

In [15]:
# examine the vocabulary and document-term matrix together
pd.DataFrame(simple_test_dtm.toarray(), columns=vect.get_feature_names_out())

Unnamed: 0,cab,call,me,please,tonight,you
0,0,1,1,1,0,0


## Multinomial Naive Bayes Implementation

- Your task is to implement Mutlinomial Naive Bayes from scratch, you can use numpy to vectorize your code and matplotlib  to show your analysis.
- Below some information has given from the documentation about Multinomial Naive Bayes, this will give you some idea about using *Smoothing Priors*.
- There is a sub-question for experimenting with $\alpha > 0$, you don't have to implement it separetely, try to incomporate it in same Model Class / Function.

ðŸ“Œ From the scikit-learn documentation:

- Multinomial Naive Bayes implements the naive Bayes algorithm for multinomially distributed data, and is one of the two classic naive Bayes variants used in text classification (where the data are typically represented as word vector counts, although tf-idf vectors are also known to work well in practice).

- The distribution $\theta_y = (\theta_{y1}, \theta_{y2}, \dots, \theta_{yn})$ is parametrized by vectors for each class $y$, where $n$ is the number of features (in text classification, the size of the vocabulary) and $\theta_{yi}$ is the probability $P(x_i|y)$ of feature appearing in a sample belonging to class.

- The parameters $\theta_y$ is estimated by a smoothed version of maximum likelihood, i.e. relative frequency counting:

$$
\hat{\theta}_{yi} = \frac{N_{yi} + \alpha}{N_{y} + \alpha n}
$$

 where $N_{yi} = \sum_{x \in T}{x_i}$ is the number of times feature $i $ appears in a sample of class in the training set $T$, and $N_{y} = \sum^{n}_{i=1}{N_{yi}}$ is the total count of all features for class $y$.

- The smoothing priors $\alpha \gt 0$ accounts for features not present in the learning samples and **prevents zero probabilities** in further computations. Setting $\alpha = 1$ is called Laplace smoothing, while $\alpha \lt 1$ is called Lidstone smoothing.


In [16]:
"""
Your code here
"""


'\nYour code here\n'

## Vectorizing Training Sample

- Use the Helper code above to vectorize for training samples
- Don't overthink it, its very easy to do

In [17]:
"""
Your code here
"""
count_vectorizer = CountVectorizer(stop_words='english')
 
# Transform the training data using only the 'text' column values: count_train 
count_train = count_vectorizer.fit_transform(X_train)

In [18]:
count_train = pd.DataFrame(count_train.toarray(), columns=count_vectorizer.get_feature_names_out())

#### reseting index

In [19]:
y_train = y_train.reset_index(drop=True)
X_train = X_train.reset_index(drop=True)
X_test = X_test.reset_index(drop=True)
y_test = y_test.reset_index(drop=True)

## Calculate Priors and Estimate Model's performance on Training Sample

- Calculate priors based on Training Sample using your NB implementation
- Evaluate your model's performance on Training Data ($\alpha = 0$)

##### 1. count num_spam and num_ham 

In [55]:
## step 1 count number of spam and ham
value_counts = y_train.value_counts()
value_counts

0    3849
1     608
Name: label, dtype: int64

In [67]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
plt.style.use('fivethirtyeight')
plt.figure(figsize=(12,8))
plt.style.use('fivethirtyeight')
sns.countplot(y_train.label)
plt.ylabel('')
plt.yticks([])
y_train.label.value_counts(normalize=True)

AttributeError: 'Series' object has no attribute 'label'

<Figure size 1200x800 with 0 Axes>

In [21]:
num_ham = value_counts[0]
num_spam = value_counts[1]

print("Number of zeros:", num_ham)
print("Number of ones:", num_spam)

Number of zeros: 3849
Number of ones: 608


2. split data to spam and ham in different df

In [22]:
"""
Your code here
"""
#1. concat
res =  pd.concat([y_train,X_train,count_train], axis=1)
res.head(1)

Unnamed: 0,label,message,008704050406,0089my,0121,01223585236,01223585334,0125698789,02,020603,...,Ã¥Ã´morrow,Ã¥Ã´rents,Ã¬ll,Ã¬Ã¯,Ã¬Ã¯ll,Ã»thanks,Ã»Âªve,Ã»Ã¯harry,Ã»Ã²,Ã»Ã³well
0,0,Sleeping nt feeling well,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [23]:
#split now to two diffeerent df
df_spam = pd.DataFrame()
df_ham = pd.DataFrame()
df_spam = res[res['label'] == 1].reset_index(drop=True).iloc[:, 0:]
df_ham = res[res['label'] == 0].reset_index(drop=True).iloc[:, 0:]

In [24]:
print("total_spam:",len(df_spam))
df_ham.head(1)

total_spam: 608


Unnamed: 0,label,message,008704050406,0089my,0121,01223585236,01223585334,0125698789,02,020603,...,Ã¥Ã´morrow,Ã¥Ã´rents,Ã¬ll,Ã¬Ã¯,Ã¬Ã¯ll,Ã»thanks,Ã»Âªve,Ã»Ã¯harry,Ã»Ã²,Ã»Ã³well
0,0,Sleeping nt feeling well,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [25]:
print("total_ham:",len(df_ham))
df_spam.head(1)

total_ham: 3849


Unnamed: 0,label,message,008704050406,0089my,0121,01223585236,01223585334,0125698789,02,020603,...,Ã¥Ã´morrow,Ã¥Ã´rents,Ã¬ll,Ã¬Ã¯,Ã¬Ã¯ll,Ã»thanks,Ã»Âªve,Ã»Ã¯harry,Ã»Ã²,Ã»Ã³well
0,1,FREE RING TONE text POLYS 87131 every week get...,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


##### 3.calculate P(x = spam ) for all spam , P(x = ham ) 

In [26]:
total_data_train = len(y_train)
# print("total_data_train :",total_data_train)
p_spam = len(df_spam) / total_data_train
p_ham = len(df_ham) / total_data_train
print("P(x = spam):",p_spam)
print("P(x = ham):",p_ham)

P(x = spam): 0.13641462867399595
P(x = ham): 0.863585371326004


#### NSpam, NHam, NVocabulary
It's important to note that:
NSpam is equal to the number of words in all the spam messages â€” it's not equal to the number of spam messages, and it's not equal to the total number of unique words in spam messages.
NHam is equal to the number of words in all the non-spam messages â€” it's not equal to the number of non-spam messages, and it's not equal to the total number of unique words in non-spam messages.

In [27]:
column_names = count_vectorizer.get_feature_names_out()

# Calculate the sum of each column and create a new row with the column name and its frequency
sum_row_spam = df_spam.iloc[:, 2:]
sum_row_spam= pd.DataFrame(data=[sum_row_spam.sum(axis=0)], columns=column_names)


sum_row_ham = df_ham.iloc[:, 2:]
sum_row_ham= pd.DataFrame(data=[sum_row_ham.sum(axis=0)], columns=column_names)


In [28]:
sum_row_ham

Unnamed: 0,008704050406,0089my,0121,01223585236,01223585334,0125698789,02,020603,02070836089,02072069400,...,Ã¥Ã´morrow,Ã¥Ã´rents,Ã¬ll,Ã¬Ã¯,Ã¬Ã¯ll,Ã»thanks,Ã»Âªve,Ã»Ã¯harry,Ã»Ã²,Ã»Ã³well
0,0,0,0,0,0,1,0,0,0,0,...,1,1,2,41,1,1,0,0,8,1


In [29]:
arr = np.array(sum_row_ham)
row_sums = arr.sum(axis=1)
print(row_sums)

[25620]


In [30]:
# N_Spam
n_ham = 0
for column, value in sum_row_ham.loc[0].items():
    n_ham += value

# N_Ham
n_spam = 0
for column, value in sum_row_spam.loc[0].items():
    n_spam += value
print(n_ham , n_spam)

# N_Vocabulary
n_vocabulary = len(count_train.columns)

alpha = 0

25620 8686


In [31]:
column_names = count_vectorizer.get_feature_names_out()

In [32]:
vocabulary = []
for word in column_names:
    vocabulary.append(word)
len(vocabulary)

8152

## Vectorizing Test Sample

- Use the Training Sample vocabulary to create word count matrix for test samples
- This is also shown in the Helper code

In [33]:
"""
Your code here
"""
# Transform the test data using only the 'text' column values: count_test 
count_test = count_vectorizer.transform(X_test)

## Estimate Model's performance on Test Sample

- Evaluate your model's performance on Test Sample, using the Training Priors ($\alpha = 0$)

In [34]:
parameters_spam = {}
parameters_ham = {}
for word in vocabulary:
    n_word_given_spam = sum_row_spam[word] # spam_messages already defined
    p_word_given_spam = (n_word_given_spam + alpha) / (n_spam + alpha*n_vocabulary)
    w = word
    parameters_spam[w] = p_word_given_spam

    n_word_given_ham = sum_row_ham[word] # ham_messages already defined
    p_word_given_ham = (n_word_given_ham + alpha) / (n_ham + alpha*n_vocabulary)
    parameters_ham[w] = p_word_given_ham


# N_Spam
n_words_per_spam_message = df_spam['message'].apply(len)
n_spam = n_words_per_spam_message.sum()

# N_Ham
n_words_per_ham_message = df_ham['message'].apply(len)
n_ham = n_words_per_ham_message.sum()
   

In [35]:
"""
Your code here
"""
import re
import math
def Classify_email(message:str)->int:
    message = str(message)
    message = re.sub('\W', ' ', message)
    message = message.lower().split()
    
    #initially
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
    
    # print(p_ham_given_message)
    # print(p_spam_given_message)
    for word in message:
        # print(word)
        if word in parameters_ham:
            # print(word,parameters_ham[word][0])
            
            p_ham_given_message *= parameters_ham[word][0]
            # p_ham_given_message = (p_ham_given_message)
            
            # print(word , parameters_ham[word][0])
            
        if word in parameters_spam:
            # print(word,parameters_spam[word][0])
            p_spam_given_message *= parameters_spam[word][0]
            # p_spam_given_message =(p_spam_given_message)
        
            
            # print(word , parameters_spam[word][0])
     
    # print(p_spam_given_message ,p_ham_given_message )       
    
    if p_spam_given_message > p_ham_given_message:
        return 1
    else :
        return 0

In [36]:
Classify_email("As a valued customer, I am pleased to advise you that following recent review of your Mob No. you are awarded with a ï¿½1500 Bonus Prize, call 09066364589")

1

In [37]:
X_test


0                                          Convey regards
1                            Â‰Ã› anyway many good evenings
2       sort code acc bank natwest reply confirm ive s...
3                                   Sorry din lock keypad
4       Hi babe Chloe r smashed saturday night great w...
                              ...                        
1110                                              problem
1111    New Theory Argument wins SITUATION loses PERSO...
1112    real getting yo need tickets one jacket done a...
1113                                      dear sleeping P
1114    Yeah DonÂ‰Ã›Ã·t stand close tho youÂ‰Ã›Ã·ll catch so...
Name: message, Length: 1115, dtype: object

In [38]:
pred = []
for row in range(len(X_test)):
    pred.append(Classify_email(row))

In [39]:
correct = 0
total = len(pred)

for i in range(len(pred)):
   if pred[i] == y_test[i]:
      correct += 1

print('Correct:', correct)
print('Incorrect:', total - correct)
print('Accuracy:', correct/total)

Correct: 931
Incorrect: 184
Accuracy: 0.8349775784753363


## Select Smoothing Priors

- Refactor your code to incorporate smoothing priors, select $\alpha = 0$ for the previous estimates / sub-questions
- Compare the performance with different values of $\alpha \gt 0$ as smoothing priors to take care of zero probabilities
- You can display a Plot or Table to show the comparison.

In [40]:
"""
Your code here
"""


'\nYour code here\n'

## Comparison with Sci-kit Learn Implementation

- Use sci-kit learn's `sklearn.naive_bayes.MultinomialNB` model to compare your implementation's performance
- (Optional) try other classifiers from `sklearn.naive_bayes` and see if you can make them work`

In [41]:
"""
Your code here
"""
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score,confusion_matrix
clf = MultinomialNB()
clf.fit(count_train, y_train)
Y_pred = clf.predict(count_test)
print(accuracy_score(y_test,Y_pred))
print(confusion_matrix(y_test,Y_pred))

0.9847533632286996
[[970   6]
 [ 11 128]]


