Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [161]:
NAME = "ati tesakulsiri"
ID = "st123009"

---

# Lab 06: Generative classifiers: Naive Bayes

As discussed in class, a Naive Bayes classifier works as follows:
$$\begin{eqnarray}
p(y \mid \mathbf{x} ; \theta) & = & \frac{p(\mathbf{x} \mid y ; \theta) p(y ; \theta)}{p(\mathbf{x} ; \theta)} \\
& \propto & p(\mathbf{x} \mid y ; \theta) p(y ; \theta) \\
& \approx & p(y ; \theta) \prod_j p(x_j \mid y ; \theta)
\end{eqnarray}$$
We will use Naive Bayes to perform diabetes diagnosis and text classification.

## Example 1: Diabetes classification

In this example we predict wheter a patient with specific diagnostic measurements has diabetes or not. As the features are
continuous, we will model the conditional probabilities
$p(x_j \mid y ; \theta)$ as univariate Gaussians with mean $\mu_{j,y}$ and standard deviation $\sigma_{j,y}$.

The data are originally from the U.S. National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK) and are available
from [Kaggle](https://www.kaggle.com/uciml/pima-indians-diabetes-database)


In [162]:
import csv
import math
import random
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

### Data manipulation

First we have some functions to read the dataset, split it into train and test, and partition it according to target class ($y$).

In [163]:
# Load data from CSV file
def loadCsv(filename):
    data_raw = pd.read_csv(filename)
    headers = data_raw.columns
    dataset = data_raw.values
    return dataset, headers

# Split dataset into test and train with given ratio
def splitDataset(test_size,*arrays,**kwargs):
    return train_test_split(*arrays,test_size=test_size,**kwargs)

# Separate training data according to target class
# Return key value pairs array in which keys are possible target variable values
# and values are the data records.

def data_split_byClass(dataset):
    Xy = {}
    for i in range(len(dataset)):
        datapair = dataset[i]
        # datapair[-1] (the last column) is the target class for this record.
        # Check if we already have this value as a key in the return array
        if (datapair[-1] not in Xy):
            # Add class as key
            Xy[datapair[-1]] = []
        # Append this record to array of records for this class key
        Xy[datapair[-1]].append(datapair)
    return Xy

### Model training

Next we have some functions used for training the model. Parameters include mean and standard deviation, used
to partition numerical variables into categorical variables, as well as 

In [164]:
# Parameters of a Gaussian are its mean and standard deviation

def mean(numbers):
    return sum(numbers)/float(len(numbers))

def stdev(numbers):
    avg = mean(numbers)
    variance = sum([pow(x-avg,2) for x in numbers])/float(len(numbers)-1)
    return math.sqrt(variance)

# Calculate Gaussian parameters mu and sigma for each attribute over a dataset

def get_gaussian_parameters(X,y):
    parameters = {}
    unique_y = np.unique(y)
    for uy in unique_y:
        mean = np.mean(X[y==uy],axis=0)
        std = np.std(X[y==uy],axis=0)
        py = y[y==uy].size/y.size
        parameters[uy] = {'prior':py,'mean':mean,'std':std}
    return parameters, unique_y

def calculateProbability(x, mu, sigma):
    sigma = np.diag(sigma**2)
    x = x.reshape(-1,1)
    mu = mu.reshape(-1,1)
    exponent = np.exp(-1/2*(x-mu).T@np.linalg.inv(sigma)@(x-mu))
    return ((1/(np.sqrt(((2*np.pi)**x.size)*np.linalg.det(sigma))))*exponent)[0,0]

### Model testing

Next some functions for testing the model on a test set and computing its accuracy. Note that we assume
$$ p(y \mid \mathbf{x} ; \theta) \propto p(\mathbf{x} \mid y ; \theta), $$
which means we assume that the priors $p(y)$ are equal for each possible value of $y$.

In [165]:
# Calculate class conditional probabilities for given input data vector

def predict_one(x,parameters,unique_y,prior = True):
    probabilities = []
    for key in parameters.keys():
        probabilities.append(calculateProbability(x,parameters[key]['mean'],parameters[key]['std'])*(parameters[key]['prior']**(float(prior))))
    probabilities = np.array(probabilities)
    return unique_y[np.argmax(probabilities)]

def getPredictions(X, parameters, unique_y,prior=True):
    predictions = []
    for i in range(X.shape[0]):
        predictions.append(predict_one(X[i],parameters,unique_y,prior))
    return np.array(predictions)

# Get accuracy for test set

def getAccuracy(y, y_pred):
    correct = len(y[y==y_pred])
    return correct/y.size

### Experiment

Here we load the diabetes dataset, split it into training and test data, train a Gaussian NB model, and test the model on the test set.

In [166]:
# Load dataset

filename = 'diabetes.csv'
dataset, headers = loadCsv(filename)
#print(headers)
#print(np.array(dataset)[0:5,:])

# Split into training and test

X_train,X_test,y_train,y_test = splitDataset(0.4,dataset[:,:-1],dataset[:,-1])
print("Total =",len(dataset),"Train =", len(X_train),"Test =",len(X_test))

# Train model

parameters, unique_y = get_gaussian_parameters(X_train,y_train)
prediction = getPredictions(X_test,parameters,unique_y)
print("Accuracy with Prior =",getAccuracy(y_test,prediction))

# Test model

prediction = getPredictions(X_test,parameters,unique_y,prior = False)
print("Accuracy without Prior =",getAccuracy(y_test,prediction))

Total = 768 Train = 460 Test = 308
Accuracy with Prior = 0.7272727272727273
Accuracy without Prior = 0.7337662337662337


###  Exercise In lab / take home work (20 points)

Find out the proportion of the records in your dataset are positive vs. negative.  Can we conclude that $p(y=1) = p(y=0)$? If not, add
the priors $p(y=1)$ and $p(y=0)$ to your NB model. Does it improve the result?



In [167]:
np.unique(y_train)

array([0., 1.])

In [168]:
# YOUR CODE HERE
# raise NotImplementedError()
py1 = y_train[y_train == 1].shape[0]/y_train.shape[0]

py0 = y_train[y_train == 0].shape[0]/y_train.shape[0]
py1,py0

prediction = getPredictions(X_test,parameters,unique_y,prior = True)
print(f'p(y=1) = {py1} \nand p(y=0) = {py0}\n')
print("Accuracy with Prior =",getAccuracy(y_test,prediction))

p(y=1) = 0.34347826086956523 
and p(y=0) = 0.6565217391304348

Accuracy with Prior = 0.7272727272727273


**Explain that you can conclude that $p(y=1) = p(y=0)$? If not, add
the priors $p(y=1)$ and $p(y=0)$ to your NB model. Does it improve the result? (double click to explain)**



> ## ANS

- p(y=1) = p(y=0)?
    - No, in this dataset the p(y=1) != p(y=0) as describe in above cell.
    
- adding prior and compare
    - testing in above cell.


- Does it improve the result?
    - I have not seen any difference in this dataset 

## Example 2: Text classification

This example has been adapted from a post by Jaya Aiyappan, available at
[Analytics Vidhya](https://medium.com/analytics-vidhya/naive-bayes-classifier-for-text-classification-556fabaf252b#:~:text=The%20Naive%20Bayes%20classifier%20is,time%20and%20less%20training%20data).

We will generate a small dataset of sentences that are classified as either "statements" or "questions."

We will assume that occurance and placement of words within a sentence is independent of each other
(i.e., the features are conditionally independent given $y$). So the sentence "this is my book" is the same as "is this my book."
We will treat words as case insensitive.

In [169]:
# Generate text data for two classes, "statement" and "question"

text_train = [['This is my novel book', 'statement'],
              ['this book has more than one author', 'statement'],
              ['is this my book', 'question'],
              ['They are novels', 'statement'],
              ['have you read this book', 'question'],
              ['who is the novels author', 'question'],
              ['what are the characters', 'question'],
              ['This is how I bought the book', 'statement'],
              ['I like fictional characters', 'statement'],
              ['what is your favorite book', 'question']]

text_test = [['this is the book', 'statement'], 
             ['who are the novels characters', 'question'], 
             ['is this the author', 'question'],
            ['I like apples']]

# Load training and test data into pandas data frames

training_data = pd.DataFrame(text_train, columns= ['sentence', 'class'])
print(training_data)
print('\n------------------------------------------\n')
testing_data = pd.DataFrame(text_test, columns= ['sentence', 'class'])
print(testing_data)


                             sentence      class
0               This is my novel book  statement
1  this book has more than one author  statement
2                     is this my book   question
3                     They are novels  statement
4             have you read this book   question
5            who is the novels author   question
6             what are the characters   question
7       This is how I bought the book  statement
8         I like fictional characters  statement
9          what is your favorite book   question

------------------------------------------

                        sentence      class
0               this is the book  statement
1  who are the novels characters   question
2             is this the author   question
3                  I like apples       None


In [170]:
# Partition training data by class

stmt_docs = [train['sentence'] for index,train in training_data.iterrows() if train['class'] == 'statement']
question_docs = [train['sentence'] for index,train in training_data.iterrows() if train['class'] == 'question']
all_docs = [train['sentence'] for index,train in training_data.iterrows()]

# Get word frequencies for each sentence and class

def get_words(text):
    # Initialize word list
    words = [];
    # Loop through each sentence in input array
    for text_row in text:       
        # Check the number of words. Assume each word is separated by a blank space
        # so that the number of words is the number of blank spaces + 1
        number_of_spaces = text_row.count(' ')
        # loop through the sentence and get words between blank spaces.
        for i in range(number_of_spaces):
            # Check for for last word
            words.append([text_row[:text_row.index(' ')].lower()])
            text_row = text_row[text_row.index(' ')+1:]  
            i = i + 1        
        words.append([text_row])
    return np.unique(words)

# Get frequency of each word in each document

def get_doc_word_frequency(words, text):  
    word_freq_table = np.zeros((len(text),len(words)), dtype=int)
    i = 0
    for text_row in text:
        # Insert extra space between each pair of words to prevent
        # partial match of words
        text_row_temp = ''
        for idx, val in enumerate(text_row):
            if val == ' ':
                 text_row_temp = text_row_temp + '  '
            else:
                  text_row_temp = text_row_temp + val.lower()
        text_row = ' ' + text_row_temp + ' '
        j = 0
        for word in words: 
            word = ' ' + word + ' '
            freq = text_row.count(word)
            word_freq_table[i,j] = freq
            j = j + 1
        i = i + 1
    
    return word_freq_table

In [171]:
# Get word frequencies for statement documents

word_list_s = get_words(stmt_docs)
word_freq_table_s = get_doc_word_frequency(word_list_s, stmt_docs)
tdm_s = pd.DataFrame(word_freq_table_s, columns=word_list_s)
print(tdm_s)

   are  author  book  bought  characters  fictional  has  how  i  is  like  \
0    0       0     1       0           0          0    0    0  0   1     0   
1    0       1     1       0           0          0    1    0  0   0     0   
2    1       0     0       0           0          0    0    0  0   0     0   
3    0       0     1       1           0          0    0    1  1   1     0   
4    0       0     0       0           1          1    0    0  1   0     1   

   more  my  novel  novels  one  than  the  they  this  
0     0   1      1       0    0     0    0     0     1  
1     1   0      0       0    1     1    0     0     1  
2     0   0      0       1    0     0    0     1     0  
3     0   0      0       0    0     0    1     0     1  
4     0   0      0       0    0     0    0     0     0  


In [172]:
# Get word frequencies over all statement documents

freq_list_s = word_freq_table_s.sum(axis=0) 
freq_s = dict(zip(word_list_s,freq_list_s))
print(freq_s)

{'are': 1, 'author': 1, 'book': 3, 'bought': 1, 'characters': 1, 'fictional': 1, 'has': 1, 'how': 1, 'i': 2, 'is': 2, 'like': 1, 'more': 1, 'my': 1, 'novel': 1, 'novels': 1, 'one': 1, 'than': 1, 'the': 1, 'they': 1, 'this': 3}


In [173]:
# Get word frequencies for question documents

word_list_q = get_words(question_docs)
word_freq_table_q = get_doc_word_frequency(word_list_q, question_docs)
tdm_q = pd.DataFrame(word_freq_table_q, columns=word_list_q)
print(tdm_q)

   are  author  book  characters  favorite  have  is  my  novels  read  the  \
0    0       0     1           0         0     0   1   1       0     0    0   
1    0       0     1           0         0     1   0   0       0     1    0   
2    0       1     0           0         0     0   1   0       1     0    1   
3    1       0     0           1         0     0   0   0       0     0    1   
4    0       0     1           0         1     0   1   0       0     0    0   

   this  what  who  you  your  
0     1     0    0    0     0  
1     1     0    0    1     0  
2     0     0    1    0     0  
3     0     1    0    0     0  
4     0     1    0    0     1  


In [174]:
# Get word frequencies over all question documents

freq_list_q = word_freq_table_q.sum(axis=0) 
freq_q = dict(zip(word_list_q,freq_list_q))
print(freq_q)
print(freq_list_s)
print(freq_list_q)

{'are': 1, 'author': 1, 'book': 3, 'characters': 1, 'favorite': 1, 'have': 1, 'is': 3, 'my': 1, 'novels': 1, 'read': 1, 'the': 2, 'this': 2, 'what': 2, 'who': 1, 'you': 1, 'your': 1}
[1 1 3 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 3]
[1 1 3 1 1 1 3 1 1 1 2 2 2 1 1 1]


In [175]:
# Get word probabilities for statement class
a = 1
prob_s = []
for count in freq_list_s:
    #print(word, count)
    prob_s.append((count+a)/(sum(freq_list_s)+len(freq_list_s)*a))
prob_s.append(a/(sum(freq_list_s)+len(freq_list_s)*a))
    
# Get word probabilities for question class

prob_q = []
for count in freq_list_q:
    prob_q.append((count+a)/(sum(freq_list_q)+len(freq_list_q)*a))
prob_q.append(a/(sum(freq_list_q)+len(freq_list_q)*a))   
    
    
print('Probability of words for "statement" class \n')
print(dict(zip(word_list_s, prob_s)))
print('------------------------------------------- \n')
print('Probability of words for "question" class \n')
print(dict(zip(word_list_q, prob_q)))

Probability of words for "statement" class 

{'are': 0.043478260869565216, 'author': 0.043478260869565216, 'book': 0.08695652173913043, 'bought': 0.043478260869565216, 'characters': 0.043478260869565216, 'fictional': 0.043478260869565216, 'has': 0.043478260869565216, 'how': 0.043478260869565216, 'i': 0.06521739130434782, 'is': 0.06521739130434782, 'like': 0.043478260869565216, 'more': 0.043478260869565216, 'my': 0.043478260869565216, 'novel': 0.043478260869565216, 'novels': 0.043478260869565216, 'one': 0.043478260869565216, 'than': 0.043478260869565216, 'the': 0.043478260869565216, 'they': 0.043478260869565216, 'this': 0.08695652173913043}
------------------------------------------- 

Probability of words for "question" class 

{'are': 0.05128205128205128, 'author': 0.05128205128205128, 'book': 0.10256410256410256, 'characters': 0.05128205128205128, 'favorite': 0.05128205128205128, 'have': 0.05128205128205128, 'is': 0.10256410256410256, 'my': 0.05128205128205128, 'novels': 0.0512820512

In [176]:
# Calculate prior for one class

def prior(className):    
    denominator = len(stmt_docs) + len(question_docs)
    
    if className == 'statement':
        numerator =  len(stmt_docs)
    else:
        numerator =  len(question_docs)
        
    return np.divide(numerator,denominator)
    
# Calculate class conditional probability for a sentence
    
def classCondProb(sentence, className):
    words = get_words(sentence)
    prob = 1
    for word in words:
        if className == 'statement':
            idx = np.where(word_list_s == word)
            prob = prob * prob_s[np.array(idx)[0,0]]
        else:
            idx = np.where(word_list_q == word)
            prob = prob * prob_q[np.array(idx)[0,0]]   
    
    return prob

# Predict class of a sentence

def predict(sentence):
    prob_statement = classCondProb(sentence, 'statement') * prior('statement')
    prob_question = classCondProb(sentence, 'question') * prior('question')
    if  prob_statement > prob_question:
        return 'statement'
    else:
        return 'question'

### In-lab exercise: Laplace smoothing

Run the code below and figure out why it fails.

When a word does not appear with a specific class in the training data, its class-conditional probability is 0, and we are unable to
get a reasonable probability for that class.

Research Laplace smoothing, and modify the code above to implement Laplace smoothing (setting the frequency of all words with frequency 0 to a frequency of 1).
Run the modified code on the test set.

In [177]:
test_docs = list([test['sentence'] for index,test in testing_data.iterrows()])
print('Getting prediction for "%s"' % test_docs[0])
predict(test_docs[0])


Getting prediction for "this is the book"


IndexError: index 0 is out of bounds for axis 1 with size 0

### Exercise 1.1 (10 points)

Explain Why it failed and explain how to solve the problem.

Explanation here! (Double click to explain)

>The word using in the test set is a new, undiscover or not in the model, so when we get to thier index with blank it raise an error.
<br>
## to fix this we need to fix the `classCondProb function` to multiply with p(y;) instead here the code below.

### Exercise 1.2 (20 points)

Modify your code and make it works.

In [178]:
# YOUR CODE HERE
# raise NotImplementedError()
# idx = np.where(word_list_s == 'author')
# if 'author' not in word_list_s:
#     print('hi')
# prob_s[np.array(idx)[0,0]]

def classCondProb(sentence, className):
    words = get_words(sentence)
    prob = 1
    for word in words:
        if className == 'statement':
            if word not in word_list_s:
                # print('hi')
                prob = prob * prior(className)
            else:
                idx = np.where(word_list_s == word)
                prob = prob * prob_s[np.array(idx)[0,0]]
        else:
            if word not in word_list_q:
                prob = prob * prior(className)
            else:
                idx = np.where(word_list_q == word)
                print(idx)
                prob = prob * prob_q[np.array(idx)[0,0]]   
    
    return prob

# classCondProb('this is the book', 'question')

In [179]:
# Test function: Do not remove
test_docs = list([test['sentence'] for index,test in testing_data.iterrows()])

for sentence in test_docs:
    print('Getting prediction for %s"' % sentence)
    print(predict(sentence))
    
print("success!")
# End Test function

Getting prediction for this is the book"
question
Getting prediction for who are the novels characters"
question
Getting prediction for is this the author"
question
Getting prediction for I like apples"
question
success!


**Expect result**:\
Getting prediction for this is the book"\
question\
Getting prediction for who are the novels characters"\
question\
Getting prediction for is this the author"\
question

### Take home exercise

Find a more substantial text classification dataset, clean up the documents, and build your NB classifier. Write a brief report on your in-lab and take home exercises and results.

In [183]:
%reset -f

In [184]:
import csv
import math
import random
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import re

In [185]:
data_set = pd.read_csv('HIDE')

In [186]:
data_set.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [187]:
data_set.isna().sum()

Category    0
Message     0
dtype: int64

In [188]:
data_set.describe()
data_set_cut = data_set[:300]

In [189]:
data_set_cut.columns

Index(['Category', 'Message'], dtype='object')

In [190]:
# data_set_cut = data_set_cut.reset_index()
cleaner = FUNCTION HIDE
test_sen = 'atdfhsgjk;l43895y68875y8**&*(%^&*('
print(cleaner(test_sen))

data_set_cut['Message'] = data_set_cut['Message'].apply(cleaner) 

atdfhsgjklyy


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_set_cut['Message'] = data_set_cut['Message'].apply(cleaner)


In [191]:
# def splitDataset(test_size,*arrays,**kwargs):
#     return train_test_split(*arrays,test_size=test_size,**kwargs)
# X_train,X_test,y_train,y_test = splitDataset(0.6,data_set_cut.Message,data_set_cut.Category)
train_data = data_set_cut.iloc[:200]
test_data = data_set_cut.iloc[200:]

In [192]:
train_data.head()

Unnamed: 0,Category,Message
0,ham,Go until jurong point crazy Available only in ...
1,ham,Ok lar Joking wif u oni
2,spam,Free entry in a wkly comp to win FA Cup final...
3,ham,U dun say so early hor U c already then say
4,ham,Nah I dont think he goes to usf he lives aroun...


In [193]:
ham_docs = [train['Message'] for index,train in train_data.iterrows() if train['Category'] == 'ham']
spam_docs = [train['Message'] for index,train in train_data.iterrows() if train['Category'] == 'spam']
all_docs = [train['Message'] for index,train in train_data.iterrows()]


In [194]:
def get_words(text):
    # Initialize word list
    words = [];
    # Loop through each sentence in input array
    for text_row in text:       
        # Check the number of words. Assume each word is separated by a blank space
        # so that the number of words is the number of blank spaces + 1
        number_of_spaces = text_row.count(' ')
        # loop through the sentence and get words between blank spaces.
        for i in range(number_of_spaces):
            # Check for for last word
            words.append([text_row[:text_row.index(' ')].lower()])
            text_row = text_row[text_row.index(' ')+1:]  
            i = i + 1        
        words.append([text_row])
    return np.unique(words)

# Get frequency of each word in each document

def get_doc_word_frequency(words, text):  
    word_freq_table = np.zeros((len(text),len(words)), dtype=int)
    i = 0
    for text_row in text:
        # Insert extra space between each pair of words to prevent
        # partial match of words
        text_row_temp = ''
        for idx, val in enumerate(text_row):
            if val == ' ':
                 text_row_temp = text_row_temp + '  '
            else:
                  text_row_temp = text_row_temp + val.lower()
        text_row = ' ' + text_row_temp + ' '
        j = 0
        for word in words: 
            word = ' ' + word + ' '
            freq = text_row.count(word)
            word_freq_table[i,j] = freq
            j = j + 1
        i = i + 1
    
    return word_freq_table

In [195]:
word_list_ham = get_words(ham_docs)
word_freq_table_ham = get_doc_word_frequency(word_list_ham, ham_docs)
tdm_ham = pd.DataFrame(word_freq_table_ham, columns=word_list_ham)
print(tdm_ham)

         Abiola  Callertune  Hee  IQ  LUCYxx  Lol  No  SEEING  Smiling  ...  \
0    19       0           0    0   0       0    0   0       0        0  ...   
1     5       0           0    0   0       0    0   0       0        0  ...   
2    10       0           0    0   0       0    0   0       0        0  ...   
3    12       0           0    0   0       0    0   0       0        0  ...   
4    15       0           0    0   0       0    0   0       0        0  ...   
..   ..     ...         ...  ...  ..     ...  ...  ..     ...      ...  ...   
162  18       0           0    0   0       0    0   0       0        0  ...   
163   6       0           0    0   0       0    0   0       0        0  ...   
164   5       0           0    0   0       0    0   0       0        0  ...   
165  13       0           0    0   0       0    0   0       0        0  ...   
166  29       0           0    0   0       0    0   0       0        0  ...   

     yo  you  youd  youhow  youll  your  youre  you

In [196]:
# Get word frequencies over all statement documents

freq_list_ham = word_freq_table_ham.sum(axis=0) 
freq_ham = dict(zip(word_list_ham,freq_list_ham))
print(freq_ham)

{'': 2376, 'Abiola': 0, 'Callertune': 0, 'Hee': 0, 'IQ': 0, 'LUCYxx': 0, 'Lol': 0, 'No': 0, 'SEEING': 0, 'Smiling': 0, 'Sorry': 0, 'WILL': 0, 'XX': 0, 'You': 0, 'Yummy': 0, 'a': 40, 'aaooooright': 1, 'able': 1, 'about': 5, 'abt': 2, 'accomodate': 1, 'accomodations': 1, 'account': 1, 'actin': 1, 'activities': 1, 'address': 3, 'aft': 1, 'after': 4, 'afternoon': 2, 'again': 3, 'ah': 1, 'ahead': 2, 'ahhh': 1, 'aids': 1, 'aight': 1, 'all': 16, 'almost': 1, 'already': 8, 'alright': 1, 'also': 1, 'always': 3, 'am': 11, 'amore': 1, 'amp': 1, 'ams': 1, 'an': 2, 'and': 37, 'animation': 1, 'another': 1, 'answer': 1, 'any': 1, 'anymore': 2, 'anythin': 1, 'anything': 7, 'anyway': 1, 'anyways': 1, 'apartment': 1, 'apologetic': 1, 'apologise': 1, 'applespairsall': 1, 'appointment': 1, 'approaches': 1, 'arabian': 1, 'ard': 2, 'are': 19, 'around': 2, 'as': 9, 'ask': 1, 'askd': 1, 'at': 13, 'available': 1, 'ave': 1, 'avoid': 1, 'awesome': 1, 'axis': 1, 'b': 3, 'babe': 2, 'babyjontet': 1, 'back': 7, 'bad

In [197]:
word_list_spam = get_words(spam_docs)
word_freq_table_spam = get_doc_word_frequency(word_list_spam, spam_docs)
tdm_spam = pd.DataFrame(word_freq_table_spam, columns=word_list_spam)
print(tdm_spam)

        AJ  Expires  LDNWARW  PPM  SPTV  SPTyrone  a  about  ac  ...  \
0   27   0        0        0    0     0         0  1      0   0  ...   
1   31   0        0        0    0     0         0  0      0   0  ...   
2   25   0        0        0    0     0         0  1      0   0  ...   
3   28   0        0        0    0     0         0  0      0   0  ...   
4   25   0        0        0    0     0         0  0      0   0  ...   
5   25   0        0        0    0     0         0  1      0   0  ...   
6   18   0        0        0    0     0         0  0      0   0  ...   
7   23   0        0        0    0     0         0  0      0   0  ...   
8   28   0        0        0    0     0         0  0      0   0  ...   
9   32   0        0        0    0     0         0  1      0   0  ...   
10  21   0        0        0    0     0         0  0      0   1  ...   
11  26   0        0        0    0     0         0  0      0   0  ...   
12  27   0        0        0    0     0         0  2      0   0 

In [198]:
freq_list_spam = word_freq_table_spam.sum(axis=0) 
freq_spam = dict(zip(word_list_spam,freq_list_spam))
print(freq_spam)
print(freq_list_ham)
print(freq_list_spam)

{'': 807, 'AJ': 0, 'Expires': 0, 'LDNWARW': 0, 'PPM': 0, 'SPTV': 0, 'SPTyrone': 0, 'a': 22, 'about': 1, 'ac': 1, 'account': 1, 'acoentry': 1, 'advise': 1, 'again': 1, 'age': 1, 'algarve': 1, 'all': 2, 'am': 1, 'ampm': 2, 'and': 5, 'annoncement': 1, 'ansr': 1, 'any': 1, 'app': 1, 'apply': 2, 'are': 7, 'arrange': 1, 'as': 4, 'august': 1, 'award': 1, 'awarded': 3, 'back': 1, 'bangb': 1, 'bangbabes': 1, 'barbie': 1, 'be': 4, 'been': 4, 'between': 2, 'bonus': 3, 'bootydelious': 1, 'box': 1, 'boxwrc': 1, 'britney': 1, 'bt': 1, 'burns': 1, 'bxipwe': 1, 'by': 1, 'c': 1, 'call': 14, 'caller': 1, 'camcorder': 1, 'camera': 1, 'cash': 4, 'chances': 1, 'charged': 2, 'chat': 1, 'chgs': 1, 'cinema': 1, 'claim': 10, 'click': 2, 'co': 1, 'code': 5, 'collected': 1, 'colour': 1, 'comes': 1, 'comp': 1, 'complimentary': 1, 'confirm': 1, 'congrats': 1, 'contact': 3, 'content': 1, 'correct': 1, 'cost': 1, 'country': 1, 'credit': 1, 'csh': 1, 'cup': 1, 'customer': 5, 'darling': 1, 'days': 1, 'delivery': 3, 'd

In [199]:
# Get word probabilities for statement class
a = 1
prob_ham = []
for count in freq_list_ham:
    #print(word, count)
    prob_ham.append((count+a)/(sum(freq_list_ham)+len(freq_list_ham)*a))
prob_ham.append(a/(sum(freq_list_ham)+len(freq_list_ham)*a))
    
# Get word probabilities for question class

prob_spam = []
for count in freq_list_spam:
    prob_spam.append((count+a)/(sum(freq_list_spam)+len(freq_list_spam)*a))
prob_spam.append(a/(sum(freq_list_spam)+len(freq_list_spam)*a))   
    
    
print('Probability of words for "statement" class \n')
print(dict(zip(word_list_ham, prob_ham)))
print('------------------------------------------- \n')
print('Probability of words for "question" class \n')
print(dict(zip(word_list_spam, prob_spam)))

Probability of words for "statement" class 

{'': 0.4142558382711746, 'Abiola': 0.0001742767514813524, 'Callertune': 0.0001742767514813524, 'Hee': 0.0001742767514813524, 'IQ': 0.0001742767514813524, 'LUCYxx': 0.0001742767514813524, 'Lol': 0.0001742767514813524, 'No': 0.0001742767514813524, 'SEEING': 0.0001742767514813524, 'Smiling': 0.0001742767514813524, 'Sorry': 0.0001742767514813524, 'WILL': 0.0001742767514813524, 'XX': 0.0001742767514813524, 'You': 0.0001742767514813524, 'Yummy': 0.0001742767514813524, 'a': 0.007145346810735448, 'aaooooright': 0.0003485535029627048, 'able': 0.0003485535029627048, 'about': 0.0010456605088881143, 'abt': 0.0005228302544440571, 'accomodate': 0.0003485535029627048, 'accomodations': 0.0003485535029627048, 'account': 0.0003485535029627048, 'actin': 0.0003485535029627048, 'activities': 0.0003485535029627048, 'address': 0.0006971070059254096, 'aft': 0.0003485535029627048, 'after': 0.000871383757406762, 'afternoon': 0.0005228302544440571, 'again': 0.00069710

In [200]:
def prior(className):    
    denominator = len(ham_docs) + len(spam_docs)
    
    if className == 'ham':
        numerator =  len(ham_docs)
    else:
        numerator =  len(spam_docs)
        
    return np.divide(numerator,denominator)
    
# Calculate class conditional probability for a sentence
    
def classCondProb(sentence, className):
    words = get_words(sentence)
    prob = 1
    for word in words:
        if className == 'ham':
            if word not in word_list_ham:
                # print('hi')
                prob = prob * prior('ham')
            else:
                idx = np.where(word_list_ham == word)
                prob = prob * prob_ham[np.array(idx)[0,0]]
        else:
            if word not in word_list_spam:
                prob = prob * prior('spam')
            else:
                idx = np.where(word_list_spam == word)
                # print(idx)
                prob = prob * prob_spam[np.array(idx)[0,0]]   
    
    return prob

# Predict class of a sentence

def predict(sentence):
    prob_statement = classCondProb(sentence, 'ham') * prior('ham')
    prob_question = classCondProb(sentence, 'spam') * prior('spam')
    if  prob_statement > prob_question:
        return 'ham'
    else:
        return 'spam'

In [230]:
test_docs = list([test['Message'] for index,test in test_data.iterrows()])

result_hat  = []
for sentence in test_docs[:20]:
    print('Getting prediction for %s"' % sentence)
    print(predict(sentence))
for sentence in test_docs:
    result_hat.append(predict(sentence))

Getting prediction for Found it ENC  ltgt  where you at"
ham
Getting prediction for I sent you  ltgt  bucks"
spam
Getting prediction for Hello darlin ive finished college now so txt me when u finish if u can love Kate xxx"
ham
Getting prediction for Your account has been refilled successfully by INR  ltDECIMALgt  Your KeralaCircle prepaid account balance is Rs  ltDECIMALgt  Your Transaction ID is KR ltgt "
ham
Getting prediction for Goodmorning sleeping ga"
ham
Getting prediction for U call me alter at  ok"
spam
Getting prediction for  say until like dat i dun buy ericsson oso cannot oredi lar"
spam
Getting prediction for As I entered my cabin my PA said  Happy Bday Boss  I felt special She askd me  lunch After lunch she invited me to her apartment We went there"
ham
Getting prediction for Aight yo dats straight dogg"
ham
Getting prediction for You please give us connection today itself before  ltDECIMALgt  or refund the bill"
ham
Getting prediction for Both  i shoot big loads so get r

In [228]:
# test_data.Category,result_hat
acc = np.where(test_data.Category == result_hat)[0].shape[0]/len(test_data) * 100

In [229]:
print(f' Accuracy percent = {acc}')

 Accuracy percent = 60.0
