# Using Naive Bayes algorithm for spam detection

In this assigment, you will predict if a sms message is 'spam' or 'ham' (i.e. not 'spam') using the Bernoulli Naive Bayes *classifier*.

The training data is from the UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection.  Please download the zip file before running the code below. 


## Step 1:  Getting, understanding, and cleaning the dataset


###  Importing the dataset

In [66]:
# Import the usual libraries
import numpy as np 
import pandas as pd  # To read in the dataset we will use the Panda's library
df = pd.read_table('sms+spam+collection/SMSSpamCollection', sep = '\t', header=None, names=['label', 'sms_message'])

# Next we observe the first 5 rows of the data to ensure everything was read correctly
df.head() 

Unnamed: 0,label,sms_message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


### Data preprocesssing
It would be more convenient if the labels were binary instead of 'ham' and 'spam'.  This way our code can always work with numerical values instead of strings.

In [67]:
df['label']=df.label.map({'spam':1, 'ham':0})
df.head() # Again, lets observe the first 5 rows to make sure everything worked before we continue

Unnamed: 0,label,sms_message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


### Splitting the dcoument into training set and test set

In [68]:
# This time we will use sklearn's method for seperating the data
from sklearn.model_selection import train_test_split

df_train_msgs, df_test_msgs, df_ytrain, df_ytest = train_test_split(df['sms_message'],df['label'], random_state=0)

#Looking at the train/test split
print("The number of training examples: ", df_train_msgs.shape[0])
print("The number of test exampels: ", df_test_msgs.shape)

print("The first four labels")
print(df_ytrain[0:4])

print("The first four sms messages")
print(df_train_msgs[0:4])


The number of training examples:  4179
The number of test exampels:  (1393,)
The first four labels
872     0
831     1
1273    0
3314    0
Name: label, dtype: int64
The first four sms messages
872     Its going good...no problem..but still need li...
831     U have a secret admirer. REVEAL who thinks U R...
1273                                                Ok...
3314    Huh... Hyde park not in mel ah, opps, got conf...
Name: sms_message, dtype: object


###  Creating the feature vector from the text (feature extraction)

Each message will have its own feature vector.  For each message we will create its feature vector as we discussed in class; we will have a feature for every word in our vocabulary.  The $j$th feature is set to one ($x_j=1$) if the $j$th word from our vocabulary occurs in the message, and set the $j$ feature is set to $0$ otherwise ($x_j=0$).

We will use the sklearn method CountVectorize to create the feature vectors for every messge.

We could have written the code to do this directly by performing the following steps:
* remove capitalization
* remove punctuation 
* tokenize (i.e. split the document into individual words)
* count frequencies of each token 
* remove 'stop words' (these are words that will not help us predict since they occur in most documents, e.g. 'a', 'and', 'the', 'him', 'is' ...

In [69]:
# importing the library
from sklearn.feature_extraction.text import CountVectorizer
# creating an instance of CountVectorizer
# Note there are issues with the way CountVectorizer removes stop words.  To learn more: https://scikit-learn.org/stable/modules/feature_extraction.html#stop-words
vectorizer = CountVectorizer(binary = True, stop_words='english')

# If we wanted to perform Multnomial Naive Bayes, we would include the word counts and use the following code
#vectorizer = CountVectorizer(stop_words='english')

# To see the 'count_vector' object
print(vectorizer)

CountVectorizer(binary=True, stop_words='english')


In [70]:
# To see the 'stop words' 
print(vectorizer.get_stop_words())

frozenset({'hereafter', 'via', 'per', 'besides', 'before', 'serious', 'indeed', 'whether', 'few', 'i', 'thereby', 'for', 'move', 'therefore', 'whatever', 'others', 'she', 'behind', 'call', 'yourself', 'there', 'among', 'anyway', 'someone', 'together', 'always', 'ourselves', 'fifteen', 'and', 'fifty', 'become', 'forty', 'along', 'eleven', 'these', 'whence', 'nothing', 'eg', 'all', 'neither', 'again', 'whereupon', 'me', 'four', 'whereby', 'thereafter', 'bill', 'might', 'first', 'back', 'with', 'an', 'nobody', 'bottom', 'everything', 'show', 'whither', 'that', 'too', 'some', 'name', 'hereby', 'couldnt', 'thick', 'since', 'no', 'top', 'own', 'twenty', 're', 'from', 'anyhow', 'however', 'seems', 'during', 'very', 'latterly', 'seem', 'by', 'should', 'rather', 'of', 'may', 'can', 'ever', 'everywhere', 'find', 'alone', 'below', 'do', 'found', 'had', 'another', 'why', 'meanwhile', 'side', 'not', 'the', 'be', 'beyond', 'see', 'less', 'con', 'its', 'though', 'put', 'whole', 'give', 'latter', 'bec

In [71]:
# Create the vocabulary for our feature transformation
vectorizer.fit(df_train_msgs)

# Next we create the feature vectors for both the training data and the test data
X_train = vectorizer.transform(df_train_msgs).toarray() # code to turn the training emails into a feature vector
X_test = vectorizer.transform(df_test_msgs).toarray() # code to turn the test email into a feature vector

# Changing the target vectors data type  
y_train=df_ytrain.to_numpy() # Convereting from a Panda series to a numpy array
y_test = df_ytest.to_numpy()

# To observe what the data looks like 
print("The label of the first training example: ", y_train[0])
print("The first training example: ", X_train[0].tolist())# I needed to covernt the datatype to list so all the feature values would be printed

The label of the first training example:  0
The first training example:  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

# Your code goes here

In [72]:
# You should NOT use the features of sklearn library in your code.
#### TO-DO #####
def Class_Probability_Estimation(df_ytrain):
    total_examples = df_ytrain.size
    spam_examples = df_ytrain.value_counts()[1]
    ham_examples = df_ytrain.value_counts()[0]
    prob_spam = spam_examples/total_examples
    prob_ham = ham_examples/total_examples
    return prob_spam, prob_ham

##############

In [73]:
#print(df_ytrain)
prob_spam, prob_ham = Class_Probability_Estimation(df_ytrain)
y_probs = {0:prob_ham,1:prob_spam}

In [74]:
print("1.")
print("The estimated value of P(y) for each class y")
print("Estimated value of P(y=0) is",prob_ham)
print("Estimated value of P(y=1) is",prob_spam)

1.
The estimated value of P(y) for each class y
Estimated value of P(y=0) is 0.8655180665230916
Estimated value of P(y=1) is 0.13448193347690834


In [75]:
#Calculating fi(xi/y) for each feature and class
def calculate_phi_xi_given_y(X_train, y_train):
    phi_values = {}
    n_features = len(X_train[0])
    for y in np.unique(y_train):
        phi_values[y] = []
        for i in range(n_features):
            numerator = np.sum((X_train[:,i] == 1) & (y_train == y))
            denominator = np.sum(y_train == y)
            phi_value = numerator/denominator
            phi_values[y].append(phi_value)
    return phi_values
    

In [76]:
phi_values = calculate_phi_xi_given_y(X_train, y_train)
phi_x_given_y_equals_one = phi_values[1]
phi_x_given_y_equals_zero = phi_values[0]
print("2.")
print("For class y=1")
print("The estimated p(xi|y=1) is:")
print(phi_x_given_y_equals_one)


2.
For class y=1
The estimated p(xi|y=1) is:
[0.012455516014234875, 0.03558718861209965, 0.0, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0, 0.010676156583629894, 0.0017793594306049821, 0.0017793594306049821, 0.0035587188612099642, 0.0035587188612099642, 0.019572953736654804, 0.014234875444839857, 0.0017793594306049821, 0.0071174377224199285, 0.0035587188612099642, 0.0035587188612099642, 0.010676156583629894, 0.0035587188612099642, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0035587188612099642, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0035587188612099642, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0035587188612099642, 0.001779

In [77]:
print("For class y=0")
print("The estimated p(xi|y=0) is:")
print(phi_x_given_y_equals_zero)

For class y=0
The estimated p(xi|y=0) is:
[0.0, 0.0, 0.0002764722145424385, 0.0, 0.0, 0.0, 0.0, 0.0002764722145424385, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0

In [78]:
def determine_class_scores(example,y_probs,phi_values):
    class_scores = {}
    for y in y_probs:
        score = np.log(y_probs[y])
        for i,x in enumerate(example):
            if (x == 1) and (np.log(phi_values[y][i]) != 0):
                score += np.log(phi_values[y][i])
            elif (x == 0) and (np.log(1 - phi_values[y][i]) != 0):
                score += np.log(1 - phi_values[y][i])
        class_scores[y] = score
        
    return class_scores  
    

In [79]:
predicted_label = []
for example in X_test:
    class_scores = determine_class_scores(example, y_probs, phi_values)
    if class_scores[1] >= class_scores[0]:
        predicted_label.append("spam")
    else:
        predicted_label.append("ham")


  if (x == 1) and (np.log(phi_values[y][i]) != 0):
  score += np.log(phi_values[y][i])


In [80]:
print("3.")
print("The predicted class for the first 50 test examples")
print(predicted_label[:50])

3.
The predicted class for the first 50 test examples
['ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'spam', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'spam', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'spam', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'spam', 'ham', 'ham', 'ham', 'spam', 'ham', 'ham', 'spam', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'spam']


In [81]:
correct_predictions = 0
incorrect_predictions = 0
for i,y in enumerate(y_test):
    if predicted_label[i] == "ham" and y == 0:
        correct_predictions += 1
    elif predicted_label[i] == "spam" and y == 1:
        correct_predictions += 1
    else:
        incorrect_predictions += 1
        
accuracy = correct_predictions/len(y_test)
print("4. Total number of examples classified correctly = ", correct_predictions)
print("5. Total number of misclassified examples = ", incorrect_predictions)
print("6. Percentage error = ", incorrect_predictions/len(y_test))
print("Accuracy = ", correct_predictions/len(y_test))
    

4. Total number of examples classified correctly =  1325
5. Total number of misclassified examples =  68
6. Percentage error =  0.048815506101938265
Accuracy =  0.9511844938980617


In [82]:
print("a) Estimated value of p(y) when y=1 is ", prob_spam)

a) Estimated value of p(y) when y=1 is  0.13448193347690834


In [83]:
print("b) Estimated value of p(y) when y=0 is ", prob_ham)

b) Estimated value of p(y) when y=0 is  0.8655180665230916


In [84]:
feature_names = vectorizer.get_feature_names_out()

In [85]:
ctr = 0
for i in feature_names:
    if i == 'admirer':
        break;
    ctr += 1
print(ctr)

799


In [86]:
print("c)")
print("phi(admirer|y=1) is equal to, ",phi_x_given_y_equals_one[799])
print("phi(admirer|y=0) is equal to,",phi_x_given_y_equals_zero[799])

c)
phi(admirer|y=1) is equal to,  0.014234875444839857
phi(admirer|y=0) is equal to, 0.0


In [87]:
ctr = 0
for i in feature_names:
    if i == 'secret':
        break;
    ctr += 1
print(ctr)

5642


In [88]:
print("d)")
print("phi(secret|y=1) is equal to, ",phi_x_given_y_equals_one[5642])
print("phi(secret|y=0) is equal to,",phi_x_given_y_equals_zero[5642])

d)
phi(secret|y=1) is equal to,  0.014234875444839857
phi(secret|y=0) is equal to, 0.000552944429084877


In [89]:
print("e) Classes predicted for first five examples in the test set:", predicted_label[:5])

e) Classes predicted for first five examples in the test set: ['ham', 'ham', 'ham', 'ham', 'ham']


In [90]:
print("f) Classes predicted for last five examples in the test set:", predicted_label[-5:])

f) Classes predicted for last five examples in the test set: ['ham', 'spam', 'spam', 'ham', 'ham']


In [91]:
print("g) Percentage error = ", incorrect_predictions/len(y_test))

g) Percentage error =  0.048815506101938265


In [92]:
#for different smoothing values of m
#a) m = 0.33

def calculate_smoothed_phi_xi_given_y(X_train, y_train, m):
    phi_values = {}
    n_features = len(X_train[0])
    for y in np.unique(y_train):
        phi_values[y] = []
        for i in range(n_features):
            numerator = np.sum((X_train[:,i] == 1) & (y_train == y)) + m
            denominator = np.sum(y_train == y) + (np.sum(y_train == y) * m)
            phi_value = numerator/denominator
            phi_values[y].append(phi_value)
    return phi_values



In [93]:
phi_values = calculate_smoothed_phi_xi_given_y(X_train, y_train, 0.33)
phi_x_given_y_equals_one = phi_values[1]
phi_x_given_y_equals_zero = phi_values[0]
print(len(phi_x_given_y_equals_one))
print(phi_x_given_y_equals_one)


7287
[0.009806544831830466, 0.02719877986781901, 0.0004414951970674016, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0004414951970674016, 0.008468680598292885, 0.0017793594306049821, 0.0017793594306049821, 0.0031172236641425627, 0.0031172236641425627, 0.015158001765980788, 0.011144409065368045, 0.0017793594306049821, 0.005792952131217724, 0.0031172236641425627, 0.0031172236641425627, 0.008468680598292885, 0.0031172236641425627, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0031172236641425627, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0031172236641425627, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0031172236641425627, 0.00177935943

In [94]:
print(len(phi_x_given_y_equals_zero))
print(phi_x_given_y_equals_zero)

7287
[6.859836902180803e-05, 6.859836902180803e-05, 0.00027647221454243844, 6.859836902180803e-05, 6.859836902180803e-05, 6.859836902180803e-05, 6.859836902180803e-05, 0.00027647221454243844, 6.859836902180803e-05, 6.859836902180803e-05, 6.859836902180803e-05, 6.859836902180803e-05, 6.859836902180803e-05, 6.859836902180803e-05, 6.859836902180803e-05, 6.859836902180803e-05, 6.859836902180803e-05, 6.859836902180803e-05, 6.859836902180803e-05, 6.859836902180803e-05, 6.859836902180803e-05, 6.859836902180803e-05, 6.859836902180803e-05, 6.859836902180803e-05, 6.859836902180803e-05, 6.859836902180803e-05, 6.859836902180803e-05, 6.859836902180803e-05, 6.859836902180803e-05, 6.859836902180803e-05, 6.859836902180803e-05, 6.859836902180803e-05, 6.859836902180803e-05, 6.859836902180803e-05, 6.859836902180803e-05, 6.859836902180803e-05, 6.859836902180803e-05, 6.859836902180803e-05, 6.859836902180803e-05, 6.859836902180803e-05, 6.859836902180803e-05, 6.859836902180803e-05, 6.859836902180803e-05, 6.8

In [95]:
def determine_class_scores_after_smoothing(example,y_probs,phi_values):
    class_scores = {}
    for y in y_probs:
        score = np.log(y_probs[y])
        for i,x in enumerate(example):
            if (x == 1) and (np.log(phi_values[y][i]) != 0):
                score += np.log(phi_values[y][i])
            elif (x == 0) and (np.log(1 - phi_values[y][i]) != 0):
                score += np.log(1 - phi_values[y][i])
        class_scores[y] = score
        
    return class_scores

In [96]:
predicted_label = []
for example in X_test:
    class_scores = determine_class_scores(example, y_probs, phi_values)
    if class_scores[1] >= class_scores[0]:
        predicted_label.append(1)
    else:
        predicted_label.append(0)

In [97]:
correct_predictions = 0
incorrect_predictions = 0
for i,y in enumerate(y_test):
    if y == predicted_label[i]:
        correct_predictions += 1
    else:
        incorrect_predictions += 1
        
accuracy = correct_predictions/len(y_test)
print("Total number of examples classified correctly (m = 0.33) = ", correct_predictions)
print("Total number of misclassified examples (m = 0.33) = ", incorrect_predictions)
print("Percentage error (m = 0.33) = ", incorrect_predictions/len(y_test))
print("Accuracy = (m = 0.33) ", correct_predictions/len(y_test))
    

Total number of examples classified correctly (m = 0.33) =  1375
Total number of misclassified examples (m = 0.33) =  18
Percentage error (m = 0.33) =  0.012921751615218953
Accuracy = (m = 0.33)  0.9870782483847811


In [98]:
## for m = 0.66

In [99]:
phi_values = calculate_smoothed_phi_xi_given_y(X_train, y_train, 0.66)
phi_x_given_y_equals_one = phi_values[1]
phi_x_given_y_equals_zero = phi_values[0]
print(len(phi_x_given_y_equals_one))
print(phi_x_given_y_equals_one)


7287
[0.008210779059297688, 0.022145521588131885, 0.0007074561591561977, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0007074561591561977, 0.007138875787848904, 0.0017793594306049821, 0.0017793594306049821, 0.0028512627020537665, 0.0028512627020537665, 0.012498392145092825, 0.009282682330746472, 0.0017793594306049821, 0.004995069244951336, 0.0028512627020537665, 0.0028512627020537665, 0.007138875787848904, 0.0028512627020537665, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0028512627020537665, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0028512627020537665, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0028512627020537665, 0.0017793594

In [100]:
print(len(phi_x_given_y_equals_zero))
print(phi_x_given_y_equals_zero)

7287
[0.00010992268770964422, 0.00010992268770964422, 0.0002764722145424385, 0.00010992268770964422, 0.00010992268770964422, 0.00010992268770964422, 0.00010992268770964422, 0.0002764722145424385, 0.00010992268770964422, 0.00010992268770964422, 0.00010992268770964422, 0.00010992268770964422, 0.00010992268770964422, 0.00010992268770964422, 0.00010992268770964422, 0.00010992268770964422, 0.00010992268770964422, 0.00010992268770964422, 0.00010992268770964422, 0.00010992268770964422, 0.00010992268770964422, 0.00010992268770964422, 0.00010992268770964422, 0.00010992268770964422, 0.00010992268770964422, 0.00010992268770964422, 0.00010992268770964422, 0.00010992268770964422, 0.00010992268770964422, 0.00010992268770964422, 0.00010992268770964422, 0.00010992268770964422, 0.00010992268770964422, 0.00010992268770964422, 0.00010992268770964422, 0.00010992268770964422, 0.00010992268770964422, 0.00010992268770964422, 0.00010992268770964422, 0.00010992268770964422, 0.00010992268770964422, 0.0001099226

In [101]:
def determine_class_scores_after_smoothing(example,y_probs,phi_values):
    class_scores = {}
    for y in y_probs:
        score = np.log(y_probs[y])
        for i,x in enumerate(example):
            if (x == 1) and (np.log(phi_values[y][i]) != 0):
                score += np.log(phi_values[y][i])
            elif (x == 0) and (np.log(1 - phi_values[y][i]) != 0):
                score += np.log(1 - phi_values[y][i])
        class_scores[y] = score
        
    return class_scores

In [102]:
predicted_label = []
for example in X_test:
    class_scores = determine_class_scores(example, y_probs, phi_values)
    if class_scores[1] >= class_scores[0]:
        predicted_label.append(1)
    else:
        predicted_label.append(0)

In [103]:
correct_predictions = 0
incorrect_predictions = 0
for i,y in enumerate(y_test):
    if y == predicted_label[i]:
        correct_predictions += 1
    else:
        incorrect_predictions += 1
        
accuracy = correct_predictions/len(y_test)
print("Total number of examples classified correctly (m = 0.66) = ", correct_predictions)
print("Total number of misclassified examples (m = 0.66) = ", incorrect_predictions)
print("Percentage error (m = 0.66) = ", incorrect_predictions/len(y_test))
print("Accuracy = (m = 0.66) ", correct_predictions/len(y_test))
    

Total number of examples classified correctly (m = 0.66) =  1375
Total number of misclassified examples (m = 0.66) =  18
Percentage error (m = 0.66) =  0.012921751615218953
Accuracy = (m = 0.66)  0.9870782483847811


In [104]:
# m = 1

In [105]:
phi_values = calculate_smoothed_phi_xi_given_y(X_train, y_train, 1)
phi_x_given_y_equals_one = phi_values[1]
phi_x_given_y_equals_zero = phi_values[0]
print(len(phi_x_given_y_equals_one))
print(phi_x_given_y_equals_one)


7287
[0.0071174377224199285, 0.018683274021352312, 0.0008896797153024911, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0008896797153024911, 0.006227758007117438, 0.0017793594306049821, 0.0017793594306049821, 0.0026690391459074734, 0.0026690391459074734, 0.010676156583629894, 0.00800711743772242, 0.0017793594306049821, 0.004448398576512456, 0.0026690391459074734, 0.0026690391459074734, 0.006227758007117438, 0.0026690391459074734, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0026690391459074734, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0026690391459074734, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0026690391459074734, 0.0017793594

In [106]:
print(len(phi_x_given_y_equals_zero))
print(phi_x_given_y_equals_zero)

7287
[0.00013823610727121925, 0.00013823610727121925, 0.0002764722145424385, 0.00013823610727121925, 0.00013823610727121925, 0.00013823610727121925, 0.00013823610727121925, 0.0002764722145424385, 0.00013823610727121925, 0.00013823610727121925, 0.00013823610727121925, 0.00013823610727121925, 0.00013823610727121925, 0.00013823610727121925, 0.00013823610727121925, 0.00013823610727121925, 0.00013823610727121925, 0.00013823610727121925, 0.00013823610727121925, 0.00013823610727121925, 0.00013823610727121925, 0.00013823610727121925, 0.00013823610727121925, 0.00013823610727121925, 0.00013823610727121925, 0.00013823610727121925, 0.00013823610727121925, 0.00013823610727121925, 0.00013823610727121925, 0.00013823610727121925, 0.00013823610727121925, 0.00013823610727121925, 0.00013823610727121925, 0.00013823610727121925, 0.00013823610727121925, 0.00013823610727121925, 0.00013823610727121925, 0.00013823610727121925, 0.00013823610727121925, 0.00013823610727121925, 0.00013823610727121925, 0.0001382361

In [107]:
def determine_class_scores_after_smoothing(example,y_probs,phi_values):
    class_scores = {}
    for y in y_probs:
        score = np.log(y_probs[y])
        for i,x in enumerate(example):
            if (x == 1) and (np.log(phi_values[y][i]) != 0):
                score += np.log(phi_values[y][i])
            elif (x == 0) and (np.log(1 - phi_values[y][i]) != 0):
                score += np.log(1 - phi_values[y][i])
        class_scores[y] = score
        
    return class_scores

In [108]:
predicted_label = []
for example in X_test:
    class_scores = determine_class_scores(example, y_probs, phi_values)
    if class_scores[1] >= class_scores[0]:
        predicted_label.append(1)
    else:
        predicted_label.append(0)

In [109]:
correct_predictions = 0
incorrect_predictions = 0
for i,y in enumerate(y_test):
    if y == predicted_label[i]:
        correct_predictions += 1
    else:
        incorrect_predictions += 1
        
accuracy = correct_predictions/len(y_test)
print("Total number of examples classified correctly (m = 1) = ", correct_predictions)
print("Total number of misclassified examples (m = 1) = ", incorrect_predictions)
print("Percentage error (m = 1) = ", incorrect_predictions/len(y_test))
print("Accuracy = (m = 1) ", correct_predictions/len(y_test))
    

Total number of examples classified correctly (m = 1) =  1369
Total number of misclassified examples (m = 1) =  24
Percentage error (m = 1) =  0.01722900215362527
Accuracy = (m = 1)  0.9827709978463748


In [110]:
#m = 0.75

In [111]:
phi_values = calculate_smoothed_phi_xi_given_y(X_train, y_train, 0.75)
phi_x_given_y_equals_one = phi_values[1]
phi_x_given_y_equals_zero = phi_values[0]
print(len(phi_x_given_y_equals_one))
print(phi_x_given_y_equals_one)


7287
[0.00788002033553635, 0.021098118962887647, 0.000762582613116421, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.000762582613116421, 0.0068632435180477885, 0.0017793594306049821, 0.0017793594306049821, 0.0027961362480935434, 0.0027961362480935434, 0.011947127605490595, 0.008896797153024912, 0.0017793594306049821, 0.004829689883070666, 0.0027961362480935434, 0.0027961362480935434, 0.0068632435180477885, 0.0027961362480935434, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0027961362480935434, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0027961362480935434, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0027961362480935434, 0.00177935943

In [112]:
print(len(phi_x_given_y_equals_zero))
print(phi_x_given_y_equals_zero)

7287
[0.00011848809194675935, 0.00011848809194675935, 0.0002764722145424385, 0.00011848809194675935, 0.00011848809194675935, 0.00011848809194675935, 0.00011848809194675935, 0.0002764722145424385, 0.00011848809194675935, 0.00011848809194675935, 0.00011848809194675935, 0.00011848809194675935, 0.00011848809194675935, 0.00011848809194675935, 0.00011848809194675935, 0.00011848809194675935, 0.00011848809194675935, 0.00011848809194675935, 0.00011848809194675935, 0.00011848809194675935, 0.00011848809194675935, 0.00011848809194675935, 0.00011848809194675935, 0.00011848809194675935, 0.00011848809194675935, 0.00011848809194675935, 0.00011848809194675935, 0.00011848809194675935, 0.00011848809194675935, 0.00011848809194675935, 0.00011848809194675935, 0.00011848809194675935, 0.00011848809194675935, 0.00011848809194675935, 0.00011848809194675935, 0.00011848809194675935, 0.00011848809194675935, 0.00011848809194675935, 0.00011848809194675935, 0.00011848809194675935, 0.00011848809194675935, 0.0001184880

In [113]:
def determine_class_scores_after_smoothing(example,y_probs,phi_values):
    class_scores = {}
    for y in y_probs:
        score = np.log(y_probs[y])
        for i,x in enumerate(example):
            if (x == 1) and (np.log(phi_values[y][i]) != 0):
                score += np.log(phi_values[y][i])
            elif (x == 0) and (np.log(1 - phi_values[y][i]) != 0):
                score += np.log(1 - phi_values[y][i])
        class_scores[y] = score
        
    return class_scores

In [114]:
predicted_label = []
for example in X_test:
    class_scores = determine_class_scores(example, y_probs, phi_values)
    if class_scores[1] >= class_scores[0]:
        predicted_label.append(1)
    else:
        predicted_label.append(0)

In [115]:
correct_predictions = 0
incorrect_predictions = 0
for i,y in enumerate(y_test):
    if y == predicted_label[i]:
        correct_predictions += 1
    else:
        incorrect_predictions += 1
        
accuracy = correct_predictions/len(y_test)
print("Total number of examples classified correctly (m = 0.75) = ", correct_predictions)
print("Total number of misclassified examples (m = 0.75) = ", incorrect_predictions)
print("Percentage error (m = 0.75) = ", incorrect_predictions/len(y_test))
print("Accuracy = (m = 0.75) ", correct_predictions/len(y_test))
    

Total number of examples classified correctly (m = 0.75) =  1371
Total number of misclassified examples (m = 0.75) =  22
Percentage error (m = 0.75) =  0.015793251974156496
Accuracy = (m = 0.75)  0.9842067480258435


In [116]:
# m = 0.5

In [117]:
phi_values = calculate_smoothed_phi_xi_given_y(X_train, y_train, 0.5)
phi_x_given_y_equals_one = phi_values[1]
phi_x_given_y_equals_zero = phi_values[0]
print(len(phi_x_given_y_equals_one))
print(phi_x_given_y_equals_one)


7287
[0.008896797153024912, 0.02431791221826809, 0.0005931198102016608, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0005931198102016608, 0.00771055753262159, 0.0017793594306049821, 0.0017793594306049821, 0.0029655990510083037, 0.0029655990510083037, 0.013641755634638196, 0.010083036773428233, 0.0017793594306049821, 0.005338078291814947, 0.0029655990510083037, 0.0029655990510083037, 0.00771055753262159, 0.0029655990510083037, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0029655990510083037, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0029655990510083037, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0017793594306049821, 0.0029655990510083037, 0.0017793594306

In [118]:
print(len(phi_x_given_y_equals_zero))
print(phi_x_given_y_equals_zero)

7287
[9.21574048474795e-05, 9.21574048474795e-05, 0.0002764722145424385, 9.21574048474795e-05, 9.21574048474795e-05, 9.21574048474795e-05, 9.21574048474795e-05, 0.0002764722145424385, 9.21574048474795e-05, 9.21574048474795e-05, 9.21574048474795e-05, 9.21574048474795e-05, 9.21574048474795e-05, 9.21574048474795e-05, 9.21574048474795e-05, 9.21574048474795e-05, 9.21574048474795e-05, 9.21574048474795e-05, 9.21574048474795e-05, 9.21574048474795e-05, 9.21574048474795e-05, 9.21574048474795e-05, 9.21574048474795e-05, 9.21574048474795e-05, 9.21574048474795e-05, 9.21574048474795e-05, 9.21574048474795e-05, 9.21574048474795e-05, 9.21574048474795e-05, 9.21574048474795e-05, 9.21574048474795e-05, 9.21574048474795e-05, 9.21574048474795e-05, 9.21574048474795e-05, 9.21574048474795e-05, 9.21574048474795e-05, 9.21574048474795e-05, 9.21574048474795e-05, 9.21574048474795e-05, 9.21574048474795e-05, 9.21574048474795e-05, 9.21574048474795e-05, 9.21574048474795e-05, 9.21574048474795e-05, 9.21574048474795e-05, 9.

In [119]:
def determine_class_scores_after_smoothing(example,y_probs,phi_values):
    class_scores = {}
    for y in y_probs:
        score = np.log(y_probs[y])
        for i,x in enumerate(example):
            if (x == 1) and (np.log(phi_values[y][i]) != 0):
                score += np.log(phi_values[y][i])
            elif (x == 0) and (np.log(1 - phi_values[y][i]) != 0):
                score += np.log(1 - phi_values[y][i])
        class_scores[y] = score
        
    return class_scores

In [120]:
predicted_label = []
for example in X_test:
    class_scores = determine_class_scores(example, y_probs, phi_values)
    if class_scores[1] >= class_scores[0]:
        predicted_label.append(1)
    else:
        predicted_label.append(0)

In [121]:
correct_predictions = 0
incorrect_predictions = 0
for i,y in enumerate(y_test):
    if y == predicted_label[i]:
        correct_predictions += 1
    else:
        incorrect_predictions += 1
        
accuracy = correct_predictions/len(y_test)
print("Total number of examples classified correctly (m = 0.5) = ", correct_predictions)
print("Total number of misclassified examples (m = 0.5) = ", incorrect_predictions)
print("Percentage error (m = 0.5) = ", incorrect_predictions/len(y_test))
print("Accuracy = (m = 0.5) ", correct_predictions/len(y_test))
    

Total number of examples classified correctly (m = 0.5) =  1375
Total number of misclassified examples (m = 0.5) =  18
Percentage error (m = 0.5) =  0.012921751615218953
Accuracy = (m = 0.5)  0.9870782483847811


In [122]:
# Zero-R Algorithm

In [123]:
def estimate_majority_class(y_train):
    unique_classes, class_count = np.unique(y_train, return_counts = True)
    majority_class_index = np.argmax(class_count)
    return unique_classes[majority_class_index]

In [124]:
majority_class_label = estimate_majority_class(y_train)

In [125]:
zero_r_predictions = np.full_like(y_test, fill_value=majority_class_label)

In [126]:
correct_predictions = np.sum(zero_r_predictions == y_test)

In [127]:
accuracy = correct_predictions / len(y_test)

In [128]:
print("i) accuracy using Zero-R algorithm = ", accuracy)

i) accuracy using Zero-R algorithm =  0.8671931083991385
