# Machine Learning - Group Assignment 2

### Group G: Joanna Andari, Karim Awad, Jiye Ren, Nirbhay Sharma, Qiuyue Zhang, Xiaoyan Zhou
#### 09/12/2017

In [32]:
import re
import pandas as pd
import sklearn
import numpy as np
from sklearn.utils import shuffle
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, f1_score, accuracy_score, confusion_matrix
from astropy.table import Table, Column

##### Question 1: 
Load the data into a Python data frame

In [33]:
#Data loading
messages = pd.read_csv('C:/Users/karim/Documents/Imperial/Machine Learning/ProblemSets/Assignment2/SMSSpamCollection.txt',sep='\t', header=None,
                           names=["label", "messages"])

##### Question 2: 
Pre-process the SMS messages: Remove all punctuation and numbers from the SMS messages, and change all messages to lower case. (Please provide the Python code that achieves this!)

In [34]:
#Data processing
def data_processing(p):
    remove_number_punc = re.sub("[^a-zA-Z]", " ", p)
    convert_to_lower_letter = remove_number_punc.lower()
    return convert_to_lower_letter

messages['messages'] = messages['messages'].apply(data_processing)
print(messages.head())
messages.groupby('label').describe()
messages['messages'].describe()

  label                                           messages
0   ham  go until jurong point  crazy   available only ...
1   ham                      ok lar    joking wif u oni   
2  spam  free entry in   a wkly comp to win fa cup fina...
3   ham  u dun say so early hor    u c already then say   
4   ham  nah i don t think he goes to usf  he lives aro...


count                       5572
unique                      5146
top       sorry  i ll call later
freq                          30
Name: messages, dtype: object

##### Question 3: 
Shuﬄe the messages and split them into a training set (2,500 messages), a validation set (1,000 messages) and a test set (all remaining messages).


In [35]:
#Data shuffling and segmenting
random_seed = 100

clean_messages_shuffled = shuffle(messages, random_state = random_seed)
training_set = clean_messages_shuffled [0:round(len(clean_messages_shuffled.axes[0])/2.2288)]
validation_set =clean_messages_shuffled [round(len(clean_messages_shuffled.axes[0])/2.2288):round(len(clean_messages_shuffled.axes[0])/1.592)]
test_set=clean_messages_shuffled[round(len(clean_messages_shuffled.axes[0])/1.592):round(len(clean_messages_shuffled.axes[0])/1)]

print(len(training_set),len(validation_set),len(test_set))

2500 1000 2072


##### Question 4: 
While Python’s SciKit-Learn library has a Na¨ıve Bayes classiﬁer, it works with continuous probability distributions and assumes numerical features. Although it is possible to transform categorical variables into numerical features using a binary encoding, we will instead build a simple Na¨ıve Bayes classiﬁer from scratch:

In [36]:
class NaiveBayesForSpam:
    def train (self, hamMessages, spamMessages):
        self.words = set (' '.join (hamMessages + spamMessages).split())
        self.priors = np.zeros (2)
        self.priors[0] = float (len (hamMessages)) / (len (hamMessages) + len (spamMessages)) # calculation of the 
        # probability of ham messages.
        self.priors[1] = 1.0 - self.priors[0] # calculation of the probability of spam messages
        self.likelihoods = [] # to build a frequency matrix
        for i, w in enumerate (self.words):
            prob1 = (1.0 + len ([m for m in hamMessages if w in m])) / len (hamMessages) # Using laplace estimator (1.0). 
                                                                                         #This calculates the conditional 
                                                                                         #probability P(words|ham) 
            prob2 = (1.0 + len ([m for m in spamMessages if w in m])) / len (spamMessages) # Using laplace estimator (1.0)
                                                                                           #This calculates the conditional
                                                                                           #probability P(words|spam)
            self.likelihoods.append ([min (prob1, 0.95), min (prob2, 0.95)]) # adjusting the probability to reaching a 
                                                                            #maximum of 0.95 instead of 1
        self.likelihoods = np.array (self.likelihoods).T  # result of the frequency matrix
        

    def train2 (self, hamMessages, spamMessages):
        self.words = set (' '.join (hamMessages + spamMessages).split())
        self.priors = np.zeros (2)
        self.priors[0] = float (len (hamMessages)) / (len (hamMessages) + len (spamMessages))
        self.priors[1] = 1.0 - self.priors[0]
        self.likelihoods = []
        spamkeywords = []
        for i, w in enumerate (self.words):
            prob1 = (1.0 + len ([m for m in hamMessages if w in m])) / len (hamMessages)
            prob2 = (1.0 + len ([m for m in spamMessages if w in m])) / len (spamMessages)
            if prob1 * 20 < prob2: # checks if the probability of a word being a spam is 20 times higher than the probability
                                   # of a word being a ham message.
                self.likelihoods.append ([min (prob1, 0.95), min (prob2, 0.95)])
                spamkeywords.append (w)
        self.words = spamkeywords
        self.likelihoods = np.array (self.likelihoods).T

    
    def predict (self, message):
        posteriors = np.copy (self.priors) # to calculate the posterior probabilities of all new data points 
        for i, w in enumerate (self.words): # for loop 
            if w in message.lower():  # convert to lower-case
                posteriors *= self.likelihoods[:,i]  #if new words already exists in the frequency matrix  
                                                    #to retrieve the posterior probability 
            else:                                   
                posteriors *= np.ones (2) - self.likelihoods[:,i] #
            posteriors = posteriors / np.linalg.norm (posteriors)  # normalise the new posterior probability
        if posteriors[0] > 0.5: # classification of ham or spam
            return ['ham', posteriors[0]]
        return ['spam', posteriors[1]]    

    def score (self, messages, labels):
        confusion = np.zeros(4).reshape (2,2) # building the confusion matrix
        for m, l in zip (messages, labels):
            if self.predict(m)[0] == 'ham' and l == 'ham':
                confusion[0,0] += 1
            elif self.predict(m)[0] == 'ham' and l == 'spam':
                confusion[0,1] += 1
            elif self.predict(m)[0] == 'spam' and l == 'ham':
                confusion[1,0] += 1
            elif self.predict(m)[0] == 'spam' and l == 'spam':
                confusion[1,1] += 1
        return (confusion[0,0] + confusion[1,1]) / float (confusion.sum()), confusion

##### Question 5:
Explain the code: What is the purpose of each function? What do ’train’ and ‘train2’ do, and what is the diﬀerence between them? Where in the code is Bayes’ Theorem being applied?

a. def train
This code is the initial step of the Naive Bayes Classifier Algorithm which consists of constructing the frequency matrix.The code starts by searches for the prior probabilities of each word being either a spam or ham then it calculates the conditional probabilities of each word being either a spam or a ham in order to build the frequency matrix. In addition, the code uses the Laplace Estimator by adding one to every entry in the frequency matrix in order to avoid cases of null probabilities. It also adjusts cases of a probability being equal to 1 by adjusting it to 0.95.

b. def train2
The Train 2 function is similar to the Train function, except it is not including ham messages with a conditional probability exceeding 0.05 (based on the if statement where prob1 * 20 < prob2). This is therefore assigning to the frequency matrix those spam words where the probability is close to 1, focusing our probabilities on words where there is a very high probability of the featured words being spam. This subsequently reduces the size of our frequency matrix, increasing both the accuracy of our findings and speed of calculation.


c. def predict 
After having trained the data with the two previous code, the Bayes theorem is applied in this algorithm. The third code consists of looking at the new data points to predict if these words are either ham or spam. If the new word is already
present in the old data set then the code calculates the posterior probability through the frequency matrix if not the code assigns a new posterior probability and then normalize it. After assigning the posterior probabilities for the new words, they are classified as spam or ham.

d. def score 
This algorithm evaluates the performance of the classification problem through a confusion matrix. In other words, the algorithm compares the actual class against the predicted class to calculate the performance measure of our prediction conducted in code number 3.

##### Question 6:
Use your training set to train the classiﬁers ‘train’ and ‘train2’. Note that the interfaces of our classiﬁers require you to pass the ham and spam messages separately

In [37]:
train_ham = training_set.loc[training_set['label'] == 'ham']
train_spam = training_set[training_set['label'] == 'spam']

classifier_train1 = NaiveBayesForSpam()
classifier_train1.train(train_ham['messages'].tolist(), train_spam['messages'].tolist())
classifier_train1.score(test_set['messages'],test_set['label'])

(0.96138996138996136, array([[ 1758.,    44.],
        [   36.,   234.]]))

In [38]:
train_ham = training_set.loc[training_set['label'] == 'ham']
train_spam = training_set[training_set['label'] == 'spam']


classifier_train2 = NaiveBayesForSpam()
classifier_train2.train2(train_ham['messages'].tolist(), train_spam['messages'].tolist())
classifier_train2.score(test_set['messages'],test_set['label'])

(0.96718146718146714, array([[ 1785.,    59.],
        [    9.,   219.]]))

##### Question 7:
Using the validation set, explore how each of the two classiﬁers performs out of sample.

In [39]:
classifier_train1.score(validation_set['messages'],validation_set['label'])

(0.96199999999999997, array([[ 848.,   18.],
        [  20.,  114.]]))

In [40]:
classifier_train2.score(validation_set['messages'],validation_set['label'])

(0.96299999999999997, array([[ 862.,   31.],
        [   6.,  101.]]))

The train classifier has an accuracy of 0.96203796203796199 and train2 has an accuracy of 0.96303696303696307. Thus, train2 classifer performs better

##### Question 8:
Why is the ‘train2’ classiﬁer faster? Why does it yield a better accuracy both on the training and the validation set?

The train2 function is faster and yields a better accuracy for both the training and validation set than the train function due to the additional if statement discussed previously in question 6. The train2 function differs from the train function by going through a shorter list of key spam words with a high conditional probability (given our restriction that $Prob 1 *20$ <$Prob_2$ criteria), rather than going through all of the words (ie including those with a high conditional ham probability) and their respective probabilities. This increases our speed, as it reduces the size of our frequency matrix, given our focus on detecting whether a word is within a spam email or not, rather than also including words within ham emails. 

##### Question 9:
How many false positives (ham messages classiﬁed as spam messages) did you get in your validation set? How would you change the code to reduce false positives at the expense of possibly having more false negatives (spam messages classiﬁed as ham messages)?

Referring to the output in Q7, for the train classifier, we have got 18 false positives while for the train2 classifier we have got 31 false positives.

The posterior threshold of 0.5 within the predict function can be decreased to reduce false positives at the expense of possibly having more false negatives. 

##### Question 10:
Run the ‘train2’ classiﬁer on the test set and report its performance using a confusion matrix.

In [41]:
y = classifier_train2.score(test_set['messages'],test_set['label'])
print(y)

(0.96718146718146714, array([[ 1785.,    59.],
       [    9.,   219.]]))


In [42]:
print('Classification Report')
def print_table(table):
    col_width = [max(len(str(x)) for x in col) for col in zip(*table)]
    for line in table:
        print("| " + " | ".join("{:{}}".format(x, col_width[i])
                                for i, x in enumerate(line)) + " |")

print('Confusion Matrix')
table = [['/', 'Pred ham', 'Pred spam'],
         ['Actual ham', '1785', '59'],
         ['Actual spam', '9', '219']]

print_table(table)

print('total_error_rate:')
print((59+9)/ (1785+59+9+219))

print ('Accuracy:')
print(1 - (59+9)/ (1785+59+9+219))

print ('Sensitivity:')
print(1785/(1785+59))

print ('Specificity:')
print(219/(219+9))

Classification Report
Confusion Matrix
| /           | Pred ham | Pred spam |
| Actual ham  | 1785     | 59        |
| Actual spam | 9        | 219       |
total_error_rate:
0.032818532818532815
Accuracy:
0.9671814671814671
Sensitivity:
0.9680043383947939
Specificity:
0.9605263157894737


The result on our test sample generates an accuracy level of 96.7%. Our sensitivity and specificity measures would suggest a marginally higher likelihood of classifying an actual spam message as ham (9 messages, generating a specificity of 0.961), although we note more instances (in absolute terms) of actual ham messages being predicted to be spam (59 text messages). As discussed above, we can adjust the posterior thresholds within our predict function to alter these measures, depending on what we determine to be more important (e.g. reducing spam in our inbox, or mis-categorising ham messages in our spam folders).  