# Week 5 Part 2
Benson Toi, Noah Collin, Ahmed Elsaeyed

It can be useful to be able to classify new "test" documents using already classified "training" documents.  A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.  Here is one example of such data:  UCI Machine Learning Repository: Spambase Data Set
For this project, you can either use the above dataset to predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder).
For more adventurous students, you are welcome (encouraged!) to come up a different set of documents (including scraped web pages!?) that have already been classified (e.g. tagged), then analyze these documents to predict how new documents should be classified.
This assignment is due end of day on Sunday.

The dataset we will be using is available here: https://archive.ics.uci.edu/ml/datasets/spambase

In [26]:
import random
import nltk
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import CategoricalNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.metrics import precision_score
import pandas as pd

# Data Prep


In [2]:
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data', header=None)


Without headers we get a generic enumeration of the column names


In [3]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,48,49,50,51,52,53,54,55,56,57
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278,1
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1


The website also provides us with the actual column names which we can add in thanks to pandas.


In [4]:
df.columns = ['word_freq_make','word_freq_address','word_freq_all','word_freq_3d','word_freq_our','word_freq_over','word_freq_remove','word_freq_internet','word_freq_order','word_freq_mail','word_freq_receive','word_freq_will','word_freq_people','word_freq_report','word_freq_addresses','word_freq_free','word_freq_business','word_freq_email','word_freq_you','word_freq_credit','word_freq_your','word_freq_font','word_freq_000','word_freq_money','word_freq_hp','word_freq_hpl','word_freq_george','word_freq_650','word_freq_lab','word_freq_labs','word_freq_telnet','word_freq_857','word_freq_data','word_freq_415','word_freq_85','word_freq_technology','word_freq_1999','word_freq_parts','word_freq_pm','word_freq_direct','word_freq_cs','word_freq_meeting','word_freq_original','word_freq_project','word_freq_re','word_freq_edu','word_freq_table','word_freq_conference','char_freq_;','char_freq_(','char_freq_[','char_freq_!','char_freq_$','char_freq_#','capital_run_length_average','capital_run_length_longest','capital_run_length_total','is_spam']

In [5]:
df.head()

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,char_freq_;,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_#,capital_run_length_average,capital_run_length_longest,capital_run_length_total,is_spam
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278,1
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1


# Splitting the Data into Testing and Training

Some "natural language processing" has been done to an extent based on the columns above. Our model can be created in sklearn such that the first 57 columns are features, and the last one is the label


In [6]:
# create a df named X that represents only the first 57 columns
X = df.iloc[:,:-1]

# create a df named y that represents only the last column
y = df.iloc[:,-1]

Sklearn lets us easily create testing and training data sets from an exisiting data set using the function train_test_split :https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html?highlight=train_test_split#sklearn.model_selection.train_test_split


In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y , test_size= 0.3, random_state = 42)

In [8]:
# check out the lengths of the dfs we just made to make sure: 
len(X_train)

3220

In [9]:
len(X_test)

1381

In [10]:
# the sum of the above should equal the length of the original df
len(df)

4601

# Initial Model Creation - Bernoulli Naive Bayes

The first step is to create a model object using the type of model we want to make. We use the Bernoulli naive bayes at first.

In [11]:
bernoulli_nb_classifier = BernoulliNB()
# The next step is to train the model using the training data we extracted before
# The function 'fit' takes the feature set and the labels as inputs and 'fits' the features to the labels
bernoulli_nb_classifier.fit(X_train, y_train)

# How to get a prediciton
# The BernoulliNB() object has a function 'predict' which takes in a single row of features and returns a predicted label
# Here we can use any random row from our 'X_test' data
bernoulli_nb_classifier.predict(X_test.iloc[[56]])
bernoulli_nb_classifier.predict(X_test.iloc[[60]])
bernoulli_nb_classifier.predict(X_test.iloc[[1000]])


# We can also pass in several rows of data from X_test and get predictions for all of them 
print(bernoulli_nb_classifier.predict(X_test[0:30]))

# We can then check the accuracy of the model against the dataset y_test, which contains the actual labels
print(y_test[0:30].values)

[0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 1 0 1 0 1 1]
[0 0 0 1 0 1 0 0 0 0 0 0 0 1 1 0 1 1 1 0 0 1 1 0 1 0 1 0 1 1]


# Exploring Different Models

As we can see the accuracy of the predictions is not that good. We can repeat this with other classifiers to see if we get better results.

In [12]:
#Trying it again with another classifier, Gaussian Naive Bayes

gnbc = GaussianNB()
gnbc.fit(X_train, y_train)
print(gnbc.predict(X_test[0:30]))
print(y_test[0:30].values)

[1 0 0 1 0 1 0 0 0 1 0 1 0 1 0 0 1 0 1 0 0 1 1 0 1 1 1 0 1 1]
[0 0 0 1 0 1 0 0 0 0 0 0 0 1 1 0 1 1 1 0 0 1 1 0 1 0 1 0 1 1]


In [13]:
#Trying it again with Multinomial Naive Bayes

mnbc = MultinomialNB()
mnbc.fit(X_train, y_train)
print(mnbc.predict(X_test[0:30]))
print(y_test[0:30].values)

[1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 1 0 1 0 1 0 1 1]
[0 0 0 1 0 1 0 0 0 0 0 0 0 1 1 0 1 1 1 0 0 1 1 0 1 0 1 0 1 1]


In [14]:
#Trying it again with Categorical Naive Bayes

cnbc = CategoricalNB()
cnbc.fit(X_train, y_train)
print(cnbc.predict(X_test[0:30]))
print(y_test[0:30].values)

[0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 1 1]
[0 0 0 1 0 1 0 0 0 0 0 0 0 1 1 0 1 1 1 0 0 1 1 0 1 0 1 0 1 1]


In [15]:
#Trying it again with linear regression

lrc = LogisticRegression(solver="liblinear", random_state=0)
lrc.fit(X_train, y_train)
print(lrc.predict(X_test[0:30]))
print(y_test[0:30].values)

[0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 1 0 0 1 0 1 0 1 1]
[0 0 0 1 0 1 0 0 0 0 0 0 0 1 1 0 1 1 1 0 0 1 1 0 1 0 1 0 1 1]


It looks like linear regression gives us the best predictions. These scores can be improved of course with more data.


# A More Scientific Way of Measuring Model Accuracy

The function roc_auc_score is a function is scikitlearn that does area-under-curve analysis of a model's output.
The higher the number, the higher the ratio of true positives to false positives. We can see that linear regression provides the best model so far, followed by categorical naive bayes.

More about area-under-the-curve and how its used to evaluate models is available here: 
https://www.analyticsvidhya.com/blog/2020/06/auc-roc-curve-machine-learning/#:~:text=The%20Area%20Under%20the%20Curve,the%20positive%20and%20negative%20classes.

From the link: "When AUC = 1, then the classifier is able to perfectly distinguish between all the Positive and the Negative class points correctly. If, however, the AUC had been 0, then the classifier would be predicting all Negatives as Positives, and all Positives as Negatives."

In [17]:
# score for linear regression model
print(roc_auc_score(y_train, lrc.predict_proba(X_train)[:,1], multi_class='ovr') )

# score using the test data set
print(roc_auc_score(y_test, lrc.predict_proba(X_test)[:,1], multi_class='ovr') )

0.9747468012710094
0.9733912758564197


We see that for linear regression, it does a little worse on the test data set. 

In [25]:
# score for categorical naive bayes
print(roc_auc_score(y_train, cnbc.predict_proba(X_train)[:,1], multi_class='ovr') )

# score using the test data set throws an erorr for some reason
#print(roc_auc_score(y_test, cnbc.predict_proba(X_test)[:,1], multi_class='ovr') )

0.9669575454770853


In [20]:
# score for bernoulli naive bayes
print(roc_auc_score(y_train, bernoulli_nb_classifier.predict_proba(X_train)[:,1], multi_class='ovr') )
print(roc_auc_score(y_test, bernoulli_nb_classifier.predict_proba(X_test)[:,1], multi_class='ovr') )

0.9509557446628041
0.9561928227148486


In [21]:
# score for gaussian naive bayes
print(roc_auc_score(y_train, gnbc.predict_proba(X_train)[:,1], multi_class='ovr') )
print(roc_auc_score(y_test, gnbc.predict_proba(X_test)[:,1], multi_class='ovr') )

0.9453477741022027
0.948221414590824


In [22]:
# score for multinomial naive bayes
print(roc_auc_score(y_train, mnbc.predict_proba(X_train)[:,1], multi_class='ovr') )
print(roc_auc_score(y_test, mnbc.predict_proba(X_test)[:,1], multi_class='ovr') )

0.8474456248695063
0.8553129068694655


The naive Bayes models do a little bit better with the test dataset, but not a significant amount. 

# Precision

Precision is another function provided by scikitlearn, documented here: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html

From the documentation: "The precision is intuitively the ability of the classifier not to label as positive a sample that is negative". 


    

In [27]:
y_pred = lrc.predict(X_test)
print(precision_score(y_test, y_pred, average="macro"))

0.9321089229209281


In [29]:
y_pred2 = gnbc.predict(X_test)
print(precision_score(y_test, y_pred2, average="macro"))

0.8362268003677202
