In [147]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.cross_validation import train_test_split
import numpy as np
import pandas as pd
from column_names import column_list
from pandas import DataFrame, Series

Loading in the data file:

In [148]:
spam_df = pd.read_csv('spambase.data', names = column_list, index_col = False)

Since we are going to split the given dataset into train and test datasets let's find out the shape of the original data, so later we will be able to check if `train_test_split` method splitted the data according to the set of parameters:

In [149]:
spam_df.shape

(4601, 58)

Before we split the data we need to assign to $X$ and $y$ variables the data from the original dataset. Since our goal is to classify spam then the very last Series (1 - spam; 0 - not spam) of the original DataFrame has to be assign to variable $y$. And the rest of the DataFrame is assigned to variable $X$.   

In [150]:
#X = spam_df.drop('spam', axis = 1)
X = spam_df[column_list[:-1]]
y = spam_df['spam']

Splitting the original dataset into training and testing datasets (60/40):

In [151]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.6, random_state = 42)

In [152]:
X_train.shape

(2760, 57)

Let's check the `train_test_split` function: 2760 rows is 60% of 4601 rows of the original dataset (2760 / 4601 = 0.599)

Training our model on the training set of data

In [153]:
clf = MultinomialNB()
clf.fit(X_train, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [154]:
clf.score(X_train, y_train)

0.78260869565217395

In [155]:
clf.score(X_test, y_test)

0.78109722976643126

Making predictions on the testing set of data

In [156]:
list_1_0 = clf.predict(X_test)
list_1_0

array([1, 0, 0, ..., 1, 0, 0])

`list_1_0` is a list of values 1, 0 that represent spam and not spam. Let's find how many percent of spam is in the list using regular and Pandas methods:

In [157]:
count_spam = 0
for item in list_1_0:
    if item == 1:
        count_spam += 1
count_spam

703

In [158]:
spam_precentage = count_spam / len(list_1_0) * 100
spam_precentage

38.18576860401956

In [159]:
df_list = DataFrame(list_1_0)
df_list.columns = ['spam']
df_list.head()

Unnamed: 0,spam
0,1
1,0
2,0
3,1
4,0


In [160]:
df_list.spam.value_counts()

0    1138
1     703
Name: spam, dtype: int64

In [161]:
len(df_list)

1841

In [162]:
703 / 1841 * 100

38.18576860401956

### Advanced Mode

Let's eliminate the features `capital_run_length_average`, `capital_run_length_longest` and `capital_run_length_total` of the original dataset and check how the score changes.

In [163]:
X = spam_df[column_list[:-4]]
y = spam_df['spam']

In [164]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.6, random_state = 42)

In [165]:
X_train.shape

(2760, 54)

In [166]:
clf = MultinomialNB()
clf.fit(X_train, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [167]:
clf.score(X_train, y_train)

0.87282608695652175

In [168]:
clf.score(X_test, y_test)

0.87235198261814229

Result: By eliminating the features `capital_run_length_average`, `capital_run_length_longest` and `capital_run_length_total` of the original dataset the score goes up from 0.78 to 0.87.