# Classification task: Spam Classifier

* author: Haochen Guo
* email:  guohaoch@usc.edu

In this task, I'm going to build three classifiers using Grid Search and K-Fold Cross-validation and choose the one has best performance. At last, output the result with given format.

### Load Spambase dataset
The Spambase dataset is from the UCI Machine Learning Repository:
http://archive.ics.uci.edu/ml/datasets/Spambase

In [1]:
import numpy as np
# Load Spambase dataset from disk 
data_path = "/mnt/hgfs/data/spambase/spambase.data"
with open(data_path, 'r') as i_f:
    data = np.array([[float(i) for i in line.split(',')] for line in i_f])
# Shuffle the data
np.random.shuffle(data)
# Slice data into features and labels
X = data[:, :-1]
y = data[:, -1]
# Free the memory
del data

### Building the classifier

In [2]:
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
# parameters settings to try in Grid Search
param_dict = {
'svm': {'C': [1, 10, 1e2, 1e3], 'gamma': ['auto',0.001, 0.0001]}, # SVM
'nb': {'alpha': [0.01, 0.1, 1.]}, # Multinomial Naive Bayes
'rf': {'n_estimators': [50, 100, 150], 'criterion': ['gini', 'entropy'] } # Random Forest
}
# estimators
svm = SVC(kernel='rbf', class_weight='balanced')
nb = MultinomialNB()
rf = RandomForestClassifier()

Parameter tuning using Grid Search with K-Fold Cross-validation:

In [3]:
from sklearn.model_selection import GridSearchCV
k = 5 # K-Flod with n_splits = 5
# find the best estimator of the best method
s_max = 0
for m in param_dict: # for each methods
    if m == 'svm':
        clf = GridSearchCV(svm, param_dict[m], cv=k)
        clf.fit(X, y)
        print("SVM:\nBest_score: %f\nBest_estimator:" % clf.best_score_)
        print(clf.best_estimator_)
    if m == 'nb':
        clf = GridSearchCV(nb, param_dict[m], cv=k)
        clf.fit(X, y)
        print("Multinomial Naive Bayes:\nBest_score: %f\nBest_estimator:" % clf.best_score_)
        print(clf.best_estimator_)
    if m == 'rf':
        clf = GridSearchCV(rf, param_dict[m], cv=k)
        clf.fit(X, y)
        print("Random Forest:\nBest_score: %f\nBest_estimator:" % clf.best_score_)
        print(clf.best_estimator_)
    print('-------------------------------------')
    if clf.best_score_ > s_max:
        s_max = clf.best_score_
        estimator = clf.best_estimator_

SVM:
Best_score: 0.922191
Best_estimator:
SVC(C=1000.0, cache_size=200, class_weight='balanced', coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=0.0001, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
-------------------------------------
Multinomial Naive Bayes:
Best_score: 0.792871
Best_estimator:
MultinomialNB(alpha=0.01, class_prior=None, fit_prior=True)
-------------------------------------
Random Forest:
Best_score: 0.956749
Best_estimator:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
-------------------------------------


Unsurprisedly, Random Forest is the best in the three.

### Evaluating with K-Fold cross-validation
Since we cannot get false positive, flase negative, and overall error rates of each fold in the above steps, I perform K-Fold again using the best classifier and get the result:

In [4]:
from sklearn.model_selection import KFold
from sklearn.metrics import confusion_matrix
k = 10
res = np.zeros((k+1, 3)) # initial result matrix
kf = KFold(n_splits = k)
for i, (train_index, test_index) in enumerate(kf.split(X)): # for each fold
    estimator.fit(X[train_index], y[train_index])
    y_pred = estimator.predict(X[test_index])
    tn, fp, fn, tp = confusion_matrix(y[test_index], y_pred).ravel() # get fp, fn
    res[i] += np.array([fp, fn, fp+fn]) / len(test_index) # add to the result matrix
res[k] += np.mean(res[:k]) # the final row corresponding to the average error rates

In [5]:
# display the table
import pandas as pd
rows = np.array(['F'+str(i) for i in range(1,11)]+['AVG']).reshape((11,1))
res_tab = pd.DataFrame(np.hstack((rows, res)), columns = ['','FP Rate','FN Rate','Overall Error Rate'])
res_tab

Unnamed: 0,Unnamed: 1,FP Rate,FN Rate,Overall Error Rate
0,F1,0.0043383947939262,0.0347071583514099,0.0390455531453362
1,F2,0.0108695652173913,0.0282608695652173,0.0391304347826087
2,F3,0.0282608695652173,0.0347826086956521,0.0630434782608695
3,F4,0.0195652173913043,0.0304347826086956,0.05
4,F5,0.017391304347826,0.0239130434782608,0.0413043478260869
5,F6,0.0239130434782608,0.0217391304347826,0.0456521739130434
6,F7,0.0260869565217391,0.0239130434782608,0.05
7,F8,0.008695652173913,0.0282608695652173,0.0369565217391304
8,F9,0.0108695652173913,0.0282608695652173,0.0391304347826087
9,F10,0.0260869565217391,0.0217391304347826,0.0478260869565217
