# Cloud Constable Content-Based Spam/Fraud Detection
______
### Stephen Camera-Murray, Himani Garg, Vijay Thangella
## CSDMC2010 SPAM Corpus
(http://csmining.org/index.php/spam-email-datasets-.html)

4327 messages out of which there are 2949 non-spam messages (HAM) and 1378 spam messages (SPAM)

Spam                                                   |  Ham
:-----------------------------------------------------:|:------------------------------------------------------:
<img src="Spam.png" alt="Spam" style="width: 200px;"/> | <img src="Ham.png" alt="Ham" style="width: 200px;"/>

### Step 3 - Build a Predictive Model
____

#### Import required libraries

In [13]:
#import libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics

#### Load Email Features Dataframe

In [14]:
# load the labels file
emailFeatures = pd.read_csv ('data/emailFeatures.tab.gz', compression='gzip', sep='\t')
emailFeatures.head()

Unnamed: 0,ham,phishy,aa,ab,ability,able,absolute,absolutely,abuse,ac,...,yw,zd,zdnet,zero,zip,zm,zt,zvbnq,zw,zzzzteana
0,0.0,1.0,0,0,0,2,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0.0,0.0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1.0,0.0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0.0,0.0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0.0,0.0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


#### Split Dataset into Test and Train

In [15]:
# set random seed for reproducibility
np.random.seed(222)

# create a random mask and split into train and test
msk = np.random.rand(len(emailFeatures)) < 0.7
train = emailFeatures[ msk]
test  = emailFeatures[~msk]

# let's confirm that our split worked
print ( 'Our training dataset has', round ( train.shape[0]/emailFeatures.shape[0] * 100, 2 ),
        '% of the observations and our test dataset has', round ( test.shape[0]/emailFeatures.shape[0] * 100, 2 ), '%' )

Our training dataset has 70.53 % of the observations and our test dataset has 29.47 %


#### Build a Naive Model

In [16]:
# all spam or all ham, depending on %

#### Build an SVM Model

#### Build a Random Forest Model

In [17]:
# Initialize a Random Forest classifier with 100 trees and a random seed for consistency
forest = RandomForestClassifier(n_estimators = 100, random_state = 24 ) 

# Fit the forest to the training set, using the bag of words as 
# features and the sentiment labels as the response variable
#
# This may take a few minutes to run
forest = forest.fit( train.drop ("ham", axis=1 ), train["ham"] )

Now, let's make our predictions.

In [18]:
# predict train and test 
pred_train = forest.predict ( train.drop ("ham", axis=1 ) )
pred_test  = forest.predict ( test.drop  ("ham", axis=1 ) )

#### Evaluate our models
Note: We should add additional scores and maybe a confusion matrix

In [19]:
# now score our predictions based on accuracy
rforest_accuracy_train = metrics.accuracy_score(train["ham"], pred_train)
rforest_accuracy_test  = metrics.accuracy_score(test["ham"], pred_test)

# print our results
print ( 'Our random forest model has', round(rforest_accuracy_train*100,3), '% accuracy on our training dataset and',
        round(rforest_accuracy_test*100,3), '% accuracy on our test dataset' )

Our random forest model has 100.0 % accuracy on our training dataset and 97.961 % accuracy on our test dataset


Clearly our training dataset, with 100% accuracy, is overfit. Our test dataset has almost 98% accuracy and the data is not highly imbalanced, this may be a good candidate for further refinement.