# Spam Detector

In [157]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

## Retrieve the Data

The data is located at [https://static.bc-edx.com/ai/ail-v-1-0/m13/challenge/spam-data.csv](https://static.bc-edx.com/ai/ail-v-1-0/m13/challenge/spam-data.csv)

Dataset Source: [UCI Machine Learning Library](https://archive.ics.uci.edu/dataset/94/spambase)

Import the data using Pandas. Display the resulting DataFrame to confirm the import was successful.

In [158]:
# Import the data
df_spam = pd.read_csv("https://static.bc-edx.com/ai/ail-v-1-0/m13/challenge/spam-data.csv")
df_spam.head()

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,char_freq_;,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_#,capital_run_length_average,capital_run_length_longest,capital_run_length_total,spam
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278,1
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1


## Predict Model Performance

You will be creating and comparing two models on this data: a Logistic Regression, and a Random Forests Classifier. Before you create, fit, and score the models, make a prediction as to which model you think will perform better. You do not need to be correct! 

Write down your prediction in the designated cells in your Jupyter Notebook, and provide justification for your educated guess.

***Any prediction that I would make at this point would be a coin flip. I think that different models will perform differently based on the dataset the are used with. The quarter landed on heads, so I will go with the logistic regression model as my guess.***

## Split the Data into Training and Testing Sets

In [159]:
# Create the labels set `y` and features DataFrame `X`

s_label = "spam"

# Drop the s_label to create the X data
X = df_spam.drop(s_label, axis=1)

# Create the y set from the s_label column
y = df_spam[s_label]

In [160]:
# Check the balance of the labels variable (`y`) by using the `value_counts` function.
y.value_counts()

spam
0    2788
1    1813
Name: count, dtype: int64

In [161]:
# Split the data into X_train, X_test, y_train, y_test
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)


## Scale the Features

Use the `StandardScaler` to scale the features data. Remember that only `X_train` and `X_test` DataFrames should be scaled.

In [162]:
from sklearn.preprocessing import StandardScaler

# Create the StandardScaler instance
ss_spam = StandardScaler()

In [163]:
# Fit the Standard Scaler with the training data
ss_spam.fit(X_train)

In [164]:
# Scale the training data
X_train_scaled = ss_spam.transform(X_train)
X_test_scaled = ss_spam.transform(X_test)


## Create and Fit a Logistic Regression Model

Create a Logistic Regression model, fit it to the training data, make predictions with the testing data, and print the model's accuracy score. You may choose any starting settings you like. 

In [165]:
# Train a Logistic Regression model and print the model score
from sklearn.linear_model import LogisticRegression

lr_model = LogisticRegression(random_state=1)
lr_model.fit(X_train_scaled, y_train)

train_accuracy = lr_model.score(X_train_scaled, y_train)
test_accuracy = lr_model.score(X_test_scaled, y_test)

print ('Scores measured using the LogisticRegression model score:')
print (f'Training Accuracy: {train_accuracy}')
print (f'Test Accuracy: {test_accuracy}')

Scores measured using the LogisticRegression model score:
Training Accuracy: 0.9298550724637681
Test Accuracy: 0.9287576020851434


In [166]:
# Make and save testing predictions with the saved logistic regression model using the test data
predictions = lr_model.predict(X_test_scaled)
# Review the predictions
predictions

array([0, 0, 1, ..., 0, 0, 1], dtype=int64)

In [167]:
from sklearn.metrics import accuracy_score

# Calculate the accuracy score by evaluating `y_test` vs. `tp_spam`
test_accuracy = accuracy_score(y_test, predictions)

print ('Score measured using the sklearn.metrics accuracy_score function:')
print(f'Test Accuracy: {test_accuracy}')

Score measured using the sklearn.metrics accuracy_score function:
Test Accuracy: 0.9287576020851434


## Create and Fit a Random Forest Classifier Model

Create a Random Forest Classifier model, fit it to the training data, make predictions with the testing data, and print the model's accuracy score. You may choose any starting settings you like. 

In [168]:
# #Checking for best n_estimators value. 
# #Selecting 50 base on test accuracy and train/test gap
# from sklearn.ensemble import RandomForestClassifier
# estimators_values = [5, 10, 20, 50, 100, 200, 300, 400, 500]

# for estimators in estimators_values:
#     rf_model = RandomForestClassifier(n_estimators=estimators, random_state=1)
        
#     rf_model.fit(X_train_scaled, y_train)

#     train_accuracy = rf_model.score(X_train_scaled, y_train)
#     test_accuracy = rf_model.score(X_test_scaled, y_test)

#     print (f'using {estimators}:')
#     print (f'Training Accuracy: {train_accuracy}')
#     print (f'Test Accuracy: {test_accuracy}')
#     print (f'gap = {train_accuracy - test_accuracy}')
#     print ()

In [169]:
# #Checking for best depth value. 
# #Selecting 7 base on test accuracy and train/test gap
# from sklearn.ensemble import RandomForestClassifier
# depth_values = [1,3,5,7,9,11,15,21]

# for depth in depth_values:
#     rf_model = RandomForestClassifier(n_estimators=50, random_state=1, max_depth = depth)
        
#     rf_model.fit(X_train_scaled, y_train)

#     train_accuracy = rf_model.score(X_train_scaled, y_train)
#     test_accuracy = rf_model.score(X_test_scaled, y_test)

#     print (f'using {depth}:')
#     print (f'Training Accuracy: {train_accuracy}')
#     print (f'Test Accuracy: {test_accuracy}')
#     print (f'gap = {train_accuracy - test_accuracy}')
#     print ()

In [170]:
# Train a Random Forest Classifier model and print the model score
from sklearn.ensemble import RandomForestClassifier

rf_model = RandomForestClassifier(n_estimators=50, random_state=1, max_depth = 7)
rf_model.fit(X_train_scaled, y_train)

train_accuracy = rf_model.score(X_train_scaled, y_train)
test_accuracy = rf_model.score(X_test_scaled, y_test)

print ('Scores measured using the RandomForest model score:')
print (f'Training Accuracy: {train_accuracy}')
print (f'Test Accuracy: {test_accuracy}')

Scores measured using the RandomForest model score:
Training Accuracy: 0.9484057971014492
Test Accuracy: 0.946133796698523


In [171]:
# Error in this comment: 
# Make and save testing predictions with the saved logistic regression model using the test data

#Corrected Comment: 
# Make and save testing predictions with the saved Random Forest model using the test data
predictions = rf_model.predict(X_test_scaled)
# Review the predictions
predictions


array([1, 0, 1, ..., 1, 0, 1], dtype=int64)

In [172]:
# Calculate the accuracy score by evaluating `y_test` vs. `tp_spam`
test_accuracy = accuracy_score(y_test, predictions)

print ('Score measured using the sklearn.metrics accuracy_score function:')
print(f'Test Accuracy: {test_accuracy}')

Score measured using the sklearn.metrics accuracy_score function:
Test Accuracy: 0.946133796698523


## Evaluate the Models

Which model performed better? How does that compare to your prediction? Write down your results and thoughts in the following markdown cell.

***Innitially the logistic regretion model looked better but I realized that the random forest model was overfit. I then did some testing to determine the model perameters that would improve the Random Forest model. by changeing the n_estimators to 50 and the max_depth to 7, I improved the performance of the Random Forest model.***

Which model performs better? ***The Random Forest model with an accuracy_score of .946 performs better than the Logistic Regresion model with an accuracy_score of .929. Side note is that both models have a good balance between training and test accuracy which should result in good generaliztion to new data.*** 

How does that compare to your prediction? ***The results are contrary to my prediction. This demonstrates why testing different models is important. In this case, the Random Forest was tuned to perform better than the Logical Regression.***