# Spam Detector

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

## Retrieve the Data

The data is located at [https://static.bc-edx.com/ai/ail-v-1-0/m13/challenge/spam-data.csv](https://static.bc-edx.com/ai/ail-v-1-0/m13/challenge/spam-data.csv)

Dataset Source: [UCI Machine Learning Library](https://archive.ics.uci.edu/dataset/94/spambase)

Import the data using Pandas. Display the resulting DataFrame to confirm the import was successful.

In [2]:
# Import the data
data = pd.read_csv("https://static.bc-edx.com/ai/ail-v-1-0/m13/challenge/spam-data.csv")
data.head()

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,char_freq_;,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_#,capital_run_length_average,capital_run_length_longest,capital_run_length_total,spam
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278,1
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1


## Predict Model Performance

You will be creating and comparing two models on this data: a Logistic Regression, and a Random Forests Classifier. Before you create, fit, and score the models, make a prediction as to which model you think will perform better. You do not need to be correct! 


ANSWER:

It's hard to tell which model will perform better, as both have different strengths and weaknesses. While Logistic Regression is good for binary classification (whether or not an email is spam), it is limited by only being able to model linear relationships between the different variables in the dataset, which could skew the results if any of them have falsely positive relationships. Meanwhile, random forest classifier can perform better on multi dimensional data, as is the case with this dataframe. It's hard to tell if either would be prone to overfitting/underfitting either. If I had to make a prediction, I would imagine the random forest classifier would be more accurate, as there are a ton of variables (58 columns) to work with here.

## Split the Data into Training and Testing Sets

In [3]:
# Create the labels set `y` and features DataFrame `X`
X = data.copy()
X = X.drop(columns='spam')

y = data['spam']

In [4]:
# Check the balance of the labels variable (`y`) by using the `value_counts` function.
data['spam'].value_counts()

spam
0    2788
1    1813
Name: count, dtype: int64

In [5]:
# Split the data into X_train, X_test, y_train, y_test
X_train, X_test, y_train, y_test = train_test_split(X, y)
X_train

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,word_freq_conference,char_freq_;,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_#,capital_run_length_average,capital_run_length_longest,capital_run_length_total
3035,0.0,0.00,0.00,0.0,0.00,0.00,0.00,0.00,0.00,0.00,...,0.0,0.000,0.000,0.000,0.000,0.000,0.000,1.156,3,37
1718,0.0,0.26,0.78,0.0,0.26,0.43,0.08,1.12,0.43,1.47,...,0.0,0.097,0.222,0.000,0.444,0.250,0.111,3.138,54,929
3066,0.0,0.00,0.00,0.0,0.00,0.00,0.00,0.00,0.00,0.00,...,0.0,0.000,0.000,0.000,0.000,0.000,0.000,1.125,2,18
3728,0.0,0.00,0.00,0.0,0.00,0.00,0.00,0.00,0.00,0.00,...,0.0,0.000,0.062,0.000,0.000,0.000,0.000,1.210,6,69
3996,0.0,0.00,0.00,0.0,0.00,0.00,0.00,0.00,0.00,0.00,...,0.0,0.000,0.000,0.000,0.000,0.000,0.000,3.153,38,82
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
420,0.0,0.00,0.00,0.0,0.00,0.68,0.00,1.36,0.68,0.68,...,0.0,0.000,0.000,0.000,1.143,0.519,0.000,3.737,75,228
3054,0.0,0.00,0.00,0.0,0.00,0.00,0.00,0.00,0.00,0.00,...,0.0,0.000,0.026,0.343,0.000,0.000,0.026,4.326,28,822
927,0.0,0.00,0.00,0.0,0.43,0.43,0.43,0.43,0.00,0.00,...,0.0,0.000,0.000,0.000,0.395,0.000,1.121,7.983,72,495
3810,0.0,0.00,0.00,0.0,0.00,0.00,0.00,3.97,0.00,0.66,...,0.0,0.110,0.110,0.000,0.000,0.000,0.000,2.857,19,120


## Scale the Features

Use the `StandardScaler` to scale the features data. Remember that only `X_train` and `X_test` DataFrames should be scaled.

In [6]:
from sklearn.preprocessing import StandardScaler

# Create the StandardScaler instance
scaler = StandardScaler()

In [7]:
# Fit the Standard Scaler with the training data
X_scaler = scaler.fit(X_train)

In [8]:
# Scale the training data
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)

## Create and Fit a Logistic Regression Model

Create a Logistic Regression model, fit it to the training data, make predictions with the testing data, and print the model's accuracy score. You may choose any starting settings you like. 

In [9]:
# Train a Logistic Regression model and print the model score
from sklearn.linear_model import LogisticRegression

logistic_regression_model = LogisticRegression(random_state=1, max_iter=1500)

lr_model = logistic_regression_model.fit(X_train_scaled, y_train)

print(f"Training Data Score: {lr_model.score(X_train_scaled, y_train)}")
print(f"Testing Data Score: {lr_model.score(X_test_scaled, y_test)}")

Training Data Score: 0.9359420289855073
Testing Data Score: 0.9183318853171155


In [10]:
# Make and save testing predictions with the saved logistic regression model using the test data
testing_predections = lr_model.predict(X_test_scaled)

# Review the predictions
testing_predections

array([1, 0, 0, ..., 1, 0, 0], dtype=int64)

In [11]:
# Calculate the accuracy score by evaluating `y_test` vs. `testing_predictions`.
accuracy_score(y_test, testing_predections)

0.9183318853171155

## Create and Fit a Random Forest Classifier Model

Create a Random Forest Classifier model, fit it to the training data, make predictions with the testing data, and print the model's accuracy score. You may choose any starting settings you like. 

In [12]:
# Train a Random Forest Classifier model and print the model score
from sklearn.ensemble import RandomForestClassifier

random_forest_model = RandomForestClassifier(n_estimators=128, random_state=1)

rf_model = random_forest_model.fit(X_train_scaled, y_train)

print(f"Training Data Score: {rf_model.score(X_train_scaled, y_train)}")
print(f"Testing Data Score: {rf_model.score(X_test_scaled, y_test)}")

Training Data Score: 0.9994202898550725
Testing Data Score: 0.947871416159861


In [13]:
# Make and save testing predictions with the saved random forest model using the test data
testing_predections_2 = rf_model.predict(X_test_scaled)

# Review the predictions
testing_predections_2

array([1, 0, 0, ..., 1, 0, 0], dtype=int64)

In [14]:
# Calculate the accuracy score by evaluating `y_test` vs. `testing_predictions_2`.
accuracy_score(y_test, testing_predections_2)

0.947871416159861

## Evaluate the Models

Which model performed better? How does that compare to your prediction? Write down your results and thoughts in the following markdown cell.

ANSWER:

--------------------------------------------------------

Logistic Regression Model:

Training Data Score: 0.9359420289855073

Testing Data Score: 0.9183318853171155

Accuracy Score: 0.9183318853171155

This model performed decently well on both training and testing datasets, and both scores are rather close to one another, which means there is no overfitting. The accuracy score of 91.8% means it correctly predicted spam emails 91.8% of the time.

-------------------------------------------------------

Random Forest Classifier Model:

Training Data Score: 0.9994202898550725

Testing Data Score: 0.947871416159861

Accuracy Score: 0.947871416159861

This model performed very well on the training data, with almost a perfect score. It correctly predicted spam emails 94.8% of the time, which is a little better than the logistic regression model. However, there is some considerable difference between the training and testing scores, and the training score was almost perfect, both of which indicates that there may be some overfitting going on.

-------------------------------------------------------

Overall, both of these models performed very well, and while Logistic Regression had a slightly lower accuracy score than random forest, the deviation between training and testing scores and the near-perfect training score in the latter indicates overfitting, which means that model performs great on this data set but would not perform as well if given other datasets to predict. So while it performed slightly worse, my choice would be the Logistic Regression model in this instance. It seems I was half-correct in my prediction, Random Forest indeed performed better. However, the overfitting makes the model less reliable than the slightly less accurate Logistic Regression.
