# Spam Detector

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

## Moving imports of starter code to first cell as is standard for code
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

## Retrieve the Data

The data is located at [https://static.bc-edx.com/ai/ail-v-1-0/m13/challenge/spam-data.csv](https://static.bc-edx.com/ai/ail-v-1-0/m13/challenge/spam-data.csv)

Dataset Source: [UCI Machine Learning Library](https://archive.ics.uci.edu/dataset/94/spambase)

Import the data using Pandas. Display the resulting DataFrame to confirm the import was successful.

### Questioning the quality of data here

This data is said to be from 1999.  The content of spam itself has changed considerably in 25 years, not to mention many other aspects of spam. Basing a machine learning model on such old data is rather pointless. Indeed this data appears to only be based on content word counts; and even then only 50 or so words. Any modern filter would need to utilize a large language model and determine more than word count but word positioning and nearest words, along with formatting features, header/meta data, email origin, whether DKIM or at least SPF records are used, etc. etc.  The evolution (or perhaps revolution) of email in the last 25 years renders this simple word count model as completely useless.  But I will work with has been provided for this demonstration.

In [2]:
# Import the data
data = pd.read_csv("https://static.bc-edx.com/ai/ail-v-1-0/m13/challenge/spam-data.csv")
data.head()

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,char_freq_;,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_#,capital_run_length_average,capital_run_length_longest,capital_run_length_total,spam
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278,1
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1


## Predict Model Performance

You will be creating and comparing two models on this data: a Logistic Regression, and a Random Forests Classifier. Before you create, fit, and score the models, make a prediction as to which model you think will perform better. You do not need to be correct! 

Write down your prediction in the designated cells in your Jupyter Notebook, and provide justification for your educated guess.

### My Prediction

To make my prediction I first turned to GeeksforGeeks,* a website I have found—over the period of this course—provides accurate answers to my questions, where I read information about both of these models prior to any class work.

From this documentation I predict the **random forest model will be better.** This is because the random forest measures random subsets of features with a group of decision trees, compared to the logistic regression which is extended from a single linear regression line. With many decision trees acting as a group of experts who "vote" by providing a prediction where in the most frequent prediction across these random subsets is picked. This reduces the overall variability and the risk of overfitting the training data. In addition the random tree model provides variable importance assessment of the features and utilizes built-in cross validation. It can also handle missing data, so less need to drop NaN values (though I note the documentation still says to deal with missing values beforehand). I note a downfall of the logistic regression is that it expects there to be no outliers in the data set, an unrealistic expectation in my opinion, though I do not see if the forest model has the same expectation or not.

All in all, my own logic suggests that the **random forest model** should be more accurate as I analyze the methodologies of both.

\* https://www.geeksforgeeks.org/understanding-logistic-regression/ and https://www.geeksforgeeks.org/random-forest-algorithm-in-machine-learning/

## Split the Data into Training and Testing Sets

In [3]:
# Create the labels set `y` and features DataFrame `X`
y = data["spam"]

## making a copy for X (cheaper to copy before scaling than reverse scaling [should we want original data])
X = data.copy().drop(columns="spam")

In [4]:
# Check the balance of the labels variable (`y`) by using the `value_counts` function.
y.value_counts()

spam
0    2788
1    1813
Name: count, dtype: int64

In [5]:
# Split the data into X_train, X_test, y_train, y_test
## setting random state to 1 here to ensure consistent output for this demonstration
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

## showing the first five rows of the training set of data
X_train.head()

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,word_freq_conference,char_freq_;,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_#,capital_run_length_average,capital_run_length_longest,capital_run_length_total
4576,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.131,0.0,0.0,0.0,0.0,1.488,5,64
4401,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.571,5,11
3707,0.17,0.0,0.17,0.0,0.0,0.0,0.0,0.0,0.8,0.0,...,0.0,0.253,0.168,0.084,0.0,0.024,0.0,4.665,81,1031
2362,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.228,53,148
1537,0.0,0.0,0.0,0.0,2.17,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.333,5,16


## Scale the Features

Use the `StandardScaler` to scale the features data. Remember that only `X_train` and `X_test` DataFrames should be scaled.

In [6]:
## Moved to first cell
# from sklearn.preprocessing import StandardScaler

"""
Note to myself:
    Fitting training data seperate from testing data, then
    scaling both based on the fit of the training data will
    remove any bias or data leackage resulting if we had
    fit with the testing data
"""
# Create the StandardScaler instance
scaler = StandardScaler()

In [7]:
# Fit the Standard Scaler with the training data
X_scaler = scaler.fit(X_train)

In [8]:
# Scale the training data
X_train_scaled = X_scaler.transform(X_train)

## and now scale the testing data too
X_test_scaled = X_scaler.transform(X_test)

## Create and Fit a Logistic Regression Model

Create a Logistic Regression model, fit it to the training data, make predictions with the testing data, and print the model's accuracy score. You may choose any starting settings you like. 

In [9]:
## Moved to first cell
# from sklearn.linear_model import LogisticRegression

# Train a Logistic Regression model and print the model score

## instantiate the LogisticRegression class with random state set to 1 for consistent output
log_reg_model = LogisticRegression(random_state=1)

## fit the model to training set X and y
log_reg_model.fit(X_train_scaled, y_train)

In [10]:
# Make and save testing predictions with the saved logistic regression model using the test data
## using a distinct variable name here as I want to extend the evaluation below
lrm_testing_predictions = log_reg_model.predict(X_test_scaled)

# Review the predictions
lrm_testing_predictions

array([0, 0, 1, ..., 0, 0, 1], dtype=int64)

In [11]:
# Calculate the accuracy score by evaluating `y_test` vs. `testing_predictions`.
accuracy_score(y_test, lrm_testing_predictions)

0.9287576020851434

## Create and Fit a Random Forest Classifier Model

Create a Random Forest Classifier model, fit it to the training data, make predictions with the testing data, and print the model's accuracy score. You may choose any starting settings you like. 

In [12]:
## Moved to first cell
# from sklearn.ensemble import RandomForestClassifier

# Train a Random Forest Classifier model and print the model score

## instantiate the LogisticRegression class with random state set to 1 for consistent output
rand_forest_model = RandomForestClassifier(random_state=1)

## fit the model to training set X and y
rand_forest_model.fit(X_train_scaled, y_train)

In [13]:
# Make and save testing predictions with the saved [random forest] model using the test data
## using a distinct variable name here as I want to extend the evaluation below
rfm_testing_predictions = rand_forest_model.predict(X_test_scaled)

# Review the predictions
rfm_testing_predictions

array([1, 1, 1, ..., 1, 0, 1], dtype=int64)

In [14]:
# Calculate the accuracy score by evaluating `y_test` vs. `testing_predictions`.
accuracy_score(y_test, rfm_testing_predictions)

0.9669852302345786

## Evaluate the Models

Which model performed better? How does that compare to your prediction? Write down your results and thoughts in the following markdown cell.

**The Random forest model was better.**

The Logistic Regression model was 92.9% accurate while the Random Forest model was 96.7% accurate. This demonstrates that the Random Forest model was more accurate as I predicted. Although I would have expected a wider difference between the two if I had dared to predict a difference. Both models were much better than I would expect compared to spam filters and associated predictors I have seen over the years. However, I cannot not stop at just that.

### Extending the Evaluation

The biggest problem I have with spam filters is the penchant for blocking important emails. To better evaluate both of these models I would have to see the false positive rates. I would take the less accurate over a more accurate one if it meant less false positives, so let's look into that.

In [15]:
## first let's pack our two sets of predictions with our testing labeled data
compare_df = pd.DataFrame(y_test.copy())
compare_df["Logistic Regression"] = lrm_testing_predictions
compare_df["Random Forest"] = rfm_testing_predictions
compare_df.head()

Unnamed: 0,spam,Logistic Regression,Random Forest
1351,1,0,1
1687,1,0,1
1297,1,1,1
2101,0,0,0
3920,0,0,0


In [16]:
## cutting out the actual spam to look at false positives
legit_compare_df = compare_df[compare_df["spam"] == 0]
legit_compare_df.head()

Unnamed: 0,spam,Logistic Regression,Random Forest
2101,0,0,0
3920,0,0,0
3313,0,1,1
4102,0,1,0
3836,0,0,0


In [17]:
## calculate percentage for false positives on Logistic Regression Model
lrm_false_positives = legit_compare_df["Logistic Regression"].sum()
lrm_false_pct = lrm_false_positives / len(legit_compare_df["Logistic Regression"])
lrm_false_pct

0.03566333808844508

In [18]:
## calculate percentage for false positives on Random Forest Model
rfm_false_positives = legit_compare_df["Random Forest"].sum()
rfm_false_pct = rfm_false_positives / len(legit_compare_df["Random Forest"])
rfm_false_pct

0.02282453637660485

#### Final Words

Again we see that the Random Forest model is superior as it only has a 2.3% false positive rate, whereas the Logistic Regression model has a 3.6% false positive rate.  However, both of these would be unacceptable for any email address that was used where any financial transaction occurred especially when the sender uses an automated email response system, such as sending the license key for software I just purchased. I would not trust either of these models to filter my email.  But, I will admit that blocking 90%+ of the spam with so low a false positive rate is still better than any spam filter I have seen in the wild over the past 25 years.  If only this could be adapted to modern spam data, it might be useful for some of the more throw-away type email addresses that don't have critical messages being sent from businesses to the individual.