# Spam Detector

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

## Retrieve the Data

The data is located at [https://static.bc-edx.com/ai/ail-v-1-0/m13/challenge/spam-data.csv](https://static.bc-edx.com/ai/ail-v-1-0/m13/challenge/spam-data.csv)

Dataset Source: [UCI Machine Learning Library](https://archive.ics.uci.edu/dataset/94/spambase)

Import the data using Pandas. Display the resulting DataFrame to confirm the import was successful.

In [3]:
# Import the data
data = pd.read_csv("https://static.bc-edx.com/ai/ail-v-1-0/m13/challenge/spam-data.csv")
data.head()

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,char_freq_;,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_#,capital_run_length_average,capital_run_length_longest,capital_run_length_total,spam
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278,1
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1


## Predict Model Performance

You will be creating and comparing two models on this data: a Logistic Regression, and a Random Forests Classifier. Before you create, fit, and score the models, make a prediction as to which model you think will perform better. You do not need to be correct! 

Write down your prediction in the designated cells in your Jupyter Notebook, and provide justification for your educated guess.

Imagine Logistic Regression as someone using a single, straightforward rule to sort the fruit. For instance, if a fruit is red and above a certain size, it goes into the apple basket; otherwise, it's considered an orange. This approach is simple and easy to follow, but it might not always get things right, especially with fruits that don't fit neatly into these rules.

On the other hand, think of Random Forest as a group of friends, each with their own unique way of deciding whether a fruit is an apple or an orange. One friend might focus on color, another on size, and another on shape. After they all make their individual decisions, they vote to decide the final outcome for each fruit. This method is great for dealing with complex situations where you need to look at the fruit from different angles, but it's a bit harder to explain why the group called a fruit an apple or an orange because so many factors were considered.

Now, if you're sorting a bunch of standard, easy-to-identify fruits, the single rule might work just fine. But if you have a mix of unusual or tricky fruits, the group's collective wisdom is likely to do a better job, even though it's a bit more complicated.

Applying this to our task of spotting spam emails: if spotting spam depends on simple, clear-cut rules (like looking for certain suspicious words), then Logistic Regression could be all you need. However, if identifying spam requires considering a whole mix of subtle clues, then Random Forest, with its team approach, might be more up to the task, despite being a bit more complex to understand.

## Split the Data into Training and Testing Sets

In [6]:
# Create the labels set `y` and features DataFrame `X`
y = data['spam']  # label.
X = data.drop('spam', axis=1)  # features.

In [7]:
# Check the balance of the labels variable (`y`) by using the `value_counts` function.
y.value_counts()


spam
0    2788
1    1813
Name: count, dtype: int64

In [9]:
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

## Scale the Features

Use the `StandardScaler` to scale the features data. Remember that only `X_train` and `X_test` DataFrames should be scaled.

In [10]:
from sklearn.preprocessing import StandardScaler

# Create the StandardScaler instance
scaler = StandardScaler()

In [11]:
# Fit the Standard Scaler with the training data
scaler.fit(X_train)

In [12]:
# Scale the training data
X_train_scaled = scaler.transform(X_train)

## Create and Fit a Logistic Regression Model

Create a Logistic Regression model, fit it to the training data, make predictions with the testing data, and print the model's accuracy score. You may choose any starting settings you like. 

In [14]:
# Scale the testing data
X_test_scaled = scaler.transform(X_test)

In [15]:
# Train a Logistic Regression model and print the model score
from sklearn.linear_model import LogisticRegression

# Create a Logistic Regression model instance
logistic_model = LogisticRegression()

# Train the model using the scaled training data
logistic_model.fit(X_train_scaled, y_train)

# Print the model score using the scaled testing data
print("Logistic Regression model score:", logistic_model.score(X_test_scaled, y_test))

Logistic Regression model score: 0.9272529858849077


In [16]:
# Make and save testing predictions with the saved logistic regression model using the test data
y_test_pred = logistic_model.predict(X_test_scaled)

# Review the predictions
print(y_test_pred)

[0 0 1 0 0 1 1 0 1 1 0 1 1 1 0 0 0 1 1 0 0 0 0 1 1 1 1 0 1 1 1 1 0 0 1 0 0
 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 1 1 1 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 1 1 0 0
 0 1 1 0 1 1 0 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 1 0 0 1
 0 0 0 1 1 0 0 1 1 0 1 0 0 0 0 1 0 1 0 1 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0
 0 0 1 0 1 0 0 0 1 1 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 0 0 1 0 0 0 1 0 1 0 0 0
 0 1 0 1 1 0 1 1 0 0 0 0 0 1 1 0 1 1 0 0 0 0 0 1 0 1 0 1 0 1 1 1 0 0 0 1 0
 0 1 0 1 0 0 0 0 1 1 0 1 0 0 1 0 1 0 1 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0
 0 1 1 0 0 0 0 0 1 1 0 1 0 1 0 0 0 0 0 1 0 1 1 1 1 0 0 1 0 0 1 0 1 0 1 1 1
 0 1 1 1 0 0 1 0 1 0 0 0 1 0 0 0 0 1 1 0 0 0 0 1 0 1 1 0 0 1 0 0 0 1 0 0 0
 0 1 0 1 0 0 1 0 0 0 1 1 0 1 1 0 0 0 0 0 1 0 1 0 1 0 1 0 0 1 1 1 1 1 1 1 0
 1 0 1 0 0 1 0 0 0 1 1 1 0 1 1 0 0 0 0 1 0 0 0 1 1 0 0 1 1 1 0 0 0 0 0 0 1
 0 1 1 0 0 0 0 0 1 0 1 1 0 1 0 0 0 0 1 0 1 1 0 1 0 1 1 0 1 0 1 0 1 0 0 0 1
 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0
 0 0 0 0 1 0 0 0 1 0 0 0 

In [23]:
# Calculate the accuracy score by evaluating `y_test` vs. `testing_predictions`.
from sklearn.metrics import accuracy_score

# Calculate the accuracy score
accuracy_log = accuracy_score(y_test, y_test_pred)

# Print the accuracy score
print("Accuracy score of the Logistic Regression model:", accuracy_log)

Accuracy score of the Logistic Regression model: 0.9272529858849077


## Create and Fit a Random Forest Classifier Model

Create a Random Forest Classifier model, fit it to the training data, make predictions with the testing data, and print the model's accuracy score. You may choose any starting settings you like. 

In [18]:
# Train a Random Forest Classifier model and print the model score
from sklearn.ensemble import RandomForestClassifier

# Create a Random Forest Classifier model instance
random_forest_model = RandomForestClassifier(random_state=1)

# Train the model using the scaled training data
random_forest_model.fit(X_train_scaled, y_train)

# Print the model score using the scaled testing data
print("Random Forest Classifier model score:", random_forest_model.score(X_test_scaled, y_test))

Random Forest Classifier model score: 0.9587404994571118


In [19]:
# Make predictions using the random forest classifier model on the scaled test data
y_test_pred_rf = random_forest_model.predict(X_test_scaled)

# Review the predictions
print(y_test_pred_rf)

[1 1 1 0 0 1 0 0 1 1 0 1 1 1 0 0 0 1 1 0 0 0 0 1 1 1 1 0 1 1 1 1 0 0 1 0 0
 0 0 0 0 1 1 0 0 0 1 0 0 1 0 0 1 1 1 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 1 1 0 0
 0 1 1 1 1 1 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 1 0 0 1
 0 0 0 1 1 0 0 1 1 0 1 0 0 0 0 1 0 1 0 1 0 1 0 0 0 0 1 1 0 1 0 0 1 0 0 0 0
 0 0 0 0 1 0 0 0 1 1 0 0 0 0 1 1 0 1 1 0 1 0 0 1 0 0 0 1 0 1 0 1 0 1 1 0 0
 0 1 0 0 1 0 1 0 0 0 0 0 1 1 1 0 1 1 0 0 0 0 0 1 0 1 0 1 0 1 1 1 0 0 0 1 0
 0 1 0 1 0 0 0 0 1 1 0 1 0 0 1 0 1 0 1 1 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0
 0 1 1 0 0 0 0 0 1 1 0 1 1 1 0 0 0 0 0 1 0 1 1 1 1 0 0 1 0 0 1 1 1 0 1 1 1
 0 1 1 1 0 0 1 0 1 0 1 0 1 0 0 0 0 1 1 1 0 0 0 1 0 1 1 0 0 1 0 0 0 1 0 0 0
 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 0 0 0 0 0 1 0 1 0 1 0 1 0 1 1 1 0 1 1 1 1 0
 1 0 1 0 0 1 0 0 0 1 1 1 0 1 1 0 0 0 0 1 0 0 0 1 1 0 0 1 0 1 0 0 0 0 0 0 1
 0 1 1 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 1 0 1 1 0 1 0 1 1 0 1 0 1 0 1 0 0 0 0
 0 0 1 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0
 0 0 0 0 1 0 1 0 1 0 0 0 

In [24]:
# Calculate the accuracy score by evaluating `y_test` vs. `testing_predictions`.
from sklearn.metrics import accuracy_score

# Calculate the accuracy score
accuracy_ran = accuracy_score(y_test, y_test_pred_rf)
print("Accuracy score:", accuracy_ran)

Accuracy score: 0.9587404994571118


In [25]:
# Assuming accuracy_log_reg and accuracy_rf are already defined
print(f"Logistic Regression Accuracy: {accuracy_log}")
print(f"Random Forest Classifier Accuracy: {accuracy_ran}")

if accuracy_log > accuracy_ran:
    print("Logistic Regression model performed the best.")
elif accuracy_log < accuracy_ran:
    print("Random Forest Classifier model performed the best.")
else:
    print("Both models performed equally.")

Logistic Regression Accuracy: 0.9272529858849077
Random Forest Classifier Accuracy: 0.9587404994571118
Random Forest Classifier model performed the best.


## Evaluate the Models

Which model performed better? How does that compare to your prediction? Write down your results and thoughts in the following markdown cell.

In the evaluation of the models applied to our spam detection task, the **Random Forest Classifier** emerged as the superior model with an accuracy of **approximately 95.87%**, outperforming the **Logistic Regression model**, which had an accuracy of **around 92.73%**. This result aligns with my initial prediction, which was influenced by an analogy comparing the model selection to choosing between a single set of rules or a team of experts for fruit classification.

I anticipated that the Random Forest, akin to a diverse group of friends each bringing their own expertise to fruit identification, would handle the complexities and potential non-linear relationships within the dataset more adeptly than Logistic Regression. This prediction was rooted in the understanding that the ensemble approach of Random Forest, which aggregates insights from multiple decision trees, would provide a more nuanced and robust analysis, especially beneficial in a dataset with varied characteristics like ours.

The analogy proved to be an effective illustration of why the Random Forest's collective 'wisdom' could offer superior performance in our context, emphasizing the value of multiple perspectives in tackling complex classification tasks. The outcome reinforces the importance of understanding the underlying mechanics and strengths of different models, aiding in making informed choices for future predictive modeling tasks.