# Spam Detector

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

## Retrieve the Data

The data is located at [https://static.bc-edx.com/ai/ail-v-1-0/m13/challenge/spam-data.csv](https://static.bc-edx.com/ai/ail-v-1-0/m13/challenge/spam-data.csv)

Dataset Source: [UCI Machine Learning Library](https://archive.ics.uci.edu/dataset/94/spambase)

Import the data using Pandas. Display the resulting DataFrame to confirm the import was successful.

In [2]:
# Import the data
data = pd.read_csv("https://static.bc-edx.com/ai/ail-v-1-0/m13/challenge/spam-data.csv")

In [3]:
# View the first few rows of the DataFrame
print("First few rows of the DataFrame:")
print(data.head())

# Check column names
print("\nColumn names:")
print(data.columns)

# Get DataFrame info
print("\nDataFrame info:")
data.info()

# Describe the data
print("\nData description:")
print(data.describe())

# Check the unique values in the target column
print("\nUnique values in the target column:")
print(data['spam'].value_counts())

First few rows of the DataFrame:
   word_freq_make  word_freq_address  word_freq_all  word_freq_3d  \
0            0.00               0.64           0.64           0.0   
1            0.21               0.28           0.50           0.0   
2            0.06               0.00           0.71           0.0   
3            0.00               0.00           0.00           0.0   
4            0.00               0.00           0.00           0.0   

   word_freq_our  word_freq_over  word_freq_remove  word_freq_internet  \
0           0.32            0.00              0.00                0.00   
1           0.14            0.28              0.21                0.07   
2           1.23            0.19              0.19                0.12   
3           0.63            0.00              0.31                0.63   
4           0.63            0.00              0.31                0.63   

   word_freq_order  word_freq_mail  ...  char_freq_;  char_freq_(  \
0             0.00            0.00  ..

## Predict Model Performance

You will be creating and comparing two models on this data: a Logistic Regression, and a Random Forests Classifier. Before you create, fit, and score the models, make a prediction as to which model you think will perform better. You do not need to be correct! 

Write down your prediction in the designated cells in your Jupyter Notebook, and provide justification for your educated guess.

### Prediction: Random Forest Classifier will perform better than Logistic Regression for spam detection.

Key reasons:

- Complexity of email spam: Spam patterns are often intricate and non-linear, which Random Forest can better capture.
- Large number of features: The dataset has many features (57), and Random Forest excels at handling high-dimensional data.

## Split the Data into Training and Testing Sets

In [4]:
# Create the labels set `y`
y = data['spam']

# Create the features DataFrame `X`
feature_columns = data.columns.drop('spam')
X = data[feature_columns]

In [5]:
# Check the balance of the labels variable (`y`) by using the `value_counts` function.
value_counts = y.value_counts()
print(value_counts)
#Percentage interpretation
percentage = y.value_counts(normalize=True) * 100
print(percentage)

spam
0    2788
1    1813
Name: count, dtype: int64
spam
0    60.595523
1    39.404477
Name: proportion, dtype: float64


In [6]:
# Split the data into X_train, X_test, y_train, y_test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [7]:
# Display the first few rows of the training data
print("First few rows of X_train:")
display(X_train.head())

# Display information about the split
print("\nShape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

First few rows of X_train:


Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,word_freq_conference,char_freq_;,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_#,capital_run_length_average,capital_run_length_longest,capital_run_length_total
1370,0.09,0.0,0.09,0.0,0.39,0.09,0.09,0.0,0.19,0.29,...,0.0,0.0,0.139,0.0,0.31,0.155,0.0,6.813,494,1458
3038,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.124,0.124,0.0,0.0,0.0,0.0,1.8,8,45
2361,0.0,0.0,2.43,0.0,0.0,0.0,0.0,0.0,0.27,0.0,...,0.0,0.0,0.344,0.0,0.0,0.0,0.0,2.319,12,167
156,0.0,0.0,0.0,0.0,1.31,0.0,1.31,1.31,1.31,1.31,...,0.0,0.0,0.0,0.0,0.117,0.117,0.0,48.5,186,291
2526,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.307,8,30



Shape of X_train: (3680, 57)
Shape of X_test: (921, 57)
Shape of y_train: (3680,)
Shape of y_test: (921,)


## Scale the Features

Use the `StandardScaler` to scale the features data. Remember that only `X_train` and `X_test` DataFrames should be scaled.

In [8]:
from sklearn.preprocessing import StandardScaler

# Create the StandardScaler instance
scaler = StandardScaler()


In [9]:
# Fit the Standard Scaler with the training data
scaler.fit(X_train)

In [10]:
# Scale the training data
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

## Create and Fit a Logistic Regression Model

Create a Logistic Regression model, fit it to the training data, make predictions with the testing data, and print the model's accuracy score. You may choose any starting settings you like. 

In [11]:
# Imports
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [12]:
# Create the Logistic Regression model
lr_model = LogisticRegression(random_state=1)

In [13]:
# Fit the logistic regression model
lr_model.fit(X_train_scaled, y_train)

In [14]:
# Make predictions
lr_predictions = lr_model.predict(X_test_scaled)
# View a sample of predictions
print("\nSample of predictions:")
for i in range(20):  # Show first 20 predictions
    print(f"Predicted: {lr_predictions[i]}, Actual: {y_test.iloc[i]}")


Sample of predictions:
Predicted: 0, Actual: 0
Predicted: 0, Actual: 0
Predicted: 0, Actual: 0
Predicted: 1, Actual: 1
Predicted: 0, Actual: 0
Predicted: 1, Actual: 1
Predicted: 0, Actual: 0
Predicted: 0, Actual: 0
Predicted: 0, Actual: 0
Predicted: 1, Actual: 0
Predicted: 0, Actual: 0
Predicted: 0, Actual: 0
Predicted: 0, Actual: 0
Predicted: 1, Actual: 1
Predicted: 0, Actual: 1
Predicted: 0, Actual: 0
Predicted: 0, Actual: 1
Predicted: 0, Actual: 1
Predicted: 1, Actual: 1
Predicted: 0, Actual: 0


In [15]:
# Calculate the accuracy score
lr_accuracy = accuracy_score(y_test, lr_predictions)
print(f"Logistic Regression Accuracy: {lr_accuracy:.4f}")

Logistic Regression Accuracy: 0.9197


## Create and Fit a Random Forest Classifier Model

Create a Random Forest Classifier model, fit it to the training data, make predictions with the testing data, and print the model's accuracy score. You may choose any starting settings you like. 

In [18]:
#Imports
from sklearn.ensemble import RandomForestClassifier

In [19]:
# Create the Random Forest Classifier model
rf_model = RandomForestClassifier(n_estimators=100, random_state=1)

In [20]:
# Fit the model on the scaled training data
rf_model.fit(X_train_scaled, y_train)

In [21]:
# Make predictions
rf_predictions = rf_model.predict(X_test_scaled)
# View a sample of predictions
print("\nSample of predictions:")
for i in range(20):  # Show first 20 predictions
    print(f"Predicted: {rf_predictions[i]}, Actual: {y_test.iloc[i]}")


Sample of predictions:
Predicted: 0, Actual: 0
Predicted: 0, Actual: 0
Predicted: 0, Actual: 0
Predicted: 1, Actual: 1
Predicted: 0, Actual: 0
Predicted: 0, Actual: 1
Predicted: 0, Actual: 0
Predicted: 0, Actual: 0
Predicted: 0, Actual: 0
Predicted: 0, Actual: 0
Predicted: 0, Actual: 0
Predicted: 0, Actual: 0
Predicted: 0, Actual: 0
Predicted: 1, Actual: 1
Predicted: 1, Actual: 1
Predicted: 0, Actual: 0
Predicted: 1, Actual: 1
Predicted: 1, Actual: 1
Predicted: 1, Actual: 1
Predicted: 0, Actual: 0


In [22]:
# Calculate the accuracy score
rf_accuracy = accuracy_score(y_test, rf_predictions)
print(f"Random Forest Classifier Accuracy: {rf_accuracy:.4f}")

Random Forest Classifier Accuracy: 0.9566


In [23]:
print("Model Accuracy Comparison:")
print(f"Logistic Regression Accuracy: {lr_accuracy:.4f}")
print(f"Random Forest Classifier Accuracy: {rf_accuracy:.4f}")

# Calculate the difference in accuracy
accuracy_difference = rf_accuracy - lr_accuracy
print(f"\nDifference in Accuracy: {accuracy_difference:.4f}")

# Determine which model performed better
better_model = "Random Forest Classifier" if rf_accuracy > lr_accuracy else "Logistic Regression"
print(f"\nThe better performing model is: {better_model}")

Model Accuracy Comparison:
Logistic Regression Accuracy: 0.9197
Random Forest Classifier Accuracy: 0.9566

Difference in Accuracy: 0.0369

The better performing model is: Random Forest Classifier


## Evaluate the Models

The Random Forest Classifier performed better, achieving an accuracy of 0.9566 compared to the Logistic Regression model's accuracy of 0.9197. This result aligns with my initial prediction.

I correctly anticipated that the Random Forest model would outperform Logistic Regression due to:

1. The complexity of email spam patterns, which Random Forest can better capture.
2. The large number of features (57) in the dataset, which Random Forest handles well.

Additionally, the Random Forest's ensemble learning approach and ability to identify important features contributed to its superior performance.

While both models performed well, with accuracies above 90%, the Random Forest Classifier's 3.69 percentage point improvement over Logistic Regression is significant in the context of spam detection, where even small improvements can greatly enhance the user experience and email security.

This outcome validates the importance of choosing algorithms that match the problem's characteristics and demonstrates the power of ensemble methods like Random Forest in handling complex, high-dimensional classification tasks such as spam detection.