# Spam Detector

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

## Retrieve the Data

The data is located at [https://static.bc-edx.com/ai/ail-v-1-0/m13/challenge/spam-data.csv](https://static.bc-edx.com/ai/ail-v-1-0/m13/challenge/spam-data.csv)

Dataset Source: [UCI Machine Learning Library](https://archive.ics.uci.edu/dataset/94/spambase)

Import the data using Pandas. Display the resulting DataFrame to confirm the import was successful.

In [2]:
# Import the data
data = pd.read_csv("https://static.bc-edx.com/ai/ail-v-1-0/m13/challenge/spam-data.csv")
data.head()

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,char_freq_;,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_#,capital_run_length_average,capital_run_length_longest,capital_run_length_total,spam
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278,1
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1


## Predict Model Performance

You will be creating and comparing two models on this data: a Logistic Regression, and a Random Forests Classifier. Before you create, fit, and score the models, make a prediction as to which model you think will perform better. You do not need to be correct! 

Write down your prediction in the designated cells in your Jupyter Notebook, and provide justification for your educated guess.

*Replace the text in this markdown cell with your predictions, and be sure to provide justification for your guess.*

## Split the Data into Training and Testing Sets

In [3]:
# Create the labels set `y` and features DataFrame `X`
# Create labels (y) and features (X)
y = data['spam']  # Assuming 'spam' is the column containing the labels (0 for not spam, 1 for spam)
X = data.drop('spam', axis=1)  # Exclude the 'spam' column to get the feature set

# Display the first few rows of the labels and features
y.head(), X.head()



(0    1
 1    1
 2    1
 3    1
 4    1
 Name: spam, dtype: int64,
    word_freq_make  word_freq_address  word_freq_all  word_freq_3d  \
 0            0.00               0.64           0.64           0.0   
 1            0.21               0.28           0.50           0.0   
 2            0.06               0.00           0.71           0.0   
 3            0.00               0.00           0.00           0.0   
 4            0.00               0.00           0.00           0.0   
 
    word_freq_our  word_freq_over  word_freq_remove  word_freq_internet  \
 0           0.32            0.00              0.00                0.00   
 1           0.14            0.28              0.21                0.07   
 2           1.23            0.19              0.19                0.12   
 3           0.63            0.00              0.31                0.63   
 4           0.63            0.00              0.31                0.63   
 
    word_freq_order  word_freq_mail  ...  word_freq_confere

In [4]:
# Check the balance of the labels variable (`y`) by using the `value_counts` function.
# Check the balance of the labels variable (y)
y.value_counts()


spam
0    2788
1    1813
Name: count, dtype: int64

In [5]:
# Split the data into X_train, X_test, y_train, y_test
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# Display the shapes of the resulting sets
X_train.shape, X_test.shape, y_train.shape, y_test.shape

# Print the shape of the training and testing sets
print("Training set shape:", X_train.shape, y_train.shape)
print("Testing set shape:", X_test.shape, y_test.shape)

Training set shape: (3680, 57) (3680,)
Testing set shape: (921, 57) (921,)


## Scale the Features

Use the `StandardScaler` to scale the features data. Remember that only `X_train` and `X_test` DataFrames should be scaled.

In [6]:
# Import the StandardScaler
from sklearn.preprocessing import StandardScaler

# Create an instance of StandardScaler
scaler = StandardScaler()


In [7]:
# Fit the Standard Scaler with the training data
X_train_scaled = scaler


In [8]:
# Scale the training data
# Fit the Standard Scaler with the training data
X_train_scaled = scaler.fit_transform(X_train)



In [9]:
# Scale the testing features using the fitted scaler from the training data
X_test_scaled = scaler.transform(X_test)

In [10]:
# Display the first few rows of the scaled features to check
X_train_scaled[:5], X_test_scaled[:5]

(array([[-0.35849294, -0.1663156 , -0.56145245, -0.04622283, -0.46213226,
         -0.34826528, -0.28974258, -0.25433765, -0.32332618, -0.37163401,
         -0.29468554, -0.62650109, -0.31339295, -0.17532912, -0.19447397,
         -0.29414289, -0.31752206, -0.35579466,  0.13679517, -0.16618766,
         -0.67441325, -0.12561726, -0.29347046, -0.215155  , -0.32371654,
         -0.29991686, -0.22988673, -0.22211804, -0.16692576, -0.21817596,
         -0.1695615 , -0.14085718, -0.17082404, -0.14304829, -0.18630409,
         -0.23399564, -0.32328961, -0.06090721, -0.18036846, -0.18204882,
         -0.12120056, -0.17268519, -0.20493605, -0.14140736, -0.30532057,
         -0.19618137, -0.07194518, -0.1076204 , -0.156207  , -0.48491863,
         -0.15134534, -0.32037559, -0.31155305,  0.69767005, -0.108199  ,
         -0.21450276, -0.43523962],
        [-0.08676893, -0.03497293, -0.22384706, -0.04622283,  0.74237706,
         -0.06285915, -0.28974258,  0.38019668, -0.32332618, -0.37163401,
  

## Create and Fit a Logistic Regression Model

Create a Logistic Regression model, fit it to the training data, make predictions with the testing data, and print the model's accuracy score. You may choose any starting settings you like. 

In [11]:
# Train a Logistic Regression model and print the model score
from sklearn.linear_model import LogisticRegression
# Create a Logistic Regression model
logistic_model = LogisticRegression(random_state=1)

# Fit the model to the scaled training data
logistic_model.fit(X_train_scaled, y_train)

# Make predictions on the testing data
y_pred_logistic = logistic_model.predict(X_test_scaled)

# Print the accuracy score of the Logistic Regression model
accuracy_logistic = accuracy_score(y_test, y_pred_logistic)
print(f"Logistic Regression Model Accuracy: {accuracy_logistic:.4f}")

Logistic Regression Model Accuracy: 0.9273


In [12]:
# Make and save testing predictions with the saved logistic regression model using the test data

# Review the predictions
# Make predictions on the test data using the logistic regression model
test_predictions_logistic = logistic_model.predict(X_test_scaled)

# Save the predictions to a DataFrame or a CSV file if needed
# Assuming you have a DataFrame for your test data, you can create a new DataFrame for predictions
predictions_df = pd.DataFrame({'Actual': y_test, 'Predicted_Logistic': test_predictions_logistic})

# Display the predictions DataFrame
print(predictions_df.head())


      Actual  Predicted_Logistic
1351       1                   0
1687       1                   0
1297       1                   1
2101       0                   0
3920       0                   0


In [13]:
# Calculate the accuracy score by evaluating `y_test` vs. `testing_predictions`.
# Calculate the accuracy score by evaluating y_test vs. testing_predictions_logistic
accuracy_testing = accuracy_score(y_test, predictions_df['Predicted_Logistic'])

# Print the accuracy score
print(f"Testing Data Accuracy Score: {accuracy_testing:.6f}")


Testing Data Accuracy Score: 0.927253


## Create and Fit a Random Forest Classifier Model

Create a Random Forest Classifier model, fit it to the training data, make predictions with the testing data, and print the model's accuracy score. You may choose any starting settings you like. 

In [14]:
# Train a Random Forest Classifier model and print the model score
from sklearn.ensemble import RandomForestClassifier
# Create a Random Forest Classifier model
random_forest_model = RandomForestClassifier(random_state=1)

# Fit the model to the scaled training data
random_forest_model.fit(X_train_scaled, y_train)

# Make predictions on the testing data
y_pred_rf = random_forest_model.predict(X_test_scaled)

# Print the accuracy score of the Random Forest Classifier model
accuracy_rf = accuracy_score(y_test, y_pred_rf)
print(f"Random Forest Classifier Model Accuracy: {accuracy_rf:.7f}")

Random Forest Classifier Model Accuracy: 0.9587405


In [15]:
# Make and save testing predictions with the saved logistic regression model using the test data


# Review the predictions
# Make predictions on the test data using the logistic regression model
test_predictions_logistic = logistic_model.predict(X_test_scaled)

# Save the predictions to a DataFrame
predictions_logistic_df = pd.DataFrame({'Actual': y_test, 'Predicted_Logistic': test_predictions_logistic})

# Display the predictions DataFrame
print(predictions_logistic_df.head())


      Actual  Predicted_Logistic
1351       1                   0
1687       1                   0
1297       1                   1
2101       0                   0
3920       0                   0


In [16]:
# Calculate the accuracy score by evaluating `y_test` vs. `testing_predictions`.
# Print the accuracy score of the Random Forest Classifier model
accuracy_rf = accuracy_score(y_test, y_pred_rf)
print(f"Random Forest Classifier Model Accuracy: {accuracy_rf:.7f}")

Random Forest Classifier Model Accuracy: 0.9587405


## Evaluate the Models

Which model performed better? How does that compare to your prediction? Write down your results and thoughts in the following markdown cell.

*Replace the text in this markdown cell with your answers to these questions.*

In [17]:
# Make predictions on the test data using the Random Forest Classifier model
test_predictions_rf = random_forest_model.predict(X_test_scaled)

# Calculate the accuracy score for Random Forest Classifier model
accuracy_rf = accuracy_score(y_test, test_predictions_rf)

# Print the accuracy score for both models
print(f"Logistic Regression Model Accuracy: {accuracy_logistic:.4f}")
print(f"Random Forest Classifier Model Accuracy: {accuracy_rf:.4f}")


Logistic Regression Model Accuracy: 0.9273
Random Forest Classifier Model Accuracy: 0.9587


The models were evaluated as follows:
Logistic Regression Model Accuracy: 92.73%
Random Forest Classifier Model Accuracy: 95.87%
The Random Forest Classifier outperformed the Logistic Regression model in terms of accuracy, achieving a higher accuracy score. This result aligns with the common expectation that Random Forests can be more effective, especially when dealing with complex or non-linear relationships in the data.

My prediction that the Random Forest model would perform better was accurate, reflecting the versatility and robustness of Random Forests in handling a variety of datasets. It's crucial to assess different models to choose the one that best suits the characteristics of the data at hand.