# Spam Detector

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

## Retrieve the Data

The data is located at [https://static.bc-edx.com/ai/ail-v-1-0/m13/challenge/spam-data.csv](https://static.bc-edx.com/ai/ail-v-1-0/m13/challenge/spam-data.csv)

Dataset Source: [UCI Machine Learning Library](https://archive.ics.uci.edu/dataset/94/spambase)

Import the data using Pandas. Display the resulting DataFrame to confirm the import was successful.

In [2]:
# Import the data
data = pd.read_csv("https://static.bc-edx.com/ai/ail-v-1-0/m13/challenge/spam-data.csv")
data.head()

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,char_freq_;,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_#,capital_run_length_average,capital_run_length_longest,capital_run_length_total,spam
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278,1
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1


## Predict Model Performance

**Prediction:** I predict that the Random Forest Classifier will perform better than the Logistic Regression model. The Random Forest model is robust to overfitting and captures non-linear relationships in the data, which might be advantageous given the high-dimensional nature of this dataset. Logistic Regression, on the other hand, assumes linearity and may not perform as well if the relationships between features and the target variable are complex.


## Split the Data into Training and Testing Sets

In [3]:
# Create the labels set `y` and features DataFrame `X`
y=data['spam']
X=data.drop('spam',axis=1)


In [4]:
# Check the balance of the labels variable (`y`) by using the `value_counts` function.
y.value_counts()

spam
0    2788
1    1813
Name: count, dtype: int64

In [5]:
# Split the data into X_train, X_test, y_train, y_test
X_train, X_test, y_train, y_test = train_test_split(X, y)


## Scale the Features

Use the `StandardScaler` to scale the features data. Remember that only `X_train` and `X_test` DataFrames should be scaled.

In [6]:
from sklearn.preprocessing import StandardScaler

# Create the StandardScaler instance
scaler = StandardScaler()


In [None]:
# Fiting the Standard Scaler with the training data
scaler.fit(X_train,y_train)

In [9]:
# Scale the training and testing data
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)


In [None]:
# Verify the Scaling Process
print(X_train_scaled.shape)
print(X_train_scaled[:5])

(3450, 57)
[[-3.40571451e-01  1.33138427e-01  9.47167946e-01 -4.12599337e-02
   1.10930720e-01 -3.42471607e-01  1.96580466e-01 -2.82895806e-01
  -3.28900890e-01  5.10261992e-01  6.27395179e-01 -6.29435503e-01
   2.98613207e-01 -1.90347977e-01  5.99395206e-01 -7.06598397e-02
  -3.28805278e-01  1.47048552e+00 -8.25656538e-01  1.92738910e-01
   1.20525570e-01 -1.10236736e-01 -2.86804264e-01 -2.20125713e-01
  -3.19621827e-01 -3.08514804e-01 -2.25850991e-01 -2.23988244e-01
  -1.61221165e-01 -2.30086889e-01 -1.56249388e-01 -1.45077428e-01
  -1.71015904e-01 -1.47384069e-01 -1.87874087e-01 -2.48077303e-01
  -3.24661911e-01 -5.76587515e-02 -1.81544883e-01  3.63865803e-01
  -1.16118250e-01 -1.68966595e-01 -2.09536491e-01 -1.16516742e-01
  -2.90281404e-01 -1.94589757e-01 -7.20178202e-02 -1.13223098e-01
  -1.56989837e-01 -6.04906175e-01 -1.50803408e-01  4.02117380e-01
  -1.88824480e-01  1.30032854e-01  4.40806983e-02  6.36597010e-01
   9.24948400e-01]
 [-3.40571451e-01 -1.66426461e-01  1.32177190e

## Create and Fit a Logistic Regression Model

Create a Logistic Regression model, fit it to the training data, make predictions with the testing data, and print the model's accuracy score. You may choose any starting settings you like. 

In [14]:
# Train a Logistic Regression model and print the model score
from sklearn.linear_model import LogisticRegression

# Create and fit the Logistic Regression model
model_lr = LogisticRegression(random_state=1)
model_lr.fit(X_train_scaled, y_train)

In [19]:
# Make and save testing predictions with the logistic regression model
testing_predictions = model_lr.predict(X_test_scaled)
# Review the predictions
print("Testing Predictions:", testing_predictions[:10])

Testing Predictions: [0 1 1 0 1 0 0 0 0 0]


In [20]:
# Calculate and print the accuracy score
accuracy = accuracy_score(y_test, testing_predictions)
print(f"Logistic Regression Accuracy: {accuracy}")

Logistic Regression Accuracy: 0.9209383145091226


## Create and Fit a Random Forest Classifier Model

Create a Random Forest Classifier model, fit it to the training data, make predictions with the testing data, and print the model's accuracy score. You may choose any starting settings you like. 

In [25]:
# Train a Random Forest Classifier model and print the model score
from sklearn.ensemble import RandomForestClassifier
Classifier=RandomForestClassifier(random_state=1)
Classifier.fit(X_train_scaled, y_train)

In [29]:
# Make predictions on the test data
rf_predictions=Classifier.predict(X_test_scaled)
# Review the predictions
rf_predictions

array([0, 1, 1, ..., 0, 0, 1], dtype=int64)

In [30]:
# Calculate and print the accuracy score
rf_accuracy = accuracy_score(y_test, rf_predictions)
print(f"Random Forest Classifier Accuracy: {rf_accuracy}")

Random Forest Classifier Accuracy: 0.9548218940052129


## Evaluate the Models

**Question 1:** Which model performed better?
The Random Forest model performed better, with an accuracy of 0.95 compared to Logistic Regression's accuracy of 0.92. This means that Random Forest outperformed Logistic Regression in terms of accuracy on this dataset.

**Question 2:** How does that compare to your prediction?
My prediction was that the Random Forest Classifier would perform better than the Logistic Regression model. This prediction was accurate because I anticipated that Random Forest would handle the complex, non-linear relationships in the data more effectively, which aligns with the actual results.

**Thoughts:**
The prediction was correct, as Random Forest indeed provided a higher accuracy. Given that Random Forest can handle interactions between features more effectively and is less likely to overfit compared to Logistic Regression, its superior performance here suggests that the dataset contains complex, non-linear relationships that Logistic Regression couldn't fully capture.