
# Spam Detection Model Analysis

In this notebook, we will create and evaluate two different machine learning models to classify emails as spam or not spam. The models considered are Logistic Regression and Random Forest, and we will determine which one performs better using the provided dataset.


In [None]:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load the dataset
df = pd.read_csv('/mnt/data/spam-data.csv')

# Display the first few rows of the dataset
df.head()



## Data Preparation and Splitting

We will split the dataset into features and labels, and then further split them into training and testing sets using a 70-30 split.


In [None]:

# Splitting the data into labels (y) and features (X)
y = df['spam']
X = df.drop(columns=['spam'])

# Splitting the data into training and testing datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)



## Feature Scaling

To ensure the features are on the same scale, we will use `StandardScaler` to standardize the feature set.


In [None]:

# Scaling the features
scaler = StandardScaler()

# Fit the scaler on the training data
X_train_scaled = scaler.fit_transform(X_train)

# Transform the testing data
X_test_scaled = scaler.transform(X_test)



## Logistic Regression Model

Next, we will create a Logistic Regression model and evaluate its performance.


In [None]:

# Creating and training the Logistic Regression model
log_reg_model = LogisticRegression(random_state=1)
log_reg_model.fit(X_train_scaled, y_train)

# Making predictions with the Logistic Regression model
y_pred_log_reg = log_reg_model.predict(X_test_scaled)

# Evaluating the Logistic Regression model
log_reg_accuracy = accuracy_score(y_test, y_pred_log_reg)
log_reg_accuracy



## Random Forest Model

We will also create a Random Forest model and evaluate its performance.


In [None]:

# Creating and training the Random Forest model
rf_model = RandomForestClassifier(random_state=1)
rf_model.fit(X_train_scaled, y_train)

# Making predictions with the Random Forest model
y_pred_rf = rf_model.predict(X_test_scaled)

# Evaluating the Random Forest model
rf_accuracy = accuracy_score(y_test, y_pred_rf)
rf_accuracy



## Model Comparison

Finally, we compare the accuracy of both models to determine which one is better at detecting spam.


In [None]:

if log_reg_accuracy > rf_accuracy:
    better_model = "Logistic Regression"
else:
    better_model = "Random Forest"

better_model, log_reg_accuracy, rf_accuracy
