<a href="https://colab.research.google.com/github/Acesohappy/Google-Colab/blob/main/Spam_Email_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task
build a spam email classification model using the data...make it easy and understandable

Here is all the data you need:
"spambase_csv.csv"

## Data loading

### Subtask:
Load the spam email dataset.


**Reasoning**:
I need to import pandas and load the data from the provided CSV file into a DataFrame, then display the first 5 rows and the shape of the DataFrame.



In [None]:
import pandas as pd

try:
    df = pd.read_csv('spambase_csv.csv')
    display(df.head())
    print(df.shape)
except FileNotFoundError:
    print("Error: 'spambase_csv.csv' not found. Please ensure the file exists in the current directory.")
    df = None
except pd.errors.EmptyDataError:
    print("Error: 'spambase_csv.csv' is empty.")
    df = None
except pd.errors.ParserError:
    print("Error: Unable to parse 'spambase_csv.csv'. Please check the file format.")
    df = None

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,char_freq_%3B,char_freq_%28,char_freq_%5B,char_freq_%21,char_freq_%24,char_freq_%23,capital_run_length_average,capital_run_length_longest,capital_run_length_total,class
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278,1
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1


(4601, 58)


## Data exploration

### Subtask:
Explore the loaded dataset to understand its characteristics.


**Reasoning**:
Examine data types, missing values, target variable distribution, and identify features for correlation analysis and visualization.



In [None]:
# 1. Examine data types
print(df.dtypes)

# 2. Check for missing values
print(df.isnull().sum())

# 3. Analyze target variable distribution
spam_percentage = (df['class'].sum() / len(df)) * 100
print(f"Percentage of spam emails: {spam_percentage:.2f}%")
print(f"Percentage of non-spam emails: {100 - spam_percentage:.2f}%")

# 4. Identify numerical features for visualization (histograms/box plots)
numerical_features_for_visualization = df.select_dtypes(include=['float64', 'int64']).columns.tolist()
numerical_features_for_visualization.remove('class') # Remove the target variable from the list
print("Numerical features to visualize:", numerical_features_for_visualization)

# 5. Examine correlation between numerical features and target variable
correlations = df.corr()['class'].drop('class') # Calculate correlations and drop the target variable's self-correlation
top_positive_correlations = correlations.nlargest(3)
top_negative_correlations = correlations.nsmallest(3)
print("Top 3 positive correlations:\n", top_positive_correlations)
print("Top 3 negative correlations:\n", top_negative_correlations)

word_freq_make                float64
word_freq_address             float64
word_freq_all                 float64
word_freq_3d                  float64
word_freq_our                 float64
word_freq_over                float64
word_freq_remove              float64
word_freq_internet            float64
word_freq_order               float64
word_freq_mail                float64
word_freq_receive             float64
word_freq_will                float64
word_freq_people              float64
word_freq_report              float64
word_freq_addresses           float64
word_freq_free                float64
word_freq_business            float64
word_freq_email               float64
word_freq_you                 float64
word_freq_credit              float64
word_freq_your                float64
word_freq_font                float64
word_freq_000                 float64
word_freq_money               float64
word_freq_hp                  float64
word_freq_hpl                 float64
word_freq_ge

## Data preparation

### Subtask:
Prepare the data for model training.


**Reasoning**:
Handle class imbalance, scale features, and consider outlier handling in one code block.  Since no specific outlier handling method was suggested, and the instructions are optional, I will skip this step. I will use SMOTE for oversampling the minority class to address the class imbalance and StandardScaler for feature scaling.



In [None]:
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import StandardScaler
import pandas as pd

# Separate features (X) and target (y)
X = df.drop('class', axis=1)
y = df['class']

# Handle class imbalance using SMOTE
smote = SMOTE(random_state=42)  # Use a random state for reproducibility
X_resampled, y_resampled = smote.fit_resample(X, y)

# Feature scaling using StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_resampled)

# Convert the scaled features back to a DataFrame
X_scaled_df = pd.DataFrame(X_scaled, columns=X.columns)

# Concatenate the scaled features and the resampled target variable
df_scaled = pd.concat([X_scaled_df, y_resampled], axis=1)
display(df_scaled.head())

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,char_freq_%3B,char_freq_%28,char_freq_%5B,char_freq_%21,char_freq_%24,char_freq_%23,capital_run_length_average,capital_run_length_longest,capital_run_length_total,class
0,-0.373038,0.370293,0.682401,-0.046419,-0.03589,-0.399787,-0.336199,-0.300452,-0.360977,-0.405004,...,-0.158408,-0.476844,-0.15603,0.589798,-0.362258,-0.11035,-0.057697,0.008666,-0.057169,1
1,0.324688,0.065033,0.399199,-0.046419,-0.305762,0.595023,0.16356,-0.126538,-0.360977,1.089627,...,-0.158408,-0.0067,-0.15603,0.079417,0.333848,-0.008515,-0.014278,0.184698,1.113899,1
2,-0.173688,-0.172392,0.824002,-0.046419,1.328466,0.275263,0.115963,-0.002313,1.859293,-0.007495,...,-0.113737,0.032479,-0.15603,-0.041264,0.349317,-0.089134,0.136215,1.874608,3.036011,1
3,-0.373038,-0.172392,-0.612237,-0.046419,0.428891,-0.399787,0.40154,1.264776,0.714466,0.596717,...,-0.158408,0.011109,-0.15603,-0.216,-0.362258,-0.11035,-0.064699,-0.083751,-0.193012,1
4,-0.373038,-0.172392,-0.612237,-0.046419,0.428891,-0.399787,0.40154,1.264776,0.714466,0.596717,...,-0.158408,0.003985,-0.15603,-0.218514,-0.362258,-0.11035,-0.064699,-0.083751,-0.193012,1


## Data splitting

### Subtask:
Split the prepared data into training and testing sets.


**Reasoning**:
Split the preprocessed data into training and testing sets using train_test_split.



In [None]:
from sklearn.model_selection import train_test_split

# Assuming df_scaled is your prepared DataFrame
X = df_scaled.drop('class', axis=1)
y = df_scaled['class']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Model training

### Subtask:
Train a Logistic Regression model on the training data.


**Reasoning**:
Train a Logistic Regression model using the training data.



In [None]:
from sklearn.linear_model import LogisticRegression

# Instantiate a LogisticRegression object
logreg_model = LogisticRegression(max_iter=1000, random_state=42)

# Train the Logistic Regression model
logreg_model.fit(X_train, y_train)

## Model evaluation

### Subtask:
Evaluate the trained Logistic Regression model.


**Reasoning**:
Evaluate the trained Logistic Regression model using the test set and calculate the evaluation metrics.



In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

# Predict the target variable using the trained model
y_pred = logreg_model.predict(X_test)

# Calculate evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

# Generate confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Print the results
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-score: {f1:.4f}")
print("Confusion Matrix:\n", cm)

# Interpret the results
print("\nInterpretation:")
print(f"The accuracy of {accuracy:.2%} indicates the overall correctness of the model's predictions.")
print(f"A precision of {precision:.2%} means that out of all the emails predicted as spam, {precision:.2%} were actually spam.")
print(f"A recall of {recall:.2%} means that out of all the actual spam emails, the model correctly identified {recall:.2%}.")
print(f"The F1-score of {f1:.2%} is the harmonic mean of precision and recall, providing a balanced measure of the model's performance.")

Accuracy: 0.9140
Precision: 0.9310
Recall: 0.8943
F1-score: 0.9122
Confusion Matrix:
 [[521  37]
 [ 59 499]]

Interpretation:
The accuracy of 91.40% indicates the overall correctness of the model's predictions.
A precision of 93.10% means that out of all the emails predicted as spam, 93.10% were actually spam.
A recall of 89.43% means that out of all the actual spam emails, the model correctly identified 89.43%.
The F1-score of 91.22% is the harmonic mean of precision and recall, providing a balanced measure of the model's performance.


## Summary:

### 1. Q&A

No questions were explicitly asked in the provided data analysis task. However, the overall goal was to build a spam email classification model.  The analysis process implicitly answers the question: "Can we build a reasonably accurate spam email classifier using the provided dataset?"  The answer, based on the model evaluation, is "Yes, a reasonably accurate classifier can be built."  The Logistic Regression model achieved an accuracy of 91.40%, indicating good overall performance.

### 2. Data Analysis Key Findings

* **Class Imbalance:** The original dataset had a moderate class imbalance, with 39.40% spam emails and 60.60% non-spam emails.  This was addressed using SMOTE oversampling.
* **Top Correlated Features:** Features like 'word\_freq\_your', 'word\_freq\_000', and 'word\_freq\_remove' showed strong positive correlations with spam emails, while features like 'word\_freq\_hp', 'word\_freq\_hpl', and 'word\_freq\_george' showed negative correlations.
* **Model Performance:** The trained Logistic Regression model achieved an accuracy of 91.40%, precision of 93.10%, recall of 89.43%, and an F1-score of 91.22% on the test set.
* **Confusion Matrix Insights:**  The model correctly classified 521 non-spam emails and 499 spam emails, misclassifying 37 non-spam emails as spam, and 59 spam emails as non-spam.

### 3. Insights or Next Steps

* **Explore Other Models:**  Experiment with other classification algorithms (e.g., Random Forest, Support Vector Machines, Naive Bayes) to see if they can achieve better performance.
* **Hyperparameter Tuning:**  Fine-tune the hyperparameters of the Logistic Regression model (and other models) using techniques like grid search or randomized search to potentially improve the model's performance.
