## Load and Inspect the Dataset

In [3]:
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Perceptron

# Load the dataset from the same directory
emails_df = pd.read_csv("emails.csv")

# Display the first few rows of the dataset
print(f"{emails_df.head()}\n")

# Display basic information about the dataset
print(f"{emails_df.info()}\n")

                                                text  spam
0  Subject: naturally irresistible your corporate...     1
1  Subject: the stock trading gunslinger  fanny i...     1
2  Subject: unbelievable new homes made easy  im ...     1
3  Subject: 4 color printing special  request add...     1
4  Subject: do not have money , get software cds ...     1

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5726 entries, 0 to 5725
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    5726 non-null   object
 1   spam    5726 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 89.6+ KB
None



## Data Preprocessing
### Data Cleaning

In [5]:
# Drop rows where the 'spam' column is empty or has values other than 0 or 1
emails_df_cleaned = emails_df.dropna(subset=['spam'])  # Drop rows with missing values in 'spam'
emails_df_cleaned = emails_df_cleaned[emails_df_cleaned['spam'].isin([0, 1])] # Keep only the rows where the values in the 'spam' column are either 0 or 1

# Display the first few rows of the cleaned dataset
emails_df_cleaned.head()

Unnamed: 0,text,spam
0,Subject: naturally irresistible your corporate...,1
1,Subject: the stock trading gunslinger fanny i...,1
2,Subject: unbelievable new homes made easy im ...,1
3,Subject: 4 color printing special request add...,1
4,"Subject: do not have money , get software cds ...",1


### Feature extraction and Data Splitting

In [7]:
# Using TF-IDF to convert email text into numerical vectors
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(emails_df_cleaned['text'])

# Target variable
y = emails_df_cleaned['spam']

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

## Logistic Regression and Linear Regression Classifiers
### Logistic Regression Classifier

In [9]:
# Logistic Regression Model
log_model = LogisticRegression()
log_model.fit(X_train, y_train)

In [13]:
# Predict on test data
log_pred = log_model.predict(X_test)

# Evaluate the model
print("Logistic Regression Accuracy:", accuracy_score(y_test, log_pred))
print("Classification Report:\n", classification_report(y_test, log_pred))

Logistic Regression Accuracy: 0.9679860302677532
Classification Report:
               precision    recall  f1-score   support

           0       0.96      1.00      0.98      1277
           1       1.00      0.88      0.93       441

    accuracy                           0.97      1718
   macro avg       0.98      0.94      0.96      1718
weighted avg       0.97      0.97      0.97      1718



The logistic regression model achieved a high accuracy of 96.8%, indicating strong overall performance in distinguishing between spam and non-spam emails. It shows excellent precision (0.96 for non-spam and 1.00 for spam) and a high F1-score (0.98 for non-spam and 0.93 for spam). However, the recall for spam emails is slightly lower at 0.88, meaning some spam emails were misclassified as non-spam. Despite this, the model demonstrates robust spam classification, with a weighted average precision, recall, and F1-score all around 0.97.

### Linear Regression Classifier

In [15]:
# Linear Regression Model
lin_model = LinearRegression()
lin_model.fit(X_train, y_train)

In [19]:
# Predict on test data
lin_pred = lin_model.predict(X_test)
lin_pred_class = [1 if p > 0.5 else 0 for p in lin_pred]  # Convert to binary classification

# Evaluate the model
print("Linear Regression Accuracy:", accuracy_score(y_test, lin_pred_class))
print("Classification Report:\n", classification_report(y_test, lin_pred_class))

Linear Regression Accuracy: 0.980209545983702
Classification Report:
               precision    recall  f1-score   support

           0       0.99      0.98      0.99      1277
           1       0.96      0.97      0.96       441

    accuracy                           0.98      1718
   macro avg       0.97      0.98      0.97      1718
weighted avg       0.98      0.98      0.98      1718



The linear regression model demonstrated strong performance with an accuracy of 98%, effectively distinguishing between spam and non-spam emails. It achieved high precision (0.99 for non-spam, 0.96 for spam) and recall (0.98 for non-spam, 0.97 for spam), resulting in excellent F1-scores of 0.99 for non-spam and 0.96 for spam. With a weighted average F1-score of 0.98, the model shows balanced and reliable classification, making it highly effective at identifying spam emails with minimal errors.

## Implementing a Perceptron Classifier

In [21]:
# Perceptron Model
perc_model = Perceptron()
perc_model.fit(X_train, y_train)

In [23]:
# Predict on test data
perc_pred = perc_model.predict(X_test)

# Evaluate the model
print("Perceptron Accuracy:", accuracy_score(y_test, perc_pred))
print("Classification Report:\n", classification_report(y_test, perc_pred))

Perceptron Accuracy: 0.9866123399301513
Classification Report:
               precision    recall  f1-score   support

           0       0.99      1.00      0.99      1277
           1       0.99      0.96      0.97       441

    accuracy                           0.99      1718
   macro avg       0.99      0.98      0.98      1718
weighted avg       0.99      0.99      0.99      1718



The Perceptron classifier achieved an impressive accuracy of 98.7%, showcasing strong performance in distinguishing between spam and non-spam emails. It exhibited high precision (0.99 for both classes) and recall (1.00 for non-spam, 0.96 for spam), with corresponding F1-scores of 0.99 for non-spam and 0.97 for spam. The weighted average F1-score of 0.99 further indicates the model's reliability, effectively minimizing classification errors while maintaining a high level of accuracy across both classes.