###Part 2: CODING

Applying spam classification code from the Problem 4 at the end of the Chapter 3 to a different dataset.

The chosen dataset is a Kaggle **"Email Spam Classification Dataset"** which has a .csv file containing related information of 5172 randomly picked email files and their respective labels for spam or not-spam classification.

The steps to follow (as shown in Chapter 3 Problem 4) are:


* _Split the datasets into a training set and a test set._
* _Write a data preparation pipeline to convert each email into a feature vector._
* _Finally, try out several classifiers and see if you can build a great spam classifier, with both high recall and high precision_

In [3]:
import kagglehub

path = kagglehub.dataset_download("balaka18/email-spam-classification-dataset-csv")

print("Path to dataset files:", path)

Path to dataset files: /root/.cache/kagglehub/datasets/balaka18/email-spam-classification-dataset-csv/versions/1


In [4]:
import pandas as pd
from sklearn.model_selection import train_test_split

# dataset path
dataset_path = '/root/.cache/kagglehub/datasets/balaka18/email-spam-classification-dataset-csv/versions/1/emails.csv'

# Load the dataset
data = pd.read_csv(dataset_path)
print(data.head())  # Inspect the first few rows to understand the data format

# Split the dataset into training and test sets
train_set, test_set = train_test_split(data, test_size=0.2, random_state=42)

# Display basic information about the dataset
print(train_set.info())
print(test_set.info())


  Email No.  the  to  ect  and  for  of    a  you  hou  ...  connevey  jay  \
0   Email 1    0   0    1    0    0   0    2    0    0  ...         0    0   
1   Email 2    8  13   24    6    6   2  102    1   27  ...         0    0   
2   Email 3    0   0    1    0    0   0    8    0    0  ...         0    0   
3   Email 4    0   5   22    0    5   1   51    2   10  ...         0    0   
4   Email 5    7   6   17    1    5   2   57    0    9  ...         0    0   

   valued  lay  infrastructure  military  allowing  ff  dry  Prediction  
0       0    0               0         0         0   0    0           0  
1       0    0               0         0         0   1    0           0  
2       0    0               0         0         0   0    0           0  
3       0    0               0         0         0   0    0           0  
4       0    0               0         0         0   1    0           0  

[5 rows x 3002 columns]
<class 'pandas.core.frame.DataFrame'>
Index: 4137 entries, 316

In [5]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score

# Separating the features and the target variable
X_train = train_set.drop(columns=['Email No.', 'Prediction'])
y_train = train_set['Prediction']
X_test = test_set.drop(columns=['Email No.', 'Prediction'])
y_test = test_set['Prediction']

# Creating and training the logistic regression model
model = LogisticRegression(max_iter=1000, random_state=42)
model.fit(X_train, y_train)

# Predicting the test set results
y_pred = model.predict(X_test)

# Evaluating the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)

print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")


Accuracy: 0.97
Precision: 0.94
Recall: 0.96


trying a validation check to ensure I have not made any errors while trying to follow the steps.



--------------------

In [9]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from nltk.stem import SnowballStemmer
import re

# Example function to clean and preprocess text
def clean_email_text(email):
    email = email.lower()  # convert to lowercase
    email = re.sub(r'https?://\S+|www\.\S+', 'URL', email)  # replace URLs
    email = re.sub(r'\d+', 'NUMBER', email)  # replace numbers
    email = re.sub(r'\W', ' ', email)  # remove punctuation
    stemmer = SnowballStemmer('english')
    email = ' '.join(stemmer.stem(word) for word in email.split())  # stemming
    return email

# Vectorization using CountVectorizer to mimic creating binary or count vectors by creating a pipeline
vectorizer = CountVectorizer(preprocessor=clean_email_text, binary=True)  # Use binary=False for counts

pipeline = Pipeline([
    ('vectorizer', vectorizer),
    ('classifier', LogisticRegression(max_iter=1000))
])


In [10]:
# Split data (already shown previously)
X_train = train_set.drop(['Email No.', 'Prediction'], axis=1)
y_train = train_set['Prediction']
X_test = test_set.drop(['Email No.', 'Prediction'], axis=1)
y_test = test_set['Prediction']

# Fit multiple models to compare
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Random Forest': RandomForestClassifier(n_estimators=100),
    'SVC': SVC(kernel='linear')
}

results = {}
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    results[name] = (accuracy, precision, recall)

for name, scores in results.items():
    print(f"{name} - Accuracy: {scores[0]:.2f}, Precision: {scores[1]:.2f}, Recall: {scores[2]:.2f}")


Logistic Regression - Accuracy: 0.97, Precision: 0.94, Recall: 0.96
Random Forest - Accuracy: 0.98, Precision: 0.96, Recall: 0.96
SVC - Accuracy: 0.96, Precision: 0.92, Recall: 0.94
