**Task 1:**

Our machine learning models were implemented for email spam classification: Multinomial Naive Bayes, Logistic Regression, Linear Support Vector Machine, and Random Forest. TF-IDF was used for feature extraction. Among all models, Linear SVM achieved the highest accuracy, making it the most suitable model for deployment. The comparison demonstrates an understanding of baseline, linear, and ensemble learning techniques.

In [2]:
import pandas as pd
import numpy as np
import re
import nltk

from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier

nltk.download('stopwords')


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [3]:
#load Dataset
df = pd.read_csv("/content/drive/MyDrive/unpaid remote internships/ArchTech/mail_data.csv", encoding="latin-1")
df.columns = ['Category', 'Message']

df['Category'] = df['Category'].map({'ham': 0, 'spam': 1})
df.head()


Unnamed: 0,Category,Message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [4]:
#Text Processing
ps = PorterStemmer()
stop_words = set(stopwords.words('english'))

def preprocess(text):
    text = text.lower()
    text = re.sub('[^a-z]', ' ', text)
    words = text.split()
    words = [ps.stem(word) for word in words if word not in stop_words]
    return ' '.join(words)

df['cleaned_message'] = df['Message'].apply(preprocess)


In [5]:
#Feature Extraction TF/IDF
vectorizer = TfidfVectorizer(max_features=3000)

X = vectorizer.fit_transform(df['cleaned_message']).toarray()
y = df['Category']


In [6]:
#Train-Test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)


In [8]:
#Train & Evaluate All Models
#1. Multinomial Naive Bayes
nb = MultinomialNB()
nb.fit(X_train, y_train)
nb_pred = nb.predict(X_test)

print("Naive Bayes Accuracy:", accuracy_score(y_test, nb_pred))


Naive Bayes Accuracy: 0.979372197309417


In [10]:
#Logistic regression
lr = LogisticRegression(max_iter=1000)
lr.fit(X_train, y_train)
lr_pred = lr.predict(X_test)

print("Logistic Regression Accuracy:", accuracy_score(y_test, lr_pred))


Logistic Regression Accuracy: 0.968609865470852


In [11]:
#Support Vector machine
svm = LinearSVC()
svm.fit(X_train, y_train)
svm_pred = svm.predict(X_test)

print("Linear SVM Accuracy:", accuracy_score(y_test, svm_pred))


Linear SVM Accuracy: 0.9847533632286996


In [12]:
#RandomForest
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
rf_pred = rf.predict(X_test)

print("Random Forest Accuracy:", accuracy_score(y_test, rf_pred))


Random Forest Accuracy: 0.979372197309417


In [15]:
#Detailed Classification Report (Best Model Example)
print(classification_report(y_test, svm_pred))


              precision    recall  f1-score   support

           0       0.98      1.00      0.99       966
           1       0.99      0.89      0.94       149

    accuracy                           0.98      1115
   macro avg       0.99      0.95      0.97      1115
weighted avg       0.98      0.98      0.98      1115



In [16]:
#Model Comparison Table
results = pd.DataFrame({
    "Model": [
        "Naive Bayes",
        "Logistic Regression",
        "Linear SVM",
        "Random Forest"
    ],
    "Accuracy": [
        accuracy_score(y_test, nb_pred),
        accuracy_score(y_test, lr_pred),
        accuracy_score(y_test, svm_pred),
        accuracy_score(y_test, rf_pred)
    ]
})

results


Unnamed: 0,Model,Accuracy
0,Naive Bayes,0.979372
1,Logistic Regression,0.96861
2,Linear SVM,0.984753
3,Random Forest,0.979372
