In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv("emails.csv")
df

Unnamed: 0,Email No.,the,to,ect,and,for,of,a,you,hou,...,connevey,jay,valued,lay,infrastructure,military,allowing,ff,dry,Prediction
0,Email 1,0,0,1,0,0,0,2,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Email 2,8,13,24,6,6,2,102,1,27,...,0,0,0,0,0,0,0,1,0,0
2,Email 3,0,0,1,0,0,0,8,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Email 4,0,5,22,0,5,1,51,2,10,...,0,0,0,0,0,0,0,0,0,0
4,Email 5,7,6,17,1,5,2,57,0,9,...,0,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5167,Email 5168,2,2,2,3,0,0,32,0,0,...,0,0,0,0,0,0,0,0,0,0
5168,Email 5169,35,27,11,2,6,5,151,4,3,...,0,0,0,0,0,0,0,1,0,0
5169,Email 5170,0,0,1,1,0,0,11,0,0,...,0,0,0,0,0,0,0,0,0,1
5170,Email 5171,2,7,1,0,2,1,28,2,0,...,0,0,0,0,0,0,0,1,0,1


In [3]:
# Checking for null values
print("Null values:\n",df.isna().sum(),"\n\n")

# There are no null values, but as a safety measure
print("Before:", df.shape)
df = df.dropna()
print("After:", df.shape) # This proves there are no null rows removed, implying no null values

Null values:
 Email No.     0
the           0
to            0
ect           0
and           0
             ..
military      0
allowing      0
ff            0
dry           0
Prediction    0
Length: 3002, dtype: int64 


Before: (5172, 3002)
After: (5172, 3002)


In [4]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score


# Applying KNN based on stop words frequency from the dataset
X = df.drop(columns=['Prediction', 'Email No.'])
y = df['Prediction']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

# Predict
y_pred = knn.predict(X_test)

# Evaluate
print("KNN")
accuracy = accuracy_score(y_test, y_pred)
error = 1 - accuracy
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)

print("Accuracy:", accuracy)
print("Error:", error)
print("Precision:", precision)
print("Recall", recall)



# Applying SVM
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

# Scaling is important for SVM
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

svm = SVC(kernel='rbf')
svm.fit(X_train_scaled, y_train)

# Predict
y_pred = svm.predict(X_test_scaled)

# Evaluate
print("\n\nSVM")
accuracy = accuracy_score(y_test, y_pred)
error = 1 - accuracy
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)

print("Accuracy:", accuracy)
print("Error:", error)
print("Precision:", precision)
print("Recall", recall)


KNN
Accuracy: 0.863768115942029
Error: 0.13623188405797104
Precision: 0.7272727272727273
Recall 0.8378378378378378


SVM
Accuracy: 0.9468599033816425
Error: 0.0531400966183575
Precision: 0.9958847736625515
Recall 0.8175675675675675


# ...  Explaination...


K-nearest neighbors (KNN) is a supervised machine learning algorithm used for classification and regression that groups a new data point with its 'k' closest neighbors from a training dataset to make a prediction. For classification, it assigns the new point to the class that is most common among its 'k' neighbors; for regression, it predicts a value based on the average of its neighbors' values. 
How it works
Classification: The algorithm finds the 'k' nearest data points to the new, unclassified point. The new point is then assigned the class that appears most frequently among these neighbors. For example, if the three nearest neighbors are class A, class B, and class A, the new point would be classified as class A.
Regression: The algorithm finds the 'k' nearest neighbors and predicts a value by taking the average of the values of those neighbors. For instance, to predict a house's price, it would use the average price of the 'k' houses closest to it. 


The Jupyter Notebook Exp2.ipynb performs binary classification for email spam detection using the K-Nearest Neighbors (KNN) and Support Vector Machine (SVM) algorithms. The dataset is already pre-processed into a Bag-of-Words format, where columns represent word frequencies, and the target variable, Prediction, is the binary class (0 for Not Spam, 1 for Spam).

Here is a breakdown of the code's steps, execution, and model analysis.

ðŸ’» Code Explanation
1. Setup and Data Loading (Cells 1, 2)
Imports: The code imports pandas (pd) for data manipulation and numpy (np) for numerical operations.

Data Loading: The emails.csv dataset is loaded into the DataFrame df.

The DataFrame has 5172 rows and 3002 columns.

Columns: Email No., a large number of word frequency counts (e.g., the, to, ect), and the target column, Prediction.

2. Data Preprocessing (Cell 3)
Handling Null Values:

Python
print("Null values:\n",df.isna().sum(),"\n\n")
df = df.dropna()
The code checks for and removes any rows with missing (null) values. The output confirms that there are no null values in the dataset, so the shape remains unchanged: (5172, 3002).

3. Model Training and Evaluation (Cell 4)
This is the core of the analysis, where the data is prepared, two models are trained, and their performances are assessed using key metrics.

Data Preparation
Feature and Target Selection:

Python
X = df.drop(columns=['Prediction', 'Email No.'])
y = df['Prediction']
Features (X): All word frequency columns (3000 total) are selected as features. The Prediction and Email No. columns are dropped from the feature set.

Target (y): The Prediction column, which contains the binary class (0 or 1), is the target variable.

Train-Test Split:

Python
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
The data is split into training (80%) and testing (20%) sets. This ensures the model is trained on one set of data and evaluated on a completely unseen set.

A. K-Nearest Neighbors (KNN)
KNN is a non-parametric, instance-based learning algorithm that classifies a point based on the majority class of its k nearest data points in the feature space .

Training:

Python
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
A KNN classifier is initialized with k=5 (meaning it looks at 5 neighbors) and trained on the training data.

Prediction: y_pred = knn.predict(X_test) generates predictions for the test set.

Evaluation: The model's performance on the test set is evaluated:

Accuracy: 0.86376

Error: 0.13623

Precision: 0.72727

Recall: 0.83783

B. Support Vector Machine (SVM)
SVM works by finding an optimal hyperplane that maximizes the margin (distance) between the different classes in the feature space .

Scaling (Crucial for SVM):

Python
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
SVM is highly sensitive to the scale of features. The data is standardized (scaled to have a mean of 0 and standard deviation of 1) before training to ensure all features contribute equally to the distance calculations.

Training:

Python
svm = SVC(kernel='rbf')
svm.fit(X_train_scaled, y_train)
A Support Vector Classifier is trained using the Radial Basis Function ('rbf') kernel, which allows the model to find non-linear decision boundaries.

Prediction: y_pred = svm.predict(X_test_scaled) generates predictions for the scaled test set.

Evaluation:

Accuracy: 0.94685

Error: 0.05314

Precision: 0.99588

Recall: 0.81756

ðŸ“ˆ Model Performance Analysis
The final output provides a comparison of the two models based on the selected classification metrics:

Metric	KNN Performance	SVM Performance
Accuracy	86.38%	94.69%
Error Rate	13.62%	5.31%
Precision	72.73%	99.59%
Recall	83.78%	81.76%
Conclusion
Overall Performance: The Support Vector Machine (SVM) model clearly outperforms KNN in terms of overall Accuracy (94.69% vs 86.38%) and has a significantly lower Error Rate.

Precision vs. Recall:

SVM demonstrates excellent Precision (99.59%). In the context of spam detection, high precision means that when the model predicts an email is spam, it is almost certainly correct. This minimizes the number of false positives (legitimate emails marked as spam).

KNN has a slightly higher Recall (83.78% vs 81.76%). Higher recall means the model is slightly better at catching all the actual spam emails (minimizing false negatives).

Model Choice: Given the high priority of Precision in spam detection (to avoid losing important, legitimate emails), the SVM model with an RBF kernel is the superior choice for this dataset.

The last markdown cell contains commented-out code that represents a backup or alternative approach to report model performance, including using a different test size (25%) and generating visual reports like Confusion Matrices, Precision-Recall curves, and ROC curves, which would provide a richer analysis if executed.

# ... Backup better code with tabels and all
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv('emails.csv')
df.head()

df.isnull().sum()

df.dropna(how='any',inplace=True)

x = df.iloc[:,1:-1].values
y = df.iloc[:,-1].values

from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.25,random_state=10)

from sklearn.metrics import ConfusionMatrixDisplay,confusion_matrix,accuracy_score,precision_score,recall_score,plot_precision_recall_curve,plot_roc_curve
def report(classifier):
    y_pred = classifier.predict(x_test)
    cm = confusion_matrix(y_test,y_pred)
    display = ConfusionMatrixDisplay(cm,display_labels=classifier.classes_)
    display.plot()
    print(f"Accuracy:  {accuracy_score(y_test,y_pred)}")
    print(f"Precision Score:  {precision_score(y_test,y_pred)}")
    print(f"Recall Score:  {recall_score(y_test,y_pred)}")
    plot_precision_recall_curve(classifier,x_test,y_test)
    plot_roc_curve(classifier,x_test,y_test)


from sklearn.neighbors import KNeighborsClassifier

kNN = KNeighborsClassifier(n_neighbors=10)
kNN.fit(x_train,y_train)

report(kNN)

from sklearn.svm import SVC
svm = SVC(gamma='auto',random_state=10)
svm.fit(x_train,y_train)

report(svm)