# Email Spam Detection

**Problem Statement:** Classify emails as spam or not spam using binary classification. The two states are:
a) Normal State – Not Spam (0)
b) Abnormal State – Spam (1)

**Objectives:**
1. Pre-process the dataset.
2. Implement K-Nearest Neighbors (KNN) and Support Vector Machine (SVM) for classification.
3. Analyze and compare the performance of the models.

## Imports
Import necessary libraries for data manipulation, visualization, and modeling.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### 1. Data Loading and Initial Exploration

In [2]:
df = pd.read_csv('emails.csv')

In [3]:
df.head()

Unnamed: 0,Email No.,the,to,ect,and,for,of,a,you,hou,...,connevey,jay,valued,lay,infrastructure,military,allowing,ff,dry,Prediction
0,Email 1,0,0,1,0,0,0,2,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Email 2,8,13,24,6,6,2,102,1,27,...,0,0,0,0,0,0,0,1,0,0
2,Email 3,0,0,1,0,0,0,8,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Email 4,0,5,22,0,5,1,51,2,10,...,0,0,0,0,0,0,0,0,0,0
4,Email 5,7,6,17,1,5,2,57,0,9,...,0,0,0,0,0,0,0,1,0,0


In [4]:
# Check data types and non-null counts
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5172 entries, 0 to 5171
Columns: 3002 entries, Email No. to Prediction
dtypes: int64(3001), object(1)
memory usage: 118.5+ MB


In [5]:
# Check for missing values
df.isnull().sum()

Email No.     0
the           0
to            0
ect           0
and           0
             ..
military      0
allowing      0
ff            0
dry           0
Prediction    0
Length: 3002, dtype: int64

### 1.2: Clean Data
Drop unnecessary columns like RowNumber, CustomerId, and Surname.

In [6]:
df.drop(['Email No.'], axis=1, inplace=True)

### 1.3. Define Input (X) and Output (y) Features
Separate the dataset into features (X) and the target variable (y).

In [7]:
# 'Prediction' is the target column (0 for Not Spam, 1 for Spam)
X = df.drop('Prediction', axis=1)
y = df['Prediction']

### 1.4. Train-Test Split
Split the data into training and testing sets (80% train, 20% test).

In [8]:
from sklearn.model_selection import train_test_split

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## 2. Implement Classification Models

### 2.1: K-Nearest Neighbors (KNN)
Train and predict using KNN with n_neighbors=5.

In [10]:
from sklearn.neighbors import KNeighborsClassifier

In [11]:
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred_knn = knn.predict(X_test)

### 2.2: Support Vector Machine (SVM)
Train and predict using SVM with a linear kernel.

In [12]:
from sklearn.svm import SVC

In [13]:
# Using kernel='linear' as it is often good for high-dimensional text data
svm = SVC(kernel='linear', random_state=42)
svm.fit(X_train, y_train)
y_pred_svm = svm.predict(X_test)

## 3. Evaluate the Models
Compare the performance of both models using standard classification metrics.

In [14]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

### 3.1: KNN Performance

In [15]:
print('--- K-Nearest Neighbors ---')
print(f'Accuracy: {accuracy_score(y_test, y_pred_svm)}')
print('\nConfusion Matrix:')
print(confusion_matrix(y_test, y_pred_knn))
print('\nClassification Report:')
print(classification_report(y_test, y_pred_knn))

--- K-Nearest Neighbors ---
Accuracy: 0.9594202898550724

Confusion Matrix:
[[645  94]
 [ 48 248]]

Classification Report:
              precision    recall  f1-score   support

           0       0.93      0.87      0.90       739
           1       0.73      0.84      0.78       296

    accuracy                           0.86      1035
   macro avg       0.83      0.86      0.84      1035
weighted avg       0.87      0.86      0.87      1035



### 3.2: SVM Performance

In [16]:
print('--- Support Vector Machine ---')
print(f'Accuracy: {accuracy_score(y_test, y_pred_svm)}')
print('\nConfusion Matrix:')
print(confusion_matrix(y_test, y_pred_svm))
print('\nClassification Report:')
print(classification_report(y_test, y_pred_svm))

--- Support Vector Machine ---
Accuracy: 0.9594202898550724

Confusion Matrix:
[[715  24]
 [ 18 278]]

Classification Report:
              precision    recall  f1-score   support

           0       0.98      0.97      0.97       739
           1       0.92      0.94      0.93       296

    accuracy                           0.96      1035
   macro avg       0.95      0.95      0.95      1035
weighted avg       0.96      0.96      0.96      1035



### 3.3. Performance Comparison

In [17]:
print(f'K-Nearest Neighbors (KNN) Accuracy: {accuracy_score(y_test, y_pred_svm)}')
print(f'Support Vector Machine (SVM) Accuracy: {accuracy_score(y_test, y_pred_svm)}')

K-Nearest Neighbors (KNN) Accuracy: 0.9594202898550724
Support Vector Machine (SVM) Accuracy: 0.9594202898550724


### Conclusion
Both models performed well on the dataset. The Support Vector Machine (SVM) classifier achieved a slightly higher accuracy than the K-Nearest Neighbors (KNN) classifier. Based on the classification reports, SVM also shows strong precision and recall for both spam and non-spam classes, making it the more robust model for this specific task.

# Notes For Viva

Algorithm and Metrics Explanation

**K-Nearest Neighbors (KNN)**

Definition: A supervised, non-parametric, and "lazy learning" algorithm used for both classification and regression. For classification, it predicts the class of a new data point based on the majority class of its 'k' nearest neighbors in the feature space.

Important Info:

- Lazy Learning: It does not build a general model during training; it stores the entire training dataset. The computation occurs at prediction time.

- Choice of 'k': The value of 'k' is critical. A small 'k' can be sensitive to noise, while a large 'k' can be computationally expensive and may oversmooth the decision boundary.

- Distance Metric: Relies on a distance metric, typically Euclidean distance.

- Scaling: Very sensitive to the scale of features. Features with larger ranges can dominate the distance calculation, so feature scaling (e.g., standardization) is almost always required.

Formula (Euclidean Distance):


$$d(p, q) = \sqrt{\sum_{i=1}^{n} (p_i - q_i)^2}$$

- $p, q$: Two data points

- $n$: Number of features

**Support Vector Machine (SVM)**

Definition: A supervised learning algorithm that finds an optimal "hyperplane" that best separates data points into different classes. The best hyperplane is the one that maximizes the "margin" (the distance) between the hyperplane and the nearest data points (called "support vectors").

Important Info:

- Kernel Trick: Can efficiently perform non-linear classification by mapping data to a higher-dimensional space. Common kernels are 'linear', 'poly', and 'rbf' (Radial Basis Function).

- High-Dimensional Data: Very effective in high-dimensional spaces, making it suitable for tasks like text classification (where each word can be a feature).

- Support Vectors: These are the data points closest to the hyperplane. They are the only points that influence the position and orientation of the hyperplane.

**Evaluation Metrics for Classification**

**Confusion Matrix**

Definition: A table used to visualize the performance of a classifier. It shows the number of correct and incorrect predictions for each class.

- True Positive (TP): Actual: 1, Predicted: 1

- True Negative (TN): Actual: 0, Predicted: 0

- False Positive (FP): Actual: 0, Predicted: 1 (Type I Error)

- False Negative (FN): Actual: 1, Predicted: 0 (Type II Error)

**Accuracy**

Definition: The ratio of correctly predicted instances to the total number of instances.

Formula:


$$Accuracy = \frac{TP + TN}{TP + TN + FP + FN}$$

**Precision**

Definition: Measures the accuracy of positive predictions. Answers the question: "Of all the emails the model predicted as spam, what fraction was actually spam?"

Formula:


$$Precision = \frac{TP}{TP + FP}$$

**Recall (Sensitivity)**

Definition: Measures the model's ability to find all positive instances. Answers the question: "Of all the actual spam emails, what fraction did the model correctly identify as spam?"

Formula:


$$Recall = \frac{TP}{TP + FN}$$

**F1-Score**

Definition: The harmonic mean of Precision and Recall. It provides a single score that balances both metrics, which is useful when classes are imbalanced.

Formula:


$$F1 = 2 \times \frac{Precision \times Recall}{Precision + Recall}$$

**Potential Viva Questions**

Q1: What is "lazy learning" and why is KNN called that?  
A: "Lazy learning" (or instance-based learning) is a method where the model does not build a general internal model during the training phase. It simply stores the training data. The main computational work is delayed until a prediction is requested. KNN is called this because it just stores all training points and finds the 'k' nearest neighbors only when it needs to make a new prediction.

Q2: In spam detection, is a False Positive or a False Negative a worse error?  
A: A False Positive (classifying a real email as spam) is generally considered much worse. This means a user might miss an important email (e.g., from their bank, job, or family) because it was incorrectly sent to the spam folder. A False Negative (classifying a spam email as real) is just an annoyance that the user has to delete manually.

Q3: What are "support vectors" in SVM?  
A: Support vectors are the data points from each class that are closest to the decision boundary (hyperplane). They are the most difficult points to classify and are the only points that "support" or define the optimal position of the hyperplane. If any other (non-support vector) point were removed, the hyperplane would not change.

Q4: Why might you need to scale your data before using KNN?  
A: KNN works by calculating distances between data points. If one feature has a very large scale (e.g., word count from 0-1000) and another has a small scale (e.g., 0-1), the feature with the larger scale will completely dominate the distance calculation. This makes the smaller-scale feature almost irrelevant. Scaling (like standardization or normalization) brings all features to a comparable scale so that each feature contributes fairly to the distance.

Q5: What is the "kernel trick" in SVM?  
A: The kernel trick is a mathematical method that allows SVM to classify data that is not linearly separable. It takes the data and projects it into a higher-dimensional space where a linear hyperplane can be used to separate it. It does this efficiently by calculating the relationships between points in the higher dimension without ever actually transforming the data, which saves a lot of computation.