In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier

In [None]:
# Load dataset
df = pd.read_csv("emails.csv")

In [None]:
df.head()

Unnamed: 0,Email No.,the,to,ect,and,for,of,a,you,hou,...,connevey,jay,valued,lay,infrastructure,military,allowing,ff,dry,Prediction
0,Email 1,0,0,1,0,0,0,2,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Email 2,8,13,24,6,6,2,102,1,27,...,0,0,0,0,0,0,0,1,0,0
2,Email 3,0,0,1,0,0,0,8,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Email 4,0,5,22,0,5,1,51,2,10,...,0,0,0,0,0,0,0,0,0,0
4,Email 5,7,6,17,1,5,2,57,0,9,...,0,0,0,0,0,0,0,1,0,0


In [None]:
df.isnull().sum()

Email No.     0
the           0
to            0
ect           0
and           0
             ..
military      0
allowing      0
ff            0
dry           0
Prediction    0
Length: 3002, dtype: int64

In [None]:
X = df.iloc[:,1:3001]  # word frequency features
X

Unnamed: 0,the,to,ect,and,for,of,a,you,hou,in,...,enhancements,connevey,jay,valued,lay,infrastructure,military,allowing,ff,dry
0,0,0,1,0,0,0,2,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,8,13,24,6,6,2,102,1,27,18,...,0,0,0,0,0,0,0,0,1,0
2,0,0,1,0,0,0,8,0,0,4,...,0,0,0,0,0,0,0,0,0,0
3,0,5,22,0,5,1,51,2,10,1,...,0,0,0,0,0,0,0,0,0,0
4,7,6,17,1,5,2,57,0,9,3,...,0,0,0,0,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5167,2,2,2,3,0,0,32,0,0,5,...,0,0,0,0,0,0,0,0,0,0
5168,35,27,11,2,6,5,151,4,3,23,...,0,0,0,0,0,0,0,0,1,0
5169,0,0,1,1,0,0,11,0,0,1,...,0,0,0,0,0,0,0,0,0,0
5170,2,7,1,0,2,1,28,2,0,8,...,0,0,0,0,0,0,0,0,1,0


In [None]:
Y = df.iloc[:,-1].values # 1 = spam, 0 = not spam
Y

array([0, 0, 0, ..., 1, 1, 0], shape=(5172,))

In [None]:
# Split data 
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.25, random_state=42)

In [None]:
from sklearn.metrics import classification_report, confusion_matrix

# -------- Support Vector Machine --------
svc = SVC(C=1.0, kernel='rbf', gamma='auto')
svc.fit(X_train, y_train)
svc_pred = svc.predict(X_test)

SVM Accuracy: 0.8932714617169374
SVM Classification Report:
               precision    recall  f1-score   support

           0       0.90      0.96      0.93       913
           1       0.87      0.74      0.80       380

    accuracy                           0.89      1293
   macro avg       0.89      0.85      0.87      1293
weighted avg       0.89      0.89      0.89      1293

SVM Confusion Matrix:
 [[872  41]
 [ 97 283]]


In [None]:
print("SVM Accuracy:", accuracy_score(y_test, svc_pred))
print("SVM Classification Report:\n", classification_report(y_test, svc_pred))
print("SVM Confusion Matrix:\n", confusion_matrix(y_test, svc_pred))

SVM Accuracy: 0.8932714617169374
SVM Classification Report:
               precision    recall  f1-score   support

           0       0.90      0.96      0.93       913
           1       0.87      0.74      0.80       380

    accuracy                           0.89      1293
   macro avg       0.89      0.85      0.87      1293
weighted avg       0.89      0.89      0.89      1293

SVM Confusion Matrix:
 [[872  41]
 [ 97 283]]


In [None]:
# -------- K-Nearest Neighbors --------
knn = KNeighborsClassifier(n_neighbors=7)
knn.fit(X_train, y_train)
knn_pred = knn.predict(X_test)

In [None]:
print("KNN Accuracy:", knn.score(X_test, y_test))
print("KNN Classification Report:\n", classification_report(y_test, knn_pred))
print("KNN Confusion Matrix:\n", confusion_matrix(y_test, knn_pred))

KNN Accuracy: 0.8685990338164251
KNN Classification Report:
               precision    recall  f1-score   support

           0       0.94      0.87      0.90       739
           1       0.73      0.86      0.79       296

    accuracy                           0.87      1035
   macro avg       0.83      0.87      0.85      1035
weighted avg       0.88      0.87      0.87      1035

KNN Confusion Matrix:
 [[645  94]
 [ 42 254]]


In [None]:
One-line explanation for each code cell

Cell 0: Import required libraries and ML classes (pandas, train_test_split, SVC, accuracy_score, KNeighborsClassifier).

Cell 1: Load the emails.csv dataset into a DataFrame.

Cell 2: Display the first few rows of the dataset to inspect structure and sample records.

Cell 3: Check for missing values column-wise to ensure data completeness.

Cell 4: Select the feature matrix X (word-frequency features from columns 1 to 3000).

Cell 5: Extract the target vector Y (spam label: 1 = spam, 0 = not spam).

Cell 6: Split the data into training and test sets with 25% reserved for testing.

Cell 7: Train a Support Vector Classifier on the training set and predict on the test set.

Cell 8: Print SVM accuracy, classification report and confusion matrix for evaluation.

Cell 9: Train a K-Nearest Neighbors classifier (k=7) on the training data and predict on the test set.

Cell 10: Print KNN accuracy, classification report and confusion matrix for evaluation.


Theory (concise, exam-style)

Feature representation: Each email is represented as a 3000-dimensional vector of word-frequency features ‚Äî a typical bag-of-words / term-frequency approach.

Train/Test split: Separates unseen data for fair evaluation; here 25% test to estimate generalization.

Support Vector Machine (SVM): A discriminative classifier that finds a hyperplane maximizing margin between classes; effective in high-dimensional spaces and can use kernels to model nonlinearity. Decision function: find w,b minimizing ||w|| with margin constraints.

K-Nearest Neighbors (KNN): A non-parametric instance-based classifier that assigns labels by majority vote among the k nearest training points under a distance metric (usually Euclidean). Sensitive to feature scaling and high dimensionality.

Evaluation metrics:

Accuracy = (TP+TN)/total ‚Äî overall correctness.

Precision = TP/(TP+FP) ‚Äî how many predicted spam were actually spam.

Recall (Sensitivity) = TP/(TP+FN) ‚Äî how many actual spam were caught.

F1-score harmonic mean of precision and recall.

Confusion matrix shows TP, FP, TN, FN counts for error analysis.

Algorithm (step-by-step) ‚Äî map to notebook cells

Import libraries ‚Äî (Cell 0).

Load dataset ‚Äî read CSV into DataFrame (Cell 1).

Inspect data ‚Äî preview rows and check missing values (Cells 2‚Äì3).

Prepare features and labels ‚Äî X = columns 1:3000, Y = last column labels (Cells 4‚Äì5).

Split dataset ‚Äî train_test_split with test_size=0.25 (Cell 6).

Train SVM ‚Äî instantiate SVC(), fit on X_train/y_train, predict on X_test (Cell 7).

Evaluate SVM ‚Äî compute accuracy, classification report, confusion matrix (Cell 8).

Train KNN ‚Äî instantiate KNeighborsClassifier(n_neighbors=7), fit, predict (Cell 9).

Evaluate KNN ‚Äî compute accuracy, classification report, confusion matrix (Cell 10).

Key concepts ‚Äî definition + one-line code-specific example

Bag-of-words / term-frequency: Vectorizing text by counts of words; here each row in X is a vector of word frequencies (Cell 4).

Train/Test split: Reserve data for evaluation; test_size=0.25 gives a held-out test set (Cell 6).

Support Vector Machine: Margin-based classifier suitable for high-dim features; used via SVC() and fit/predict (Cell 7).

K-Nearest Neighbors: Predicts label by majority of k closest training samples; used with n_neighbors=7 (Cell 9).

Feature scaling: Rescales features to comparable ranges ‚Äî missing here; required for KNN and often helpful for SVM.

Confusion matrix: 2√ó2 matrix of TP/FP/TN/FN to inspect error types (Cells 8 & 10).

Precision/Recall/F1: Metrics shown in classification_report ‚Äî interpret precision vs recall tradeoffs (Cells 8 & 10).

Curse of dimensionality: As dimension increases (3000 features), distance measures become less informative ‚Äî affects KNN performance (Cells 4 & 9).

Overfitting/Underfitting: If model performs much better on train than test, it overfits; check via train/test scores (implied in evaluation cells).

Hyperparameter tuning: Choosing kernel, C for SVM or k for KNN ‚Äî not implemented; use cross-validation to tune.


1. K-Nearest Neighbors (KNN)
üîπ Concept:

KNN is a lazy learning algorithm ‚Äî it doesn‚Äôt learn a model in advance.
Instead, it stores all the training data and makes predictions only when needed.

When a new data point comes, KNN:

Looks at the K closest points in the training data (using distance ‚Äî usually Euclidean distance).

Checks the majority class among those K neighbors.

Assigns that class to the new data point.

‚öîÔ∏è 2. Support Vector Machine (SVM)
üîπ Concept:

SVM tries to find the best dividing boundary (hyperplane) that separates different classes with the maximum margin.

Think of it as drawing a line (2D) or plane (3D) that divides the data as cleanly as possible ‚Äî keeping the widest possible gap between the two classes.