<a href="https://colab.research.google.com/github/Pawan-Pokhrel/Multi-Class-Email-Classification/blob/main/Email_Classification_CC6057.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Email Classification Using Machine Learning  
**Course:** CC6057 – Applied Machine Learning  
**Student:** Pawan Pokhrel (23048667)

---

## 1. Introduction
Email communication generates large volumes of unstructured text data.  
Automatically categorizing emails into meaningful classes such as *spam*, *promotions*, *forums*, and *social media* improves usability and security.

This notebook demonstrates an end-to-end machine learning pipeline for **email classification**, including:
- Text preprocessing
- Feature extraction using TF-IDF
- Model training (Logistic Regression and SVM)
- Model evaluation
- Manual prediction using the trained model


## 2. Import Required Libraries

In [42]:

import pandas as pd
import numpy as np
import re # for regex operations
import string # for string manipulation

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

import nltk

## 3. Download NLTK Resources

In [43]:

nltk.download('stopwords')
nltk.download('wordnet') # lemmatization
nltk.download('omw-1.4') # WordNet corpus

from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


## 4. Load Dataset

In [44]:

# Load dataset CSV file into a DataFrame
df = pd.read_csv('full_dataset.csv')
df.head()


Unnamed: 0,id,subject,body,text,category,category_id
0,promotions_582,Anniversary Special: Buy one get one free,"As our loyal customer, get exclusive $60 off $...",Anniversary Special: Buy one get one free As o...,promotions,1
1,spam_1629,Your Amazon was used on new device,Your $5000 refund is processed. Claim: bit.ly/...,Your Amazon was used on new device Your $5000 ...,spam,3
2,spam_322,Re: Your Google inquiry,"Hi, following up about your Google application...","Re: Your Google inquiry Hi, following up about...",spam,3
3,social_media_80,Digital Ritual Experience Creation,Cross-cultural ceremony design. Join: virtualr...,Digital Ritual Experience Creation Cross-cultu...,social_media,2
4,forum_1351,"Your post was moved to ""Programming Help""","Trending: ""cooking"" (258 comments). View: supp...","Your post was moved to ""Programming Help"" Tren...",forum,0


## 5. Text Preprocessing Function

In [45]:

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    """Clean and preprocess text"""
    text = str(text).lower()
    text = re.sub(r'http\S+|www\S+', '', text)  # remove URLs
    text = re.sub(r'<.*?>', '', text)  # remove HTML tags
    text = re.sub(r'[%s]' % re.escape(string.punctuation), '', text)  # remove punctuation
    text = re.sub(r'\d+', '', text)  # remove numbers

    tokens = text.split()  # split text into words
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]  # remove stopwords + lemmatize
    return ' '.join(tokens)

df['clean_text'] = df['text'].apply(preprocess_text)
df[['text', 'clean_text']].head()


Unnamed: 0,text,clean_text
0,Anniversary Special: Buy one get one free As o...,anniversary special buy one get one free loyal...
1,Your Amazon was used on new device Your $5000 ...,amazon used new device refund processed claim ...
2,"Re: Your Google inquiry Hi, following up about...",google inquiry hi following google application...
3,Digital Ritual Experience Creation Cross-cultu...,digital ritual experience creation crosscultur...
4,"Your post was moved to ""Programming Help"" Tren...",post moved programming help trending cooking c...


## 6. Train-Test Split

In [46]:

X = df['clean_text']
y = df['category_id']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)


## 7. TF-IDF Feature Extraction

In [47]:

# Convert text to numerical features using TF-IDF
tfidf = TfidfVectorizer(max_features=100)
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)


## 8. Train Logistic Regression and SVM Models

In [48]:

# Logistic Regression
lr_model = LogisticRegression(max_iter=1000)
lr_model.fit(X_train_tfidf, y_train)


In [49]:

# Support Vector Machine (Linear SVM)
svm_model = LinearSVC()
svm_model.fit(X_train_tfidf, y_train)

## 9. Model Evaluation and Confusion Matrix

In [54]:

# Predictions
lr_preds = lr_model.predict(X_test_tfidf)
svm_preds = svm_model.predict(X_test_tfidf)

# Accuracy scores
print('Logistic Regression Accuracy:', accuracy_score(y_test, lr_preds))
print('SVM Accuracy:', accuracy_score(y_test, svm_preds))

# Classification report for SVM
print('\nClassification Report (SVM):\n')
print(classification_report(y_test, svm_preds))


Logistic Regression Accuracy: 0.9558605341246291
SVM Accuracy: 0.9551186943620178

Classification Report (SVM):

              precision    recall  f1-score   support

           0       0.99      0.92      0.95       450
           1       1.00      0.94      0.97       449
           2       0.94      0.97      0.95       449
           3       0.96      0.97      0.97       449
           4       0.88      0.95      0.91       449
           5       0.98      0.98      0.98       450

    accuracy                           0.96      2696
   macro avg       0.96      0.96      0.96      2696
weighted avg       0.96      0.96      0.96      2696



### Confusion Matrix for Logistic Regression

In [59]:
confusion_matrix(y_test, lr_preds)

array([[418,   0,  12,   1,  18,   1],
       [  0, 425,   7,   1,  12,   4],
       [  4,   0, 432,   1,  10,   2],
       [  1,   0,   1, 435,  12,   0],
       [  6,   1,   6,   7, 426,   3],
       [  0,   0,   0,   4,   5, 441]])

### Confusion Matrix for SVM

In [58]:
confusion_matrix(y_test, svm_preds)

array([[412,   0,  17,   1,  19,   1],
       [  0, 423,   7,   3,  12,   4],
       [  0,   0, 436,   1,  10,   2],
       [  1,   0,   0, 434,  14,   0],
       [  5,   0,   6,   7, 427,   4],
       [  0,   0,   0,   4,   3, 443]])

## 10. Manual Prediction on a Single Sample

In [56]:

sample_index = 5
sample_text = X_test.iloc[sample_index]  # email text
actual_label = y_test.iloc[sample_index]
print('Original Text:\n' + df.loc[X_test.index[sample_index], 'text'])

# Convert to TF-IDF vector
sample_tfidf = tfidf.transform([sample_text])
predicted_label = svm_model.predict(sample_tfidf)[0]

# Results
print('\nProcessed Email Text:\n', sample_text)
print('\nActual Category ID:', actual_label)
print('Predicted Category ID:', predicted_label)

# Category names
category_map = dict(zip(df['category_id'], df['category']))
print('\nActual Category Name:', category_map[actual_label])
print('Predicted Category Name:', category_map[predicted_label])


Original Text:
Email binding code: 467569 Use this code: 383921 to complete sign-in. Expires in 5 minutes.

Processed Email Text:
 email binding code use code complete signin expires minute

Actual Category ID: 5
Predicted Category ID: 5

Actual Category Name: verify_code
Predicted Category Name: verify_code


## 11. Conclusion
This notebook demonstrated a complete workflow for email classification using machine learning, with SVM performing well on test data.