# Project: Marketing Email Text Classification (NLP)

This project classifies marketing emails into categories such as 
Promotion, Transactional, and Engagement using TF-IDF and Logistic Regression.

## Steps:
1. Load dataset of marketing emails
2. Clean and vectorize the text using TF-IDF
3. Split dataset into train/test
4. Train Logistic Regression classifier
5. Evaluate using accuracy and classification report

## Files:
- marketing_email_classification.ipynb
- emails.csv
- requirements.txt


In [1]:
# Import packages and libraries

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score

In [2]:
# Load the dataset

df = pd.read_csv("emails.csv")
df.head()

Unnamed: 0,Email_Text,Category
0,Limited-time offer! Upgrade your plan today fo...,Promotion
1,Your subscription has been renewed successfully.,Transactional
2,Reminder: Complete your profile to get persona...,Engagement
3,Flash sale starts now! Don't miss out.,Promotion
4,Your ticket has been received. Support will re...,Transactional


In [3]:
df. shape

(306, 2)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 306 entries, 0 to 305
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Email_Text  306 non-null    object
 1   Category    306 non-null    object
dtypes: object(2)
memory usage: 4.9+ KB


In [5]:
df.value_counts("Category")

Category
Transactional    140
Engagement       125
Promotion         41
Name: count, dtype: int64

In [6]:
# Split data

X_train, X_test, y_train, y_test = train_test_split(
    df["Email_Text"], df["Category"], test_size=0.2, random_state=42
)

In [7]:
# TF-IDF vectorization

vectorizer = TfidfVectorizer(stop_words="english", max_features=500)
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

In [8]:
# Train model

model = LogisticRegression()
print(model.fit(X_train_tfidf, y_train))

LogisticRegression()


In [9]:
# Predict
y_pred = model.predict(X_test_tfidf)

In [10]:
# Evaluate

print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

Accuracy: 0.9354838709677419

               precision    recall  f1-score   support

   Engagement       0.90      1.00      0.95        26
    Promotion       1.00      0.50      0.67         8
Transactional       0.97      1.00      0.98        28

     accuracy                           0.94        62
    macro avg       0.95      0.83      0.86        62
 weighted avg       0.94      0.94      0.93        62



### Results:
 - Current model achieves Accuracy: 0.935 on the test set.

## Future Scope:

- Expand Dataset – Increase the number of emails and diversify categories to improve model generalization and robustness.
- Advanced NLP Models – Implement transformer-based models like BERT, DistilBERT, or RoBERTa for higher accuracy and better understanding of email context.
- Hyperparameter Optimization – Explore techniques like Grid Search or Random Search to fine-tune vectorization and classifier parameters.
- Feature Engineering – Experiment with additional features such as email metadata (sender, subject line length, time sent) to improve predictions.
- Real-Time Classification – Build a pipeline to classify incoming marketing emails in real-time, integrating with a messaging system or email server.
- Sentiment & Intent Analysis – Extend the model to identify sentiment or specific call-to-action intent, providing actionable insights for marketing campaigns.