# 💌 Email Classification using LLaMA

This notebook demonstrates how to use the TinyLLaMA model for classifying emails into one of three categories: `Priority`, `Promotions`, and `Updates`.

## 📦 Step 1: Install & Import Required Packages

In [1]:
# Install if not already
# pip install llama-cpp-python pandas scikit-learn

from llama_cpp import Llama
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
from tqdm import tqdm
import time
import re


pd.set_option('display.max_colwidth', None)

## 📁 Step 2: Load Dataset

In [3]:
emails_df = pd.read_csv(r"C:\Users\USER\Documents\my-ai-projects\email_dataset.csv", nrows = 40)
pd.set_option('display.max_colwidth', None)

# Check structure
emails_df.head()

Unnamed: 0,email_id,email_content,expected_category
0,1,Flash Sale - 48 Hours Only!\nEverything must go! Massive discounts on all items. Shop now before it's too late!,Promotions
1,2,Monthly Department Updates\nReview this month's KPIs and upcoming projects. New policies attached for review.,Updates
2,3,Monthly Department Updates\nReview this month's KPIs and upcoming projects. New policies attached for review.,Updates
3,4,Urgent: Server Maintenance Required\nOur main server needs immediate maintenance due to critical errors. Please address ASAP.,Priority
4,5,Urgent: Server Maintenance Required\nOur main server needs immediate maintenance due to critical errors. Please address ASAP.,Priority


In [4]:
emails_df['expected_category'].value_counts()

expected_category
Priority      22
Updates       11
Promotions     7
Name: count, dtype: int64

In [5]:
print(emails_df.columns)

Index(['email_id', 'email_content', 'expected_category'], dtype='object')


In [6]:
emails_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40 entries, 0 to 39
Data columns (total 3 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   email_id           40 non-null     int64 
 1   email_content      40 non-null     object
 2   expected_category  40 non-null     object
dtypes: int64(1), object(2)
memory usage: 1.1+ KB


## 🧠 Step 3: Load the TinyLLaMA Model

In [7]:
model_path = r"C:\Users\USER\Documents\my-ai-projects\tinyllama-1.1b-chat-v0.3.Q4_K_M.gguf"

llm = Llama(
    model_path=model_path,
    n_ctx=1024,
    verbose= False
)

llama_context: n_ctx_per_seq (1024) < n_ctx_train (2048) -- the full capacity of the model will not be utilized
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility


## 🛠️ Step 4: Define a Good Prompt Template

In [None]:
import random

few_shot_examples = [
    ("Save big on clearance items – up to 70% off!", "Promotions"),
    ("Introducing our new product line – available now!", "Promotions"),
    ("Exclusive deal: Subscribe now and get 2 months free", "Promotions"),
    ("Weekend Special: Free shipping on all orders!", "Promotions"),
    ("Don't miss our Labor Day Sale!", "Promotions"),
    ("Biggest Sale of the Year – Limited Time Offer!", "Promotions"),

    ("Deadline Approaching: Final drafts due by Friday", "Priority"),
    ("Incident Alert: Unexpected downtime in production", "Priority"),
    ("Reminder: Submit budget revisions by 5 PM", "Priority"),
    ("New Office Guidelines: Please review seating update", "Updates"),
    ("Quarterly Newsletter: Message from the CEO", "Updates"),
    ("Staff Bulletin: Remote work policy updates", "Updates"),
]

def build_prompt(email_text):
    random.shuffle(few_shot_examples)
    shots = "\n\n".join([f'Email: "{ex[0]}"\nCategory: {ex[1]}' for ex in few_shot_examples])

    return f"""
You are an intelligent email assistant. Classify the email into one of: Priority, Updates, or Promotions.

Think:
- Does the email promote sales, offers, or marketing content? → Promotions
- Is it internal info or general company updates? → Updates
- Is it time-sensitive or requesting urgent action? → Priority

Examples:
{shots}

Now classify this email:
\"\"\"{email_text}\"\"\"

Category:"""



def classify_email(email_content):
    full_prompt = build_prompt(email_content)
    output = llm(full_prompt, max_tokens=10, temperature=0.0, top_p=0.9)
    raw_text = output['choices'][0]['text'].strip()

    # Extract category
    match = re.search(r'\b(Priority|Updates|Promotions)\b', raw_text, re.IGNORECASE)
    return match.group(1).capitalize() if match else "Unknown"


# 3. Apply classification only to validation set
emails_df["predicted_category"] = emails_df["email_content"].apply(classify_email)
# 4. Filter out unknowns
valid_df = emails_df[emails_df["predicted_category"] != "Unknown"]

In [None]:
few_shot_prompt = """
You are an intelligent email assistant. Your job is to classify emails into one of three categories:

- "Priority": Time-sensitive or urgent emails that require immediate action.
- "Updates": Informational emails such as newsletters or internal updates.
- "Promotions": Marketing emails such as sales, discounts, or advertisements.

Think:
- Is the email trying to **sell** or **advertise** something? → Promotions
- Is it **internal information**, a **regular update**, or general announcement? → Updates
- Is it **urgent**, **time-sensitive**, or asking for **immediate action**? → Priority

Classify the email strictly as one of: Priority, Updates, or Promotions.

Here are some examples:

---

Email: "Flash Sale - 48 Hours Only! Everything must go! Massive discounts on all items. Shop now before it's too late!"
Category: Promotions

Email: "Get early access to our Black Friday deals – Exclusive to our subscribers!"
Category: Promotions

Email: "Just for you: Buy 1 Get 1 Free this week at our stores."
Category: Promotions

Email: "Special offer ends tonight! Get 30% off all electronics."
Category: Promotions

Email: "Join us for our grand opening – Free samples, music, and giveaways!"
Category: Promotions

Email: "Sign up for our loyalty program and receive 10% off every purchase."
Category: Promotions

Email: "This week only: Buy one, get one 50% off on all footwear!"
Category: Promotions

---

Email: "New Office Guidelines: Please review the new post-pandemic seating arrangement."
Category: Updates

Email: "Quarterly Company Newsletter: A message from the CEO, Q1 highlights, and policy changes."
Category: Updates

Email: "Upcoming Maintenance: Our servers will be down this Sunday from 2–4 AM."
Category: Updates

Email: "Staff Bulletin: Remote work policy updates for Q3."
Category: Updates

Email: "Product Roadmap: Here's what's coming in version 2.5"
Category: Updates

---

Email: "Reminder: Submit Q2 budget revisions by 5 PM today to avoid approval delays."
Category: Priority

Email: "Incident Alert: Unexpected downtime detected in the production environment. Engineering is investigating."
Category: Priority

Email: "Action Required: Confirm your attendance for tomorrow’s board meeting."
Category: Priority

Email: "Payroll Issue: Please contact HR by noon if your payment hasn’t arrived."
Category: Priority

Email: "Deadline Approaching: Final project drafts due by EOD Friday."
Category: Priority

---

Now classify this email:
\"{email_content}\"

Category:"""

LLaMA models are not "finetuned for classification", so you must craft a clear instruction

## 🧪 Step 5: Run Inference for All Emails

In [None]:
def classify_email(email_content):
    # Inject email into few-shot prompt
    full_prompt = few_shot_prompt.format(email_content=email_content.strip())
    output = llm(full_prompt, max_tokens=10, temperature=0.0, top_p=0.9)
    raw_text = output['choices'][0]['text'].strip()

    # Try to extract clean category
    match = re.search(r'\b(Priority|Updates|Promotions)\b', raw_text, re.IGNORECASE)
    return match.group(1).capitalize() if match else "Unknown"


# 3. Apply classification only to validation set
emails_df["predicted_category"] = emails_df["email_content"].apply(classify_email)
# 4. Filter out unknowns
valid_df = emails_df[emails_df["predicted_category"] != "Unknown"]

## 📊 Step 6: Evaluate Performance

In [10]:
valid_df = emails_df[emails_df["predicted_category"] != "Unknown"]

# 5. Evaluate
print(classification_report(valid_df["expected_category"], valid_df["predicted_category"]))
accuracy = accuracy_score(valid_df["expected_category"], valid_df["predicted_category"])
print(f"✅ Accuracy: {accuracy:.2%}")

              precision    recall  f1-score   support

    Priority       1.00      1.00      1.00        14
     Updates       1.00      1.00      1.00         2

    accuracy                           1.00        16
   macro avg       1.00      1.00      1.00        16
weighted avg       1.00      1.00      1.00        16

✅ Accuracy: 100.00%


## 🔁 Step 7: Improve Accuracy (Few-Shot Prompting)

In [None]:
def build_fewshot_prompt(email_text):
    return f"""
You are an intelligent email assistant. Your task is to classify the email content into one of: Priority, Promotions, Updates.

Examples:
Email: "Urgent: Server will go down tonight."  
Category: Priority

Email: "Huge sale: 50% off electronics!"  
Category: Promotions

Email: "Here is your weekly newsletter."  
Category: Updates

Now classify the following email:
\"\"\"
{email_text.strip()}
\"\"\"

Category:"""

In [None]:
categories = ["Priority", "Updates", "Promotions"]
cm = confusion_matrix(valid_df["expected_category"], valid_df["predicted_category"], labels=categories)

plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=categories, yticklabels=categories)
plt.title("Confusion Matrix")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.tight_layout()
plt.show()

In [None]:
report = classification_report(valid_df["expected_category"], valid_df["predicted_category"],
                               labels=categories, output_dict=True)

category_names = categories
f1_scores = [report[cat]["f1-score"] for cat in category_names]

plt.figure(figsize=(6, 4))
sns.barplot(x=category_names, y=f1_scores)
plt.title("F1 Score per Category")
plt.ylabel("F1 Score")
plt.ylim(0, 1)
plt.show()

In [None]:
misclassified = valid_df[valid_df["expected_category"] != valid_df["predicted_category"]]
print("\n❌ Sample Misclassifications:\n")
display(misclassified[["email_content", "expected_category", "predicted_category"]].head(5))

In [None]:
misclassified.shape