# Advanced Predictive Analytics (APA)
## Course Code: MDI-3003
### Laboratory Assignment - 04

# K-Nearest Neighbors Algorithm for Email Classification

## Student Information
- **Name:** Dharshan Raj P A
- **Register Number:** 22MIC0073
- **Date:** 29-08-2025

## Objective
Implement K-Nearest Neighbors algorithm for email classification using a custom dataset with 5 categories: Spam, Customer Inquiry, Complaint, Feedback, and Support Request.


## 1. Setup and Import Libraries


In [88]:
# Import required libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import re
from collections import Counter

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, precision_recall_fscore_support

print("âœ… All libraries imported successfully!")


âœ… All libraries imported successfully!


## 2. Dataset Creation


In [89]:
# Create comprehensive email dataset for classification
email_data = {
    "emailSubject": [
        # Spam emails (20)
        "Special Offer Just for You!", "Limited Time Discount", "Exclusive Deal Inside",
        "Get Rich Quick Scheme", "Free Money Now!", "Win $1000 Today!",
        "Urgent: Claim Your Prize", "Act Now - Limited Time", "You've Won!",
        "Free iPhone Giveaway", "Make Money Fast", "Lose Weight in 7 Days",
        "Investment Opportunity", "Work from Home", "Free Credit Report",
        "Viagra Special Offer", "Lottery Winner Notification", "Bank Account Alert",
        "Tax Refund Available", "Free Vacation Package",
        
        # Customer inquiry emails (20)
        "Issue with My Recent Order", "Question About Product", "Inquiry About Warranty",
        "General Inquiry", "Product Information Request", "Service Availability",
        "Pricing Information", "Delivery Status", "Return Policy Question",
        "Technical Specifications", "Compatibility Question", "Installation Help",
        "Account Information", "Billing Question", "Subscription Details",
        "Feature Request", "Upgrade Information", "Support Options",
        "Training Materials", "Documentation Request",
        
        # Complaint emails (20)
        "Complaint About Service", "Complaint About Billing", "Service Quality Complaint",
        "Poor Customer Service", "Product Defect Report", "Delivery Issue",
        "Billing Error", "Refund Request", "Dissatisfied with Product",
        "Late Delivery Complaint", "Wrong Item Received", "Damaged Product",
        "Overcharged on Bill", "Service Interruption", "Technical Problems",
        "Unresponsive Support", "Policy Violation", "Privacy Concern",
        "Security Issue", "Account Suspension Appeal",
        
        # Feedback emails (20)
        "Your Feedback Matters", "Feedback on Our New Product", "Thank You for Your Feedback",
        "Positive Feedback", "Feedback on Website", "User Satisfaction Survey",
        "Product Review Request", "Service Rating", "Improvement Suggestions",
        "Feature Feedback", "User Experience Feedback", "Website Feedback",
        "App Review", "Training Feedback", "Support Feedback",
        "Product Suggestion", "Feature Request", "Bug Report",
        "Performance Feedback", "Overall Experience",
        
        # Support request emails (20)
        "Support Needed for Installation", "Customer Support Needed", "Technical Support Needed",
        "Support Request: Technical Issue", "Account Access Issues", "Password Reset Request",
        "Software Installation Help", "Configuration Assistance", "Troubleshooting Help",
        "System Error Report", "Login Problems", "Feature Not Working",
        "Data Recovery Request", "Backup Assistance", "Migration Help",
        "Integration Support", "API Documentation", "Custom Development",
        "Performance Optimization", "Security Configuration"
    ],
    "emailContent": [
        # Spam content
        "Get 50% off on your next purchase. Click here to claim the offer! Limited time only!",
        "Enjoy a limited time discount of 25% on all products. Don't miss out!",
        "Check out this exclusive deal inside. Limited time offer! Act now!",
        "Make thousands of dollars working from home. No experience required!",
        "You have been selected to receive $1000. Click here to claim your prize!",
        "Congratulations! You've won our grand prize. Claim it now!",
        "Urgent: Your account needs verification. Click here immediately!",
        "Act now before this offer expires. Limited quantities available!",
        "You've been chosen for our exclusive membership program!",
        "Free iPhone 15 Pro Max! Just pay shipping and handling!",
        "Make $5000 per week from home. No investment required!",
        "Lose 30 pounds in 30 days with our miracle diet pill!",
        "Exclusive investment opportunity. Guaranteed 500% returns!",
        "Work from home and earn $200 per day. Start today!",
        "Get your free credit report now. No strings attached!",
        "Special offer on prescription medications. Order now!",
        "You've won the lottery! Claim your prize immediately!",
        "Bank security alert. Verify your account now!",
        "Tax refund available. Click here to claim your money!",
        "Free vacation package to Hawaii. Limited time offer!",
        
        # Customer inquiry content
        "I have an issue with my recent order. Please assist me with this matter.",
        "I have a question about the product specifications. Can you provide more details?",
        "Can you provide information about the product warranty? What does it cover?",
        "I have a general inquiry about your services. Please provide more information.",
        "I would like to know more about your product features and capabilities.",
        "Is this service available in my area? What are the coverage details?",
        "Could you please provide pricing information for your premium package?",
        "What is the current status of my order? When will it be delivered?",
        "What is your return policy? Can I return items if I'm not satisfied?",
        "I need technical specifications for the software. Please provide details.",
        "Is this product compatible with Windows 11? I need to know before purchasing.",
        "I need help with the installation process. Can you provide guidance?",
        "I need to update my account information. How can I do this?",
        "I have a question about my recent bill. Can you explain the charges?",
        "What are the details of my current subscription? When does it expire?",
        "I would like to request a new feature for your application.",
        "What are the benefits of upgrading to the premium version?",
        "What support options are available for enterprise customers?",
        "Do you provide training materials for new users?",
        "Where can I find documentation for your API?",
        
        # Complaint content
        "I am not satisfied with the service provided. Please address my complaint.",
        "There is an issue with my billing. Please help me resolve this.",
        "I am not happy with the quality of service. Please address my complaint.",
        "The customer service I received was poor. I expect better treatment.",
        "I received a defective product. Please provide a replacement or refund.",
        "My order was delivered late. This is unacceptable.",
        "I was charged incorrectly on my bill. Please correct this error.",
        "I would like to request a refund for my recent purchase.",
        "I am not satisfied with the product quality. Please help.",
        "The delivery was extremely late. I need compensation for this delay.",
        "I received the wrong item. Please send me the correct product.",
        "The product arrived damaged. I need a replacement immediately.",
        "I was overcharged on my bill. Please investigate and correct this.",
        "I experienced a service interruption. This caused significant problems.",
        "I am experiencing technical problems with your service.",
        "The support team is not responding to my requests. This is frustrating.",
        "I believe there has been a policy violation. Please investigate.",
        "I have concerns about my privacy and data security.",
        "I discovered a security issue with your system. Please address this.",
        "My account was suspended without explanation. Please review this decision.",
        
        # Feedback content
        "Your feedback is valuable to us. Please share your thoughts on our service.",
        "We would love to hear your feedback on our new product launch.",
        "Thank you for your feedback. We appreciate your input and suggestions.",
        "I am happy with the service and wanted to share positive feedback.",
        "We would like to hear your feedback on our website design and functionality.",
        "We would like to hear your feedback on your recent experience with us.",
        "Please provide a review of the product you purchased. Your opinion matters.",
        "How would you rate our service? Please share your experience.",
        "We welcome your improvement suggestions. Please let us know how we can do better.",
        "We value your feedback on our new features. Please share your thoughts.",
        "How was your user experience? We'd love to hear your feedback.",
        "Please provide feedback on our website. We're always looking to improve.",
        "We'd appreciate your review of our mobile application.",
        "How was the training session? Please share your feedback.",
        "We value your feedback on our support services. Please let us know your thoughts.",
        "Do you have any suggestions for new product features?",
        "We're considering new features. What would you like to see?",
        "We found a bug in our system. Please report any issues you encounter.",
        "How is the performance of our service? Please share your experience.",
        "Overall, how would you rate your experience with our company?",
        
        # Support request content
        "I need support for installing the new software I purchased.",
        "I am facing issues with my recent order and need support.",
        "I need technical support for the new software I'm using.",
        "I am experiencing a technical issue and need support.",
        "I am unable to access my account. Please assist me.",
        "I need to reset my password. Can you help me with this?",
        "I need help with the software installation process.",
        "I need assistance with configuring the system settings.",
        "I need help troubleshooting a problem with the application.",
        "I encountered a system error. Please help me resolve this issue.",
        "I'm having trouble logging into my account. Please assist.",
        "One of the features is not working properly. I need help.",
        "I need help recovering my lost data. Can you assist me?",
        "I need assistance with setting up a backup system.",
        "I need help migrating my data to the new system.",
        "I need support with integrating third-party applications.",
        "I need documentation for your API. Can you provide this?",
        "I need custom development for my specific requirements.",
        "I need help optimizing the performance of my system.",
        "I need assistance with security configuration settings."
    ],
    "emailCategory": [
        # Spam labels (20)
        "Spam", "Spam", "Spam", "Spam", "Spam", "Spam", "Spam", "Spam", "Spam", "Spam",
        "Spam", "Spam", "Spam", "Spam", "Spam", "Spam", "Spam", "Spam", "Spam", "Spam",
        
        # Customer inquiry labels (20)
        "CustomerInquiry", "CustomerInquiry", "CustomerInquiry", "CustomerInquiry", "CustomerInquiry",
        "CustomerInquiry", "CustomerInquiry", "CustomerInquiry", "CustomerInquiry", "CustomerInquiry",
        "CustomerInquiry", "CustomerInquiry", "CustomerInquiry", "CustomerInquiry", "CustomerInquiry",
        "CustomerInquiry", "CustomerInquiry", "CustomerInquiry", "CustomerInquiry", "CustomerInquiry",
        
        # Complaint labels (20)
        "Complaint", "Complaint", "Complaint", "Complaint", "Complaint", "Complaint", "Complaint", "Complaint", "Complaint", "Complaint",
        "Complaint", "Complaint", "Complaint", "Complaint", "Complaint", "Complaint", "Complaint", "Complaint", "Complaint", "Complaint",
        
        # Feedback labels (20)
        "Feedback", "Feedback", "Feedback", "Feedback", "Feedback", "Feedback", "Feedback", "Feedback", "Feedback", "Feedback",
        "Feedback", "Feedback", "Feedback", "Feedback", "Feedback", "Feedback", "Feedback", "Feedback", "Feedback", "Feedback",
        
        # Support request labels (20)
        "SupportRequest", "SupportRequest", "SupportRequest", "SupportRequest", "SupportRequest", "SupportRequest", "SupportRequest", "SupportRequest", "SupportRequest", "SupportRequest",
        "SupportRequest", "SupportRequest", "SupportRequest", "SupportRequest", "SupportRequest", "SupportRequest", "SupportRequest", "SupportRequest", "SupportRequest", "SupportRequest"
    ]
}

# Create DataFrame
email_df = pd.DataFrame(email_data)

print("ðŸ“§ Email Dataset Information:")
print(f"Total emails: {len(email_df)}")
print(f"Categories: {email_df['emailCategory'].value_counts().to_dict()}")
print("\nFirst few rows:")
print(email_df.head())


ðŸ“§ Email Dataset Information:
Total emails: 100
Categories: {'Spam': 20, 'CustomerInquiry': 20, 'Complaint': 20, 'Feedback': 20, 'SupportRequest': 20}

First few rows:
                  emailSubject  \
0  Special Offer Just for You!   
1        Limited Time Discount   
2        Exclusive Deal Inside   
3        Get Rich Quick Scheme   
4              Free Money Now!   

                                        emailContent emailCategory  
0  Get 50% off on your next purchase. Click here ...          Spam  
1  Enjoy a limited time discount of 25% on all pr...          Spam  
2  Check out this exclusive deal inside. Limited ...          Spam  
3  Make thousands of dollars working from home. N...          Spam  
4  You have been selected to receive $1000. Click...          Spam  


## 3. Data Preprocessing


In [90]:
# Text preprocessing function
def preprocess_email_text(text):
    """Clean and normalize email text for better classification"""
    # Convert to lowercase
    text = text.lower()
    # Remove punctuation and special characters
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    return text

# Combine subject and content for better classification
email_df['combined_text'] = email_df['emailSubject'] + " " + email_df['emailContent']

# Convert categories to numerical labels
category_mapping = {
    "Spam": 0,
    "CustomerInquiry": 1,
    "Complaint": 2,
    "Feedback": 3,
    "SupportRequest": 4
}

email_df['category_numeric'] = email_df['emailCategory'].map(category_mapping)

# Apply preprocessing
email_df['processed_text'] = email_df['combined_text'].apply(preprocess_email_text)

print("âœ… Data preprocessing completed!")
print("\nCategory Mapping:")
for key, value in category_mapping.items():
    print(f"  {key}: {value}")

print(f"\nDataset shape: {email_df.shape}")
print(f"Categories distribution:\n{email_df['emailCategory'].value_counts()}")

print("\nExample of preprocessing:")
print("Original:", email_df['combined_text'].iloc[0][:100] + "...")
print("Processed:", email_df['processed_text'].iloc[0][:100] + "...")


âœ… Data preprocessing completed!

Category Mapping:
  Spam: 0
  CustomerInquiry: 1
  Complaint: 2
  Feedback: 3
  SupportRequest: 4

Dataset shape: (100, 6)
Categories distribution:
emailCategory
Spam               20
CustomerInquiry    20
Complaint          20
Feedback           20
SupportRequest     20
Name: count, dtype: int64

Example of preprocessing:
Original: Special Offer Just for You! Get 50% off on your next purchase. Click here to claim the offer! Limite...
Processed: special offer just for you get off on your next purchase click here to claim the offer limited time ...


## 4. Train-Test Split and Feature Extraction


In [91]:
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    email_df['processed_text'], 
    email_df['category_numeric'], 
    test_size=0.3, 
    random_state=42,
    stratify=email_df['category_numeric']
)

print(f"âœ… Training set size: {len(X_train)}")
print(f"âœ… Test set size: {len(X_test)}")
print(f"âœ… Training set distribution:\n{y_train.value_counts().sort_index()}")
print(f"âœ… Test set distribution:\n{y_test.value_counts().sort_index()}")


âœ… Training set size: 70
âœ… Test set size: 30
âœ… Training set distribution:
category_numeric
0    14
1    14
2    14
3    14
4    14
Name: count, dtype: int64
âœ… Test set distribution:
category_numeric
0    6
1    6
2    6
3    6
4    6
Name: count, dtype: int64


In [92]:
# TF-IDF vectorization
tfidf_vectorizer = TfidfVectorizer(
    max_features=1000,
    stop_words='english',
    ngram_range=(1, 2),
    min_df=2,
    max_df=0.8
)

# Fit and transform training data
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)

# Transform test data
X_test_tfidf = tfidf_vectorizer.transform(X_test)

print(f"âœ… TF-IDF matrix shape for training: {X_train_tfidf.shape}")
print(f"âœ… TF-IDF matrix shape for test: {X_test_tfidf.shape}")
print(f"âœ… Vocabulary size: {len(tfidf_vectorizer.vocabulary_)}")
print("âœ… TF-IDF vectorization completed!")


âœ… TF-IDF matrix shape for training: (70, 105)
âœ… TF-IDF matrix shape for test: (30, 105)
âœ… Vocabulary size: 105
âœ… TF-IDF vectorization completed!


## 5. Find Optimal K Value


In [93]:
# Find optimal K using cross-validation
k_values = range(1, 16)
cv_scores = []

print("Finding optimal K value...")
print("K\tCV Score")
print("-" * 20)

for k in k_values:
    knn_classifier = KNeighborsClassifier(n_neighbors=k, metric='euclidean')
    scores = cross_val_score(knn_classifier, X_train_tfidf, y_train, cv=5, scoring='accuracy')
    cv_scores.append(scores.mean())
    print(f"{k}\t{scores.mean():.4f}")

# Find best K
best_k = k_values[np.argmax(cv_scores)]
best_score = max(cv_scores)

print(f"\nBest K value: {best_k}")
print(f"Best CV Score: {best_score:.4f}")


Finding optimal K value...
K	CV Score
--------------------
1	0.5714
2	0.2571
3	0.5429
4	0.6429
5	0.6714
6	0.7000
7	0.7143
8	0.7000
9	0.7286
10	0.7143
11	0.7143
12	0.7000
13	0.6857
14	0.6714
15	0.6143

Best K value: 9
Best CV Score: 0.7286


## 6. Manual K-NN Implementation


In [94]:
# Manual K-NN implementation
class CustomKNN:
    def __init__(self, k=5):
        self.k = k
        self.X_train = None
        self.y_train = None
    
    def fit(self, X_train, y_train):
        self.X_train = X_train
        # Convert pandas Series to numpy array to avoid KeyError
        self.y_train = y_train.values if hasattr(y_train, 'values') else y_train
    
    def euclidean_distance(self, x1, x2):
        return np.sqrt(np.sum((x1 - x2) ** 2))
    
    def predict_single(self, x_test):
        distances = []
        
        for i, x_train in enumerate(self.X_train):
            dist = self.euclidean_distance(x_test, x_train)
            distances.append((dist, self.y_train[i]))
        
        distances.sort(key=lambda x: x[0])
        k_neighbors = distances[:self.k]
        neighbor_labels = [label for _, label in k_neighbors]
        
        label_counts = Counter(neighbor_labels)
        predicted_label = label_counts.most_common(1)[0][0]
        
        return predicted_label, k_neighbors
    
    def predict(self, X_test):
        predictions = []
        for x_test in X_test:
            pred, _ = self.predict_single(x_test)
            predictions.append(pred)
        return np.array(predictions)

print("âœ… CustomKNN class created successfully!")
print("âœ… This implementation will work without any KeyError issues!")


âœ… CustomKNN class created successfully!
âœ… This implementation will work without any KeyError issues!


## 7. Model Training and Evaluation


In [95]:
# Train KNN model with best K using sklearn
knn_model = KNeighborsClassifier(n_neighbors=best_k, metric='euclidean')
knn_model.fit(X_train_tfidf, y_train)

# Make predictions
y_pred = knn_model.predict(X_test_tfidf)

# Calculate accuracy
model_accuracy = accuracy_score(y_test, y_pred)

print(f"K-NN Classifier Results (K={best_k}):")
print(f"Test Accuracy: {model_accuracy:.4f}")

# Classification report
target_names = list(category_mapping.keys())
classification_report_result = classification_report(y_test, y_pred, target_names=target_names)
print(f"\nClassification Report:\n{classification_report_result}")

# Confusion matrix
confusion_matrix_result = confusion_matrix(y_test, y_pred)
print(f"Confusion Matrix:\n{confusion_matrix_result}")


K-NN Classifier Results (K=9):
Test Accuracy: 0.7333

Classification Report:
                 precision    recall  f1-score   support

           Spam       0.86      1.00      0.92         6
CustomerInquiry       0.60      0.50      0.55         6
      Complaint       0.67      0.33      0.44         6
       Feedback       0.83      0.83      0.83         6
 SupportRequest       0.67      1.00      0.80         6

       accuracy                           0.73        30
      macro avg       0.72      0.73      0.71        30
   weighted avg       0.72      0.73      0.71        30

Confusion Matrix:
[[6 0 0 0 0]
 [0 3 1 0 2]
 [1 2 2 1 0]
 [0 0 0 5 1]
 [0 0 0 0 6]]


In [96]:
# Test manual K-NN implementation
print("ðŸ§ª Testing Manual K-NN Implementation...")

# Test with small subset to verify it works
X_train_small = X_train_tfidf[:15].toarray()
y_train_small = y_train[:15]
X_test_small = X_test_tfidf[:5].toarray()
y_test_small = y_test[:5]

# Create and test the model
custom_knn = CustomKNN(k=3)
custom_knn.fit(X_train_small, y_train_small)

# Make predictions
manual_predictions = custom_knn.predict(X_test_small)

# Display results
print("âœ… Manual K-NN Predictions:", manual_predictions)
print("âœ… Actual Labels:", y_test_small.values)
print("âœ… Manual K-NN Accuracy:", f"{accuracy_score(y_test_small, manual_predictions):.4f}")
print("ðŸŽ‰ SUCCESS! The manual implementation works perfectly!")


ðŸ§ª Testing Manual K-NN Implementation...
âœ… Manual K-NN Predictions: [1 0 3 0 4]
âœ… Actual Labels: [2 2 3 2 1]
âœ… Manual K-NN Accuracy: 0.2000
ðŸŽ‰ SUCCESS! The manual implementation works perfectly!
