# Email Similarity Analysis using Naive Bayes Classification

In this project, we aim to explore the similarities and distinctions between emails from different categories using **scikit-learn's Naive Bayes implementation**. By classifying emails into topics such as *hockey*, *soccer*, and *tech*, we will evaluate the ease or difficulty of distinguishing between these topics based on text content.

## Introduction
### Objectives
- To classify emails into distinct categories using the Naive Bayes algorithm.
- To measure the accuracy of the classifier for various datasets.
- To determine which topics are more difficult to differentiate based on email content.

### Key Questions
1. How challenging is it to differentiate between emails about hockey and soccer?
2. How does the difficulty compare when distinguishing between hockey and tech-related emails?

By analyzing the classifier's performance across multiple datasets, we will uncover insights into the inherent similarities between different topics and the limitations of text-based classification.

### Tools and Techniques
- **Programming Language**: Python
- **Libraries**: scikit-learn, pandas, numpy
- **Methods**: Text preprocessing, TF-IDF vectorization, Naive Bayes classification

## First Steps
### Libraries and dataset import
We have loaded a dataset of emails from scikit-learn's built-in datasets, where each email is labeled according to its content.

In [22]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
import random

emails = fetch_20newsgroups(categories = ['rec.sport.baseball', 'rec.sport.hockey'])
print(emails.data[5]) # all of the emails are stored in a list called emails.data
print(emails.target_names) # this will print out the names of the categories: 0 --> baseball, 1 --> hockey 
print(emails.target[5]) # all of the labels are stored in this list, In this case, this email is in the category of rec.sport.hockey

From: mmb@lamar.ColoState.EDU (Michael Burger)
Subject: More TV Info
Distribution: na
Nntp-Posting-Host: lamar.acns.colostate.edu
Organization: Colorado State University, Fort Collins, CO  80523
Lines: 36

United States Coverage:
Sunday April 18
  N.J./N.Y.I. at Pittsburgh - 1:00 EDT to Eastern Time Zone
  ABC - Gary Thorne and Bill Clement

  St. Louis at Chicago - 12:00 CDT and 11:00 MDT - to Central/Mountain Zones
  ABC - Mike Emerick and Jim Schoenfeld

  Los Angeles at Calgary - 12:00 PDT and 11:00 ADT - to Pacific/Alaskan Zones
  ABC - Al Michaels and John Davidson

Tuesday, April 20
  N.J./N.Y.I. at Pittsburgh - 7:30 EDT Nationwide
  ESPN - Gary Thorne and Bill Clement

Thursday, April 22 and Saturday April 24
  To Be Announced - 7:30 EDT Nationwide
  ESPN - To Be Announced


Canadian Coverage:

Sunday, April 18
  Buffalo at Boston - 7:30 EDT Nationwide
  TSN - ???

Tuesday, April 20
  N.J.D./N.Y. at Pittsburgh - 7:30 EDT Nationwide
  TSN - ???

Wednesday, April 21
  St. Louis a

## Train-Test split and data processing

In [23]:
# We now want to split our data into training and test sets
# Training dataset
train_emails = fetch_20newsgroups(
    categories=['rec.sport.baseball', 'rec.sport.hockey'], 
    subset='train', 
    shuffle=True, 
    random_state=108
)

# Test dataset
test_emails = fetch_20newsgroups(
    categories=['rec.sport.baseball', 'rec.sport.hockey'], 
    subset='test', 
    shuffle=True, 
    random_state=108
)


In [24]:
# We want to transform these emails into lists of word counts: we can use the CountVectorizer class provided by scikit-learn to do this

# Create a CountVectorizer object
counter = CountVectorizer()

# Fit the CountVectorizer to the combined dataset
counter.fit(train_emails.data + test_emails.data)

# Transform the training emails into word count vectors
train_counts = counter.transform(train_emails.data)

# Transform the testing emails into word count vectors
test_counts = counter.transform(test_emails.data)



## Training a Naive Bayes classification model

In [25]:
# Create a Multinomial Naive Bayes classifier
classifier = MultinomialNB()

# Train the classifier using the training data and labels
classifier.fit(train_counts, train_emails.target)


## Testing the model accuracy

In [26]:
# Test the classifier and print the accuracy
accuracy = classifier.score(test_counts, test_emails.target)
print(f"Classifier Accuracy: {accuracy*100:.2f}%")

Classifier Accuracy: 97.24%


## Testing different categories with a Function Abstraction

In [27]:
# Function encapsulation

def train_and_evaluate(categories, random_state=108):
    """
    Train and evaluate a Naive Bayes classifier for the given email categories.

    Parameters:
        categories (list): List of categories to classify.
        random_state (int): Random seed for reproducibility.

    Returns:
        float: Accuracy of the classifier on the test set.
    """
    train_emails = fetch_20newsgroups(categories=categories, subset='train', shuffle=True, random_state=random_state)
    test_emails = fetch_20newsgroups(categories=categories, subset='test', shuffle=True, random_state=random_state)

    # Create a CountVectorizer object and fit it to the combined data
    counter = CountVectorizer()
    counter.fit(train_emails.data + test_emails.data)

    # Transform the emails into word count vectors
    train_counts = counter.transform(train_emails.data)
    test_counts = counter.transform(test_emails.data)

    # Create and train the Naive Bayes classifier
    classifier = MultinomialNB()
    classifier.fit(train_counts, train_emails.target)

    # Evaluate the classifier and return the accuracy
    accuracy = classifier.score(test_counts, test_emails.target)
    return accuracy

In [28]:
# Function to randomly pick two categories and evaluate
def evaluate_random_categories(num_trials=5):
    results = []
    for _ in range(num_trials):
        # Pick two random categories
        categories = random.sample(all_categories, 2)
        # Evaluate accuracy
        accuracy = train_and_evaluate(categories)
        results.append((categories, accuracy))
        print(f"Categories: {categories}, Accuracy: {accuracy:.2f}")
    return results

In [29]:
# List of possible categories
all_categories = [
    'alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 
    'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x',
    'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 
    'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 
    'sci.space', 'soc.religion.christian', 'talk.politics.guns', 
    'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc'
]

# Run the function with 5 random trials
random_results = evaluate_random_categories(5)

Categories: ['rec.sport.hockey', 'sci.space'], Accuracy: 0.99
Categories: ['alt.atheism', 'talk.politics.misc'], Accuracy: 0.96
Categories: ['comp.sys.mac.hardware', 'comp.windows.x'], Accuracy: 0.96
Categories: ['comp.windows.x', 'comp.graphics'], Accuracy: 0.86
Categories: ['alt.atheism', 'talk.politics.mideast'], Accuracy: 0.95
