# Email similarity

In this project, I will be utilizing scikit-learn's Naive Bayes implementation to work with various datasets for text classification. My primary objective is to assess the classifier's accuracy and determine the difficulty level of distinguishing between different categories of text documents. Specifically, I will explore how challenging it is to differentiate between emails related to topics like hockey and soccer or emails about hockey and technology.

**Key Project Objectives:**

1. **Exploring the Data**: I will begin by examining a dataset of emails, each labeled based on its content. I'll investigate the categories represented in the dataset to understand the scope of classification.

2. **Selecting Categories**: My focus will be on evaluating the effectiveness of a Naive Bayes classifier in distinguishing between baseball and hockey-related emails. I will specify the relevant categories for my analysis.

3. **Data Inspection**: I'll take a closer look at one of the emails within the chosen categories to understand the content.

4. **Label Analysis**: To determine whether an email is related to baseball or hockey, I'll inspect the labels associated with the emails and map them to their respective categories.

5. **Creating Training and Test Sets**: I will split the data into training and test sets to facilitate model training and evaluation. This split will ensure consistency across runs by using the `random_state` parameter.

6. **Counting Words**: I will transform the emails into lists of word counts using the `CountVectorizer` class, enabling me to work with numerical data.

7. **Building a Naive Bayes Classifier**: I'll create a Multinomial Naive Bayes classifier and train it using the training data and associated labels.

8. **Model Evaluation**: I will assess the performance of the Naive Bayes Classifier by calculating its accuracy score on the test data. Accuracy measures the percentage of correct classifications made by the classifier.

9. **Testing Other Datasets**: To further explore the classifier's capabilities, I will apply it to different datasets with varying topics and assess its performance. I'll examine scenarios where the topics are distinct to determine the classifier's accuracy.

Throughout this project, I aim to gain insights into the effectiveness of the Multinomial Naive Bayes classifier in classifying text documents and the challenges it may encounter when distinguishing between different categories of content.

-----

## Import Libraries

In [2]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

## Load Data

In [3]:
# Fetch the 20 newsgroups dataset for 'rec.sport.baseball' and 'rec.sport.hockey'
emails = fetch_20newsgroups(categories=['rec.sport.baseball', 'rec.sport.hockey'])

## Explore Data

In [4]:
# Print sample email
print("Sample Email:")
print(emails.data[5])

Sample Email:
From: mmb@lamar.ColoState.EDU (Michael Burger)
Subject: More TV Info
Distribution: na
Nntp-Posting-Host: lamar.acns.colostate.edu
Organization: Colorado State University, Fort Collins, CO  80523
Lines: 36

United States Coverage:
Sunday April 18
  N.J./N.Y.I. at Pittsburgh - 1:00 EDT to Eastern Time Zone
  ABC - Gary Thorne and Bill Clement

  St. Louis at Chicago - 12:00 CDT and 11:00 MDT - to Central/Mountain Zones
  ABC - Mike Emerick and Jim Schoenfeld

  Los Angeles at Calgary - 12:00 PDT and 11:00 ADT - to Pacific/Alaskan Zones
  ABC - Al Michaels and John Davidson

Tuesday, April 20
  N.J./N.Y.I. at Pittsburgh - 7:30 EDT Nationwide
  ESPN - Gary Thorne and Bill Clement

Thursday, April 22 and Saturday April 24
  To Be Announced - 7:30 EDT Nationwide
  ESPN - To Be Announced


Canadian Coverage:

Sunday, April 18
  Buffalo at Boston - 7:30 EDT Nationwide
  TSN - ???

Tuesday, April 20
  N.J.D./N.Y. at Pittsburgh - 7:30 EDT Nationwide
  TSN - ???

Wednesday, April 21

In [5]:
# Print target label
print("Target Label:", emails.target[5])

Target Label: 1


## Data Wrangling

In [6]:
# Split the dataset into training and test sets
train_emails = fetch_20newsgroups(categories=['rec.sport.baseball', 'rec.sport.hockey'], subset='train', shuffle=True, random_state=108)
test_emails = fetch_20newsgroups(categories=['rec.sport.baseball', 'rec.sport.hockey'], subset='test', shuffle=True, random_state=108)

# Save the training and test labels
train_labels = train_emails.target
test_labels = test_emails.target

In [7]:
# Counting Words using CountVectorizer
counter = CountVectorizer()
counter.fit(test_emails.data + train_emails.data)

# Save the training and test counts
train_counts = counter.transform(train_emails.data)
test_counts = counter.transform(test_emails.data)

## Naive Bayes Classifier

In [8]:
# Create and train a Naive Bayes Classifier
classifier = MultinomialNB()
classifier.fit(train_counts, train_labels)

In [9]:
# Evaluate the classifier's accuracy on the test data
accuracy = classifier.score(test_counts, test_labels)
print(f"The accuracy of the classifier on the test data is {accuracy:.3f}")

The accuracy of the classifier on the test data is 0.972


**Summary:** 
The classifier does a pretty good job distinguishing between baseball and hockey emails, offering a 97.2% accuracy in its classifications.

## Testing with Different Datasets

In [10]:
print("\nTesting with a different dataset:")

# Split the dataset into training and test sets
train_emails = fetch_20newsgroups(categories=['comp.sys.ibm.pc.hardware', 'rec.sport.hockey'], subset='train', shuffle=True, random_state=108)
test_emails = fetch_20newsgroups(categories=['comp.sys.ibm.pc.hardware', 'rec.sport.hockey'], subset='test', shuffle=True, random_state=108)

# Save the training and test labels
train_labels = train_emails.target
test_labels = test_emails.target

# Counting Words using CountVectorizer
counter = CountVectorizer()
counter.fit(test_emails.data + train_emails.data)

# Save the training and test counts
train_counts = counter.transform(train_emails.data)
test_counts = counter.transform(test_emails.data)

# Create and train a Naive Bayes Classifier
classifier = MultinomialNB()
classifier.fit(train_counts, train_labels)

# Evaluate the classifier's accuracy on the test data
accuracy = classifier.score(test_counts, test_labels)
print(f"The accuracy of the classifier on the test data is {accuracy:.3f}")


Testing with a different dataset:


The accuracy of the classifier on the test data is 0.997


**Summary:** 
The classifier does an even better job at distinguishing between ibm hardware and hockey emails, offering a 99.7% accuracy in its classifications.