# Email Similarity

In this project, scikit-learn's Naive Bayes classifier is used to distinguish the difference between emails about different categories.

## 1. Explore the Data

In [15]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

In [16]:
emails = fetch_20newsgroups()

print(emails.target_names)

['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']


See how effective the Naive Bayes classifier is at distinguishing the difference between a baseball email and a hockey email.

In [17]:
# Import the baseball and hockey emails
emails = fetch_20newsgroups(categories = ['rec.sport.baseball', 'rec.sport.hockey'])

In [18]:
# Emails are stored in a list called emails.data
print(emails.data[5])

# Labels are stored in a list called emails.target
print(emails.target[5])

From: mmb@lamar.ColoState.EDU (Michael Burger)
Subject: More TV Info
Distribution: na
Nntp-Posting-Host: lamar.acns.colostate.edu
Organization: Colorado State University, Fort Collins, CO  80523
Lines: 36

United States Coverage:
Sunday April 18
  N.J./N.Y.I. at Pittsburgh - 1:00 EDT to Eastern Time Zone
  ABC - Gary Thorne and Bill Clement

  St. Louis at Chicago - 12:00 CDT and 11:00 MDT - to Central/Mountain Zones
  ABC - Mike Emerick and Jim Schoenfeld

  Los Angeles at Calgary - 12:00 PDT and 11:00 ADT - to Pacific/Alaskan Zones
  ABC - Al Michaels and John Davidson

Tuesday, April 20
  N.J./N.Y.I. at Pittsburgh - 7:30 EDT Nationwide
  ESPN - Gary Thorne and Bill Clement

Thursday, April 22 and Saturday April 24
  To Be Announced - 7:30 EDT Nationwide
  ESPN - To Be Announced


Canadian Coverage:

Sunday, April 18
  Buffalo at Boston - 7:30 EDT Nationwide
  TSN - ???

Tuesday, April 20
  N.J.D./N.Y. at Pittsburgh - 7:30 EDT Nationwide
  TSN - ???

Wednesday, April 21
  St. Louis a

## 2. Make the Training and Test Sets

In [19]:
# Split the data into training and test sets
train_emails = fetch_20newsgroups(categories = ['rec.sport.baseball', 'rec.sport.hockey'],\
                                  subset = 'train', shuffle = True, random_state = 1)

test_emails = fetch_20newsgroups(categories = ['rec.sport.baseball', 'rec.sport.hockey'],\
                                 subset = 'test', shuffle = True, random_state = 1)

## 3. Count Words

In [20]:
# Create a CountVectorizer object
counter = CountVectorizer()

# Tell counter what possible words can exist in our emails
counter.fit(test_emails.data + train_emails.data)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

In [21]:
# Transform these emails into lists of word counts
train_counts = counter.transform(train_emails.data)
test_counts = counter.transform(test_emails.data)

## 3. Make a Naive Bayes Classifier

In [22]:
# Create a MultinomialNB object
classifier = MultinomialNB()

# Train the classifier with the training set and the lables associated with
classifier.fit(train_counts, train_emails.target)

# Test the classifier on the test set with accuracy metric
print(classifier.score(test_counts, test_emails.target))

0.9723618090452262


## 4. Test Other Datasets

In [23]:
emails1 = fetch_20newsgroups(categories = ['comp.sys.ibm.pc.hardware', 'rec.sport.hockey'])

train_emails1 = fetch_20newsgroups(categories = ['comp.sys.ibm.pc.hardware', 'rec.sport.hockey'],\
                                   subset = 'train', shuffle = True, random_state = 1)

test_emails1 = fetch_20newsgroups(categories = ['comp.sys.ibm.pc.hardware', 'rec.sport.hockey'],\
                                  subset = 'test', shuffle = True, random_state = 1)

counter1 = CountVectorizer()
counter1.fit(test_emails1.data + train_emails1.data)

train_counts1 = counter.transform(train_emails1.data)
test_counts1 = counter.transform(test_emails1.data)

classifier1 = MultinomialNB()
classifier1.fit(train_counts1, train_emails1.target)

print(classifier1.score(test_counts1, test_emails1.target))

0.9860935524652339


It looks like the classifier do a better job at distinguishing the difference between emails about tech and emails about hockey.