# Email Similarity

In this project, you will use scikit-learn’s **Naive Bayes** implementation on several different datasets. By reporting the accuracy of the classifier, we can find which datasets are harder to distinguish. For example, how difficult do you think it is to distinguish the difference between emails about hockey and emails about soccer? How hard is it to tell the difference between emails about hockey and emails about tech? In this project, we’ll find out exactly how difficult those two tasks are.

In [2]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

## Exploring the Data

In [24]:
emails = fetch_20newsgroups()

print(emails.target_names)

['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']


In [28]:
# Select targets of interest
emails = fetch_20newsgroups(categories=['rec.sport.baseball', 'rec.sport.hockey'])

print(emails.target_names)

['rec.sport.baseball', 'rec.sport.hockey']


In [34]:
# Explore specific data point and check the label matches
print(emails.data[5])

print(emails.target[5]) # This email should correspond with the target index (0=baseball, 1=hockey)

From: mmb@lamar.ColoState.EDU (Michael Burger)
Subject: More TV Info
Distribution: na
Nntp-Posting-Host: lamar.acns.colostate.edu
Organization: Colorado State University, Fort Collins, CO  80523
Lines: 36

United States Coverage:
Sunday April 18
  N.J./N.Y.I. at Pittsburgh - 1:00 EDT to Eastern Time Zone
  ABC - Gary Thorne and Bill Clement

  St. Louis at Chicago - 12:00 CDT and 11:00 MDT - to Central/Mountain Zones
  ABC - Mike Emerick and Jim Schoenfeld

  Los Angeles at Calgary - 12:00 PDT and 11:00 ADT - to Pacific/Alaskan Zones
  ABC - Al Michaels and John Davidson

Tuesday, April 20
  N.J./N.Y.I. at Pittsburgh - 7:30 EDT Nationwide
  ESPN - Gary Thorne and Bill Clement

Thursday, April 22 and Saturday April 24
  To Be Announced - 7:30 EDT Nationwide
  ESPN - To Be Announced


Canadian Coverage:

Sunday, April 18
  Buffalo at Boston - 7:30 EDT Nationwide
  TSN - ???

Tuesday, April 20
  N.J.D./N.Y. at Pittsburgh - 7:30 EDT Nationwide
  TSN - ???

Wednesday, April 21
  St. Louis a

## Making the Training and Test Sets

Change the name of the variable from emails to train_emails / test_emails. Add these three parameters to the function calls:

subset='train' / 'test'
shuffle = True
random_state = 108

In [19]:
# split into train/test sets
train_emails = fetch_20newsgroups(categories=['rec.sport.baseball', 'rec.sport.hockey'], subset='train', shuffle=True, random_state=108)

test_emails = fetch_20newsgroups(categories=['rec.sport.baseball', 'rec.sport.hockey'], subset='test', shuffle=True, random_state=108)

## Counting Words

In [21]:
# Create list of word counts
counter = CountVectorizer()
counter.fit(test_emails.data + train_emails.data)

In [38]:
# make a list of the counts of our words in our training/test sets
train_counts = counter.transform(train_emails.data)
test_counts = counter.transform(test_emails.data)

## Making a Naive Bayes Classifier

classifier‘s .fit() function takes two parameters. The first is the training set (train_counts), the second is the labels associated with the training emails (train_emails.target).

In [42]:
classifier = MultinomialNB()
classifier.fit(train_counts, train_emails.target)

In [44]:
# Test the classifier
# .score() takes the test set and the test labels as parameters and returns the accuracy of the classifier on the test data. 
# Accuracy measures the percentage of classifications a classifier correctly made.
print(classifier.score(test_counts, test_emails.target))

0.9723618090452262


## Testing Other Datasets

Change the categories in create train_emails and test_emails to try other combinations:

+ 'alt.atheism'
+ 'comp.graphics'
+ 'comp.os.ms-windows.misc'
+ 'comp.sys.ibm.pc.hardware'
+ 'comp.sys.mac.hardware'
+ 'comp.windows.x'
+ 'misc.forsale'
+ 'rec.autos'
+ 'rec.motorcycles'
+ 'rec.sport.baseball'
+ 'rec.sport.hockey'
+ 'sci.crypt'
+ 'sci.electronics'
+ 'sci.med'
+ 'sci.space'
+ 'soc.religion.christian'
+ 'talk.politics.guns'
+ 'talk.politics.mideast'
+ 'talk.politics.misc'
+ 'talk.religion.misc'