# Email Similarity

In this project, you will use scikit-learn’s Naive Bayes implementation on several different datasets. 
By reporting the accuracy of the classifier, we can find which datasets are harder to distinguish. 

For example, 
- how difficult do you think it is to distinguish the difference between emails about hockey and emails about soccer? 
- How hard is it to tell the difference between emails about hockey and emails about tech? 

In this project, we’ll find out exactly how difficult those two tasks are.

In [1]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

### Load the Data

In [2]:
emails = fetch_20newsgroups()

In [3]:
# cotegories
print(emails.target_names)

['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']


### Exploring the Data

We’re interested in seeing how effective our Naive Bayes classifier is at telling the difference between <b>a baseball email and a hockey email</b>.

In [4]:
emails = fetch_20newsgroups(categories = ['rec.sport.baseball', 'rec.sport.hockey'])
print(emails.target_names)

['rec.sport.baseball', 'rec.sport.hockey']


In [5]:
# the email at index 5 in the list.
print(emails.data[5])

From: mmb@lamar.ColoState.EDU (Michael Burger)
Subject: More TV Info
Distribution: na
Nntp-Posting-Host: lamar.acns.colostate.edu
Organization: Colorado State University, Fort Collins, CO  80523
Lines: 36

United States Coverage:
Sunday April 18
  N.J./N.Y.I. at Pittsburgh - 1:00 EDT to Eastern Time Zone
  ABC - Gary Thorne and Bill Clement

  St. Louis at Chicago - 12:00 CDT and 11:00 MDT - to Central/Mountain Zones
  ABC - Mike Emerick and Jim Schoenfeld

  Los Angeles at Calgary - 12:00 PDT and 11:00 ADT - to Pacific/Alaskan Zones
  ABC - Al Michaels and John Davidson

Tuesday, April 20
  N.J./N.Y.I. at Pittsburgh - 7:30 EDT Nationwide
  ESPN - Gary Thorne and Bill Clement

Thursday, April 22 and Saturday April 24
  To Be Announced - 7:30 EDT Nationwide
  ESPN - To Be Announced


Canadian Coverage:

Sunday, April 18
  Buffalo at Boston - 7:30 EDT Nationwide
  TSN - ???

Tuesday, April 20
  N.J.D./N.Y. at Pittsburgh - 7:30 EDT Nationwide
  TSN - ???

Wednesday, April 21
  St. Louis a

In [6]:
# the label of the email at index 5
print(emails.target[5])

1


It means emails.data[5] is belong about <b>hockey</b>. <i>which is the target names sequence ['rec.sport.baseball', 'rec.sport.hockey'] [0,1]</i>

### Making the Training and Test Sets

In [7]:
# Change the name of your variable from emails to train_emails
train_emails = fetch_20newsgroups(
    categories = ['rec.sport.baseball', 'rec.sport.hockey'], 
    subset = 'train',
    shuffle = True,
    random_state=108
)

In [8]:
# Create another variable named test_emails
test_emails = fetch_20newsgroups(
    categories = ['rec.sport.baseball', 'rec.sport.hockey'], 
    subset = 'test',
    shuffle = True,
    random_state=108
)

In [9]:
print(train_emails.target)

[1 1 0 ... 1 1 0]


In [10]:
print(test_emails.target)

[1 0 0 1 1 0 1 1 1 1 0 0 0 1 0 1 1 1 1 0 0 0 0 1 0 1 0 1 0 1 1 0 0 0 1 0 1
 1 1 1 0 0 1 1 0 1 1 1 0 1 0 0 0 1 1 1 0 0 0 0 0 0 1 1 0 0 1 0 0 0 1 1 1 1
 0 1 1 0 0 0 0 0 0 1 1 0 1 1 1 0 0 1 0 1 1 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0
 0 0 1 1 1 1 0 0 0 1 1 0 0 1 0 0 0 1 1 1 0 0 1 1 0 0 0 0 0 0 0 1 1 0 1 0 1
 1 0 0 1 1 0 1 0 1 1 1 0 1 0 0 1 1 1 1 1 1 1 1 1 0 0 1 0 1 0 0 1 1 1 1 1 0
 0 0 1 1 0 0 0 1 0 1 0 1 1 1 0 0 0 1 1 1 1 1 1 0 0 1 1 0 0 1 1 1 0 0 0 1 0
 0 1 0 1 0 1 1 0 1 0 1 1 0 0 0 1 1 1 0 1 1 0 0 0 0 1 0 1 1 1 1 0 0 0 0 0 0
 0 0 0 1 1 0 1 1 0 1 1 1 1 0 1 1 0 1 0 1 0 1 0 1 1 0 1 1 0 0 1 1 0 0 0 0 0
 1 0 0 1 1 0 1 0 1 0 1 0 0 0 1 1 0 1 1 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 1
 1 0 0 0 0 1 1 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 0 1 0 1 1 1 1 1 0 0 0 0 0 0 0
 1 1 0 0 0 1 1 1 1 0 1 1 0 0 1 1 1 1 0 0 0 1 0 1 0 0 1 1 1 1 1 1 0 0 1 0 0
 1 0 0 0 1 0 1 0 1 0 0 0 1 1 1 1 1 0 1 0 0 1 1 1 1 0 1 0 1 1 1 1 0 1 0 1 0
 0 0 0 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 0 0 0 0 0 1 1 0 1 1 0 0 1 1 1 1 1 1
 1 0 1 0 1 1 0 0 0 0 1 0 

## Counting Words

In [11]:
counter = CountVectorizer()
counter.fit(test_emails.data + train_emails.data)

train_counts = counter.transform(train_emails.data)
test_counts = counter.transform(test_emails.data)

In [12]:
print(train_counts)

  (0, 833)	1
  (0, 949)	1
  (0, 1029)	1
  (0, 1417)	1
  (0, 1573)	1
  (0, 2137)	1
  (0, 2322)	1
  (0, 2501)	1
  (0, 3521)	1
  (0, 3910)	3
  (0, 3982)	1
  (0, 4071)	2
  (0, 4103)	2
  (0, 4187)	1
  (0, 4309)	5
  (0, 4396)	1
  (0, 4400)	1
  (0, 4945)	1
  (0, 5512)	2
  (0, 5896)	1
  (0, 6104)	2
  (0, 6473)	1
  (0, 6588)	4
  (0, 6988)	1
  (0, 7081)	1
  :	:
  (1196, 18104)	1
  (1196, 19058)	1
  (1196, 19376)	1
  (1196, 19764)	1
  (1196, 20289)	1
  (1196, 20487)	2
  (1196, 20595)	1
  (1196, 20880)	1
  (1196, 20929)	34
  (1196, 21481)	2
  (1196, 21484)	1
  (1196, 21541)	2
  (1196, 21733)	2
  (1196, 21745)	1
  (1196, 22121)	1
  (1196, 22173)	1
  (1196, 22174)	1
  (1196, 22351)	1
  (1196, 22656)	1
  (1196, 22911)	1
  (1196, 23044)	1
  (1196, 23134)	1
  (1196, 23144)	1
  (1196, 23186)	1
  (1196, 23579)	1


In [13]:
print(test_counts)

  (0, 3514)	1
  (0, 3778)	1
  (0, 4071)	1
  (0, 4309)	1
  (0, 4458)	1
  (0, 4496)	1
  (0, 4982)	1
  (0, 5611)	1
  (0, 5896)	1
  (0, 7549)	1
  (0, 9016)	1
  (0, 9052)	2
  (0, 10005)	1
  (0, 10220)	1
  (0, 10306)	1
  (0, 10396)	1
  (0, 10745)	1
  (0, 10802)	1
  (0, 10808)	1
  (0, 11348)	1
  (0, 11974)	2
  (0, 12139)	1
  (0, 12167)	1
  (0, 12220)	1
  (0, 13206)	1
  :	:
  (795, 21530)	3
  (795, 21561)	1
  (795, 21662)	1
  (795, 21733)	1
  (795, 21801)	1
  (795, 22236)	1
  (795, 22284)	2
  (795, 22290)	1
  (795, 22351)	1
  (795, 22357)	1
  (795, 22499)	1
  (795, 22655)	1
  (795, 22773)	3
  (795, 22941)	1
  (795, 23092)	1
  (795, 23142)	1
  (795, 23198)	3
  (795, 23303)	2
  (795, 23409)	1
  (795, 23445)	1
  (795, 23454)	1
  (795, 23550)	1
  (795, 23615)	4
  (795, 23619)	1
  (795, 23625)	1


## Making Naive Bayes Classfifier

In [14]:
classifier = MultinomialNB()
classifier.fit(train_counts, train_emails.target)

print(classifier.score(test_counts, test_emails.target))

0.9723618090452262


The accuracy of classify all the emails in the test set and compare the classification of each email to its actual label is 97%.

### Testing Other Datasets

Our classifier does a pretty good job distinguishing between soccer emails and hockey emails. But let’s see how it does with emails about really different topics.

In [15]:
train_emails = fetch_20newsgroups(
    categories = ['comp.sys.ibm.pc.hardware','rec.sport.hockey'], 
    subset = 'train',
    shuffle = True,
    random_state=108
)

test_emails = fetch_20newsgroups(
    categories = ['comp.sys.ibm.pc.hardware','rec.sport.hockey'], 
    subset = 'test',
    shuffle = True,
    random_state=108
)

# counting words
counter = CountVectorizer()
counter.fit(test_emails.data + train_emails.data)

train_counts = counter.transform(train_emails.data)
test_counts = counter.transform(test_emails.data)

# Bayes Classification
classifier = MultinomialNB()
classifier.fit(train_counts, train_emails.target)

print(classifier.score(test_counts, test_emails.target))

0.9974715549936789


The classification was 99% accurate when trying to classify hockey and tech emails.

This is better than when was trying to classify hockey and baseball emails. This makes sense emails about sports probably share more words in common.

### Extra
Play around with different sets of data. Can you find a set that’s incredibly accurate or incredibly inaccurate?

Your classifier can work even when there are more than two labels. Try setting categories equal to a list of three or four of the categories.

In [29]:
email_trial = fetch_20newsgroups()
print(len(email_trial.data))

11314


In [24]:
train_emails = fetch_20newsgroups(
    categories = ['soc.religion.christian', 'talk.politics.misc', 'sci.space'], 
    subset = 'train',
    shuffle = True,
    random_state=108
)

test_emails = fetch_20newsgroups(
    categories = ['soc.religion.christian', 'talk.politics.misc', 'sci.space'], 
    subset = 'test',
    shuffle = True,
    random_state=108
)

# counting words
counter = CountVectorizer()
counter.fit(test_emails.data + train_emails.data)

train_counts = counter.transform(train_emails.data)
test_counts = counter.transform(test_emails.data)

# Bayes Classification
classifier = MultinomialNB()
classifier.fit(train_counts, train_emails.target)

print(classifier.score(test_counts, test_emails.target))

0.9609800362976406


the classification was 96% accurate when trying to classify social religion christian, politicsand science space. 

In [25]:
train_emails = fetch_20newsgroups(
    categories = ['soc.religion.christian', 'talk.religion.misc', 'talk.politics.mideast', 'sci.med'], 
    subset = 'train',
    shuffle = True,
    random_state=108
)

test_emails = fetch_20newsgroups(
    categories = ['soc.religion.christian', 'talk.religion.misc', 'talk.politics.mideast', 'sci.med'], 
    subset = 'test',
    shuffle = True,
    random_state=108
)

# counting words
counter = CountVectorizer()
counter.fit(test_emails.data + train_emails.data)

train_counts = counter.transform(train_emails.data)
test_counts = counter.transform(test_emails.data)

# Bayes Classification
classifier = MultinomialNB()
classifier.fit(train_counts, train_emails.target)

print(classifier.score(test_counts, test_emails.target))

0.9106263194933145


the classifier was 91% accurate when trying to classify social religion christian, talk religion, talk politics middle east and science med.

This is lower than when trying to classify social religion christian, talk politik and science space. Because religion christian, talk religion and politic middle east probably share the same words, while science not. 