# Machine Learning: Supervised Learning
# Email Similarity: a Naive Bayes problem
### Eleazar I. Madariaga González
### 16/07/2020

In this project, I'll use scikit-learn’s Naive Bayes implementation on several different datasets. By reporting the accuracy of the classifier, we can find which datasets are harder to distinguish. For example:  
How difficult do you think it is to distinguish the difference between emails about hockey and emails about soccer?  
How hard is it to tell the difference between emails about hockey and emails about tech?  
In this project, I’ll find out exactly how difficult those two tasks are.

## 1: Exploring the Data

I’ve imported a dataset of emails from scikit-learn’s datasets. All of these emails are tagged based on their content. 

In [3]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

emails = fetch_20newsgroups()
# Print emails.target_names to see the different categories
print(emails.target_names)

['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']


We’re interested in seeing how effective our Naive Bayes classifier is at telling the difference between a baseball email and a hockey email. We can select the categories of articles we want from __fetch_20newsgroups__ by adding the parameter __categories__.  
In the function call, set categories equal to the list __['rec.sport.baseball', 'rec.sport.hockey']__

In [4]:
emails = fetch_20newsgroups(categories = ['rec.sport.baseball', 'rec.sport.hockey'])
print(emails.target_names)

['rec.sport.baseball', 'rec.sport.hockey']


Let’s take a look at one of these emails. All of the emails are stored in a list called __emails.data__.

In [5]:
# Print the email at index 5 in the list
print(emails.data[5])

From: mmb@lamar.ColoState.EDU (Michael Burger)
Subject: More TV Info
Distribution: na
Nntp-Posting-Host: lamar.acns.colostate.edu
Organization: Colorado State University, Fort Collins, CO  80523
Lines: 36

United States Coverage:
Sunday April 18
  N.J./N.Y.I. at Pittsburgh - 1:00 EDT to Eastern Time Zone
  ABC - Gary Thorne and Bill Clement

  St. Louis at Chicago - 12:00 CDT and 11:00 MDT - to Central/Mountain Zones
  ABC - Mike Emerick and Jim Schoenfeld

  Los Angeles at Calgary - 12:00 PDT and 11:00 ADT - to Pacific/Alaskan Zones
  ABC - Al Michaels and John Davidson

Tuesday, April 20
  N.J./N.Y.I. at Pittsburgh - 7:30 EDT Nationwide
  ESPN - Gary Thorne and Bill Clement

Thursday, April 22 and Saturday April 24
  To Be Announced - 7:30 EDT Nationwide
  ESPN - To Be Announced


Canadian Coverage:

Sunday, April 18
  Buffalo at Boston - 7:30 EDT Nationwide
  TSN - ???

Tuesday, April 20
  N.J.D./N.Y. at Pittsburgh - 7:30 EDT Nationwide
  TSN - ???

Wednesday, April 21
  St. Louis a

All of the labels can be found in the list __emails.target__

In [6]:
# Print the label of the email at index 5
print(emails.target[5])

1


The labels themselves are numbers, but those numbers correspond to the label names found at __emails.target_names__.  
Is this a baseball email or a hockey email?

In [7]:
# The target of email 5 is 1, which corresponds to rec.sport.hockey
print(emails.target_names)

['rec.sport.baseball', 'rec.sport.hockey']


## 2: Making the Training  and Test Sets

We now want to split our data into training and test sets. Change the name of our variable from __emails__ to __train_emails__. Add these three parameters to the function call:

* subset='train'
* shuffle = True
* random_state = 108

Adding the __random_state__ parameter will make sure that every time we run the code, our dataset is split in the same way.

In [10]:
train_emails = fetch_20newsgroups(
    categories = ['rec.sport.baseball', 'rec.sport.hockey'], 
    subset = 'train', 
    shuffle =  True,
    random_state = 108)

Create another variable named __test_emails__ and set it equal to __fetch_20newsgroups__. The parameters of the function should be the same as before except __subset__ should now be __'test'__.

In [11]:
test_emails = fetch_20newsgroups(
    categories = ['rec.sport.baseball', 'rec.sport.hockey'], 
    subset = 'test', 
    shuffle =  True,
    random_state = 108)

## 3: Counting Words

We want to transform these emails into lists of word counts. The __CountVectorizer__ class makes this easy for us.  
Create a __CountVectorizer__ object and name it __counter__.

In [12]:
counter = CountVectorizer()

We need to tell __counter__ what possible words can exist in our emails __.counter__ has a __.fit()__ a function that takes a list of all our data.  
Call __.fit()__ with __test_emails.data__ + __train_emails.data__ as a parameter.

In [13]:
counter.fit(test_emails.data + train_emails.data)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

We can now make a list of the counts of our words in our training set.  
Create a variable named __train_counts__. Set it equal to __counter__‘s __transform__ function using __train_emails.data__ as a parameter.

In [14]:
train_counts = counter.transform(train_emails.data)

Let’s also make a variable named __test_counts__. This should be the same function call as before, but use __test_emails.data__ as the parameter of transform.

In [15]:
test_counts = counter.transform(test_emails.data)

## 4: Making a naive Bayes Classifier

Let’s now make a Naive Bayes classifier that we can train and test on. Create a MultinomialNB object named __classifier__.

In [16]:
classifier = MultinomialNB()

Call classifier‘s __.fit()__ function. __.fit()__ takes two parameters:  
* The first should be our training set, which for us is __train_counts__.  
* The second should be the labels associated with the training emails. Those are found in __train_emails.target__.

In [17]:
classifier.fit(train_counts, train_emails.target)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

Test the Naive Bayes Classifier by printing __classifier__‘s __.score()__ function. __.score()__ takes the test set and the test labels as parameters.  
__.score()__ returns the accuracy of the classifier on the test data. Accuracy measures the percentage of classifications a classifier correctly made.

In [18]:
# the two parameters to .score() should be test_counts and test_emails.target
print(classifier.score(test_counts, test_emails.target))

0.9723618090452262


__.score()__ will classify all the emails in the test set and compare the classification of each email to its actual label.  
After completing these comparisons, it will calculate and return the accuracy.

## 5: Testing Other Datasets

Our classifier does a pretty good job distinguishing between soccer emails and hockey emails. But let’s see how it does with emails about really different topics.  
Find where we create __train_emails__ and __test_emails__. Change the categories to be __['comp.sys.ibm.pc.hardware', 'rec.sport.hockey']__.  
Did our classifier do a better or worse job on these two datasets?

In [21]:
# train_emails
train_emails = fetch_20newsgroups(
    categories = ['comp.sys.ibm.pc.hardware', 'rec.sport.hockey'], 
    subset = 'train', 
    shuffle =  True,
    random_state = 108)

# test_emails
test_emails = fetch_20newsgroups(
    categories = ['comp.sys.ibm.pc.hardware', 'rec.sport.hockey'], 
    subset = 'test', 
    shuffle =  True,
    random_state = 108)

counter = CountVectorizer()
counter.fit(test_emails.data + train_emails.data)
train_counts = counter.transform(train_emails.data)
test_counts = counter.transform(test_emails.data)
classifier = MultinomialNB()
classifier.fit(train_counts, train_emails.target)

print(classifier.score(test_counts, test_emails.target))

0.9974715549936789


The classifier was 99% accurate when trying to classify hockey and tech emails.  
This is better than when it was trying to classify hockey and soccer emails. This makes sense — emails about sports probably share more words in common.


We can play around with different sets of data. We can find a set that’s incredibly accurate or incredibly inaccurate?  
The possible categories are listed below.

* 'alt.atheism'
* 'comp.graphics'
* 'comp.os.ms-windows.misc'
* 'comp.sys.ibm.pc.hardware'
* 'comp.sys.mac.hardware'
* 'comp.windows.x'
* 'misc.forsale'
* 'rec.autos'
* 'rec.motorcycles'
* 'rec.sport.baseball'
* 'rec.sport.hockey'
* 'sci.crypt'
* 'sci.electronics'
* 'sci.med'
* 'sci.space'
* 'soc.religion.christian'
* 'talk.politics.guns'
* 'talk.politics.mideast'
* 'talk.politics.misc'
* 'talk.religion.misc'

Example with the categories: __['alt.atheism', 'rec.autos']__.

In [22]:
# train_emails
train_emails = fetch_20newsgroups(
    categories = ['alt.atheism', 'rec.autos'], 
    subset = 'train', 
    shuffle =  True,
    random_state = 108)

# test_emails
test_emails = fetch_20newsgroups(
    categories = ['alt.atheism', 'rec.autos'], 
    subset = 'test', 
    shuffle =  True,
    random_state = 108)

counter = CountVectorizer()
counter.fit(test_emails.data + train_emails.data)
train_counts = counter.transform(train_emails.data)
test_counts = counter.transform(test_emails.data)
classifier = MultinomialNB()
classifier.fit(train_counts, train_emails.target)

print(classifier.score(test_counts, test_emails.target))

0.9916083916083916


                                                                                                                   By  
                                                                                        Eleazar I. Madariaga González
                                                                                  As part of my Data Analyst training
                                                                                Thanks to Codecademy for the guidance 