### Naive Bayes - Email Similarity

In this project we will use scikit-learns Naive Bayes implementation several datasets.
By reporting the accuracy for the classifier we can find which datasets are harder to distinguish.

In [16]:
# Importing our modules
from sklearn.datasets import fetch_20newsgroups
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

emails = fetch_20newsgroups()

# Printing the different categories in emails
print(emails.target_names)

['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']


In [17]:
# starting our investigation with baseball and hockey
emails = fetch_20newsgroups(categories = ['rec.sport.baseball', 'rec.sport.hockey'])

# Taking a look at the email data & label
print(emails.data[0])
print('\nLabel:')
print(emails.target[0])
print(emails.target_names[0])

From: dougb@comm.mot.com (Doug Bank)
Subject: Re: Info needed for Cleveland tickets
Reply-To: dougb@ecs.comm.mot.com
Organization: Motorola Land Mobile Products Sector
Distribution: usa
Nntp-Posting-Host: 145.1.146.35
Lines: 17

In article <1993Apr1.234031.4950@leland.Stanford.EDU>, bohnert@leland.Stanford.EDU (matthew bohnert) writes:

|> I'm going to be in Cleveland Thursday, April 15 to Sunday, April 18.
|> Does anybody know if the Tribe will be in town on those dates, and
|> if so, who're they playing and if tickets are available?

The tribe will be in town from April 16 to the 19th.
There are ALWAYS tickets available! (Though they are playing Toronto,
and many Toronto fans make the trip to Cleveland as it is easier to
get tickets in Cleveland than in Toronto.  Either way, I seriously
doubt they will sell out until the end of the season.)

-- 
Doug Bank                       Private Systems Division
dougb@ecs.comm.mot.com          Motorola Communications Sector
dougb@nwu.edu       

In [18]:
# Making the training and test sets
train_emails = fetch_20newsgroups(categories = ['rec.sport.baseball', 'rec.sport.hockey'],
                                  subset = 'train',
                                  shuffle = 'true',
                                  random_state = 108)

test_emails = fetch_20newsgroups(categories = ['rec.sport.baseball', 'rec.sport.hockey'],
                                  subset = 'test',
                                  shuffle = 'true',
                                  random_state = 108)

In [19]:
# Transforming the emails into lists of word counts
counter = CountVectorizer()
counter.fit(test_emails.data + train_emails.data)

train_counts = counter.transform(train_emails.data)
test_counts = counter.transform(test_emails.data)

In [22]:
# Creating our Naive Bayes Classifier
classifier = MultinomialNB()
classifier.fit(train_counts, train_emails.target)
print('Classifier Score:', classifier.score(test_counts, test_emails.target))

Classifier Score: 0.9723618090452262


### Final thoughts

The categories in train and set emails can be changed around - see what data sets are easier / harder to distinguish!