# Email Similarity with Naive Bayes Classifier

For this project, we'll be using a dataset from the sklearn library. Moreover, we'll be implementing Naive Bayes to classify/distinguish topics in an email.

In [2]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

emails = fetch_20newsgroups()

In the dataset, each email is tagged based on their content. Let's see what tags we are working with. 

In [3]:
print(emails.target_names)

['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']


If we're interested in seeing how effective our Naive Bayes Classifier is at differentiating between a baseball email and a hockey email, we can select those categories via:

```
emails = fetch_20newsgroups(categories = ['rec.sport.baseball', 'rec.sport.hockey'])
```



In [4]:
emails = fetch_20newsgroups(categories = ['rec.sport.baseball', 'rec.sport.hockey'])

Let's see what we're working with by printing out the 5th email in the list.

In [5]:
print(emails.data[5])

From: mmb@lamar.ColoState.EDU (Michael Burger)
Subject: More TV Info
Distribution: na
Nntp-Posting-Host: lamar.acns.colostate.edu
Organization: Colorado State University, Fort Collins, CO  80523
Lines: 36

United States Coverage:
Sunday April 18
  N.J./N.Y.I. at Pittsburgh - 1:00 EDT to Eastern Time Zone
  ABC - Gary Thorne and Bill Clement

  St. Louis at Chicago - 12:00 CDT and 11:00 MDT - to Central/Mountain Zones
  ABC - Mike Emerick and Jim Schoenfeld

  Los Angeles at Calgary - 12:00 PDT and 11:00 ADT - to Pacific/Alaskan Zones
  ABC - Al Michaels and John Davidson

Tuesday, April 20
  N.J./N.Y.I. at Pittsburgh - 7:30 EDT Nationwide
  ESPN - Gary Thorne and Bill Clement

Thursday, April 22 and Saturday April 24
  To Be Announced - 7:30 EDT Nationwide
  ESPN - To Be Announced


Canadian Coverage:

Sunday, April 18
  Buffalo at Boston - 7:30 EDT Nationwide
  TSN - ???

Tuesday, April 20
  N.J.D./N.Y. at Pittsburgh - 7:30 EDT Nationwide
  TSN - ???

Wednesday, April 21
  St. Louis a

Now we find the label of that email:

In [6]:
print(emails.target[5])

1


In [7]:
print(emails.target_names)

['rec.sport.baseball', 'rec.sport.hockey']


We see that the email at index 5 of the list is talking about hockey!

Now that we understand the data, we're ready to split the data and train the model.

In [8]:
train_emails = fetch_20newsgroups(subset='train',shuffle=True,random_state=108)
test_emails = fetch_20newsgroups(subset='test',shuffle=True,random_state=108)

We now set up our model. The first step is to transform these emails into a list of word counts. This will help the model determine the topic based on the possible words used. We will do so with the CountVectorizer class from sklearn.

In [9]:
counter = CountVectorizer()
counter.fit(test_emails.data + train_emails.data)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

In [10]:
train_counts = counter.transform(train_emails.data)
test_counts = counter.transform(test_emails.data)

Great! We're now ready to make the Naive Bayes Classifier

In [11]:
classifier = MultinomialNB()
classifier.fit(train_counts,train_emails.target)
print(classifier.score(test_counts, test_emails.target))

0.7626128518321826


From this experience, I learned how to build a Naive Bayes Classifier to label emails based on their topic of discussion and words used. 