# Email Similarity

In [23]:
# Import libraries and datasets
from sklearn.datasets import fetch_20newsgroups
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

In [24]:
# Explore email dataset
emails = fetch_20newsgroups()
print(emails.target_names)

['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']


We’re interested in seeing how effective our Naive Bayes classifier is at telling the difference between a baseball email and a hockey email.

## Training and Test Sets

In [25]:
categories = ['rec.sport.baseball', 'rec.sport.hockey']

In [26]:
# Let's take a look at one email
print(emails.data[5])

From: dfo@vttoulu.tko.vtt.fi (Foxvog Douglas)
Subject: Re: Rewording the Second Amendment (ideas)
Organization: VTT
Lines: 58

In article <1r1eu1$4t@transfer.stratus.com> cdt@sw.stratus.com (C. D. Tavares) writes:
>In article <1993Apr20.083057.16899@ousrvr.oulu.fi>, dfo@vttoulu.tko.vtt.fi (Foxvog Douglas) writes:
>> In article <1qv87v$4j3@transfer.stratus.com> cdt@sw.stratus.com (C. D. Tavares) writes:
>> >In article <C5n3GI.F8F@ulowell.ulowell.edu>, jrutledg@cs.ulowell.edu (John Lawrence Rutledge) writes:
>
>> >> The massive destructive power of many modern weapons, makes the
>> >> cost of an accidental or crimial usage of these weapons to great.
>> >> The weapons of mass destruction need to be in the control of
>> >> the government only.  Individual access would result in the
>> >> needless deaths of millions.  This makes the right of the people
>> >> to keep and bear many modern weapons non-existant.

>> >Thanks for stating where you're coming from.  Needless to say, I
>> >disagree 

All of the labels can be found in the list emails.target.  
The labels themselves are numbers, but those numbers correspond to the label names found at emails.target_names.

In [27]:
print(emails.target_names[emails.target[5]])

talk.politics.guns


For example, the previous email is a talk.politics.guns email.

Let's keep only the baseball and hockey emails.

In [29]:
#  Split data into training and test sets
train_emails = fetch_20newsgroups(categories = ['rec.sport.baseball', 'rec.sport.hockey'], subset = 'train', shuffle = True, random_state = 100)
test_emails = fetch_20newsgroups(categories = ['rec.sport.baseball', 'rec.sport.hockey'], subset = 'test', shuffle = True, random_state = 100)
train_targets = train_emails.target
test_targets = test_emails.target

## Counting Words  

We want to transform these emails into lists of word counts

In [30]:
counter = CountVectorizer()

We need to tell counter what possible words can exist in our emails.

In [31]:
counter.fit(test_emails.data + train_emails.data)

We can now make a list of the counts of our words in our training and test sets.

In [32]:
train_counts = counter.transform(train_emails.data)
test_counts = counter.transform(test_emails.data)

In [33]:
# Create the Naive Bayes classifier
classifier = MultinomialNB()
# Fit the model
classifier.fit(train_counts, train_targets)

In [35]:
print(classifier.score(test_counts, test_targets))

0.9723618090452262


# Try other topics for classification

In [37]:
#  Split data into training and test sets
train_emails = fetch_20newsgroups(categories = ['comp.sys.ibm.pc.hardware', 'talk.politics.guns'], subset = 'train', shuffle = True, random_state = 100)
test_emails = fetch_20newsgroups(categories = ['comp.sys.ibm.pc.hardware', 'talk.politics.guns'], subset = 'test', shuffle = True, random_state = 100)
train_targets = train_emails.target
test_targets = test_emails.target

counter = CountVectorizer()
counter.fit(test_emails.data + train_emails.data)

train_counts = counter.transform(train_emails.data)
test_counts = counter.transform(test_emails.data)

classifier = MultinomialNB()
classifier.fit(train_counts, train_targets)

print(classifier.score(test_counts, test_targets))

0.9933862433862434
