# Email Similarity

### SUPERVISED LEARNING: ADVANCED CLASSIFICATION

#### In this project, We will use scikit-learn’s Naive Bayes implementation on several different datasets. By reporting the accuracy of the classifier, we can find which datasets are harder to distinguish.

In [1]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

emails = fetch_20newsgroups()

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


In [3]:
#categories of emails
emails.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

In [5]:
'''we want to see how effective out classifier is in finding the difference 
between a baseball email and a hockey email'''
emails = fetch_20newsgroups(categories = ['rec.sport.baseball', 'rec.sport.hockey'])

In [6]:
#printing email at index 5
emails.data[5]

'From: mmb@lamar.ColoState.EDU (Michael Burger)\nSubject: More TV Info\nDistribution: na\nNntp-Posting-Host: lamar.acns.colostate.edu\nOrganization: Colorado State University, Fort Collins, CO  80523\nLines: 36\n\nUnited States Coverage:\nSunday April 18\n  N.J./N.Y.I. at Pittsburgh - 1:00 EDT to Eastern Time Zone\n  ABC - Gary Thorne and Bill Clement\n\n  St. Louis at Chicago - 12:00 CDT and 11:00 MDT - to Central/Mountain Zones\n  ABC - Mike Emerick and Jim Schoenfeld\n\n  Los Angeles at Calgary - 12:00 PDT and 11:00 ADT - to Pacific/Alaskan Zones\n  ABC - Al Michaels and John Davidson\n\nTuesday, April 20\n  N.J./N.Y.I. at Pittsburgh - 7:30 EDT Nationwide\n  ESPN - Gary Thorne and Bill Clement\n\nThursday, April 22 and Saturday April 24\n  To Be Announced - 7:30 EDT Nationwide\n  ESPN - To Be Announced\n\n\nCanadian Coverage:\n\nSunday, April 18\n  Buffalo at Boston - 7:30 EDT Nationwide\n  TSN - ???\n\nTuesday, April 20\n  N.J.D./N.Y. at Pittsburgh - 7:30 EDT Nationwide\n  TSN - ??

In [8]:
#label corresponding to the 5th email
emails.target[5]

1

In [10]:
#training variable
train_emails = fetch_20newsgroups(categories = ['rec.sport.baseball', 'rec.sport.hockey'],
                                  subset = 'train',
                                  shuffle = True,
                                  random_state = 108)


In [11]:
#testing variable
test_emails = fetch_20newsgroups(categories = ['rec.sport.baseball', 'rec.sport.hockey'],
                                  subset = 'test',
                                  shuffle = True,
                                  random_state = 108)

In [12]:
#transforming emails to list of word counts
counter = CountVectorizer()

In [14]:
#telling counter what words can exist in our emails
counter.fit(test_emails.data + train_emails.data)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

In [16]:
train_counts = counter.transform(train_emails.data)

In [18]:
test_counts = counter.transform(test_emails.data)

In [20]:
#Naive Bayes Classifier
classifier = MultinomialNB()

In [21]:
classifier.fit(train_counts, train_emails.target)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [22]:
classifier.score(test_counts, test_emails.target)

0.9723618090452262