# Email Similarity

In this project, you will use scikit-learn’s Naive Bayes implementation on several different datasets. By reporting the accuracy of the classifier, we can find which datasets are harder to distinguish. For example, how difficult do you think it is to distinguish the difference between emails about hockey and emails about soccer? How hard is it to tell the difference between emails about hockey and emails about tech? In this project, we’ll find out exactly how difficult those two tasks are.

We’ve imported a dataset of emails from scikit-learn’s datasets. All of these emails are tagged based on their content.
Print emails.target_names to see the different categories.

In [14]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

train_emails = fetch_20newsgroups(categories=['rec.sport.baseball', 'rec.sport.hockey'],subset='train',shuffle = True,random_state = 108)

test_emails=fetch_20newsgroups(categories=['rec.sport.baseball', 'rec.sport.hockey'],subset='test',shuffle = True,random_state = 108)


print(train_emails.target_names)

['rec.sport.baseball', 'rec.sport.hockey']


We now want to split our data into training and test sets. Change the name of your variable from emails to train_emails. Add these three parameters to the function call:

subset='train'
shuffle = True
random_state = 108
Adding the random_state parameter will make sure that every time you run the code, your dataset is split in the same way.



Let’s now make a Naive Bayes classifier that we can train and test on. Create a MultinomialNB object named classifier.

Call classifier‘s .fit() function. .fit() takes two parameters. The first should be our training set, which for us is train_counts. The second should be the labels associated with the training emails. Those are found in train_emails.target.

Test the Naive Bayes Classifier by printing classifier‘s .score() function. .score() takes the test set and the test labels as parameters.

.score() returns the accuracy of the classifier on the test data. Accuracy measures the percentage of classifications a classifier correctly made.

In [16]:
train_emails = fetch_20newsgroups(categories=['rec.sport.baseball', 'rec.sport.hockey'],subset='train',shuffle = True,random_state = 108)

test_emails=fetch_20newsgroups(categories=['rec.sport.baseball', 'rec.sport.hockey'],subset='test',shuffle = True,random_state = 108)



# print(emails.target_names)

# print(emails.data[5])
# print(emails.target[5])

counter=CountVectorizer()
counter.fit(test_emails.data + train_emails.data)
train_counts=counter.transform(train_emails.data)
test_counts=counter.transform(test_emails.data)

classifier=MultinomialNB()
classifier.fit(train_counts,train_emails.target)
print(classifier.score(test_counts,test_emails.target))

0.9723618090452262


Our classifier does a pretty good job distinguishing between soccer emails and hockey emails. But let’s see how it does with emails about really different topics.

Find where you create train_emails and test_emails. Change the categories to be ['comp.sys.ibm.pc.hardware','rec.sport.hockey'].

Did your classifier do a better or worse job on these two datasets?

In [17]:
# train_emails = fetch_20newsgroups(categories=['rec.sport.baseball', 'rec.sport.hockey'],subset='train',shuffle = True,random_state = 108)

# test_emails=fetch_20newsgroups(categories=['rec.sport.baseball', 'rec.sport.hockey'],subset='test',shuffle = True,random_state = 108)

train_emails = fetch_20newsgroups(categories=['comp.sys.ibm.pc.hardware','rec.sport.hockey'],subset='train',shuffle = True,random_state = 108)

test_emails=fetch_20newsgroups(categories=['comp.sys.ibm.pc.hardware','rec.sport.hockey'],subset='test',shuffle = True,random_state = 108)

# print(emails.target_names)

# print(emails.data[5])
# print(emails.target[5])

counter=CountVectorizer()
counter.fit(test_emails.data + train_emails.data)
train_counts=counter.transform(train_emails.data)
test_counts=counter.transform(test_emails.data)

classifier=MultinomialNB()
classifier.fit(train_counts,train_emails.target)
print(classifier.score(test_counts,test_emails.target))

0.9974715549936789
