###Welcome

In this project we will categorize emails using a Naive Bayes Classifier

We will use different datasets to find out which kind of emails are harder to classify 

What exactly are we doing?

The Bayes Theorem is a branch of statistics called Bayesian Statistics where we calculate probabilities based on prior knowledge of any given event or series of events. It has applications in A/B testing, Statistical Modeling, Machine Learning and Robotics. 

P(A|B) = P(B|A) * P(A) / P(B) This is the formula
We can read it as: What is the probability of A given that B is true = The Probability of B given that A is true * Probability of A (all outcomes where A is true)/ Probability of B (all outcomes where B is true) 


We want to look at the text contained in different emails and calculate the probability of a series of words belonging to one or different topics in order to classify such emails (Example emails where the main topic is politics or the main topic is religion) 

In [10]:
#Let's first import our dataset and libraries to run the naive bayes classifier from scikit learn

from sklearn.datasets import fetch_20newsgroups
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

In [11]:
#The dataset contains different kinds of emails
emails = fetch_20newsgroups()

print(emails.target_names)

['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']


Our objective is to find out how accurate is our Naive Bayes classifier at knowing the difference between one type of email from the other, in this case hockey vs baseball

In [16]:
#We assign to the variable emails only the two groups of emails we are interested in classifying
emails_data = fetch_20newsgroups(categories = ['rec.sport.baseball', 'rec.sport.hockey'])
print(emails_data.target_names)

['rec.sport.baseball', 'rec.sport.hockey']


In [24]:
#Let's check one of the emails in our dataset just to be sure
print(emails_data.data[3])

From: monack@helium.gas.uug.arizona.edu (david n monack)
Subject: Re: ESPN Tonight
Organization: University of Arizona - Tucson, Arizona
Lines: 17

In <1qkj1kINN3g1@master.cs.rose-hulman.edu> swartzjh@RoseVC.Rose-Hulman.Edu writes:

>Has anyone heard what game ESPN is showing tonight.  They said they will
>show whatever game means the most playoff-wise. I would assume this would
>be the Blues-Tampa game or the Minnesota-Red Wings game...  Anyone heard for
>sure???

>		Jeff Swartz

I heard it will be the Minnesota-Detroit game. Don't know the time
though.

Dave

--
David Monack        e-mail: monack@gas.uug.arizona.edu
"Love is the delusion that one woman differs from another." H.L. Mencken



All of the emails have been previously labeled since this is the only way of training our supervised machine learning algorithm, lets find out if the previous email was labeled as hockey or baseball

In [25]:
#Okay the previous was labeled as a hockey email
print(emails_data.target[3])

1


In [27]:
#We now want to split our data for training

emails_training = fetch_20newsgroups(categories = ['rec.sport.baseball', 'rec.sport.hockey'], subset = 'train', shuffle = True, random_state = 108)

In [28]:
#Now for testing

emails_test = fetch_20newsgroups(categories = ['rec.sport.baseball', 'rec.sport.hockey'], subset = 'test', shuffle = True, random_state = 108)

In [30]:
#We now transform the emails into word counts using the CountVectorizer function

counter = CountVectorizer()

#Then tell the function all the possible words in our dataset

counter.fit(emails_test.data + emails_training.data)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [32]:
#Let's now make a list of the counts of words in our training and test sets

training_count = counter.transform(emails_training.data)
test_count = counter.transform(emails_test.data)

Great, time to make our Naive Bayes Classifier

In [33]:
#First we create an object out of the MutinomialNB scikit learn library

classifier = MultinomialNB()

In [34]:
#Now we fit in the data specifying the training set and the labels

classifier.fit(training_count, emails_training.target)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [None]:
Okay time to check how accurate is our classifier

In [36]:
print(classifier.score(test_count, emails_test.target))

0.9723618090452262


###Great! We achieved a 97.2% Accuracy

Let's test our classifier one more time with 2 different kinds of emails

We will choose politics and religion = 'talk.politics.misc', 'talk.religion.misc'

In [43]:
#Split our data for training and testing 

emails_training2 = fetch_20newsgroups(categories = ['talk.politics.misc', 'talk.religion.misc'], subset = 'train', shuffle = True, random_state = 108)

emails_test2 = fetch_20newsgroups(categories = ['talk.politics.misc', 'talk.religion.misc'], subset = 'test', shuffle = True, random_state = 108)

In [44]:
#Create Naive Bayes Classifier

counter2 = CountVectorizer()

#Tell the function the vocabulary
counter2.fit(emails_test2.data + emails_training2.data)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [45]:
#Make a list of the counts of words in our training and test sets

training_count2 = counter.transform(emails_training2.data)
test_count2 = counter.transform(emails_test2.data)

In [46]:
#Create object from scikit learn library

classifier2 = MultinomialNB()

In [47]:
#Fit in the data specifying the training set and the labels

classifier2.fit(training_count2, emails_training2.target)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [48]:
#Check accuracy
print(classifier2.score(test_count2, emails_test2.target))

0.8805704099821747


###The classifier is less accurate distinguishing between politics and religion emails, yet 88% is still quite good for the purpose of this project

Thank you for reading