# Email Classifier Model With Diverse Aspects

In this project, I've used Naive Bayes implementation on several different datasets. By reporting the accuracy of the classifier, it can be found which datasets are harder to distinguish. 


I've explored, 

- How difficult it is to distinguish the difference between emails about hockey and emails about baseball?
- How hard is it to tell the difference between emails about hockey and emails about tech? 
- Building an email classifier that classifies emails containing conflicting political issues (Policts_guns, Middle East, Religions)

In [37]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

## 1. Baseball email and Hockey email Classifier Model

In [39]:
emails = fetch_20newsgroups(categories = ['rec.sport.baseball', 'rec.sport.hockey'])



### Exploring an Email

In [40]:
print(emails.data[5])

From: mmb@lamar.ColoState.EDU (Michael Burger)
Subject: More TV Info
Distribution: na
Nntp-Posting-Host: lamar.acns.colostate.edu
Organization: Colorado State University, Fort Collins, CO  80523
Lines: 36

United States Coverage:
Sunday April 18
  N.J./N.Y.I. at Pittsburgh - 1:00 EDT to Eastern Time Zone
  ABC - Gary Thorne and Bill Clement

  St. Louis at Chicago - 12:00 CDT and 11:00 MDT - to Central/Mountain Zones
  ABC - Mike Emerick and Jim Schoenfeld

  Los Angeles at Calgary - 12:00 PDT and 11:00 ADT - to Pacific/Alaskan Zones
  ABC - Al Michaels and John Davidson

Tuesday, April 20
  N.J./N.Y.I. at Pittsburgh - 7:30 EDT Nationwide
  ESPN - Gary Thorne and Bill Clement

Thursday, April 22 and Saturday April 24
  To Be Announced - 7:30 EDT Nationwide
  ESPN - To Be Announced


Canadian Coverage:

Sunday, April 18
  Buffalo at Boston - 7:30 EDT Nationwide
  TSN - ???

Tuesday, April 20
  N.J.D./N.Y. at Pittsburgh - 7:30 EDT Nationwide
  TSN - ???

Wednesday, April 21
  St. Louis a

In [41]:
print('This Email is Classified as:\n                      {}'.format(emails.target_names[emails.target[5]]))

This Email is Classified as:
                      rec.sport.hockey


### Train - Test Split

In [42]:
train_emails = fetch_20newsgroups(categories=['rec.sport.baseball', 'rec.sport.hockey'], subset = 'train', shuffle = True, random_state = 108)

test_emails = fetch_20newsgroups(categories=['rec.sport.baseball', 'rec.sport.hockey'], subset = 'test', shuffle = True, random_state = 108)

### Data Transformation

In [43]:
counter = CountVectorizer()
counter.fit(train_emails.data + test_emails.data )
train_counts = counter.transform(train_emails.data)
test_counts = counter.transform(test_emails.data)

## **Naive Bayes**

In [44]:
classifier = MultinomialNB()
classifier.fit(train_counts, train_emails.target)
print('While distinguishing between two sport \'Baseball\' and \'Hockey\'\nThe model score is : {} %'.format(classifier.score(test_counts, test_emails.target)*100))

While distinguishing between two sport 'Baseball' and 'Hockey'
The model score is : 97.23618090452261 %


## 2. Tech email and Hockey email Classifier Model

In [45]:
emails = fetch_20newsgroups(categories = ['comp.sys.ibm.pc.hardware','rec.sport.hockey'])



### Exploring an Email

In [46]:
print(emails.data[5])

From: smorris@venus.lerc.nasa.gov (Ron Morris )
Subject: Murray as GM  (was: Wings will win
Organization: NASA Lewis Research Center
Lines: 37
Distribution: world
NNTP-Posting-Host: venus.lerc.nasa.gov
News-Software: VAX/VMS VNEWS 1.41    

In article <1993Apr19.204348.8254@sol.UVic.CA>, gballent@hudson.UVic.CA writes...
> 
>In article 735249453@vela.acs.oakland.edu, ragraca@vela.acs.oakland.edu (Randy A. Graca) writes:
> 
>>are predicting).  Although I think Bryan Murray is probably the best GM
>>I have ever seen in hockey
> 
>How do you figure that??  When Bryan Murray took over the Wings they were
>a pretty good team that was contending for the Stanley Cup but looked
>unlikely to win it.  Now they are a pretty good team that is contending for
>the Stanley Cup but looks unlikely to win it.  A truly great GM would
>have been able to make the moves to push the team to the upper echelon
>of the NHL and maybe win the Stanley Cup.  A good GM (like Murray) can

I think Murray has done a gr

In [47]:
print('This Email is Classified as:\n                      {}'.format(emails.target_names[emails.target[5]]))

This Email is Classified as:
                      rec.sport.hockey


## **Naive Bayes** 

In [49]:
train_emails = fetch_20newsgroups(categories=['comp.sys.ibm.pc.hardware','rec.sport.hockey'], subset = 'train', shuffle = True, random_state = 108)

test_emails = fetch_20newsgroups(categories=['comp.sys.ibm.pc.hardware','rec.sport.hockey'], subset = 'test', shuffle = True, random_state = 108)




counter = CountVectorizer()
counter.fit(train_emails.data + test_emails.data )
train_counts = counter.transform(train_emails.data)
test_counts = counter.transform(test_emails.data)




classifier = MultinomialNB()
classifier.fit(train_counts, train_emails.target)
print('While distinguishing between \'Tech\' and \'Hockey\'\nThe model score is : {} %'.format(classifier.score(test_counts, test_emails.target)*100))

While distinguishing between 'Tech' and 'Hockey'
The model score is : 99.74715549936789 %


### The classifier was **99%** accurate when trying to classify **hockey and tech emails**(97%).

This is better than when it was trying to classify **hockey and soccer emails**. This makes sense — emails about sports probably share more words in common.

## 3. Contradictory Political email Classifier Model

In [50]:
emails = fetch_20newsgroups(categories = ['talk.politics.guns','talk.politics.mideast', 'talk.religion.misc'])

print(emails.target_names)

['talk.politics.guns', 'talk.politics.mideast', 'talk.religion.misc']


### Exploring an Email

In [51]:
print(emails.data[5])

From: roby@chopin.udel.edu (Scott W Roby)
Subject: Re: BATF/FBI Murders Almost Everyone in Waco Today! 4/19
Nntp-Posting-Host: chopin.udel.edu
Organization: University of Delaware
Lines: 32

In article <1993Apr20.142131.27347@rti.rti.org> jbs@rti.rti.org writes:
>In article <C5rpoJ.IJv@news.udel.edu> roby@chopin.udel.edu (Scott W Roby) writes:
>>
>>Well they had over 40 days to come out with their hands up on national tv 
>>to get the trial they deserved.  Instead they chose to set fire to their 
>>compund hours after the tanks dropped off the tear gas.
>
>This is about the third person who's parroted the FBI's line about the
>fires being set "six hours after the tear gas was injected."  Suppose you
>want to explain to us the videotape footage shown on national TV last night
>in which a tank with the gas-injecting tubes is pulling its injection tubes
>out of the second story of a building as the building begins to belch smoke
>and then fire?

I've already corrected my mistake earlier i

In [52]:
print('This Email is Classified as:\n                      {}'.format(emails.target_names[emails.target[5]]))

This Email is Classified as:
                      talk.politics.guns


## **Naive Bayes** 

In [53]:
train_emails = fetch_20newsgroups(categories=['talk.politics.guns','talk.politics.mideast', 'talk.religion.misc'], subset = 'train', shuffle = True, random_state = 108)

test_emails = fetch_20newsgroups(categories=['talk.politics.guns','talk.politics.mideast', 'talk.religion.misc'], subset = 'test', shuffle = True, random_state = 108)




counter = CountVectorizer()
counter.fit(train_emails.data + test_emails.data )
train_counts = counter.transform(train_emails.data)
test_counts = counter.transform(test_emails.data)




classifier = MultinomialNB()
classifier.fit(train_counts, train_emails.target)
print('While distinguishing among \'Politics_Guns\', \'Middle East\' and \'Religion\'\nThe model score is : {} %'.format(classifier.score(test_counts, test_emails.target)*100))

While distinguishing among 'Politics_Guns', 'Middle East' and 'Religion'
The model score is : 94.24823410696267 %
