<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Classification-of-emails-with-Naive-Bayes" data-toc-modified-id="Classification-of-emails-with-Naive-Bayes-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Classification of emails with Naive Bayes</a></span></li><li><span><a href="#Downloading-the-data" data-toc-modified-id="Downloading-the-data-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Downloading the data</a></span><ul class="toc-item"><li><span><a href="#Training-set" data-toc-modified-id="Training-set-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Training set</a></span><ul class="toc-item"><li><span><a href="#Investigating-the-data-attributes" data-toc-modified-id="Investigating-the-data-attributes-2.1.1"><span class="toc-item-num">2.1.1&nbsp;&nbsp;</span>Investigating the data attributes</a></span></li></ul></li><li><span><a href="#Test-set" data-toc-modified-id="Test-set-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Test set</a></span></li></ul></li><li><span><a href="#Creating-the-classifier" data-toc-modified-id="Creating-the-classifier-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Creating the classifier</a></span><ul class="toc-item"><li><span><a href="#Vectorising-the-data" data-toc-modified-id="Vectorising-the-data-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Vectorising the data</a></span></li><li><span><a href="#Creating-a-Naive-Bayes-Classifier" data-toc-modified-id="Creating-a-Naive-Bayes-Classifier-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Creating a Naive Bayes Classifier</a></span></li><li><span><a href="#Scoring-the-classifier" data-toc-modified-id="Scoring-the-classifier-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Scoring the classifier</a></span></li></ul></li></ul></div>

In [1]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

# Classification of emails with Naive Bayes

This project uses one of Scikit-Learn's real-world datasets - the 20 newsgroups dataset.

https://scikit-learn.org/stable/datasets/real_world.html

The original source of the data is:
http://qwone.com/~jason/20Newsgroups/

This is a collection of around 18,000 different emails from 20 different news groups. The subject of the emails ranges from cars to space to politics.

The goal in this project is to create a classifier that can determine whether the posts are about particular subjects.

The choice of the subjects is somewhat arbitrary. We will use two sports - baseball and hockey, as the targets.

# Downloading the data

In [2]:
emails = fetch_20newsgroups()

The dataset is already split into training and test sets.

To access each of them, we can pass an additional parameter when choosing the categories - `subset = 'test'` or `subset = 'train'`

## Training set

In [3]:
# creating the training set

emails = fetch_20newsgroups(categories = ['rec.sport.baseball', 'rec.sport.hockey'])
train_emails = fetch_20newsgroups(categories = ['rec.sport.baseball', 'rec.sport.hockey'],\
                                  subset='train',\
                                  shuffle=True,\
                                  random_state=1)

### Investigating the data attributes

As always, the attributes of the dataset can be explored using tab completion. This will be done by typing `emails.` and pressing tab to get a list of available attributes

In [4]:
emails.data[0]

"From: dougb@comm.mot.com (Doug Bank)\nSubject: Re: Info needed for Cleveland tickets\nReply-To: dougb@ecs.comm.mot.com\nOrganization: Motorola Land Mobile Products Sector\nDistribution: usa\nNntp-Posting-Host: 145.1.146.35\nLines: 17\n\nIn article <1993Apr1.234031.4950@leland.Stanford.EDU>, bohnert@leland.Stanford.EDU (matthew bohnert) writes:\n\n|> I'm going to be in Cleveland Thursday, April 15 to Sunday, April 18.\n|> Does anybody know if the Tribe will be in town on those dates, and\n|> if so, who're they playing and if tickets are available?\n\nThe tribe will be in town from April 16 to the 19th.\nThere are ALWAYS tickets available! (Though they are playing Toronto,\nand many Toronto fans make the trip to Cleveland as it is easier to\nget tickets in Cleveland than in Toronto.  Either way, I seriously\ndoubt they will sell out until the end of the season.)\n\n-- \nDoug Bank                       Private Systems Division\ndougb@ecs.comm.mot.com          Motorola Communications Sect

In [5]:
emails.DESCR



In [6]:
emails.filenames

array(['C:\\Users\\Stewa\\scikit_learn_data\\20news_home\\20news-bydate-train\\rec.sport.baseball\\102709',
       'C:\\Users\\Stewa\\scikit_learn_data\\20news_home\\20news-bydate-train\\rec.sport.hockey\\53653',
       'C:\\Users\\Stewa\\scikit_learn_data\\20news_home\\20news-bydate-train\\rec.sport.baseball\\102689',
       ...,
       'C:\\Users\\Stewa\\scikit_learn_data\\20news_home\\20news-bydate-train\\rec.sport.baseball\\104496',
       'C:\\Users\\Stewa\\scikit_learn_data\\20news_home\\20news-bydate-train\\rec.sport.baseball\\102648',
       'C:\\Users\\Stewa\\scikit_learn_data\\20news_home\\20news-bydate-train\\rec.sport.hockey\\53647'],
      dtype='<U95')

In [7]:
emails.target

array([0, 1, 0, ..., 0, 0, 1], dtype=int64)

In [8]:
len(emails.target)

1197

In [9]:
emails.target_names

['rec.sport.baseball', 'rec.sport.hockey']

## Test set

In [10]:
test_emails = fetch_20newsgroups(categories = ['rec.sport.baseball', 'rec.sport.hockey'],\
                                 subset='test',\
                                 shuffle=True,\
                                 random_state=1)

# Creating the classifier

## Vectorising the data

The `CountVectorizer` method can be used to convert the data into a suitable form for modeling.

The following converts the data into vectors of word counts.

In [11]:
counter = CountVectorizer()
counter.fit(test_emails.data + train_emails.data)
train_counts = counter.transform(train_emails.data)
test_counts = counter.transform(test_emails.data)

## Creating a Naive Bayes Classifier

In [12]:
classifier = MultinomialNB()
classifier.fit(train_counts, train_emails.target)
print(classifier.score(test_counts, test_emails.target))

0.9723618090452262


## Scoring the classifier

In [13]:
print(classifier.score(test_counts, test_emails.target))

0.9723618090452262
