# Newsgroups Similarity

In this project we will apply Scikit-learn’s Multinomial Naive Bayes Classifier to Scikit-learn’s example datasets to find which category combinations are harder for it to distinguish. We are going to achieve that by reporting the accuracy of several variations of the classifier that were fit on different categories of newsgroups.

How difficult is it to distinguish emails about hockey and emails about soccer? 
How hard is it to tell the difference between emails about sports and emails about tech? 
In this project, we’ll find out exactly how difficult those two tasks are.

We are going to use 20 newsgroups dataset which, according to documentation, comprises around 18000 newsgroups posts on 20 topics split in two subsets: one for training (or development) and the other one for testing (or for performance evaluation). The split between the train and test set is based upon a messages posted before and after a specific date. Documentation also says that it is easy for any classifier to overfit on particular things that appear in the 20 Newsgroups data. Many classifiers achieve very high F-scores, but their results would not generalize to other documents that aren’t from this window of time. For this reason, the functions that load 20 Newsgroups data provide a parameter called `remove`, telling it what kinds of information to strip out of each file. We are going to remove headers, signature blocks, and quotation blocks, as recommended by documentation, to get more realistic results. 

Let's start with imports and data exploration.


In [1]:
# Data import
from sklearn.datasets import fetch_20newsgroups

# Imports for ml
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

# Imports related to evaluation
from sklearn.metrics import classification_report

## Exploring the Data

We’re interested in seeing how effective our Naive Bayes classifier is at telling the difference between a baseball and a hockey newsgroup as well as the difference between sports and tech newsgroups. Let's create 2 variables containing lists of target newsgroups.

In [2]:
# Preselect sports categories
cats_sport = [
    'rec.sport.baseball',
    'rec.sport.hockey',
 ]
 
# Preselect sports and tech categories
cats_sport_tech = [
    'rec.sport.baseball',
    'rec.sport.hockey',
    'comp.os.ms-windows.misc',
    'comp.sys.ibm.pc.hardware',
 ]

Now let's fetch information that corresponds to our preselected newsgroups. 

In [3]:
# Fetch only preselected categories from 20 available
ng_sport = fetch_20newsgroups(categories = cats_sport)
ng_sport_tech = fetch_20newsgroups(categories = cats_sport_tech)

All the newsgroups are stored in an array called `data`. Let's access this array and view the first item in each of the previously created variables. 

In [4]:
# Show the 1st item of sports newsgroups
ng_sport.data[0]

"From: dougb@comm.mot.com (Doug Bank)\nSubject: Re: Info needed for Cleveland tickets\nReply-To: dougb@ecs.comm.mot.com\nOrganization: Motorola Land Mobile Products Sector\nDistribution: usa\nNntp-Posting-Host: 145.1.146.35\nLines: 17\n\nIn article <1993Apr1.234031.4950@leland.Stanford.EDU>, bohnert@leland.Stanford.EDU (matthew bohnert) writes:\n\n|> I'm going to be in Cleveland Thursday, April 15 to Sunday, April 18.\n|> Does anybody know if the Tribe will be in town on those dates, and\n|> if so, who're they playing and if tickets are available?\n\nThe tribe will be in town from April 16 to the 19th.\nThere are ALWAYS tickets available! (Though they are playing Toronto,\nand many Toronto fans make the trip to Cleveland as it is easier to\nget tickets in Cleveland than in Toronto.  Either way, I seriously\ndoubt they will sell out until the end of the season.)\n\n-- \nDoug Bank                       Private Systems Division\ndougb@ecs.comm.mot.com          Motorola Communications Sect

In [5]:
# Show the 1st item of sports & tech newsgroups
ng_sport_tech.data[0]

'From: jimg@cybernet.cse.fau.edu (Jim Gorycki)\nSubject: New Franchise name\nOrganization: Cybernet BBS, Boca Raton, Florida\nLines: 31\n\nThe new name is Florida Panthers.  \nThe panther is an endangered species, mostly located in the Everglades.\nA couple of years ago, there were license plates made with Panthers on\nthem (part of the revenue were to go to some protection fund).\n\nThe name of the new President of the Panthers should be announced today.\n\nAs of yesterday\'s paper, Huizenga\'s new hockey team will take the ice at\nthe Miami Arena this fall.  The team has a guaranteed two-year lease with\nthe arena, with four one-year options that could run through 1999.\n\n"It\'s not our choice", James Blosser, a lawyer and Huizenga Aid said\nabout ruling out the arena as a long term option.  "The NHL told us we \ncan\'t stay there.  It\'s not economically feasible."\n\nOne reason is because the Miami Heat basketball team controls skybox\nand advertising revenue at the arena, reducin

All of the labels can be found in the array `target`. When we fetch only 2 newsgroups this array contains only 2 distinct numerical values.

In [6]:
ng_sport.target[:10]

array([0, 1, 0, 1, 0, 1, 0, 1, 1, 1], dtype=int64)

When we fetch more newsgroups the number of distinct numerical values rises to match the number of newsgroups - in our case to 4.

In [7]:
ng_sport_tech.target[:10]

array([3, 3, 0, 2, 1, 2, 2, 3, 1, 1], dtype=int64)

Let's check weather the number of newsgroups and labels are the same for both variables.

In [8]:
print('Sports newsgroups labels vs data ratio:', len(ng_sport.target), '/', len(ng_sport.data)) 
print('Sports & Tech newsgroups labels vs data ratio:', len(ng_sport_tech.target), '/', len(ng_sport_tech.data))


Sports newsgroups labels vs data ratio: 1197 / 1197
Sports & Tech newsgroups labels vs data ratio: 2378 / 2378


The labels themselves are numbers, but those numbers correspond to the label names found in a list `target_names`, so we can easily map one to another. 

In [9]:
# Map the label of the 5th item in dataset ng_sport with its target name
ng_sport.target_names[ng_sport.target[5]]

'rec.sport.hockey'

## Making the Training and Test Sets

We now want to split our data into training and test sets (2 for each variable) and remove newsgroup's specific signifiers: `headers`, `footers`, `quotes`.

In [10]:
train_ng_sport = fetch_20newsgroups(
    categories = cats_sport,
    subset='train',
    remove=('headers', 'footers', 'quotes'),
    shuffle=True, 
    random_state=33
)
 
test_ng_sport = fetch_20newsgroups(
    categories = cats_sport,
    subset='test',
    remove=('headers', 'footers', 'quotes'),
    shuffle=True, 
    random_state=33
)

train_ng_sport_tech = fetch_20newsgroups(
    categories = cats_sport_tech,
    subset='train',
    remove=('headers', 'footers', 'quotes'),
    shuffle=True, 
    random_state=96
)
 
test_ng_sport_tech = fetch_20newsgroups(
    categories = cats_sport_tech,
    subset='test',
    remove=('headers', 'footers', 'quotes'),
    shuffle=True, 
    random_state=96
)

In [11]:
print('Sports train and test sets ratio:', len(train_ng_sport.data), '/', len(test_ng_sport.data)) 
print('Sports & Tech train and test sets ratio:', len(train_ng_sport_tech.data), '/', len(test_ng_sport_tech.data))

Sports train and test sets ratio: 1197 / 796
Sports & Tech train and test sets ratio: 2378 / 1582


## Counting Words

We want to transform these emails into lists of word counts. The CountVectorizer class makes this easy for us, but we need to create 2 of them to compare the final performance of classifiers.


In [12]:
counter_s = CountVectorizer()
counter_s.fit(train_ng_sport.data + test_ng_sport.data)

counter_st = CountVectorizer()
counter_st.fit(train_ng_sport_tech.data + test_ng_sport_tech.data)

CountVectorizer()

In [13]:
train_s_counts = counter_s.transform(train_ng_sport.data)
test_s_counts = counter_s.transform(test_ng_sport.data)

train_st_counts = counter_st.transform(train_ng_sport_tech.data)
test_st_counts = counter_st.transform(test_ng_sport_tech.data)

## Making a Naive Bayes Classifier

Now it's time to create classifiers, fit them and compare their accuracy scores.

In [14]:
classifier_s = MultinomialNB()
classifier_st = MultinomialNB()

classifier_s.fit(train_s_counts, train_ng_sport.target)
classifier_st.fit(train_st_counts, train_ng_sport_tech.target)


MultinomialNB()

In [15]:
print('Sports newsgroups distinction accuracy score:', round(classifier_s.score(test_s_counts, test_ng_sport.target),2) * 100, '%')
print('Sports and Tech newsgroups distinction accuracy score:', round(classifier_st.score(test_st_counts, test_ng_sport_tech.target),2) * 100, '%')

Sports newsgroups distinction accuracy score: 87.0 %
Sports and Tech newsgroups distinction accuracy score: 70.0 %


In [16]:
# Calculate evaluation metrics for sports ng
print(classification_report(test_ng_sport.target, classifier_s.predict(test_s_counts)))

              precision    recall  f1-score   support

           0       0.82      0.95      0.88       397
           1       0.94      0.79      0.86       399

    accuracy                           0.87       796
   macro avg       0.88      0.87      0.87       796
weighted avg       0.88      0.87      0.87       796



In [17]:
# Calculate evaluation metrics for sports & tech ng
print(classification_report(test_ng_sport_tech.target, classifier_st.predict(test_st_counts)))

              precision    recall  f1-score   support

           0       1.00      0.01      0.02       394
           1       0.50      0.98      0.66       392
           2       0.90      0.88      0.89       397
           3       0.86      0.92      0.89       399

    accuracy                           0.70      1582
   macro avg       0.82      0.70      0.61      1582
weighted avg       0.82      0.70      0.62      1582



## Conclusion 

We have created 2 classifiers that were fit on 2 stacks of newsgroups. The first contained only sports newsgroups the second was its gradual modification and contained two tech newsgroups as addition to the same sports newsgroups. 
- As a result Sports' newsgroups being alone got almost `20%` better accuracy score than the mixed version. 
- Judging by the scores each of 2 Tech's newsgroups separately, in extensive report, classifier had a hard time distinguishing one from another. 
- At the same time, it did just fine when identified Tech and Sport newsgroups in general. 