### Naive Bayes

The following work is based on the tutorial located at [http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html](http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html). First, import the dataset and assign a variable to the training data.

In [1]:
from sklearn.datasets import fetch_20newsgroups
twenty_train = fetch_20newsgroups(subset='train', shuffle=True, random_state=42)

Some print statements to get a sense of the underlying data.

In [118]:
# List of names corresponding to newsgroup integers
print twenty_train.target_names
# List of emails
print len(twenty_train.data)
# List of newsgroup integer 
print len(twenty_train.target)
# Print the first three lines of the 0th email
print("\n".join(twenty_train.data[0].split("\n")[:3]))
# Print the newsgroups corresponding to the target integers of the first 10 emails
for t in twenty_train.target[:10]:print(twenty_train.target_names[t])

['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']
11314
11314
From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
rec.autos
comp.sys.mac.hardware
comp.sys.mac.hardware
comp.graphics
sci.space
talk.politics.guns
sci.med
comp.sys.ibm.pc.hardware
comp.os.ms-windows.misc
comp.sys.mac.hardware


Warning: computationally expensive. The built in vectorizer creates a sparse representation of words that are longer than 2 letters, removing stop words (highly common words) and punctuation. See e.g. [http://scikit-learn.org/stable/modules/feature_extraction.html](http://scikit-learn.org/stable/modules/feature_extraction.html). By setting `binary=True` the vectorizer considers all non-zero counts to be 1. This corresponds to the binomial version of the naive Bayes assumption.

In [68]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer(binary=True)
X_train_counts = count_vect.fit_transform(twenty_train.data)
X_train_counts.shape

(11314, 130107)

I prefer to convert the returned object to a simple array. The `CountVectorizer.build_analyzer()` is a useful tool for extracting feature indices from test data. A sample text is analyzed and given as a list of words, and as a list of feature indices.

In [71]:
X_train = X_train_counts.toarray()
print X_train[47]
print X_train.shape
analyze = count_vect.build_analyzer()
a = analyze("This is something that you never saw, but might want to categorize.")
print a
indices = []
for item in a:
    indices.append(count_vect.vocabulary_.get(item))
print indices

[0 0 0 ..., 0 0 0]
(11314, 130107)
[u'this', u'is', u'something', u'that', u'you', u'never', u'saw', u'but', u'might', u'want', u'to', u'categorize']
[114731, 68532, 108821, 114440, 128402, 86839, 104813, 35805, 81998, 123196, 115475, 38131]


Check that features for some random word are indeed binomial. In the following example, the 3453th word contains 63 unique words from the corpus.

In [70]:
import numpy as np
unique, counts = np.unique(X_train[3453], return_counts=True)
print unique
print counts

[0 1]
[130044     63]


Need to use this to check whether the count is non-zero. My first guess of a non-occurring word was successful!  

In [122]:
print type(count_vect.vocabulary_.get('doogie'))

<type 'NoneType'>


Check how balanced the data is across newsgroups. Looks quite balanced. A sanity check is provided to confirm a partition of categories in the provided list of names.

In [87]:
from itertools import compress
totalEntries = 0
for i in range(len(twenty_train.target_names)):
    boolVec = twenty_train.target == i
    #print boolVec[0:47]
    subset = X_train[boolVec,:]
    totalEntries += subset.shape[0]
    print twenty_train.target_names[i], ":", subset.shape

print totalEntries

alt.atheism : (480, 130107)
comp.graphics : (584, 130107)
comp.os.ms-windows.misc : (591, 130107)
comp.sys.ibm.pc.hardware : (590, 130107)
comp.sys.mac.hardware : (578, 130107)
comp.windows.x : (593, 130107)
misc.forsale : (585, 130107)
rec.autos : (594, 130107)
rec.motorcycles : (598, 130107)
rec.sport.baseball : (597, 130107)
rec.sport.hockey : (600, 130107)
sci.crypt : (595, 130107)
sci.electronics : (591, 130107)
sci.med : (594, 130107)
sci.space : (593, 130107)
soc.religion.christian : (599, 130107)
talk.politics.guns : (546, 130107)
talk.politics.mideast : (564, 130107)
talk.politics.misc : (465, 130107)
talk.religion.misc : (377, 130107)
11314


In [152]:
# Given a word and an integer category, it calculates the likelihood that a document with that word belongs
# to that category.

# Subset the training dataset for the category.
# Numerator:    number of occurrences of word + 1
# Denominator:  number of documents in subset +  len(twenty_train.target_names)
from itertools import compress

totalDocs = X_train.shape[0]
subset = []
def wordCatProb(word):
    #subsetBoolVector = twenty_train.target == category
    #subset = X_train[subsetBoolVector,:]
    index = count_vect.vocabulary_.get(word)
    if index is None:
        numerator = float(1)
    else:
        numerator = float(sum(subset[:,index])) + 1
    denominator = subset.shape[0] + len(twenty_train.target_names)
    return numerator/denominator
    


It turns out that just over one third of the entries in the "Atheism" newsgroup contain the word "God" or "god", whereas soc.religion.christian has that word in just over half of its submissions. Note that the vectorizer converts all words to lowercase.

In [142]:
word = "god"
print "Occurences of", word, "in", twenty_train.target_names[0],":",  wordCatProb(word, 0)
print "Occurences of", word, "in", twenty_train.target_names[15],":",  wordCatProb(word, 15)


Occurences of god in alt.atheism : 0.346
Occurences of god in soc.religion.christian : 0.557350565428


In [136]:
twenty_test = fetch_20newsgroups(subset='test', shuffle=True, random_state=42)
test_emails = twenty_test.data
test_labels = twenty_test.target

print test_emails[0]
print twenty_train.target_names[test_labels[0]]
print analyze(test_emails[0])

From: v064mb9k@ubvmsd.cc.buffalo.edu (NEIL B. GANDLER)
Subject: Need info on 88-89 Bonneville
Organization: University at Buffalo
Lines: 10
News-Software: VAX/VMS VNEWS 1.41
Nntp-Posting-Host: ubvmsd.cc.buffalo.edu


 I am a little confused on all of the models of the 88-89 bonnevilles.
I have heard of the LE SE LSE SSE SSEI. Could someone tell me the
differences are far as features or performance. I am also curious to
know what the book value is for prefereably the 89 model. And how much
less than book value can you usually get them for. In other words how
much are they in demand this time of year. I have heard that the mid-spring
early summer is the best time to buy.

			Neil Gandler

rec.autos
[u'from', u'v064mb9k', u'ubvmsd', u'cc', u'buffalo', u'edu', u'neil', u'gandler', u'subject', u'need', u'info', u'on', u'88', u'89', u'bonneville', u'organization', u'university', u'at', u'buffalo', u'lines', u'10', u'news', u'software', u'vax', u'vms', u'vnews', u'41', u'nntp', u'posting', u'

In [None]:
from time import time

test_predictions = []
counter = 0

test_emails = test_emails[0:400]

for i in test_emails:
    t0 = time()
    squozenEmail = analyze(i)
    squozenEmail = set(squozenEmail)
    phi = []
    probDict = {}
    for cat in range(len(twenty_train.target_names)):
        subsetBoolVector = twenty_train.target == cat
        subset = X_train[subsetBoolVector,:]
        # cat is the Category under consideration.
        prob = 1
        for word in squozenEmail:
            if word + str(cat) in probDict.keys():
                prob *= probDict[word + str(cat)]
            else:
            #print word
                curProb = wordCatProb(word)
                prob *= curProb
                probDict[word + str(cat)] = curProb
            #print "wordcatprob", wordCatProb(word)
        phi.append(prob)
        #print prob
    test_predictions.append(phi.index(max(phi)))
    if counter % 10 == 0:
        print "Just finished email number", "\t\t", counter
        print "Category was", "\t\t", test_predictions[counter]
        print "Categorization time:", "\t\t", round(time()-t0, 3), "s"
    counter+=1 
    
print test_predictions.shape     
print test_labels.shape

In [140]:
phi

[0.0]

In [155]:
print test_labels[0:20]

[ 7  5  0 17 19 13 15 15  5  1  2  5 17  8  0  2  4  1  6 16]
