# Naïve Bayesian Classifie for text classification

Assuming a set of documents that need to be classified, use the naïve Bayesian Classifier model to perform this task. Built-in Java classes/API can be used to write the program. Calculate the accuracy, precision, and recall for your data set.

In [1]:
# import dataset from sklearn
from sklearn.datasets import fetch_20newsgroups

In [2]:
# This lists out all the available categories
print(fetch_20newsgroups().target_names)

['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']


In [3]:
# Here we'll pick only a few for now
categories = fetch_20newsgroups().target_names[:3]
categories

['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc']

In [4]:
# Fetch training and testing sets for the above selected categories
train = fetch_20newsgroups(categories=categories,shuffle=True,subset='train')
test = fetch_20newsgroups(categories=categories,shuffle=True,subset='test')

In [5]:
# The data above is a dict bundle with the following keys, feel free to explore
train.keys()

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])

In [6]:
# The text feature we will first extract is frequency/count hence we need a count vectorizer
from sklearn.feature_extraction.text import CountVectorizer

In [7]:
# Instantiate and fit the training data to the obj
count = CountVectorizer()
train_data = count.fit_transform(train.data)

In [86]:
# The resultant will a feature vector of dim's
train_data.shape

(1655, 57389)

In [8]:
# Now that we have the basic count we can calculate tf and idf for the train data documents using the below
from sklearn.feature_extraction.text import TfidfTransformer

In [9]:
# Instantiate and fit the count feature vector data to the tfidf obj
tfidf =  TfidfTransformer()
trans_train_data = tfidf.fit_transform(train_data)

In [11]:
# The resultant is a transformed feature vec of dims
trans_train_data.shape

(1655, 57389)

In [12]:
# Instantiate a naive bayes model , this case a MultinomialNB
from sklearn.naive_bayes import MultinomialNB

In [13]:
# Instantiate and fit the trnasformed traindata and the expected target labels
model =  MultinomialNB()
model.fit(trans_train_data,train.target)

MultinomialNB()

In [14]:
# Now the model is trained and complete. We can test it out with data its never seen before.

In [15]:
# Count and transform the test data in the same way we did for training data
test_data = count.transform(test.data)
trans_test_data = tfidf.transform(test_data)
trans_test_data.shape

(1102, 57389)

In [17]:
# Call model.predict() with the transformed test data and store the predictions
preds = model.predict(trans_test_data)

In [18]:
# For evaluating model performance
from sklearn import metrics

In [19]:
# call classification report with the expected and predicted values along with the target classnames
print(metrics.classification_report(test.target,preds,target_names=test.target_names))

                         precision    recall  f1-score   support

            alt.atheism       0.90      0.98      0.94       319
          comp.graphics       0.89      0.88      0.88       389
comp.os.ms-windows.misc       0.91      0.86      0.88       394

               accuracy                           0.90      1102
              macro avg       0.90      0.91      0.90      1102
           weighted avg       0.90      0.90      0.90      1102



In [20]:
print(metrics.confusion_matrix(test.target,preds))

[[314   4   1]
 [ 16 342  31]
 [ 18  39 337]]


# End