# 6)Assuming a set of documents that need to be classified, use the naïve Bayesian Classifier model to perform this task. Built-in Java classes/API can be used to write the program. Calculate the accuracy, precision, and recall for your data set.

#### Document Classification using Naive Bayes Classifier

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
txt=pd.read_csv('Text.csv',names=['text','label']) #Tabular form data
print('Total instances in the dataset:',txt.shape[0])
txt['labelnum']=txt.label.map({'pos':1,'neg':0})
X=txt.text
Y=txt.labelnum
print('\nThe message and its label of first 5 instances are listed below')
X5, Y5 = X[0:5], txt.label[0:5]
for x, y in zip(X5,Y5):
    print(x,'->',y)

Total instances in the dataset: 18

The message and its label of first 5 instances are listed below
I love this sandwich -> pos
This is an amazing place -> pos
I feel very good about these beers -> pos
This is my best work -> pos
What an awesome view -> pos


In [3]:
# Splitting the dataset into train and test data
xtrain,xtest,ytrain,ytest=train_test_split(X,Y,random_state=0)
print('\nDataset is split into Training and Testing samples')
print('Total training instances :', xtrain.shape[0])
print('Total testing instances :', xtest.shape[0])


Dataset is split into Training and Testing samples
Total training instances : 13
Total testing instances : 5


In [4]:
# Output of count vectoriser is a sparse matrix
# CountVectorizer - stands for 'feature extraction'
count_vect = CountVectorizer()
xtrain_dtm = count_vect.fit_transform(xtrain) # Transform training data into document-term matrix
xtest_dtm = count_vect.transform(xtest)
print('\nTotal features extracted using CountVectorizer:',xtrain_dtm.shape[1])


Total features extracted using CountVectorizer: 50


In [5]:
print('\nFeatures for first 5 training instances are listed below')
df=pd.DataFrame(xtrain_dtm.toarray(),columns=count_vect.get_feature_names())
#print(df[0:5])#tabular representation
print(df)
#print(xtrain_dtm) #Same as above but sparse matrix representation
# Training Naive Bayes (NB) classifier on training data.
clf = MultinomialNB().fit(xtrain_dtm,ytrain)
predicted = clf.predict(xtest_dtm)


Features for first 5 training instances are listed below
    about  am  an  and  awesome  bad  beers  best  boss  can  ...   today  \
0       0   0   1    0        1    0      0     0     0    0  ...       0   
1       1   0   0    0        0    0      1     0     0    0  ...       0   
2       0   0   0    0        0    0      0     0     0    0  ...       0   
3       0   0   0    0        0    0      0     0     0    0  ...       1   
4       0   0   0    0        0    0      0     0     1    0  ...       0   
5       0   0   0    0        0    0      0     0     0    1  ...       0   
6       0   1   0    1        0    0      0     0     0    0  ...       0   
7       0   0   0    0        0    0      0     0     0    0  ...       0   
8       0   0   0    0        0    0      0     1     0    0  ...       0   
9       0   0   0    0        0    0      0     0     0    0  ...       0   
10      0   0   0    0        0    0      0     0     0    0  ...       0   
11      0   0   0 

In [6]:
print('\nClassstification results of testing samples are given below')
for doc, p in zip(xtest, predicted):
    pred = 'pos' if p==1 else 'neg'
    print('%s -> %s ' % (doc, pred))
#printing accuracy metrics
print('\nAccuracy metrics')
print('Accuracy of the classifer is',metrics.accuracy_score(ytest,predicted))
print('Recall :',metrics.recall_score(ytest,predicted),
'\nPrecison :',metrics.precision_score(ytest,predicted))
print('Confusion matrix')
print(metrics.confusion_matrix(ytest,predicted))


Classstification results of testing samples are given below
This is an amazing place -> neg 
I am tired of this stuff -> neg 
He is my sworn enemy -> neg 
This is an awesome place -> pos 
What a great holiday -> pos 

Accuracy metrics
Accuracy of the classifer is 0.8
Recall : 0.6666666666666666 
Precison : 1.0
Confusion matrix
[[2 0]
 [1 2]]
