## **Implementation of Naive Bayesian Classifier Model to classify a set of documents and to measure the accuracy, precision and recall**

Import Necessary Libraries

In [1]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
import numpy as np

Load the Data

In [2]:
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']
twenty_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True)
twenty_test = fetch_20newsgroups(subset='test', categories=categories, shuffle=True)


print(f"Number of training documents: {len(twenty_train.data)}")
print(f"Number of testing documents: {len(twenty_test.data)}")


print(f"Target class names: {twenty_train.target_names}")


print("\nExample document from the training set:")
print("\n".join(twenty_train.data[0].split("\n")))
print(f"Target label: {twenty_train.target[0]}")

Number of training documents: 2257
Number of testing documents: 1502
Target class names: ['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']

Example document from the training set:
From: sd345@city.ac.uk (Michael Collier)
Subject: Converting images to HP LaserJet III?
Nntp-Posting-Host: hampton
Organization: The City University
Lines: 14

Does anyone know of a good way (standard PC application/PD utility) to
convert tif/img/tga files into LaserJet III format.  We would also like to
do the same, converting to HPGL (HP plotter) files.

Please email any response.

Is this the correct group?

Thanks in advance.  Michael.
-- 
Michael Collier (Programmer)                 The Computer Unit,
Email: M.P.Collier@uk.ac.city                The City University,
Tel: 071 477-8000 x3769                      London,
Fax: 071 477-8565                            EC1V 0HB.

Target label: 1


Preprocess the Data

In [3]:
vectorizer = CountVectorizer(stop_words='english')
X_train = vectorizer.fit_transform(twenty_train.data)
X_test = vectorizer.transform(twenty_test.data)

print(f"Shape of training data: {X_train.shape}")
print(f"Shape of testing data: {X_test.shape}")

Shape of training data: (2257, 35482)
Shape of testing data: (1502, 35482)


Train the Naive Bayes Classifier

In [4]:
nb_classifier = MultinomialNB()

nb_classifier.fit(X_train, twenty_train.target)

Make Predictions

In [5]:
y_pred = nb_classifier.predict(X_test)

Measure Accuracy, Precision, and Recall

In [6]:
accuracy = accuracy_score(twenty_test.target, y_pred)
print(f"Accuracy: {accuracy:.4f}")

print("\nClassification report:")
print(classification_report(twenty_test.target, y_pred, target_names=twenty_test.target_names))

cm = confusion_matrix(twenty_test.target, y_pred)
print("\nConfusion matrix:")
print(cm)


Accuracy: 0.9421

Classification report:
                        precision    recall  f1-score   support

           alt.atheism       0.93      0.91      0.92       319
         comp.graphics       0.95      0.97      0.96       389
               sci.med       0.96      0.92      0.94       396
soc.religion.christian       0.93      0.96      0.95       398

              accuracy                           0.94      1502
             macro avg       0.94      0.94      0.94      1502
          weighted avg       0.94      0.94      0.94      1502


Confusion matrix:
[[289   3   5  22]
 [  5 376   6   2]
 [ 11  13 366   6]
 [  5   4   5 384]]
