<a href="https://colab.research.google.com/github/Muthon1/DataScience/blob/main/phase_2_Decision_Tree_and_Naive_Bayes_Classifiers_for_Text_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Difference between Decision Tree and Naive Bayes Classifiers
Naive Bayes classifier is a probabilistic classifier based on the Bayes' theorem which assumes that features are conditionally independent given the class. It works well with high-dimensional data and is therefore commonly used for text classification tasks such as spam detection, document categorization and sentiment analysis.

Decision tree classifier on the other hand is a tree-based model that splits data based on feature values without any assumptions of independence. It works well for both classification and regression problems and is often used where model interpretability is important.

In [16]:
# Building the models

# Install required libraries
!pip install numpy pandas scikit-learn matplotlib

# Importing the necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.datasets import fetch_20newsgroups
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, accuracy_score



In [7]:
# Load the 20 Newsgroup dataset
newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))


In [14]:
# Preprocessing the text data
# Vectorize text using TfidfVectorizer
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(newsgroups.data)
y = newsgroups.target

# Split data into training and test sets
# Use the vectorized data 'X' instead of 'newsgroups.data'
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the Decision Tree model
dt_classifier = DecisionTreeClassifier(random_state=42)
dt_classifier.fit(X_train, y_train)

# Make predictions
dt_predictions = dt_classifier.predict(X_test)

# Evaluate the model
print("Decision Tree Classifier Performance:")
print(f"Accuracy: {accuracy_score(y_test, dt_predictions)}")
print(classification_report(y_test, dt_predictions))

Decision Tree Classifier Performance:
Accuracy: 0.46047745358090186
              precision    recall  f1-score   support

           0       0.34      0.34      0.34       151
           1       0.39      0.36      0.37       202
           2       0.44      0.43      0.43       195
           3       0.36      0.43      0.39       183
           4       0.50      0.42      0.46       205
           5       0.62      0.55      0.58       215
           6       0.51      0.51      0.51       193
           7       0.28      0.55      0.37       196
           8       0.50      0.49      0.50       168
           9       0.58      0.52      0.55       211
          10       0.62      0.60      0.61       198
          11       0.67      0.58      0.62       201
          12       0.34      0.32      0.33       202
          13       0.60      0.57      0.58       194
          14       0.53      0.52      0.53       189
          15       0.47      0.44      0.45       202
          16 

In [15]:
# Training the Naive Bayes model
nb_classifier = MultinomialNB()
nb_classifier.fit(X_train, y_train)

# Make predictions
nb_predictions = nb_classifier.predict(X_test)

# Evaluate the model
print("Naive Bayes Classifier Performance:")
print(f"Accuracy: {accuracy_score(y_test, nb_predictions)}")
print(classification_report(y_test, nb_predictions))


Naive Bayes Classifier Performance:
Accuracy: 0.7196286472148541
              precision    recall  f1-score   support

           0       0.75      0.27      0.40       151
           1       0.70      0.68      0.69       202
           2       0.67      0.66      0.66       195
           3       0.55      0.78      0.64       183
           4       0.87      0.67      0.75       205
           5       0.90      0.81      0.85       215
           6       0.79      0.70      0.74       193
           7       0.84      0.76      0.79       196
           8       0.49      0.77      0.60       168
           9       0.92      0.83      0.88       211
          10       0.88      0.92      0.90       198
          11       0.70      0.85      0.77       201
          12       0.85      0.62      0.72       202
          13       0.91      0.86      0.88       194
          14       0.80      0.83      0.81       189
          15       0.42      0.94      0.58       202
          16    

# Summary
The outputs display the accuracy and classification reports for both models. The classification report includes precision, recall, f1-score, and support for each of the 20 categories.
Decision tree classifier may achieve decent accuracy, but could struggle with overfitting. Naive Bayes classifier is likely to perform faster and have consistent performance but may struggle with topics that require capturing feature dependancies.
Therefore, model selection will rely on various factors. For a more interpretable and complex model, decision tree would fit best whereas Naive Bayes would be optimal for a fast, scalable and simple solution.

In [17]:
# Improving the accuracy of the classifiers
# Using the Random forest classifier
from sklearn.ensemble import RandomForestClassifier

# Initialize the Random Forest Classifier
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the model
rf_classifier.fit(X_train, y_train)

# Make predictions on the test set
rf_predictions = rf_classifier.predict(X_test)

# Evaluate the model
print("Random Forest Classifier Performance:")
print(f"Accuracy: {accuracy_score(y_test, rf_predictions)}")
print(classification_report(y_test, rf_predictions))

Random Forest Classifier Performance:
Accuracy: 0.6692307692307692
              precision    recall  f1-score   support

           0       0.55      0.43      0.48       151
           1       0.61      0.58      0.60       202
           2       0.60      0.67      0.63       195
           3       0.54      0.67      0.60       183
           4       0.80      0.67      0.73       205
           5       0.79      0.76      0.77       215
           6       0.69      0.68      0.68       193
           7       0.44      0.75      0.56       196
           8       0.61      0.65      0.63       168
           9       0.79      0.76      0.77       211
          10       0.81      0.86      0.83       198
          11       0.83      0.77      0.80       201
          12       0.59      0.54      0.57       202
          13       0.79      0.74      0.77       194
          14       0.77      0.75      0.76       189
          15       0.62      0.78      0.69       202
          16  

Using Random Forest classifier improves the accuracy compared to a single Decision Tree because it reduces overfitting and provides better generalization.
Fine-tuning the hyperparameters further can help optimize performance.