# Lab 03: Text Classification on the DBpedia14 dataset

### Objectives:
1. Build a Naive Bayes classification model from scratch
2. Evaluate the performance of your model on the DBpedia14 dataset
3. Train an off-the-shelf NB classifier and compare its performance to your implementation
4. Train off-the-shelf implementations of the linear-SVM, RBF-kernel-SVM, and perceptron and compare their performance with the NB models

### Suggested Reading

1. https://arxiv.org/pdf/1811.12808.pdf

### Download the dataset

In [3]:
import datasets
import pandas as pd
from sklearn.model_selection import train_test_split
import random

#train_ds, test_ds = datasets.load_dataset('dbpedia_14', split=['train[:80%]', 'test[80%:]'])
#df_train: pd.DataFrame = train_ds.to_pandas()
#df_test: pd.DataFrame = test_ds.to_pandas()


ds = datasets.load_dataset('dbpedia_14', split='train')
df_ds: pd.DataFrame = ds.to_pandas()
df_ds= df_ds.sample(frac=0.1)
X,Y = df_ds['content'], df_ds['label']
X_train, X_test, Y_train, Y_test =  train_test_split(X, Y, test_size=0.2, random_state=123, stratify=Y)

Reusing dataset d_bpedia14 (/root/.cache/huggingface/datasets/d_bpedia14/dbpedia_14/2.0.0/7f0577ea0f4397b6b89bfe5c5f2c6b1b420990a1fc5e8538c7ab4ec40e46fa3e)


# Part I: Build your own Naive Bayes classification model

### (5 pts) Task I: Build a model from scratch
Using your notes from lecture-02, implement a Naive Bayes model and train it on the DBpedia dataset. Also, feel free to use any text preprocessing you wish, such as the pipeline from Lab02. 

Below is a template class to help you think about the structure of this problem (feel free to design your own code if you like). It contains methods for each inference step in NB. It also has a classmethod that you could use to instantiate the class from a list of documents and a corresponding list of labels. Here we are suggesting you create a dictionary that maps each word to a unique $ith$ index in the $\phi_{i,k}$ probabilty matrix, which you need to estimate. Because the labels are a set of 0-indexed integers, they naturally map to a unique position $\mu_{k}$ (you should check this to make sure).

In [4]:
from typing import Union, List
import numpy as np

Y_train=np.array(list(Y_train))

class NaiveBayesModel:
    
    """Multinomial NB model class template"""
    
    phi: np.ndarray # (N, K)
    
    mu: np.ndarray  # (K,)
    
    vocab: dict     # vocabulary map from word to row index in phi
    
    n_class: int    # number of classes
    
    
    def __init__():
        """
        Parameters
        ----------
        vocabulary: {str: int} <- {word: index}
        num_classes: Number of classes
        """
        docs_list= list(X_train)
        
        vocabulary = {word: idx for idx, word in enumerate(set(" ".join(docs_list).split(" ")))}
        
        num_classes= len(set(Y_train))

    
        mu_hat = np.array([sum(Y_train == idx) / len(docs_list) for idx in range(num_classes)])
        
        
        
        Xtr = np.zeros(shape=(len(docs_list), len(vocabulary)))
        for i, doc in enumerate(docs_list):
            for word in doc.split(" "):
                j = vocabulary[word]
                Xtr[i, j] += 1
        
        word_count_by_class = {k: np.sum(Xtr[np.where(Y_train == k)]) for k in range(num_classes)}

        
        
        alpha = 1.0
        phi_hat = np.zeros(shape=(num_classes, len(vocabulary)))
        for word, j in vocabulary.items():
            for k in range(num_classes):
                num_word_j_class_k = sum(np.squeeze(Xtr[np.where(Y_train == k), j]))
                phi_hat[k, j] = (alpha + num_word_j_class_k) / (alpha * len( vocabulary) + word_count_by_class[k])
        np.sum(phi_hat, axis=1)
        
    
        p_y_given_Xte=[]
        yte_hat=[]
        Xte = np.zeros(shape=(len(X_test), len(vocabulary)))
        for i, doc in enumerate(list(X_test)):
            for word in doc.split(" "):
                if word in vocabulary:
                    j = vocabulary[word]
                    Xte[i,j] += 1
                else:
                    pass
        return Xte,phi_hat,mu_hat

In [None]:
Xte,phi_hat,mu_hat= NaiveBayesModel.__init__()

In [None]:
yte_hat=np.zeros(len(Xte))
p_y=[]
for i in range(len(Xte)):
            p_y_given_Xte = Xte[i].dot(np.log(phi_hat).T) + np.log(mu_hat)
            p_y.append(list(p_y_given_Xte))
            yte_hat[i] = np.argmax(p_y_given_Xte)

# Part II: Model performance evaluation

Evaluating the performance of a classification model may seem as simple as computing an accuracy, and in some cases that is sufficient, but in general accuracy is not a reliable metric by itself. Typically we need to evaluate our model using several different metrics. 

One common issue is class imbalance, which is when the label distribution in the data varies far from uniform. In this case a high accuracy can be misleading because low frequency labels don't contribute equally to the score. More generally, this is one of the biggest drawbacks of using MLE in NLP: models tend to be much less sensitive to low probability labels than to higher probabilty labels. Later in this class we will explore models that learn by predicting words given their context, can you think of reasons why this can be problematic? Hint: remember Zipf's law?

Another reason to use multiple evaluation methods is that it can help you better understand your data. Evaluating performance on individual classes often reveals problems with the data that would otherwise go unnoticed. For example, if you observe an abundance of misclassified data specific to only a few classes, chances are you have inconsistent labels for those classes in the training set. This is very common in 3rd party mechanical turk data, where quality can vary wildly.

In this lab we will use three metrics and one visualization tool:

1. [Accuracy](https://en.wikipedia.org/wiki/Accuracy_and_precision)
2. [F1 score](https://en.wikipedia.org/wiki/F-score)
3. [AUC ROC score](https://en.wikipedia.org/wiki/Receiver_operating_characteristic)
4. [The confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix)

The [metrics module](https://scikit-learn.org/stable/modules/model_evaluation.html) within sklearn provides support for nearly any evaluation metric that you will need.

# Part III: Compare your performance to an off-the-shelf NB classifier
Open source implementations of your custom NB classifier from Part I already exist of course. One such implementation is [`sklearn.naive_bayes.MultinomialNB`](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB) from the sklearn library. 

### (5 pts) Task II: NB model comparison
Train this model on the same data and compare its performance with your model using the metrics from part II.

In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
import sklearn.metrics
params = {'alpha': [0.5, 0.7,1,2],
         }
text_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf',GridSearchCV(MultinomialNB(), param_grid=params, n_jobs=-1, cv=5, verbose=5)),
])
# Fitting our train data to the pipeline
text_clf.fit(X_train, Y_train)
# Predicting our test data
predicted = text_clf.predict(X_test)
print(classification_report(Y_test,predicted ))

In [None]:
results= np.reshape([nb_accuracy,mm_accuracy,nb_f1score,mm_f1score,nb_AUC,mm_AUC],(3,2))
measurement= pd.DataFrame(data=results, index=["Accuracy","F1 score","AUC ROC score"], columns=["NB model","My model"])
print(measurement)
print("NB model:",nb_confu)
print("My model:",mm_confu)

               NB model  My model
Accuracy       0.140089  0.954821
F1 score       0.140089  0.954821
AUC ROC score  0.630883  0.000000
NB model: [[[ 6485  3913]
  [  407   395]]

 [[10147   247]
  [  685   121]]

 [[ 8017  2386]
  [  471   326]]

 [[10320    92]
  [  783     5]]

 [[ 9923   468]
  [  763    46]]

 [[10209   186]
  [  781    24]]

 [[ 9327  1062]
  [  576   235]]

 [[10237   166]
  [  510   287]]

 [[10394     1]
  [  805     0]]

 [[10318    72]
  [  782    28]]

 [[10393    30]
  [  772     5]]

 [[10392    10]
  [  795     3]]

 [[ 9651   737]
  [  745    67]]

 [[10156   261]
  [  756    27]]]
My model: [[[10360    38]
  [  105   697]]

 [[10355    39]
  [   16   790]]

 [[10363    40]
  [   92   705]]

 [[10397    15]
  [   10   778]]

 [[10342    49]
  [   16   793]]

 [[10369    26]
  [   10   795]]

 [[10322    67]
  [   31   780]]

 [[10380    23]
  [   21   776]]

 [[10394     1]
  [   42   763]]

 [[10379    11]
  [   65   745]]

 [[10367    56]
  [   13   7

# Part IV: Compare NB to other classification models

Now that we've built and validated our NB classifier, we want to evaluate other models on this task.

### (5 pts) Task III: Evaluate the perceptron, SVM (linear), and SVM (RBF kernel)
Train and evaluate the following models on this dataset, and compare them with the NB models.

1. [Perceptron](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Perceptron.html#sklearn.linear_model.Perceptron)
2. [Linear-SVM](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html)
3. [RBF-Kernel-SVM](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)

In [None]:
text_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', Perceptron(tol=1e-3, random_state=0)),
])
# Fitting our train data to the pipeline
text_clf.fit(X_train, Y_train)

# Predicting our test data
predicted = text_clf.predict(X_test)
print(classification_report(X_test,predicted ))

In [None]:
param_grid = {'C': np.arange(0.1,1,0.1)}
text_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', GridSearchCV(LinearSVC(tol=1e-5),param_grid,refit=True,verbose=2)),
])
# Fitting our train data to the pipeline
text_clf.fit(X_train, Y_train)

# Predicting our test data
predicted = text_clf.predict(Y_test)
print(classification_report(Y_test,predicted ))

### (5 pts) Task IV: Select the best model

1. Which model performed the best overall? 
2. What metric(s) influence this decision?
3. Does the model that learns a non-linear decision boundary help?

SVM performs the best.
According to the accuracy of test datasets.
Yes. it can provide more chocie to train models