# Lab 03: Text Classification on the DBpedia14 dataset

### Objectives:
1. Build a Naive Bayes classification model from scratch
2. Evaluate the performance of your model on the DBpedia14 dataset
3. Train an off-the-shelf NB classifier and compare its performance to your implementation
4. Train off-the-shelf implementations of the linear-SVM, RBF-kernel-SVM, and perceptron and compare their performance with the NB models

### Suggested Reading

1. https://arxiv.org/pdf/1811.12808.pdf

### Download the dataset

In [12]:
import datasets
import pandas as pd

train_ds, test_ds = datasets.load_dataset('dbpedia_14', split=['train', 'test'])
df_train: pd.DataFrame = train_ds.to_pandas().sample(frac = 0.1).reset_index(drop=True)
df_test: pd.DataFrame = test_ds.to_pandas().sample(frac = 0.1).reset_index(drop=True)

Reusing dataset dbpedia_14 (/Users/jieyisun/.cache/huggingface/datasets/dbpedia_14/dbpedia_14/2.0.0/01dab9e10d969eadcdbc918be5a09c9190a24caeae33b10eee8f367a1e3f1f0c)
100%|██████████| 2/2 [00:00<00:00, 255.82it/s]


In [13]:
len(df_train), len(df_test)

(56000, 7000)

In [14]:
df_train.head()

Unnamed: 0,label,title,content
0,13,Scroogenomics,Scroogenomics is a non-fiction book written b...
1,4,John McGee (politician),John McGee (born January 27 1973) is a former...
2,11,Spiral Ascent,Spiral Ascent is the debut album by Indian He...
3,12,Lost Horizon (1973 film),Lost Horizon is a 1973 American musical film ...
4,7,Swift River (New Zealand),The Swift River is a river of the Canterbury ...


In [7]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 56000 entries, 0 to 55999
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   label    56000 non-null  int64 
 1   title    56000 non-null  object
 2   content  56000 non-null  object
dtypes: int64(1), object(2)
memory usage: 1.3+ MB


In [8]:
label = df_train["label"]
max(label) #13

13

In [9]:
df_train.describe(include=['object'])

Unnamed: 0,title,content
count,56000,56000
unique,56000,55999
top,Lar Corbett,Alturas Lake is an alpine lake in Blaine Coun...
freq,1,2


# Part I: Build your own Naive Bayes classification model

### (10 pts) Task I: Build a model from scratch
Using your notes from lecture-02, implement a Naive Bayes model and train it on the DBpedia dataset. Also, feel free to use any text preprocessing you wish, such as the pipeline from Lab02. 

Below is a template class to help you think about the structure of this problem (feel free to design your own code if you like). It contains methods for each inference step in NB. It also has a classmethod that you could use to instantiate the class from a list of documents and a corresponding list of labels. Here we are suggesting you create a dictionary that maps each word to a unique $ith$ index in the $\phi_{i,k}$ probabilty matrix, which you need to estimate. Because the labels are a set of 0-indexed integers, they naturally map to a unique position $\mu_{k}$ (you should check this to make sure).

In [17]:
# Your code goes here
from cProfile import label
from typing import Union, List
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

class NaiveBayesModel:
    
    """Multinomial NB model class template"""
    
    phi: np.ndarray # (N, K)
    
    mu: np.ndarray  # (K,)
    
    vocab: dict     # vocabulary map from word to row index in phi
    
    n_class: int    # number of classes   
    
    DF_Count_T: pd.core.frame.DataFrame   # word count of docs
    
    labels_list: List[int]
    
    
    def __init__(self, vocabulary: dict, num_classes: int, DF_Count_T: pd.core.frame.DataFrame, labels_list: List[int]):
        """
        Parameters
        ----------
        vocabulary: {str: int} <- {word: index}
        num_classes: Number of classes
        """
        self.vocab = vocabulary
        self.n_class = num_classes
        self.mu = np.zeros(shape = (num_classes,))
        N = len(vocabulary.keys())
        self.phi = np.zeros(shape = (N, num_classes))
        self.labels_list = labels_list
        self.DF_Count_T = DF_Count_T
    
    @classmethod
    def from_preprocessed_data(cls, docs_list: List[str], labels_list: List[int]):
        # extracting features from text using countvectorizer
        # create a vectorizer object
        cv = CountVectorizer(input='content', stop_words = "english", max_features=400)
        # tokenize and build vocab, encode document
        X = cv.fit_transform(docs_list)
        #print(cv.vocabulary_)

        ColumnNames = cv.get_feature_names()
        DF_Count_T = pd.DataFrame(X.toarray(),columns=ColumnNames).T
        DF_Count = DF_Count_T.T
        
        labels_set = set(labels_list)
        num_classes = len(labels_set)

        vocabulary = dict()
        for i in range(len(DF_Count_T.index)):
            vocabulary[DF_Count.columns[i]] = i
        return cls(vocabulary, num_classes, DF_Count_T, labels_list)

    
    def estimate_mu(self, alpha: float = 1.):
        """
        Estimate P(Y), the prior over labels
        
        Parameters
        ----------
        alpha: smoothing parameter
        """
        for i in self.labels_list:
            self.mu[i] += 1
        self.mu = (self.mu + alpha) / (sum(self.mu) + self.n_class * alpha)
        return self.mu
    
    def estimate_phi(self, alpha: float = 1.):
        """
        Estimate phi, the N x K matrix 
        describing the probability of
        the nth word in the kth class.
        
        Parameters
        ----------
        alpha: smoothing parameter
        """
        #create an empty matrix
        NK = []
        for i in range(self.n_class):
            NK.append([])
        
        for i in range(len(self.labels_list)):
            NK[self.labels_list[i]] += [i]
        
            
        for i in range(self.n_class):
            k = self.DF_Count_T.iloc[:,NK[i]]
            #the word count of all words in class k
            k_sum = k.apply(lambda x: x.sum(), axis=1)
            for word in k_sum.index:
                self.phi[self.vocab[word],i] = k_sum[self.vocab[word]]
        
        self.phi = np.array(pd.DataFrame(self.phi).apply(lambda x: (x+alpha)/(x.sum()+len(x)*alpha), axis=0))
        
        return self.phi
    
    def predict_label(self, text: str) -> int:
        """
        Compute label given some input text
        
        Parameters
        ----------
        text: raw input text
        
        Returns
        -------
        int: corresponding to the predicted label
        """
        text = text.split(" ")
        value = list(self.mu)
        for word in text:
            if word in self.vocab.keys():
                for j in range(self.n_class):
                    value[j] = value[j] * self.phi[self.vocab[word],j]
        return value.index(max(value))

model = NaiveBayesModel.from_preprocessed_data(df_train["content"], df_train["label"])



In [6]:
NK = [] #save the result in a N x K matrix

N = [1,1,1,1]
K = [1,1,1]
#create an empty N x K matrix
for i in range(len(N)):
    NK.append([])
    for j in range(len(K)):
        NK[i].append(0)

NK

[[0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0]]

In [18]:
mu = model.estimate_mu()
mu

array([0.07246403, 0.07035741, 0.07305317, 0.07339237, 0.071982  ,
       0.07094655, 0.06841147, 0.07119649, 0.07276752, 0.07173207,
       0.06991109, 0.07060735, 0.07121434, 0.07196415])

In [19]:
phi = model.estimate_phi()
phi

array([[0.0015015 , 0.00143091, 0.00389731, ..., 0.00312975, 0.00073335,
        0.00119646],
       [0.00080636, 0.00269237, 0.00362937, ..., 0.00201846, 0.00052382,
        0.0010877 ],
       [0.00108442, 0.00788884, 0.00302041, ..., 0.00272152, 0.0008905 ,
        0.00125085],
       ...,
       [0.00606162, 0.00363376, 0.00604082, ..., 0.00129272, 0.002331  ,
        0.00372536],
       [0.00130686, 0.00054601, 0.00211916, ..., 0.00124736, 0.00440009,
        0.00397009],
       [0.00144589, 0.00062132, 0.00129098, ..., 0.00099789, 0.0002881 ,
        0.00043508]])

In [20]:
NB_Result = []
for doc in df_test['content']:
    NB_Result.append(model.predict_label(doc))
NB_Result = np.array(NB_Result)
NB_Result

array([ 8, 12,  5, ...,  7,  5,  3])

# Part II: Model performance evaluation

Evaluating the performance of a classification model may seem as simple as computing an accuracy, and in some cases that is sufficient, but in general accuracy is not a reliable metric by itself. Typically we need to evaluate our model using several different metrics. 

One common issue is class imbalance, which is when the label distribution in the data varies far from uniform. In this case a high accuracy can be misleading because low frequency labels don't contribute equally to the score. More generally, this is one of the biggest drawbacks of using MLE in NLP: models tend to be much less sensitive to low probability labels than to higher probabilty labels. Later in this class we will explore models that learn by predicting words given their context, can you think of reasons why this can be problematic? Hint: remember Zipf's law?

Another reason to use multiple evaluation methods is that it can help you better understand your data. Evaluating performance on individual classes often reveals problems with the data that would otherwise go unnoticed. For example, if you observe an abundance of misclassified data specific to only a few classes, chances are you have inconsistent labels for those classes in the training set. This is very common in 3rd party mechanical turk data, where quality can vary wildly.

In this lab we will use three metrics and one visualization tool:

1. [Accuracy](https://en.wikipedia.org/wiki/Accuracy_and_precision)
2. [F1 score](https://en.wikipedia.org/wiki/F-score)
3. [AUC ROC score](https://en.wikipedia.org/wiki/Receiver_operating_characteristic)
4. [The confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix)

The [metrics module](https://scikit-learn.org/stable/modules/model_evaluation.html) within sklearn provides support for nearly any evaluation metric that you will need.

In [21]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import confusion_matrix

In [22]:
# Accuracy
Acc = accuracy_score(NB_Result, df_test['label'])
print(Acc)

0.849


In [23]:
# F1 Score
F1 = f1_score(NB_Result, df_test['label'], average='weighted')
print(F1)

0.8503527856550249


In [24]:
#the confusion matrix
cm = confusion_matrix(df_test['label'], NB_Result)
print(cm)

[[367   6  11  11   5   9  16   6   0   1   5  10   5  13]
 [ 22 465   4   4   0   0  11   2   1   0   0   0   1   3]
 [ 17   1 306  60  20   2   3   0   1   2   2  49  16  33]
 [  5   0  12 457  11   2   0   3   0   0   0   6   0   2]
 [  6  12  17  54 360   5   3   0   3   1   5   1   0   2]
 [ 30   0   4  24   6 406   8   7   1   0   1   7  10   4]
 [ 27  15   9  16   3  14 411  12   3   3   0   1   3   5]
 [  4   0   0   3   0   3  11 449   7   3   0   0   0   0]
 [  0   3   0   0   0   0   4  10 507   0   0   0   0   0]
 [  4   1   1  14   2   0   1   6   0 372  67   0   0   1]
 [  6   1   0   6   0   1   2   1   0  79 409   5   0   0]
 [  2   0   1   3   0   0   0   0   0   0   0 474   5   1]
 [  1   0   3   6   0   1   0   0   0   0   0  14 473   5]
 [  7   4   7   9   1   2   0   1   1   1   1   3  17 487]]


# Part III: Compare your performance to an off-the-shelf NB classifier
Open source implementations of your custom NB classifier from Part I already exist of course. One such implementation is [`sklearn.naive_bayes.MultinomialNB`](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB) from the sklearn library. 

### (5 pts) Task II: NB model comparison
Train this model on the same data and compare its performance with your model using the metrics from part II.

In [25]:
from sklearn.naive_bayes import MultinomialNB

In [26]:

def process(content):
    vc = CountVectorizer(input='content', stop_words = "english", max_features=400)
    DTM_Count = vc.fit_transform(content)
    ColumnNames = vc.get_feature_names()
    DF_Count = pd.DataFrame(DTM_Count.toarray(),columns=ColumnNames)
    return DF_Count

In [27]:
x_train = process(df_train['content'])
y_train = df_train['label']
clf = MultinomialNB()
sk_NB = clf.fit(x_train, y_train)



In [28]:
print(len(df_test))
test = process(df_test['content'])
SKNB_Result = sk_NB.predict(test)

7000


Feature names unseen at fit time:
- 1972
- 31
- added
- assembly
- campus
- ...
Feature names seen at fit time, yet now missing:
- 1986
- 1987
- 1988
- albums
- bank
- ...



In [29]:
#accuracy
sk_acc = accuracy_score(SKNB_Result, df_test['label'])
print(Acc, sk_acc)

0.849 0.265


In [30]:
# F1 Score
sk_f1 = f1_score(SKNB_Result, df_test['label'], average='weighted')
print(F1, sk_f1)

0.8503527856550249 0.2838799876012286


In [31]:
#the confusion matrix
sk_cm = confusion_matrix(df_test['label'], SKNB_Result)
print(cm, sk_cm)

[[367   6  11  11   5   9  16   6   0   1   5  10   5  13]
 [ 22 465   4   4   0   0  11   2   1   0   0   0   1   3]
 [ 17   1 306  60  20   2   3   0   1   2   2  49  16  33]
 [  5   0  12 457  11   2   0   3   0   0   0   6   0   2]
 [  6  12  17  54 360   5   3   0   3   1   5   1   0   2]
 [ 30   0   4  24   6 406   8   7   1   0   1   7  10   4]
 [ 27  15   9  16   3  14 411  12   3   3   0   1   3   5]
 [  4   0   0   3   0   3  11 449   7   3   0   0   0   0]
 [  0   3   0   0   0   0   4  10 507   0   0   0   0   0]
 [  4   1   1  14   2   0   1   6   0 372  67   0   0   1]
 [  6   1   0   6   0   1   2   1   0  79 409   5   0   0]
 [  2   0   1   3   0   0   0   0   0   0   0 474   5   1]
 [  1   0   3   6   0   1   0   0   0   0   0  14 473   5]
 [  7   4   7   9   1   2   0   1   1   1   1   3  17 487]] [[322   5  35   3   5  25  24   3   0   5   2   9  10  17]
 [ 57  98  38   1  19  64  18   1   3   5  40   0   1 168]
 [ 90   5 198   6   8  28  24   3   1   2   4  11   7 1

# Part IV: Compare NB to other classification models

Now that we've built and validated our NB classifier, we want to evaluate other models on this task.

### (5 pts) Task III: Evaluate the perceptron, SVM (linear), and SVM (RBF kernel)
Train and evaluate the following models on this dataset, and compare them with the NB models.

1. [Perceptron](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Perceptron.html#sklearn.linear_model.Perceptron)
2. [Linear-SVM](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html)
3. [RBF-Kernel-SVM](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)

In [32]:
from sklearn.linear_model import Perceptron
from sklearn.svm import SVC, LinearSVC

In [33]:
# Your code goes here
# Perceptron
per = Perceptron(tol=1e-3, random_state=0)
per_model = per.fit(x_train, df_train['label'])
result_per = per.predict(test)
per_acc = accuracy_score(result_per, df_test['label'])
#compare the accuracy of perceptron and my NB model
print(Acc, per_acc)

0.849 0.308


Feature names unseen at fit time:
- 1972
- 31
- added
- assembly
- campus
- ...
Feature names seen at fit time, yet now missing:
- 1986
- 1987
- 1988
- albums
- bank
- ...



In [34]:
# Linear-SVM
L_SVM = LinearSVC(C=5)
L_SVM.fit(x_train, df_train['label'])
result_lsvm = L_SVM.predict(test)
L_SVM_acc = accuracy_score(result_lsvm, df_test['label'])
#compare the accuracy of Linear-SVM and my NB model
print(Acc, L_SVM_acc)

0.849 0.2905714285714286


Feature names unseen at fit time:
- 1972
- 31
- added
- assembly
- campus
- ...
Feature names seen at fit time, yet now missing:
- 1986
- 1987
- 1988
- albums
- bank
- ...



In [35]:
# RBF-Kernel SVM
RBF_SVM = SVC(kernel = 'rbf', C=5, probability=True)
RBF_SVM.fit(x_train, df_train['label'])
result_rbfsvm = RBF_SVM.predict(test)
RBF_acc = accuracy_score(result_rbfsvm, df_test['label'])
#compare the accuracy of RBF-Kernel SVM and my NB model
print(Acc, RBF_acc)
#the result can't show up

### (5 pts) Task IV: Select the best model

1. Which model performed the best overall? 
My first Naive Bayes model performed the best overall.
2. What metric(s) influence this decision?
I only used the accuracy to make this decision. The accuracy of the first Naive Bayes model is much higher than other ones.
3. Does the model that learns a non-linear decision boundary help?
Maybe. But the result of non-linear decision boundary didn't show up even running for more than 3 hours.