# Lab 03: Text Classification on the DBpedia14 dataset

### Objectives:
1. Build a Naive Bayes classification model from scratch
2. Evaluate the performance of your model on the DBpedia14 dataset
3. Train an off-the-shelf NB classifier and compare its performance to your implementation
4. Train off-the-shelf implementations of the linear-SVM, RBF-kernel-SVM, and perceptron and compare their performance with the NB models

### Suggested Reading

1. https://arxiv.org/pdf/1811.12808.pdf

### Download the dataset

In [1]:
import datasets
import pandas as pd

train_ds, test_ds = datasets.load_dataset('dbpedia_14', split=['train[:80%]', 'test[80%:]'])
df_train: pd.DataFrame = train_ds.to_pandas()
df_test: pd.DataFrame = test_ds.to_pandas()

Reusing dataset d_bpedia14 (C:\Users\Administrator\.cache\huggingface\datasets\d_bpedia14\dbpedia_14\2.0.0\7f0577ea0f4397b6b89bfe5c5f2c6b1b420990a1fc5e8538c7ab4ec40e46fa3e)


In [2]:
len(df_train),len(df_test)

(448000, 14000)

In [3]:
df_train.head()

Unnamed: 0,label,title,content
0,0,E. D. Abbott Ltd,Abbott of Farnham E D Abbott Limited was a Br...
1,0,Schwan-Stabilo,Schwan-STABILO is a German maker of pens for ...
2,0,Q-workshop,Q-workshop is a Polish company located in Poz...
3,0,Marvell Software Solutions Israel,Marvell Software Solutions Israel known as RA...
4,0,Bergan Mercy Medical Center,Bergan Mercy Medical Center is a hospital loc...


In [4]:
df_train["content"][100]

' Thunes Mekaniske Værksted A/S Thune for short was a Norwegian manufacturing company that among other things built locomotives. The production facilities were last located at Skøyen.'

In [5]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 448000 entries, 0 to 447999
Data columns (total 3 columns):
 #   Column   Non-Null Count   Dtype 
---  ------   --------------   ----- 
 0   label    448000 non-null  int64 
 1   title    448000 non-null  object
 2   content  448000 non-null  object
dtypes: int64(1), object(2)
memory usage: 10.3+ MB


In [6]:
label = df_train["label"]
max(label)

11

In [7]:
df_train.describe(include=['object'])

Unnamed: 0,title,content
count,448000,448000
unique,448000,447896
top,Protea neriifolia,Steinkopf is a mountain of Hesse Germany.
freq,1,4


# process the data 

*note: the dataset is so large that it consume much time ,here I select first 10000 instances as sample

In [8]:
import nltk
from nltk.tokenize import word_tokenize 
from nltk.stem import WordNetLemmatizer
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')
stopwords = set(nltk.corpus.stopwords.words('english'))
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Administrator\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Administrator\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Administrator\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Administrator\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [40]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
import numpy as np

In [66]:
import random
new_doc=[]
new_label=[]
index=random.sample(range(0,len(df_train)),10000)
for i in index:
    new_doc.append(df_train["content"][i])
    new_label.append(df_train["label"][i])

In [68]:
new_df = pd.DataFrame({'content':new_doc,'label':new_label})

In [14]:
import re
def normalize_document(doc):
    doc = re.sub(r'[^a-zA-Z\s]', '', doc, re.I|re.A)
    doc = doc.lower()
    doc = doc.strip()
    
    # tokenize document
    word_tokens = word_tokenize(doc)   
    # filter stopwords out of document
    filtered_sentence = [] 
    for w in word_tokens: 
        if w not in stopwords: 
            filtered_sentence.append(w) 
    # re-create document from filtered tokens
    doc = ' '.join(filtered_sentence)
    
    return doc

In [69]:
#normalize document
norm_doc = new_df['content'].apply(normalize_document)

In [71]:
norm_doc[0]

'rivka galchen born april canadianamerican writer first novel atmospheric disturbances published translated languages awarded william saroyan international prize writing'

In [72]:
norm_doc.head()

0    rivka galchen born april canadianamerican writ...
1    taxonomy triploceras genus algae specifically ...
2    scopula pseudocorrivalaria moth geometridae fa...
3    hayward high school hhs serves students around...
4    fleur pellerin french pronunciation fl pl born...
Name: content, dtype: object

In [73]:
from nltk.corpus import wordnet 

In [74]:

def nltk_tag_to_wordnet_tag(nltk_tag):
    if nltk_tag.startswith('J'):
        return wordnet.ADJ
    elif nltk_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
    elif nltk_tag.startswith('R'):
        return wordnet.ADV
    else:          
        return None

#initialze lemmatizer
lemmatizer = WordNetLemmatizer()

In [75]:
def lemmatize_sentence(sentence):
    #tokenize the sentence and find the POS tag for each token
    nltk_tagged = nltk.pos_tag(nltk.word_tokenize(sentence))  
    #tuple of (token, wordnet_tag)
    wordnet_tagged = map(lambda x: (x[0], nltk_tag_to_wordnet_tag(x[1])), nltk_tagged)
    lemmatized_sentence = []
    for word, tag in wordnet_tagged:
        if tag is None:
            #if there is no available tag, append the token as is
            lemmatized_sentence.append(word)
        else:        
            #else use the tag to lemmatize the token
            lemmatized_sentence.append(lemmatizer.lemmatize(word, tag))
    return " ".join(lemmatized_sentence)


In [76]:
#lemmatize document
lemm_doc= norm_doc.apply(lemmatize_sentence)

In [77]:
lemm_doc[0]

'rivka galchen bear april canadianamerican writer first novel atmospheric disturbance publish translated language award william saroyan international prize writing'

In [78]:
lemm_doc.head()

0    rivka galchen bear april canadianamerican writ...
1    taxonomy triploceras genus algae specifically ...
2    scopula pseudocorrivalaria moth geometridae fa...
3    hayward high school hhs serve student around h...
4    fleur pellerin french pronunciation fl pl bear...
Name: content, dtype: object

In [79]:
#convert normalized, lemmatized text to numeric format
cv = CountVectorizer()
cv_df = cv.fit_transform(lemm_doc) 

In [80]:
#convert normalized, lemmatized text to numeric format
tdif = TfidfVectorizer()
tdif_df =  tdif.fit_transform(lemm_doc)

In [88]:
X_data = tdif_df
y_data = new_df['label']

In [89]:
x_train, x_test, y_train, y_test = train_test_split(X_data, y_data, random_state=4)

In [90]:
x_train.shape

(7500, 43775)

# Part I: Build your own Naive Bayes classification model

### (5 pts) Task I: Build a model from scratch
Using your notes from lecture-02, implement a Naive Bayes model and train it on the DBpedia dataset. Also, feel free to use any text preprocessing you wish, such as the pipeline from Lab02. 

Below is a template class to help you think about the structure of this problem (feel free to design your own code if you like). It contains methods for each inference step in NB. It also has a classmethod that you could use to instantiate the class from a list of documents and a corresponding list of labels. Here we are suggesting you create a dictionary that maps each word to a unique $ith$ index in the $\phi_{i,k}$ probabilty matrix, which you need to estimate. Because the labels are a set of 0-indexed integers, they naturally map to a unique position $\mu_{k}$ (you should check this to make sure).

In [91]:
class NaiveBayesModel:
    
    def __init__(self):
        self.dictionary = {}
        self.classes = None
    
    def fit(self,x_train,y_train):
        x_train = x_train.toarray()
        self.classes=set(y_train)
        self.dictionary["total_y"]=y_train.shape[0]
    
        for current_class in self.classes:
            self.dictionary[current_class]={}
        
            # dataset selection for a particular class
            criteria=(y_train == current_class)
            x_train_current=x_train[criteria]
            y_train_current=y_train[criteria]
        
            self.dictionary[current_class]["total_word_count_in_class"]=sum(map(sum,x_train_current))
            self.dictionary[current_class]["total_y_in_class"]=y_train_current.shape[0]
             
            num_features=x_train.shape[-1]
            for i in range(1,num_features+1):
                self.dictionary[current_class][i]=sum(x_train_current[:,i-1])
    
    def probability(self,x,current_class):      
        output=np.log(self.dictionary[current_class]["total_y_in_class"]) - np.log(self.dictionary["total_y"])
        num_features=len(self.dictionary[current_class].keys()) - 2
        for i in range(1,num_features + 1):
            count_current_class_with_feature_word=self.dictionary[current_class][i] + 1
            count_current_class_with_all_feature_words=self.dictionary[current_class]["total_word_count_in_class"]+num_features
        
            probability_of_feature_word=np.log(count_current_class_with_feature_word) - np.log(count_current_class_with_all_feature_words)
            x_copy=x.copy()
            while x_copy[i-1]>0:
                output+=probability_of_feature_word
                x_copy[i-1]-=1
        return output
        
    def _predict(self,x):       
        best_class=-1000
        best_p=1000
        first_run=True  
        # storing all the classes in class_values
        class_values=self.dictionary.keys()
    
        # loops over ALL THE CLASSES
        for current_class in class_values:
            if current_class=="total_y":
                continue
            
            # calculate probability corresponding to every class
            current_class_p=self.probability(x,current_class)
        
            # calculating best probability and class
            if first_run or best_p < current_class_p:
                best_p=current_class_p
                best_class=current_class
            first_run=False
        return best_class

    def predict(self,x_test):
        x_test = x_test.toarray()
        y_pred = [] 
        for index in range(x_test.shape[0]):
            y_pred.append(self._predict(x_test[index]))    
        return y_pred


In [92]:
# Your code goes here
Bayes = NaiveBayesModel()
Bayes.fit(x_train,y_train)

In [93]:
y_pred = Bayes.predict(x_test[:100,:])

# Part II: Model performance evaluation

Evaluating the performance of a classification model may seem as simple as computing an accuracy, and in some cases that is sufficient, but in general accuracy is not a reliable metric by itself. Typically we need to evaluate our model using several different metrics. 

One common issue is class imbalance, which is when the label distribution in the data varies far from uniform. In this case a high accuracy can be misleading because low frequency labels don't contribute equally to the score. More generally, this is one of the biggest drawbacks of using MLE in NLP: models tend to be much less sensitive to low probability labels than to higher probabilty labels. Later in this class we will explore models that learn by predicting words given their context, can you think of reasons why this can be problematic? Hint: remember Zipf's law?

Another reason to use multiple evaluation methods is that it can help you better understand your data. Evaluating performance on individual classes often reveals problems with the data that would otherwise go unnoticed. For example, if you observe an abundance of misclassified data specific to only a few classes, chances are you have inconsistent labels for those classes in the training set. This is very common in 3rd party mechanical turk data, where quality can vary wildly.

In this lab we will use three metrics and one visualization tool:

1. [Accuracy](https://en.wikipedia.org/wiki/Accuracy_and_precision)
2. [F1 score](https://en.wikipedia.org/wiki/F-score)
3. [AUC ROC score](https://en.wikipedia.org/wiki/Receiver_operating_characteristic)
4. [The confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix)

The [metrics module](https://scikit-learn.org/stable/modules/model_evaluation.html) within sklearn provides support for nearly any evaluation metric that you will need.

In [43]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import confusion_matrix


In [96]:
print(accuracy_score(y_test[:100],y_pred))
print(f1_score(y_test[:100],y_pred,average='micro'))
print(confusion_matrix(y_test[:100],y_pred))

0.92
0.92
[[ 7  0  2  0  0  0  0  0  0  0  0  0]
 [ 0 13  0  0  0  0  0  0  0  0  0  0]
 [ 0  0  8  0  0  0  0  0  0  0  0  0]
 [ 0  0  0 12  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  6  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  6  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  7  0  0  0  0  0]
 [ 0  0  0  0  0  0  1  9  0  0  0  0]
 [ 0  0  0  0  0  0  0  1 11  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  5  1  0]
 [ 0  1  0  0  0  0  0  0  0  0  8  0]
 [ 0  0  2  0  0  0  0  0  0  0  0  0]]


# Part III: Compare your performance to an off-the-shelf NB classifier
Open source implementations of your custom NB classifier from Part I already exist of course. One such implementation is [`sklearn.naive_bayes.MultinomialNB`](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB) from the sklearn library. 

### (5 pts) Task II: NB model comparison
Train this model on the same data and compare its performance with your model using the metrics from part II.

In [98]:
from sklearn.naive_bayes import MultinomialNB

In [99]:
# Your code goes here
mnb = MultinomialNB().fit(x_train, y_train)
y_pred = mnb.predict(x_test)

In [101]:
print(accuracy_score(y_test,y_pred))
print(f1_score(y_test,y_pred,average='micro'))
print(confusion_matrix(y_test,y_pred))

0.9292
0.9292
[[218   3  12   0   1  12   3   0   0   0   0   0]
 [  3 216   0   0   0   1   3   1   0   0   0   0]
 [  3   2 201   2   8   1   2   0   0   0   0   0]
 [  0   0   4 200   3   2   0   0   0   0   0   0]
 [  2   2   0   2 211   1   1   0   0   0   0   0]
 [  5   0   0   0   0 221   1   0   0   0   0   0]
 [  4   6   2   0   1   1 196   3   0   0   0   0]
 [  0   0   0   0   0   1   1 225   0   0   1   0]
 [  0   0   0   0   1   1   5   2 238   0   0   0]
 [  0   0   0   2   0   0   0   0   0 201  12   0]
 [  1   0   0   0   0   1   0   0   0   2 196   0]
 [  0   0  50   0   0   0   0   0   0   0   0   0]]


# Part IV: Compare NB to other classification models

Now that we've built and validated our NB classifier, we want to evaluate other models on this task.

### (5 pts) Task III: Evaluate the perceptron, SVM (linear), and SVM (RBF kernel)
Train and evaluate the following models on this dataset, and compare them with the NB models.

1. [Perceptron](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Perceptron.html#sklearn.linear_model.Perceptron)
2. [Linear-SVM](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html)
3. [RBF-Kernel-SVM](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)

In [102]:
from sklearn.linear_model import Perceptron
from sklearn.svm import SVC, LinearSVC

In [103]:
# Your code goes here
per = Perceptron().fit(x_train,y_train)
y_pred = per.predict(x_test)

In [106]:
print(accuracy_score(y_test,y_pred))
print(f1_score(y_test,y_pred,average='micro'))
print(confusion_matrix(y_test,y_pred))

0.9504
0.9504
[[225   3  11   0   0   6   3   0   0   1   0   0]
 [  4 215   0   0   1   1   1   0   0   2   0   0]
 [  6   3 197   2   5   0   1   0   0   2   0   3]
 [  0   0   4 198   6   0   0   0   0   0   0   1]
 [  1   2   0   0 212   2   1   0   0   0   0   1]
 [  6   0   0   0   0 219   2   0   0   0   0   0]
 [  6   5   3   1   2   1 191   2   1   0   1   0]
 [  0   0   0   0   0   2   1 224   0   1   0   0]
 [  0   0   0   0   0   0   2   2 243   0   0   0]
 [  1   0   0   0   1   0   0   0   0 209   4   0]
 [  0   1   0   0   0   0   0   0   0   2 197   0]
 [  1   0   3   0   0   0   0   0   0   0   0  46]]


In [107]:
svc = SVC()
svc.fit(x_train,y_train)
y_pred = svc.predict(x_test)

In [108]:
print(accuracy_score(y_test,y_pred))
print(f1_score(y_test,y_pred,average='micro'))
print(confusion_matrix(y_test,y_pred))

0.9436
0.9436
[[234   1   7   0   0   5   2   0   0   0   0   0]
 [  5 212   3   0   0   0   4   0   0   0   0   0]
 [  5   2 207   0   3   0   0   0   0   0   0   2]
 [  0   0  12 192   3   2   0   0   0   0   0   0]
 [  3   2   4   0 208   1   1   0   0   0   0   0]
 [  8   0   0   0   1 217   1   0   0   0   0   0]
 [  6   3   5   0   0   0 197   2   0   0   0   0]
 [  2   0   0   0   1   1   1 223   0   0   0   0]
 [  4   0   1   0   0   0   2   1 239   0   0   0]
 [  1   0   1   0   0   0   0   0   0 210   3   0]
 [  3   0   1   0   0   1   0   0   0  10 185   0]
 [  1   0  14   0   0   0   0   0   0   0   0  35]]


In [109]:
svcLinear = LinearSVC()
svcLinear.fit(x_train,y_train)
y_pred = svcLinear.predict(x_test)

In [110]:
print(accuracy_score(y_test,y_pred))
print(f1_score(y_test,y_pred,average='micro'))
print(confusion_matrix(y_test,y_pred))

0.962
0.962
[[231   1   7   0   0   7   3   0   0   0   0   0]
 [  5 218   0   0   0   0   1   0   0   0   0   0]
 [  5   1 204   0   6   0   0   0   0   1   0   2]
 [  1   0   4 197   4   2   0   0   0   0   0   1]
 [  3   2   1   0 211   1   1   0   0   0   0   0]
 [  6   0   0   0   1 218   1   1   0   0   0   0]
 [  4   4   2   0   0   0 198   3   2   0   0   0]
 [  0   0   0   0   1   1   1 225   0   0   0   0]
 [  0   0   0   0   0   0   0   2 245   0   0   0]
 [  0   0   0   0   0   0   0   0   0 212   3   0]
 [  0   0   0   0   0   1   0   0   0   2 197   0]
 [  0   0   1   0   0   0   0   0   0   0   0  49]]


### (5 pts) Task IV: Select the best model

1. Which model performed the best overall? 
2. What metric(s) influence this decision?
3. Does the model that learns a non-linear decision boundary help?

from the results above ,best model is LinearSVC. It is helpful to learn the model of nonlinear decision boundary