# Naive Bayes with Ngram
In this project we use the naive bayes algorithm to perform text classification.<br>
The dataset which is use can be found on [kaggle](https://www.kaggle.com/datasets/kashnitsky/hierarchical-text-classification).

## Pulling our dependancies

In [27]:
from sklearn.model_selection import train_test_split
from tqdm import tqdm
import numpy as np
import pandas as pd
import string
import re
import nltk
import spacy

## Extracting the dataset

In [2]:
def load_csv_dataset(path):
    """Function to load a dataset from a csv file

    Args:
        path (str): relative path to the csv file

    Returns:
        pd.DataFrame: the dataframe load
    """
    return pd.read_csv(path)

In [3]:
df = load_csv_dataset("train_40k.csv");

### Check the dataframe we just obtain

In [4]:
df.head()

Unnamed: 0,productId,Title,userId,Helpfulness,Score,Time,Text,Cat1,Cat2,Cat3
0,B000E46LYG,Golden Valley Natural Buffalo Jerky,A3MQDNGHDJU4MK,0/0,3.0,-1,The description and photo on this product need...,grocery gourmet food,meat poultry,jerky
1,B000GRA6N8,Westing Game,unknown,0/0,5.0,860630400,This was a great book!!!! It is well thought t...,toys games,games,unknown
2,B000GRA6N8,Westing Game,unknown,0/0,5.0,883008000,"I am a first year teacher, teaching 5th grade....",toys games,games,unknown
3,B000GRA6N8,Westing Game,unknown,0/0,5.0,897696000,I got the book at my bookfair at school lookin...,toys games,games,unknown
4,B00000DMDQ,I SPY A is For Jigsaw Puzzle 63pc,unknown,2/4,5.0,911865600,Hi! I'm Martine Redman and I created this puzz...,toys games,puzzles,jigsaw puzzles


## Drop all unnecessary data
In the dataset we just load, there is a lot of column which are irrelevant to the text classification task we wanna perform. We want to classify the text contain in the Text column anf get the correct Cat1.<br>
Let's rename the Cat1 column in label and Text into description to be clearer.

In [5]:
df = df.drop("productId", axis=1)
df = df.drop("Title", axis=1)
df = df.drop("userId", axis=1)
df = df.drop("Helpfulness", axis=1)
df = df.drop("Score", axis=1)
df = df.drop("Time", axis=1)
df = df.drop("Cat2", axis=1)
df = df.drop("Cat3", axis=1)
df = df.rename(columns={"Text": "description", "Cat1": "label"})

In [6]:
df.head()

Unnamed: 0,description,label
0,The description and photo on this product need...,grocery gourmet food
1,This was a great book!!!! It is well thought t...,toys games
2,"I am a first year teacher, teaching 5th grade....",toys games
3,I got the book at my bookfair at school lookin...,toys games
4,Hi! I'm Martine Redman and I created this puzz...,toys games


## Text preprocessing
Since we are using a naive bayes algorithm we do not need to remove the stopwords.

### Define the patterns to be removed in the text

In [7]:
lemmatizer = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
stemmer = nltk.SnowballStemmer("english")
remove_symbols = re.compile('[-+/(){}\[\]\|@,;]')
remove_numbers = re.compile('[0-9] {,1}')
PUNCTUATION = string.punctuation
emoji_pattern = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
        u"\U00002702-\U000027B0"
        u"\U000024C2-\U0001F251"
        "]+", flags=re.UNICODE)

### Define a function to remove all those pattern in a sentence

In [10]:
def lemmatize_sentence(sentence):
    """Function to lemmatize a sentence

    Args:
        sentence (str): the string to lemmatize

    Returns:
        str: the lemmatized string
    """
    doc = lemmatizer(sentence)
    return " ".join([token.lemma_ for token in doc])

def text_preprocess(sentence):
    """Function to preprocess a sentence to remove punctuation, emoji, symbols and to lemmatize

    Args:
        sentence (str): sentence to be preprocess

    Returns:
        str: the new sentence
    """
    if isinstance(sentence, str):
        sentence = sentence.lower() ## Make the text lower case
        sentence = sentence.translate(str.maketrans('', '', PUNCTUATION)) ## Remove the punctuation
        sentence = emoji_pattern.sub(' ', sentence)
        sentence = remove_symbols.sub(' ', sentence)
        sentence = remove_numbers.sub(' ', sentence)
        sentence = lemmatize_sentence(sentence)
        return sentence
    Exception("sentence need to be a string.")
    

In [15]:
print(df.description.values[0])
print(text_preprocess(df.description.values[0]))

The description and photo on this product needs to be changed to indicate this product is the BuffalOs version of this beef jerky.
the description and photo on this product need to be change to indicate this product be the buffalos version of this beef jerky


### Apply the function to all the dataset
This function is actually slow due to the lemmatization of the sentences which is a hard task.

In [26]:
tqdm.pandas() ## To display a progress bar
df.description = df.description.progress_apply(lambda text : text_preprocess(text))

100%|██████████| 40000/40000 [04:31<00:00, 147.30it/s]


We need to verify the output :

In [32]:
print(df.description[:2])

0    the description and photo on this product need...
1    this be a great book it be well think through ...
Name: description, dtype: object


## Split the dataset between train and test
We will split the dataset into 80% of train and 20% of test. 
We need to verify if the dataset is balanced. If not then we need to use a stratify function to keep the ratio of category between test and train.

In [33]:
train, test = train_test_split(df, test_size=0.2, stratify=df.label)

In [46]:
for val, val2 in zip(train.label.value_counts().items(), train.label.value_counts(normalize=True)):
    print(val[1], val2, val[0])

8213 0.25665625 toys games
7817 0.24428125 health personal care
4677 0.14615625 beauty
4509 0.14090625 baby products
3890 0.1215625 pet supplies
2894 0.0904375 grocery gourmet food


In [47]:
for val, val2 in zip(test.label.value_counts().items(), test.label.value_counts(normalize=True)):
    print(val[1], val2, val[0])

2053 0.256625 toys games
1955 0.244375 health personal care
1169 0.146125 beauty
1128 0.141 baby products
972 0.1215 pet supplies
723 0.090375 grocery gourmet food


We can see that we still have the same proportion of labels in test and in train

## Ngram
Now that we have preprocess our text and split it into test and train, we need to apply ngram on it to create token for our naive bayes algorithm.<br>
We need to create a function to do this for each sentence.

In [58]:
def ngram(text, n=1):
    words = []
    for word in text.split():
        words = np.append(words, word);
    temp = zip(*[words[i:] for i in range(0, n)])
    ans = [' '.join(n) for n in temp]
    return ans

In [72]:
ngram(train.description.values[0], 2)

['I read',
 'read the',
 'the review',
 'review for',
 'for this',
 'this while',
 'while I',
 'I be',
 'be look',
 'look for',
 'for a',
 'a specific',
 'specific flavor',
 'flavor red',
 'red velvet',
 'velvet cake',
 'cake for',
 'for a',
 'a gift',
 'gift and',
 'and I',
 'I be',
 'be sell',
 'sell I',
 'I buy',
 'buy the',
 'the red',
 'red velvet',
 'velvet as',
 'as well',
 'well as',
 'as blueberry',
 'blueberry cheesecake',
 'cheesecake chai',
 'chai pecan',
 'pecan pie',
 'pie and',
 'and zombie',
 'zombie I',
 'I love',
 'love these',
 'these they',
 'they go',
 'go on',
 'on so',
 'so smooth',
 'smooth and',
 'and they',
 'they re',
 're shiny',
 'shiny enough',
 'enough that',
 'that people',
 'people think',
 'think I',
 'I m',
 'm wear',
 'wear glossit',
 'glossit come',
 'come in',
 'in a',
 'a little',
 'little box',
 'box in',
 'in just',
 'just a',
 'a couple',
 'couple day',
 'day and',
 'and they',
 'they send',
 'send pen',
 'pen and',
 'and card',
 'card and',
 '

## Naive Bayes

In [113]:
class NaiveBayes:
    """Naive Bayes classe to implement naive bayes algorithm with nGram
    """
    def __init__(self, classes):
        """
        Args:
            classes (np.array): classes of the dataset
        """
        self.classes = np.unique(classes)
        self.nb_classes = len(classes)
    
    def get_classes_occ(self, Y):
        """Function to get the classe occurence for each class

        Args:
            Y (np.array): an array containing all the classes of the dataset
        
        Return:
            (dict): a dictionnaire containing the classes occ for each class
        """
        self.classes_occ = dict()
        for y in Y:
            if y not in self.classes_occ:
                self.classes_occ[y] = 0
            self.classes_occ[y] += 1
        return self.classes_occ

    def compile(self, X, Y, n=1):
        """Function to create the bag of word for naive bayes algo

        Args:
            X (np.array): the text to process
            Y (np.array): the label for each text
            n (int, optional): Ngram values. Defaults to 1.
        """
        if len(X) != len(Y):
            Exception("X and Y need to have the same length.")
        self.X = X #Store the dataset
        self.Y = Y #Store the dataset
        self.n = n
        self.BoW = dict() ## Bag of words initialization
        self.classes_vocab_len = dict() ## Number of total word in a class
        self.vocab = dict() ##Unique token in the total vocab 
        for label in self.classes: ## Bag of words of each classes initialization
            self.BoW[label] = dict() 
            self.classes_vocab_len[label] = 0
        for x, y in tqdm(zip(X, Y), total=len(X)):
            """Get the tokens of ngram size in each sentence and store it inside 
            the corresponding bag of word.
            """
            ngram_sentence = ngram(x, n=n)
            for token in ngram_sentence:
                if token not in self.BoW[y]:
                    self.BoW[y][token] = 0
                if token not in self.vocab:
                    self.vocab[token] = 1
                self.BoW[y][token] += 1
                self.classes_vocab_len[y] += 1
    
    def train(self):
        """Function to calculate each denominators of Naive Bayes
        """
        self.classes_occ = dict()
        for y in self.Y:
            """Get every classes occurence into a dictionnary
            """
            if y not in self.classes_occ:
                self.classes_occ[y] = 0
            self.classes_occ[y] += 1
            
        self.classes_proba_log = dict()
        for y in self.classes_occ:
            """Get the probabilities for each classes. We use log to avoid small proba
            """
            self.classes_proba_log[y] = np.log(float(self.classes_occ[y]) / float(len(self.Y)))
        
        self.denominators = dict()
        for y in self.classes:
            """Calculation of each class denominator for naive bayes
            """
            self.denominators[y] = self.classes_vocab_len[y] + len(self.vocab)
    
    
    def predict(self, text):
        """Function to get the probabilities of each classes for a given sentence

        Args:
            text (str): a preprocess sentence to evaluate the classe
        
        Return:
            (np.array): an array containing the proba of each classes for the given sentence.
            The proba are given in log space.
        
        """
        likelihood_prob = np.zeros(self.classes.shape[0]) ## Initialize proba at 0 for each class
        for i, y in enumerate(self.classes):
            for token in ngram(text, n=self.n):
                """Calculate the proba for each token in the sentence.
                The token need to be in the vocab else it is ignore
                """
                if token in self.vocab: ### We ignore the word if not in the vocab
                    token_counts = 0
                    if token in self.BoW[y]:
                        token_counts = self.BoW[y][token]
                    token_counts += 1 ### Laplace
                    token_prob = float(token_counts)/float(self.denominators[y]) ### Final proba of the token
                    likelihood_prob[i] += np.log(token_prob) ### Calculating somme of proba of each token
        for i, y in enumerate(self.classes):
            likelihood_prob[i] += self.classes_proba_log[y] ### Final probabilities of each classe
        return likelihood_prob           

In [132]:
nb = NaiveBayes(df.label.values)
nb.compile( train.description.values, 
            train.label.values, 
            n=1)

100%|██████████| 32000/32000 [00:19<00:00, 1603.37it/s]


In [133]:
nb.train()

In [134]:
print(test.description.values[0])
print(test.label.values[0])
print(nb.predict(test.description.values[0]))
print(nb.classes[nb.predict(test.description.values[0]).argmax()])

my daughter have acid reflux and be unable to lie flat in her baby bed since we bring she home from the hospital she sleep in her car seat for the first couple of week but I want something more comfortable for she as I know that she would have to sleep upright for a while I call this the cadillac of bouncy seat it be very comfortable and my daughter sleep well in it I would definitely recommend it
baby products
[-473.5694169  -525.77407658 -549.07425379 -506.39120286 -506.10959036
 -505.54038887]
baby products


In [135]:
def test_model(model, test):
    success = 0
    for x_test, y_test in tqdm(zip(test.description.values, test.label.values), total=len(test.label)):
        if model.classes[model.predict(x_test).argmax()] == y_test:
            success += 1
    return (float(success) / len(test.label.values)) * 100.0

In [138]:
test_model(nb, test)

100%|██████████| 8000/8000 [00:36<00:00, 219.07it/s]


82.66250000000001

In [141]:
nb2 = NaiveBayes(df.label.values)
nb2.compile(    train.description.values, 
                train.label.values, 
                n=2)
nb2.train()
test_model(nb2, test)

100%|██████████| 32000/32000 [00:21<00:00, 1509.36it/s]
100%|██████████| 8000/8000 [00:34<00:00, 234.54it/s]


73.32499999999999

In [142]:
nb3 = NaiveBayes(df.label.values)
nb3.compile(    train.description.values, 
                train.label.values, 
                n=3)
nb3.train()
test_model(nb3, test)

100%|██████████| 32000/32000 [00:22<00:00, 1427.61it/s]
100%|██████████| 8000/8000 [00:35<00:00, 223.92it/s]


64.5625

In [144]:
nb4 = NaiveBayes(df.label.values)
nb4.compile(    train.description.values, 
                train.label.values, 
                n=4)
nb4.train()
test_model(nb4, test)

100%|██████████| 32000/32000 [00:23<00:00, 1342.87it/s]
100%|██████████| 8000/8000 [00:32<00:00, 248.18it/s]


55.35

In [145]:
nb5 = NaiveBayes(df.label.values)
nb5.compile(    train.description.values, 
                train.label.values, 
                n=5)
nb5.train()
test_model(nb5, test)

100%|██████████| 32000/32000 [00:25<00:00, 1278.06it/s]
100%|██████████| 8000/8000 [00:32<00:00, 242.60it/s]


44.3125