# Natural Language Processing @ AIMS
# Project: Naive Bayes for text classification (Sentiment Analysis)
**German Shâma Wache**


In this assignment, we explore sentiment analysis using the Naive Bayes classifier. Sentiment analysis is the task of determining the emotional tone of a piece of text. We work with the AfriSenti dataset, which provides sentiment-labelled text in 14 African languages. Our task will involve pre-processing the data, implementing the model, and evaluating performance. This assignment will give us hands-on experience with fundamental NLP techniques and allow us to experiment with different tokenization methods.

**Data**

We will use the AfriSenti dataset for sentiment analysis. It consists of tweets which have been annotated according to their emotional tone as either positive, negative, or neutral. For this task, we have to implement a binary classifier that predicts if the tone of a tweet is positive or negative. We make sure to filter our tweets labelled as neutral after loading the data. The dataset covers 14 African languages. You may choose any one of these languages for your analysis. The dataset can be found here: https://github.com/afrisenti-semeval/afrisent-semeval-2023.


# Installations, Imports and Downloads

In [None]:
import os
import warnings
import re
warnings.filterwarnings("ignore", category=UserWarning)

from collections import defaultdict, Counter

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
from sklearn.metrics import classification_report

#FS = (8, 4)  # figure size
#RS = 124  # random seed

In [None]:
# Download dataset
PROJECT_DIR = os.getcwd() + '/afrisent-semeval-2023'
print('Current directiory: ', PROJECT_DIR)
PROJECT_GITHUB_URL = 'https://github.com/afrisenti-semeval/afrisent-semeval-2023.git'

if not os.path.isdir(PROJECT_DIR):
  !git clone {PROJECT_GITHUB_URL}
else:
  %cd {PROJECT_DIR}
  !git pull {PROJECT_GITHUB_URL}

Current directiory:  /content/afrisent-semeval-2023/afrisent-semeval-2023
/content/afrisent-semeval-2023/afrisent-semeval-2023
From https://github.com/afrisenti-semeval/afrisent-semeval-2023
 * branch            HEAD       -> FETCH_HEAD
Already up to date.


<a name="section1"></a>
#1. Text processing

Sentiment analysis is the task of classifying the emotional tone of a piece of text. Sentiment analysis datasets resemble the following table - it consists of pieces of text which have been annotated as _positive_ or _negative_ (some datasets also allow _neutral_ as a label).


| Text                     | Label |
|-----------------------------------|--------------|
| I haven't heard anything. I'm really worried actually             |  negative     |
| About to go to bed. I am so glad the Tigers won tonight!                     | positive          |

The AfriSenti dataset consists of tweets which have been human-labelled according to their emotional tone as either positive, negative, or neutral.


* [Paper introducing the dataset: _AfriSenti: A Twitter Sentiment Analysis Benchmark for African Languages_, Muhammad et al., 2023.](https://aclanthology.org/2023.emnlp-main.862.pdf)
* [Github repository containing the full dataset](https://github.com/afrisenti-semeval/afrisent-semeval-2023)

### AfriSenti languages

| No. | Language                     | Code | Country        |
|-----|------------------------------|--------------|----------------|
| 1   | Algerian Arabic              | arq          | Algeria        |
| 2   | Amharic                      | amh          | Ethiopia       |
| 3   | Hausa                        | hau          | Nigeria        |
| 4   | Igbo                         | ibo          | Nigeria        |
| 5   | Kinyarwanda                  | kin          | Rwanda         |
| 6   | Moroccan Arabic/Darija       | ary          | Morocco        |
| 7   | Mozambique Portuguese        | por        | Mozambique     |
| 8   | Nigerian Pidgin              | pcm          | Nigeria        |
| 9   | Oromo                        | orm          | Ethiopia       |
| 10  | Swahili                      | swa          | Kenya/Tanzania |
| 11  | Tigrinya                     | tir          | Ethiopia       |
| 12  | Twi                          | twi          | Ghana          |
| 13  | Xitsonga                     | tso          | Mozambique     |
| 14  | Yoruba                       | yor          | Nigeria        |


The dataset covers 14 languages spoken across the African continent. For each language, the dataset stores annotated tweets for training (*train.tsv*), validation (*dev.tsv*), and testing (*test.tsv*). In NLP datasets and implementations, languages are often referred to via abbreviations known as language codes. Our first task is to edit the next code cell to enter the language code of the language we want to use going forward in this notebook. The table above lists the language codes of each of the 14 languages.

We choose to work with **Igbo**.


In [None]:
# Choose language
language =  'ibo'  # Can be ['arq', 'amh', 'hau', 'ibo', 'kin', 'ary', 'por', 'pcm', orm', 'swa', 'tir', 'twi', 'tso', 'yor']

<a name="section1_1"></a>
## 1.1. Data loading

Now we can load the train/dev/test datasets for our chosen language. Each dataset is stored as a .tsv file (tab-separated values) with two data columns (the tweet text and the sentiment label) separated by a tab space. Below we read these datasets into dataframes - a Python data structure for storing tabular data in the pandas library. We display a few rows of the training dataframe to show the data format.

In [None]:
# Load data
DATA_DIR = f'{PROJECT_DIR}/data/{language}'
print('Data directory: ', DATA_DIR)

train_df = pd.read_csv(f'{DATA_DIR}/train.tsv', sep='\t', names=['text', 'label'], header=0)
dev_df = pd.read_csv(f'{DATA_DIR}/dev.tsv', sep='\t', names=['text', 'label'], header=0)
test_df = pd.read_csv(f'{DATA_DIR}/test.tsv', sep='\t', names=['text', 'label'], header=0)

print('Train shape: ', train_df.shape)
print('Dev shape: ', dev_df.shape)
print('Test shape: ', test_df.shape)

# Display data
train_df.sample(n=10)

Data directory:  /content/afrisent-semeval-2023/afrisent-semeval-2023/data/ibo
Train shape:  (10192, 2)
Dev shape:  (1841, 2)
Test shape:  (3682, 2)


Unnamed: 0,text,label
8877,@user @user @user @user @user @user @user @use...,positive
6829,@user @user @user @user @user @user @user @use...,neutral
2342,@user ana ekwugheri ka ndi egbuwara isi..,negative
7398,Chukwu ekwena ka ngwele gba aji! No. https://t...,positive
9231,@user Nna ochie Ututu Oma 🙌🙌🙌🙌,positive
2700,@user Gịnị ka i chọrọ?,neutral
5031,@user Hahahha Dokita oñu otu. Ñuruma nwa nna a...,neutral
350,@user @user Isi gi no?🤷🤣🤣🤣,negative
9907,@user Site n'ike chukwu,positive
2969,Aka n'agbaji igwe https://t.co/1hXb148HvK,neutral


<a name="section1_2"></a>
## 1.2. Data cleaning

Before we proceed, it's important to ensure that our data is clean and ready for processing. In this section, we will perform some basic data cleaning steps to remove unwanted elements in the text and prepare our dataset for NLP modelling purposes

In [None]:
# Discard neutral examples
train_df = train_df[train_df['label'] != 'neutral']
dev_df = dev_df[dev_df['label'] != 'neutral']
test_df = test_df[test_df['label'] != 'neutral']

The extent of data cleaning and preprocessing will depend on quality of the raw dataset, the NLP task we are preparing the data for, and our personal preferences as NLP practitioners. The ``nltk`` library (Natural Language Toolkit) is a popular Python library for text processing and pre-processing. It supports tokenization, stemming, tagging, and parsing for several languages. We do not need it for this this notebook, since we stick to rather basic preprocessing stategies.

* Replace all urls with a special '[URL]' token.
* Replace all numbers with a special '[NUM]' token.
* Remove white extra whitespaces either side of the text.

In [None]:
def clean(text):
    # Replace URLS with [URL]
    text = re.sub(r'http\S+', '[URL]', text)

    # Replace numbers with [NUM]
    text = re.sub(r'\d+', '[NUM]', text)

    # Remove trailing spaces
    text = text.strip()

    return text

train_df['text'] = train_df['text'].apply(clean)
dev_df['text'] = dev_df['text'].apply(clean)
test_df['text'] = test_df['text'].apply(clean)

<a name="section1_3"></a>
## 1.3. Word-based tokenization

One of the fundamental steps in text processing for NLP is constructing a vocabulary from our dataset. A vocabulary is a set of unique words or tokens present in the text corpus. In this section, we will create a vocabulary from our training dataset and explore its characteristics.

We refer to vocabulary items as **types** and to particular occurrences of these types in the dataset as **tokens**. For now we simply tokenise our text data based on the existing tokens in the raw text - we split text on white spaces.

For NLP purposes, we want to map each type in our vocabulary to an **index**, a unique number identifying that type. Later we can use this index to, for example, look up vector representations for our words using a lookup table. To achieve this, our vocabulary will be represented with three variables:
* index2type: list of unique types in the vocabulary e.g. ['word1', 'word2', 'word3', ...]
* type2index: dictionary mapping types to their index in the index2type vocabulary e.g. {'word1': 0, 'word2': 1, 'word3': 2, ...}
* type2count: dictionary mapping types to the number of corresponding token occurences of that type in the training data e.g. {'word1': 1012, 'word2': 510, 'word3': 45, ...}


In [None]:
# Store training data text as list of tweets
train_corpus = train_df['text'].tolist()
train_corpus[0:5]

['Nna Ike Gwuru ooo. 😂 [URL]',
 '@user Chineke nna kezi mgbe ole???',
 'Lol. Isi adirokwanu gi nma.. 😐😒😒😒 [URL]',
 '@user haha. Fulani herdsmen. akpa amu gi retweet. Rie nsi 😝',
 'Nna ghetto di gi na aru biko!!! [URL]']

Next we have to decide how we are going to tokenize the tweets. Libraries like ``nltk`` provide regex-based tokenizers that are handcrafted for specific languages.  For now we will use the simple strategy of splitting text on white spaces - so our tokens will be the units of text divided by white spaces.

In [None]:
def whitespace_tokenize(sentences):
    return [sentence.split() for sentence in sentences]

tokenized_train_corpus = whitespace_tokenize(train_corpus)
tokenized_train_corpus[0:5]

[['Nna', 'Ike', 'Gwuru', 'ooo.', '😂', '[URL]'],
 ['@user', 'Chineke', 'nna', 'kezi', 'mgbe', 'ole???'],
 ['Lol.', 'Isi', 'adirokwanu', 'gi', 'nma..', '😐😒😒😒', '[URL]'],
 ['@user',
  'haha.',
  'Fulani',
  'herdsmen.',
  'akpa',
  'amu',
  'gi',
  'retweet.',
  'Rie',
  'nsi',
  '😝'],
 ['Nna', 'ghetto', 'di', 'gi', 'na', 'aru', 'biko!!!', '[URL]']]

In [None]:
# Count number of tokens in corpus
def count_tokens(sentences):
    """
    Count number of tokens in corpus

    param: sentences: list of list of tokens e.g. [['This', 'is', 'a', 'sentence'], ['This', 'is', 'another', 'sentence'], ...]
    return:
        count: number of tokens in corpus
    """
    count = 0
    for sentence in sentences:
      count += len(sentence)
    return count

In [None]:
num_tokens = count_tokens(tokenized_train_corpus)
print('Number of tokens in corpus: ', num_tokens)

Number of tokens in corpus:  64891


In [None]:
# Collect type counts in corpus
def create_type_counts(sentences):
    """
    Count number of types in corpus

    param: sentences: list of list of tokens e.g. [['This', 'is', 'a', 'sentence'], ['This', 'is', 'another', 'sentence'], ...]
    return:
        type2count: dictionary of type counts in corpus e.g. {'This': 2, 'sentence': 2, ...}
    """
    type2count = defaultdict(int)
    for sentence in sentences:
      for token in sentence:
        type2count[token] += 1
    return type2count

In [None]:
type2count = create_type_counts(tokenized_train_corpus)
print('Number of types in corpus: ', len(type2count))

# Sort types by counts
type2count = dict(sorted(type2count.items(), key=lambda x: x[1], reverse=True))

# Print first few types and counts
for i, (type_, count) in enumerate(type2count.items()):
    print(f'{type_}: {count}')
    if i == 5:
        break

Number of types in corpus:  15516
@user: 8653
[URL]: 1546
na: 1397
gi: 1123
di: 543
Chukwu: 509


In [None]:
# Create vocabulary
def create_vocabulary(type2count, min_count):
    """
    This function creates an indexed vocabulary from vocabulary counts and returns it as a list and a dictionary.

    param:
        type2count: dictionary of type counts in corpus e.g. {'This': 2, 'sentence': 2, ...}
        min_count: minimum count of a word to be included in the vocabulary
    return:
        index2type: list of words in the vocabulary e.g. ['word1', 'word2', 'word3', ...]
        type2index: dictionary mapping words to their index in the index2type vocabulary e.g. {'word1': 0, 'word2': 1, 'word3': 2, ...}
    """
    # TODO: COMPLETE THIS CODE
    index2type = []
    for word in type2count:
      if type2count[word] >= min_count:
        index2type.append(word)
    type2index = defaultdict(int)
    for index, word in enumerate(index2type):
      type2index[word] = index
    return index2type, type2index

In [None]:
index2type, type2index = create_vocabulary(type2count, min_count=1)

# It's good practice to add a special token for unknown words and padding (to make all sentences in training batches the same length)
type2index['<UNK>'] = len(index2type)
index2type.append('<UNK>')
type2index['<PAD>'] = len(index2type)
index2type.append('<PAD>')

print('Vocabulary size: ', len(index2type))
print('First 10 words in the vocabulary: ', index2type[0:10])

Vocabulary size:  15518
First 10 words in the vocabulary:  ['@user', '[URL]', 'na', 'gi', 'di', 'Chukwu', 'onye', 'Onye', 'ka', 'm']


<a name="section1_4"></a>
## 1.4. BPE tokenisation



So far we have tokenised sentences based on the white spaces separating tokens in raw text. In most modern NLP systems, sentences are tokenised into subword tokens instead of words. This approach helps in handling out-of-vocabulary words and improves the model's ability to capture morphological (subword) information.

Byte Pair Encoding (BPE) is a popular subword tokenisation algorithm in NLP. In this section, we will implement the BPE algorithm and apply it to our dataset.

BPE and related algorithms have two parts:
* A type learner that takes a raw training corpus and induces a vocabulary (a set of types) of prespecified size (e.g. 1000 subwords).
* A token segmenter that takes a raw test sentence and tokenises it according to that subword vocabulary.

## BPE type learner (train on training set)

1. Start with a vocabulary consisting of all individual characters e.g. {A, B, C, D,…, a, b, c, d...}.
2. Repeat until the prespecified vocabulary size has been reached:
    * Choose the two symbols that are most frequently adjacent in the training corpus (say 'A', 'B').
    * Merge these symbols and add the newly merged symbol 'AB' to the vocabulary.
    * Replace every adjacent 'A' 'B' in the corpus with 'AB'.


## BPE token segmenter (apply to train/dev/test set)

Segmenter algorithm: Run each merge learned from the training data greedily, in the order they were learned (test frequencies don't play a role).

So merge every "A" "B" to "AB", then merge "AB" "C" to "ABC", etc.

## Other details

* Usually basic tokenization is performed first (space-based tokenization and separating punctuation). BPE is then applied to the initial tokens.
* To enable the algorithm to learn to represent the boundary between tokens, commonly a special end-of-word symbol '_' is added before spaces in the training corpus (or alternatively between the space and the next word).

## Pseudocode

```
function BYTE-PAIR ENCODING(strings C, number of merges k) returns vocab V
    V <- all unique characters in C     # initial set of tokens is characters
    for i = 1 to k do                   # merge tokens k times / until vocab size reached
        t_L, t_R <- Most frequent pair of adjacent tokens in C
        t_new <- t_L + t_R              # make new token by concatenating
        V <- V + t_new                  # update the vocabulary
        Replace each occurrence of t_L, t_R in C with t_new   # and update the corpus
    return V
```




In [None]:
# Implement BPE algorithm

class BPETokenizer():

    def __init__(self, sentences, vocab_size):

        """
        Initialize the BPE tokenizer.

        Args:
            sentences (list[str]): list of list of tokens e.g. [['This', 'is', 'a', 'sentence'], ['This', 'is', 'another', 'sentence'], ...]
            vocab_size (int): The desired vocabulary size after training.
        """

        self.sentences = [[' '.join(word) for word in sentence] for sentence in sentences]
        self.vocab_size = vocab_size
        self.word_freqs = defaultdict(int)
        self.merges = {}


    def train(self):

        """
        Train the BPE tokenizer by iteratively merging the most frequent pairs of symbols.

        Returns:
            dict: A dictionary of merges in the format {(a, b): 'ab'}, where 'a' and 'b' are symbols merged into 'ab'.
        """

        # Prepare initial vocabulary
        for sentence in self.sentences:
            for word in sentence:
                self.word_freqs[word] += 1

        vocab = Counter(self.word_freqs)

        while len(self.merges) < self.vocab_size:
            pair_freqs = self.compute_pair_freqs(vocab)
            if not pair_freqs:
                break
            best_pair = max(pair_freqs, key=pair_freqs.get)
            if pair_freqs[best_pair] < 1:
                break
            self.merges[best_pair] = best_pair[0] + best_pair[1]
            vocab = self.merge_pair(best_pair[0], best_pair[1], vocab)

        return self.merges

    def compute_pair_freqs(self, vocab):

        """
        Compute the frequency of each pair of symbols in the corpus.

        Returns:
            dict: A dictionary of pairs and their frequencies in the format {(a, b): frequency}.
        """

        pair_freqs = defaultdict(int)
        for word, freq in vocab.items():
            symbols = word.split()
            for i in range(len(symbols)-1):
                pair_freqs[(symbols[i], symbols[i+1])] += freq
        return pair_freqs


    def merge_pair(self, a, b, vocab):

        """
        Merge the given pair of symbols in all words where they appear adjacent.

        Args:
            a (str): The first symbol in the pair.
            b (str): The second symbol in the pair.

        Returns:
            dict: The updated splits dictionary after merging.
        """

        new_vocab = {}
        merge = a + b
        for word in vocab:
            new_word = word.replace(a + ' ' + b, merge)
            new_vocab[new_word] = vocab[word]
        return Counter(new_vocab)

    def tokenize(self, text):

        """
        Tokenize a given text using the trained BPE tokenizer.

        Args:
            text (str): The text to be tokenized.

        Returns:
            list[str]: A list of tokens obtained after applying BPE tokenization.
        """

        pre_tokenized_text = text.split()
        splits_text = [[l for l in word] for word in pre_tokenized_text]

        for pair, merge in self.merges.items():
            for idx, split in enumerate(splits_text):
                i = 0
                while i < len(split) - 1:
                    if split[i] == pair[0] and split[i + 1] == pair[1]:
                        split = split[:i] + [merge] + split[i + 2 :]
                    else:
                        i += 1
                splits_text[idx] = split
        result = sum(splits_text, [])
        return result


Now let's train a BPE tokeniser on our AfriSenti training corpus, apply it to our corpus, and see how our vocabulary changes after subword tokenisation.

In [None]:
# Train BPE
bpe = BPETokenizer(tokenized_train_corpus, vocab_size=1000)
merges = bpe.train()
print('Merges: ', merges)

# Tokenize text
text = 'This is a test sentence.'
tokenized_text = text.split()
tokens = bpe.tokenize(text)
print('BPE tokens: ', tokens)

# Apply to our dataset
train_df['bpe_text'] = train_df['text'].apply(lambda x: ' '.join(bpe.tokenize(x)))
dev_df['bpe_text'] = dev_df['text'].apply(lambda x: ' '.join(bpe.tokenize(x)))
test_df['bpe_text'] = test_df['text'].apply(lambda x: ' '.join(bpe.tokenize(x)))

train_df.head()
len(merges)

Merges:  {('e', 'r'): 'er', ('u', 's'): 'us', ('us', 'er'): 'user', ('@', 'user'): '@user', ('m', 'a'): 'ma', ('n', 'a'): 'na', ('n', 'y'): 'ny', ('.', '.'): '..', ('k', 'w'): 'kw', ('d', 'i'): 'di', ('k', 'e'): 'ke', ('g', 'b'): 'gb', ('g', 'i'): 'gi', ('U', 'R'): 'UR', ('[', 'UR'): '[UR', ('[UR', 'L'): '[URL', ('[URL', ']'): '[URL]', ('c', 'h'): 'ch', ('k', 'a'): 'ka', ('w', 'a'): 'wa', ('ny', 'e'): 'nye', ('n', 'e'): 'ne', ('kw', 'u'): 'kwu', ('z', 'i'): 'zi', ('C', 'h'): 'Ch', ('m', 'e'): 'me', ('r', 'a'): 'ra', ('g', 'o'): 'go', ('b', 'u'): 'bu', ('b', 'i'): 'bi', ('r', 'i'): 'ri', ('n', 'u'): 'nu', ('s', 'i'): 'si', ('o', 'o'): 'oo', ('g', 'a'): 'ga', ('g', 'w'): 'gw', ('m', 'ma'): 'mma', ('gb', 'o'): 'gbo', ('l', 'e'): 'le', ('u', 'kwu'): 'ukwu', ('😂', '😂'): '😂😂', ('a', 'l'): 'al', ('h', 'a'): 'ha', ('h', 'e'): 'he', ('b', 'e'): 'be', ('..', '.'): '...', ('n', 'd'): 'nd', ('ny', 'i'): 'nyi', ('n', 'w'): 'nw', ('o', 'r'): 'or', ('y', 'a'): 'ya', ('t', 'a'): 'ta', ('h', 'i'): 'hi'

1000

We now create a vocabulary of BPE tokens, based on our tokenised corpus. Specifying the ``vocab_size`` parameter of our BPE training algorithm allows us to control the vocabulary size, which enables smaller vocabularies than word-based tokenisation.

In [None]:
bpe_corpus = train_df['bpe_text'].tolist()
tokenized_bpe_corpus = whitespace_tokenize(bpe_corpus)

# Count number of BPE tokens in corpus
num_tokens = count_tokens(tokenized_bpe_corpus)
print('Number of BPE tokens in corpus: ', num_tokens)

# Collect type counts in BPE corpus
bpe_type2count = create_type_counts(tokenized_bpe_corpus)
print('Number of BPE types in corpus: ', len(bpe_type2count))

# Sort types by counts
bpe_type2count = dict(sorted(bpe_type2count.items(), key=lambda x: x[1], reverse=True))

# Print first few types and counts
for i, (type_, count) in enumerate(bpe_type2count.items()):
    print(f'{type_}: {count}')
    if i == 5:
        break

# Create a vocabulary for BPE tokens
bpe_index2type, bpe_type2index = create_vocabulary(bpe_type2count, min_count=2)
print('Vocabulary size: ', len(bpe_index2type))
print('First 10 BPE tokens in the vocabulary: ', bpe_index2type[0:10])

Number of BPE tokens in corpus:  126551
Number of BPE types in corpus:  1310
@user: 8741
a: 4056
e: 2949
n: 2100
o: 1987
i: 1941
Vocabulary size:  1192
First 10 BPE tokens in the vocabulary:  ['@user', 'a', 'e', 'n', 'o', 'i', 'na', 'm', '.', 'u']


<a name="section2"></a>
# 2. Feature extraction (Count-based features)

We choose to represent each token of a given vocabulary with by a pair of two integers: one integer representing its frequency in the set of positive (labeled) tweets and the second integer representing its frequency in the set of negative (labeled) tweets.

To help us train our naive bayes model, we will need to compute a dictionary where the keys are a tuple (word, label) and the values are the corresponding frequency.  Note that the labels we'll use here are 'positive' and 'negative'.

For example: given a list of tweets `["i am rather excited", "you are rather happy"]` and the label 'positive', the function will return a dictionary that contains the following key-value pairs:

{
    ("rather", 'positive'): 2,
    ("happi", 'positive') : 1,
    ("excit", 'positive') : 1
}

We therefore create a function `count_tweets` that takes a list of cleaned and tokenized tweets as input and returns a such dictionary.

In [None]:
def count_tweets(tweets, ys, result={}):
    '''
    Input:
        result: a dictionary that will be used to map each pair to its frequency
        tweets: a list of tweets
        ys: a list corresponding to the sentiment of each tweet (either 'positive' or 'negative')
    Output:
        result: a dictionary mapping each pair to its frequency
    '''
    for y, tweet in zip(ys, tweets):
        for word in tweet:
            # define the key, which is the word and label tuple
            pair = (word, y)
            result[pair] = result.get(pair, 0) + 1

    return result

<a name="section3"></a>
# 3. Training the Model using Naive Bayes

Naive bayes is an algorithm that could be used for sentiment analysis. It takes a short time to train and also has a short prediction time.

#### So how do we train a Naive Bayes classifier?
- The first part of training a naive bayes classifier is to identify the number of classes that we have.
- we will create a probability for each class.
$P(D_{pos})$ is the probability that the document is positive.
$P(D_{neg})$ is the probability that the document is negative.
We use the formulas as follows and store the values in a dictionary:

$$P(D_{pos}) = \frac{D_{pos}}{D}\tag{1}$$

$$P(D_{neg}) = \frac{D_{neg}}{D}\tag{2}$$

Where $D$ is the total number of documents, or tweets in this case, $D_{pos}$ is the total number of positive tweets and $D_{neg}$ is the total number of negative tweets.


#### Prior and Logprior

The prior probability represents the underlying probability in the target population that a tweet is positive versus negative.  In other words, if we had no specific information and blindly picked a tweet out of the population set, what is the probability that it will be positive versus that it will be negative? That is the "prior".

The prior is the ratio of the probabilities $\frac{P(D_{pos})}{P(D_{neg})}$.
We can take the log of the prior to rescale it, and we'll call this the logprior

$$\text{logprior} = log \left( \frac{P(D_{pos})}{P(D_{neg})} \right) = log \left( \frac{D_{pos}}{D_{neg}} \right)$$.

Note that $log(\frac{A}{B})$ is the same as $log(A) - log(B)$.  So the logprior can also be calculated as the difference between two logs:

$$\text{logprior} = \log (P(D_{pos})) - \log (P(D_{neg})) = \log (D_{pos}) - \log (D_{neg})\tag{3}$$


#### Positive and Negative Probability of a Word
To compute the positive probability and the negative probability for a specific word in the vocabulary, we'll use the following inputs:

- $freq_{pos}$ and $freq_{neg}$ are the frequencies of that specific word in the positive or negative class. In other words, the positive frequency of a word is the number of times the word is counted with the label of 'positive'.
- $N_{pos}$ and $N_{neg}$ are the total number of positive and negative words for all documents (for all tweets), respectively.
- $V$ is the number of unique words in the entire set of documents, for all classes, whether positive or negative.

We'll use these to compute the positive and negative probability for a specific word using this formula:

$$ P(W_{pos}) = \frac{freq_{pos} + 1}{N_{pos} + V}\tag{4} $$
$$ P(W_{neg}) = \frac{freq_{neg} + 1}{N_{neg} + V}\tag{5} $$

Notice that we add the "+1" in the numerator for additive smoothing (Laplacian smoothing).

#### Log likelihood
To compute the loglikelihood of that very same word, we can implement the following equations:

$$\text{loglikelihood} = \log \left(\frac{P(W_{pos})}{P(W_{neg})} \right)\tag{6}$$

##### We will need to create `freqs` dictionary
- Given your `count_tweets` function, you can compute a dictionary called `freqs` that contains all the frequencies.
- In this `freqs` dictionary, the key is the tuple (word, label)
- The value is the number of times it has appeared.

We will use this dictionary in several parts of this notebook.



Note that `freqs` dictionary will be different in the case where we use the **Word-based tokenization** from the case where we use the **BPE tokenization**. But, for now, let's just define the `freqs` dictionary in the case of the **Word-based tokenization**.

In [None]:
# Word-based tokenized training tweets
train_x = whitespace_tokenize(train_df['text'].apply(clean).tolist())

train_y = train_df['label'].tolist()

# Build the freqs dictionary for later uses
freqs = count_tweets(train_x, train_y, {})

In [None]:
freqs

{('Nna', 'negative'): 64,
 ('Ike', 'negative'): 127,
 ('Gwuru', 'negative'): 7,
 ('ooo.', 'negative'): 2,
 ('😂', 'negative'): 138,
 ('[URL]', 'negative'): 716,
 ('@user', 'negative'): 4131,
 ('Chineke', 'negative'): 41,
 ('nna', 'negative'): 27,
 ('kezi', 'negative'): 1,
 ('mgbe', 'negative'): 28,
 ('ole???', 'negative'): 1,
 ('Lol.', 'negative'): 12,
 ('Isi', 'negative'): 36,
 ('adirokwanu', 'negative'): 1,
 ('gi', 'negative'): 486,
 ('nma..', 'negative'): 1,
 ('😐😒😒😒', 'negative'): 1,
 ('haha.', 'negative'): 1,
 ('Fulani', 'negative'): 1,
 ('herdsmen.', 'negative'): 1,
 ('akpa', 'negative'): 4,
 ('amu', 'negative'): 15,
 ('retweet.', 'negative'): 1,
 ('Rie', 'negative'): 5,
 ('nsi', 'negative'): 22,
 ('😝', 'negative'): 3,
 ('ghetto', 'negative'): 1,
 ('di', 'negative'): 172,
 ('na', 'negative'): 554,
 ('aru', 'negative'): 10,
 ('biko!!!', 'negative'): 1,
 ('Ezigbo', 'negative'): 45,
 ('onye', 'negative'): 228,
 ('iberibe', 'negative'): 27,
 ('Thief!!!!!', 'negative'): 1,
 ('Ole!!!', '

In [None]:
def train_naive_bayes(freqs, train_x, train_y):
    '''
    Input:
        freqs: dictionary from (word, label) to how often the word appears
        train_x: a list of tweets
        train_y: a list of labels corresponding to the tweets ('negative','positive')
    Output:
        logprior: the log prior. (equation 3 above)
        loglikelihood: the log likelihood of you Naive bayes equation. (equation 6 above)
    '''
    loglikelihood = {}
    logprior = 0

    # calculate V, the number of unique words in the vocabulary
    vocab = []
    for word, label in freqs.keys():
        if word not in vocab:
            vocab.append(word)
    V = len(vocab)

    # calculate N_pos, N_neg, V_pos, V_neg
    N_pos = N_neg = 0
    for pair in freqs.keys():
        # if the label is positive (greater than zero)
        if pair[1] == 'positive':

            # Increment the number of positive words by the count for this (word, label) pair
            N_pos += freqs[pair]

        # else, the label is negative
        else:

            # increment the number of negative words by the count for this (word,label) pair
            N_neg += freqs[pair]

    # Calculate D, the number of documents
    D = len(train_x)

    # Calculate D_pos, the number of positive documents
    D_pos = len([el for el in train_y if el == 'positive'])

    # Calculate D_neg, the number of negative documents
    D_neg = len([el for el in train_y if el == 'negative'])

    # Calculate logprior
    logprior = np.log(D_pos/D_neg)

    # For each word in the vocabulary...
    for word in vocab:
        # get the positive and negative frequency of the word
        freq_pos = freqs.get((word,'positive'), 0)
        freq_neg = freqs.get((word,'negative'), 0)

        # calculate the probability that each word is positive, and negative
        p_w_pos = (freq_pos + 1)/(N_pos + V)
        p_w_neg = (freq_neg + 1)/(N_neg + V)

        # calculate the log likelihood of the word
        loglikelihood[word] = np.log(p_w_pos/p_w_neg)

    return logprior, loglikelihood

In [None]:
#Get the logprior and the loglikelihood
logprior, loglikelihood = train_naive_bayes(freqs, train_x, train_y)
print(logprior)
print(loglikelihood)

0.17071601067364667
{'Nna': -0.46627148746775177, 'Ike': -1.6989113235708597, 'Gwuru': -2.360309805816225, 'ooo.': 0.6999609888753373, '😂': -1.2263581507028063, '[URL]': -0.13331430988056134, '@user': -0.190454324898075, 'Chineke': 0.9786743913443576, 'nna': 0.39426041092087805, 'kezi': -0.9740154446963342, 'mgbe': -0.12180356950670167, 'ole???': -0.2808682641363889, 'Lol.': -2.1526704410379804, 'Isi': -0.4577989722954671, 'adirokwanu': -0.9740154446963342, 'gi': -0.010794103874188908, 'nma..': -0.2808682641363889, '😐😒😒😒': -0.9740154446963342, 'haha.': -0.9740154446963342, 'Fulani': -0.9740154446963342, 'herdsmen.': -0.9740154446963342, 'akpa': -0.5040118154505986, 'amu': -1.9548446977080605, 'retweet.': -0.9740154446963342, 'Rie': -1.3794805528044987, 'nsi': -3.4163624800655388, '😝': -0.9740154446963342, 'ghetto': -0.2808682641363889, 'di': 0.4847339956389785, 'na': 0.13831611671313368, 'aru': 0.41227891642355624, 'biko!!!': -0.9740154446963342, 'Ezigbo': 0.03362506576604859, 'onye': 

<a name="section4"></a>
# 4. Prediction

Now that we have the `logprior` and `loglikelihood`, we can test the naive bayes function by making predicting on some tweets!

We will implement the `naive_bayes_predict` function to make predictions on tweets.
* The function takes in the `tweet`, `logprior`, `loglikelihood`.
* It returns the probability that the tweet belongs to the positive or negative class.
* For each tweet, sum up loglikelihoods of each word in the tweet.
* Also add the logprior to this sum to get the predicted sentiment of that tweet.

$$ p = logprior + \sum_i^N (loglikelihood_i)$$

Remember that if $p>0$, the tweet will be 'positive' and if $p<0$ the tweet will be 'negative'.

In [None]:
def naive_bayes_predict(tweet, logprior, loglikelihood):
    '''
    Input:
        tweet: a string
        logprior: a number
        loglikelihood: a dictionary of words mapping to numbers
    Output:
        y_hat: the predicted class of tweet (either 'positive' or 'negative')

    '''

    # initialize probability to zero
    p = 0

    # add the logprior
    p += logprior

    for word in tweet:
        # check if the word exists in the loglikelihood dictionary
        if word in loglikelihood:
            # add the log likelihood of that word to the probability
            p += loglikelihood[word]

    # Now that we have p,
    if p > 0:
      y_hat = 'positive'
    else:
      y_hat = 'negative'

    return y_hat

<a name="section5"></a>
# 5. Evaluation of our model on the development set and reporting its performance using sklearn’s classification_report method.

We'll work on this section using the word_based tokenization.


In [None]:
# Word-based tokenized development tweets
dev_x = whitespace_tokenize(dev_df['text'].apply(clean).tolist())

# True classes of each tweet in the development set
dev_y = dev_df['label'].tolist()

# Predictions (classification) on the development tweets
y_hat = []
for tweet in dev_x:
  y_hat_tweet = naive_bayes_predict(tweet, logprior, loglikelihood)
  y_hat.append(y_hat_tweet)

# Visualising the outputs of our system on the development data
ddf = pd.DataFrame({'Tweet of the development set': dev_df['text'].tolist(), 'True class': dev_y, 'Predicted class': y_hat})
ddf

Unnamed: 0,Tweet of the development set,True class,Predicted class
0,Uche Chukwu ga emé... regardless . #NigeriaDec...,positive,positive
1,@user Anuri ubochi ncheta omumu gi. Mazi Zebru...,positive,positive
2,@user Daalu nwanne. Oseburuwa goziere m gi..,positive,positive
3,@user @user @user @user @user @user @user @use...,positive,positive
4,@user Ka Jehovah Nara mgburu obi gi,positive,positive
...,...,...,...
1025,@user 😂😂😂😂😂ajọ ndụ mehnn,negative,positive
1026,Here we go... Ndi e gbuwara isi... [URL],negative,negative
1027,Àlà ga Agba ndị àlà ọ... Ma kà chị 😎😎,negative,positive
1028,@user NAN adazi ewu sef. Ha zụcha cou and gate...,negative,negative


### Let's check the performance of naive bayes on the development ser using sklearn’s classification_report method.

In [None]:
report_word_based = classification_report(dev_y, y_hat)
print(report_word_based)

              precision    recall  f1-score   support

    negative       0.87      0.85      0.86       470
    positive       0.88      0.90      0.89       560

    accuracy                           0.88      1030
   macro avg       0.88      0.87      0.88      1030
weighted avg       0.88      0.88      0.88      1030



Here's a brief interpretation of the metrics.

### Word-Based Tokenization Results:
- **Precision**: High for both classes, slightly better for the positive class.
- **Recall**: Also high, with the positive class having a better recall.
- **F1-Score**: Consistently high across both classes, with the positive class slightly ahead.
- **Accuracy**: High overall performance at $88\%$.



<a name="section6"></a>
# 6. Tokenization: Experiment with word-based tokenization vs Byte Pair Encoding (BPE) and compare performance.

We have already experimented with word-based tokenization so far. So, let's experiment with BPE tokenization in this section ond compare with the result of Word-based. The **BPE Tokenization** algorithm has already been implemented in the **data preprocessing** section of this notebook.

In [None]:
# BPE training tweets
train_x = whitespace_tokenize(train_df['bpe_text'].apply(clean).tolist())

train_y = train_df['label'].tolist()

# Build the freqs dictionary in the BPE case
freqs = count_tweets(train_x, train_y, {})

#Get the logprior and the loglikelihood
logprior, loglikelihood = train_naive_bayes(freqs, train_x, train_y)
print(logprior)
print(loglikelihood)

0.17071601067364667
{'Nna': -0.3732237659060711, 'Ike': -1.517348784757967, 'G': 0.2841917533052488, 'wuru': -0.9060609862782234, 'ooo': -0.47349394839226766, '.': 0.13334822078212505, '😂': -1.4969992453426872, '[URL]': -0.29813360265270344, '@user': -0.3498984789345613, 'Ch': 0.7465856804871137, 'i': 0.12683242605170844, 'neke': 1.0196304363648092, 'nna': 0.11646899593244685, 'ke': 0.2149737664457187, 'zi': 0.14305666478519055, 'm': 0.22980028659684637, 'gb': -1.0777779898233808, 'e': -0.06355534385826876, 'ole': -2.1027344881225476, '???': -0.8595409706433305, 'Lol.': -2.3258780394367573, 'Isi': -0.5653014976453906, 'adiro': -0.8778901093115272, 'kw': -0.7914773233211194, 'anu': -1.0020410332506136, 'gi': -0.10616403500455894, 'nma': 1.2243549213858855, '..': -0.14836115719406132, '😐': -2.533517404215002, '😒😒': -1.6327308588768121, '😒': -2.1588239547735912, 'ha': -0.6006793367270417, 'F': 0.19461955545394527, 'u': -0.2857405477159515, 'la': -0.3693346340800755, 'ni': 0.48169600491081

### Evaluation of naive bayes with BPE tokenization on the development set through the sklearn's classification metric

In [None]:
# BPE tokenized development tweets
dev_x_bpe = whitespace_tokenize(dev_df['bpe_text'].apply(clean).tolist())

# True classes of each tweet in the development set
dev_y = dev_df['label'].tolist()

# Predictions (classification) on the development tweets
y_hat_bpe = []
for tweet in dev_x_bpe:
  y_hat_tweet = naive_bayes_predict(tweet, logprior, loglikelihood)
  y_hat_bpe.append(y_hat_tweet)

# Visualising the outputs of our systems ( word based and BPE) on the development data
dddf = pd.DataFrame({'Tweet of the development set': dev_df['text'].tolist(), 'True class': dev_y, 'Predicted class with word-based': y_hat, 'Predicted class BPE': y_hat_bpe})

#Save it as a csv file. After runing this cell, find the csv file in the directory of the notebook
dddf.to_csv('Tweets_with_predicted_classes.csv', index = False)

#display it
dddf

Unnamed: 0,Tweet of the development set,True class,Predicted class with word-based,Predicted class BPE
0,Uche Chukwu ga emé... regardless . #NigeriaDec...,positive,positive,positive
1,@user Anuri ubochi ncheta omumu gi. Mazi Zebru...,positive,positive,positive
2,@user Daalu nwanne. Oseburuwa goziere m gi..,positive,positive,positive
3,@user @user @user @user @user @user @user @use...,positive,positive,positive
4,@user Ka Jehovah Nara mgburu obi gi,positive,positive,positive
...,...,...,...,...
1025,@user 😂😂😂😂😂ajọ ndụ mehnn,negative,positive,negative
1026,Here we go... Ndi e gbuwara isi... [URL],negative,negative,negative
1027,Àlà ga Agba ndị àlà ọ... Ma kà chị 😎😎,negative,positive,negative
1028,@user NAN adazi ewu sef. Ha zụcha cou and gate...,negative,negative,negative



Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.




Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.




Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.



In [None]:
# BPE report
report_BPE = classification_report(dev_y, y_hat_bpe)
print(report_BPE)

              precision    recall  f1-score   support

    negative       0.80      0.91      0.85       470
    positive       0.92      0.81      0.86       560

    accuracy                           0.86      1030
   macro avg       0.86      0.86      0.86      1030
weighted avg       0.87      0.86      0.86      1030



Here's a brief interpretation of the metrics.

### BPE Tokenization Results:
- **Precision**: It's higher for the positive class compared to the negative.
- **Recall**: It's higher for the negative class, meaning the model is better at identifying all negative instances.
- **F1-Score**: Reflects the trade-off between precision and recall, with a slightly lower score for the negative class.
- **Accuracy**: Slightly lower than the word-based method at $86\%$.


<a name="section7"></a>

# 7. Conclusion (Comparison)

- The word-based tokenization model has a higher overall accuracy (0.88 vs. 0.86).
- The BPE tokenization model has a higher precision for positive sentiment but lower for negative compared to the word-based model.
- In contrast, the BPE tokenization model has higher recall for negative sentiment, indicating it is less likely to miss negative tweets but is more likely to falsely identify positive tweets as negative.
- The f1-scores suggest that the word-based model has a slightly better balance between precision and recall, particularly for the positive class.

Choosing between these models depends on the specific use case:
- If the cost of missing negative sentiments is high (for example, in monitoring for harmful content), the BPE model might be preferable due to its higher recall for the negative class.
- If the aim is to maintain a balance between precision and recall across both sentiment classes, the word-based model is slightly superior.