# A multinomial model for text

Dante’s “Divina Commedia” is a well known example of a poem in which the author uses different styles
through the different parts of the work. The poem is divided in three Cantiche, respectively “Inferno”
(Hell), “Purgatorio” (Purgatory) and “Paradiso” (Heaven). Each part is written using different linguistic
styles, moving from less to more aulic as we progress from Hell towards Heaven. Each Cantica is divided
in Canti, 34 for Inferno, and 33 for Purgatorio and Paradiso. Each Canto consists of a variable number
of verses (115 to 160), organized in tercets. <br>
In this laboratory we will use statistical methods to analyze how the stylistic differences can be exploited
to understand the Cantica of a given tercet. In particular, we will build a multinomial word model for
the three Canticas and classify tercets excerpts. To avoid biased results, the tercets used to train the
model will be different from those we evaluate on.


### Loading Data
The files data/inferno.txt, data/purgatorio.txt, data/paradiso.txt contain tercets for the
three different parts. To simplifying parsing, the files are already organized so that each line corresponds
to a tercet. We use $25%$ of the tercets as validation, and the remaining ones as
training data.

In [19]:
import numpy as np
from load import load_data, split_data

In [9]:
lInf, lPur, lPar = load_data()
print("lInf - size (number of tercets):", len(lInf))
print("lPur - size (number of tercets):", len(lPur))
print("lPar - size (number of tercets):", len(lPar)) 

lInf - size (number of tercets): 1597
lPur - size (number of tercets): 1608
lPar - size (number of tercets): 1607


In [10]:
#Split the data: reserve 25% for validation, 75% for training
lInf_train, lInf_evaluation = split_data(lInf, 4)
lPur_train, lPur_evaluation = split_data(lPur, 4)
lPar_train, lPar_evaluation = split_data(lPar, 4)

print("lInf_train - size (number of tercets):", len(lInf_train))
print("lInf_evaluation - size (number of tercets):", len(lInf_evaluation))
print("lPur_train - size (number of tercets):", len(lPur_train))
print("lPur_evaluation - size (number of tercets):", len(lPur_evaluation))
print("lPar_train - size (number of tercets):", len(lPar_train))
print("lPar_evaluation - size (number of tercets):", len(lPar_evaluation))

lInf_train - size (number of tercets): 1197
lInf_evaluation - size (number of tercets): 400
lPur_train - size (number of tercets): 1206
lPur_evaluation - size (number of tercets): 402
lPar_train - size (number of tercets): 1205
lPar_evaluation - size (number of tercets): 402


Now, we create a dictionary of all te tercets used for training:

In [11]:
DTR = {
    "lInf": lInf_train,
    "lPur": lPur_train,
    "lPar": lPar_train
}

In [35]:
DVAL = lInf_evaluation + lPur_evaluation + lPar_evaluation

## Model #1
We can assume the document is composed of $N$ words (for ex: if the document is just a simple sentence like "How are you", we have $N = 3$). <br>
More precisely, each word corresponds to a **token**: in the *NLP* field, a token is a single unit of text, and in the simplest case, the one considered here, a token is a word. <br>
Moreover, we can abstract and consider that we have $N$ Random Variables $X_0, X_1, X_2, ... , X_n \in D$ that describe the $N$ tokens of the document. $D$ is the *set* of all the possible *distinct* words in the document, has size equal to $M$, and is called *dictionary*: if for example the document is "How are you, you are good", $D$ contains just {"How", "are", "you", "good"} and $M = 4$. <br>
Having said that, we can consider the whole document to be the realization (i.e. observation) of the $N$ R.V. $(X_1, ..., X_n)$. <br>
We also consider all the tokens to be *i.i.d.*, so, for any pair of tokens $X_p$ and $X_z$, we have:
$$
P(X_p, X_z) = P(X_p) \cdot P(X_z) \quad \text{for } p \neq z
$$

We can model everything in terms of a **Categorical Distribution**: <br>
Each $X_i$ (for $i = 1, \ldots, N$) is a categorical random variable that takes values in the dictionary $D = \{D_1, D_2, \ldots, D_M\}$ with probability:

$$
P(X_i = D_j) = \pi_j \quad \text{for } j = 1, \ldots, M
$$

where:

- $\pi_j$ is the probability of observing the $j$-th word from the dictionary,
- $\pi_j \geq 0$ for all $j$,
- and $\sum_{j=1}^{M} \pi_j = 1$.

The whole document can be seen as a sequence of $N$ independent draws from this categorical distribution. 

### Parameters estimation via Maximum Likelihood approach
$\Pi = (\pi_1, \pi_2, ..., \pi_M)$ are the model parameters that we have to estimate in order to compute the likelihoods. <br>
Recalling the **Categorical Distribution** formulas, we can say that each token $X_i$ is a categorical random variable with probability mass function:

$$
f_{X_i}(x) = P(X_i = x) = \prod_{j=1}^{M} \pi_j^{\mathbb{I}[x = D_j]}
$$

where:

- $\pi_j$ is the probability of word $D_j$ in the dictionary,
- $\mathbb{I}[x = D_j]$ is the indicator function that is $1$ if $x = D_j$ and $0$ otherwise.

If, for example, we have a document having Tokens: ["dog", "cat", "dog", "cat", "dog"] and dictionary $D$ = {dog, cat} (so, $N =5$, $M=2$), we would compute the following probability mass functions:
$$
f_{X_1}(\text{"dog"}) = \pi_{"dog"}^1 \times \pi_{"cat"}^0 = \pi_{"dog"}\\
f_{X_2}(\text{"cat"}) = \pi_{"dog"}^0 \times \pi_{"cat"}^1 = \pi_{"cat"}
$$

As a result, we can then compute the **Likelihood function** as the product of all the probability mass functions for each one of the $N$ independent tokens in the document:
$$
\mathcal{L} \left( \Pi \right) = \mathcal{L}(\pi_1, \pi_2, \dots, \pi_M) = P(X_1, X_2, \dots, X_N) = \prod_{i=1}^{N} \prod_{j=1}^{M} \pi_j^{\mathbb{I}[X_i = D_j]}
$$
Where:

- $\pi_j$ is the probability of observing the $j$-th word in the dictionary $D$,
- $\mathbb{I}[X_i = D_j]$ is the indicator function that is 1 if $X_i = D_j$ and 0 otherwise.

The first product runs over all tokens $X_1, X_2, \dots, X_N$, and the second product runs over all possible words in the dictionary $D_1, D_2, \dots, D_M$. <br>
Shifting to the *log-domain*, we can write:
$$
\mathcal{l}(\Pi) = \log \mathcal{L}(\Pi) = \sum_{i=1}^{N} \sum_{j=1}^{M} \mathbb{I}[x_i = j] \log \pi_j = \sum_{j=1}^{M} N_j \log \pi_j
$$
Where:
- $\mathbb{I}[x_i = j]$ is the indicator function (1 if $x_i = j$, 0 otherwise),
- $N_j$ is the number of times word $j$ appears in the document.

We have the constraint that the probabilities must sum to 1:

$$
\sum_{j=1}^{M} \pi_j = 1
$$

To solve the **Maximum Likelihood Estimation (MLE)** problem under this constraint, we use the **method of Lagrange multipliers**.

We define the **Lagrangian function** $\mathcal{l}(\Pi, \lambda)$ to include this contraint when maximizing the log-likelihood:

$$
\mathcal{l}(\Pi, \lambda) = \mathcal{l}(\pi_1, \dots, \pi_M, \lambda) = \sum_{j=1}^{M} N_j \log \pi_j + \lambda \left(1 - \sum_{j=1}^{M} \pi_j \right)
$$

Then, we compute the partial derivatives and set them to zero:

For each $j = 1, \dots, M$:

$$
\frac{\partial \mathcal{l}}{\partial \pi_j} = \frac{N_j}{\pi_j} - \lambda = 0
$$

And for the constraint:

$$
\frac{\partial \mathcal{l}}{\partial \lambda} = 1 - \sum_{j=1}^{M} \pi_j = 0
$$

Solving this system gives us the MLE estimates for $\pi_j$. <br>
The ML estimates of the model parameters can be obtained by maximizing the log-likelihood function for each *cantica* $c$:
the ML estimate of the probability of word $j$ in cantica $c$ is given by:
$$
\pi_{c,j} = \frac{N_{c,j}}{N_c}
$$
Where:
- $N_{c,j}$ is the number of occurrences of each word $j$ (belonging to $D$) in cantica $c$, 
- $N_c = \sum_{j=1}^{M} N_{c,j}$ is the total number of tokens in cantica $c$.

This means that:

> The ML estimate $\pi_{c,j}$ is simply the **relative frequency** of word $j$ in cantica $c$.

> This is a fundamental result: under the assumption that words are *i.i.d.* samples from a categorical distribution, the MLE of each word's probability is **just the proportion of times the word appears in the document**. <br>

So, the complete formula in the log-domain for the likelihood is:
$$
\mathcal{l}(\Pi) = \log \mathcal{L}(\Pi) = \sum_{i=1}^{N} \sum_{j=1}^{M} \mathbb{I}[x_i = j] \log \pi_j = \sum_{j=1}^{M} N_j \log \pi_j = \sum_{j=1}^{M} N_j \left[ \log N_{c,j} - \log N_c \right]
$$

**NOTE**: for some classes word frequencies may be 0. Since the prediction is based on the logarithm of
the frequencies, this implies that we will get $- \infty $ values for samples that contain words that do not
appear in one cantica, corresponding to a class-conditional probability (likelihood) of 0. This may lead
to numerical issues, since we may have prediction likelihoods for different canticas that are 0 for all three
classes (e.g., a tercet containg a word that does not appear in Inferno, a second word that does not
appear in Purgatory and a third word that does not appear in Paradiso). <br>
A solution is to introduce augmented counts, so instead of using $N_{c,j}$ we use its *smoothed* version:
$$
N_{bc,j} = N_{c,j} + \varepsilon
$$
Where $\varepsilon$ is a small positive value and it's considered an hyperparameter. <br>
So the smoothed formula for the log-likelihood, the one that will be used in the code, is this:
$$
\mathcal{l}(\Pi) \approx \mathcal{l} (\Pi, \varepsilon) = \sum_{j=1}^{M} N_j \left[ \log \left( N_{c,j} + \varepsilon \right) - \log N_c \right]
$$

To compute the frequencies we first need to build the dictionary of all possible words. We consider only
words that appear in the training set — if test or validation tercets contain unknown words we can
simply discard them, as they would not, in any case, provide support for any of the three hypotheses
under consideration. <br>
The best way to do this is using a dictionary where the keys are the distinct words, and the values are occurrences of each word in the cantica:

In [13]:
def Model1_buildDict(canticaTercets):
    """
    This function builds a dictionary of words from the cantica tercets.
    It returns a set of words.

    Parameters:
    canticaTercets (list): A list of tercets from a cantica.

    Returns:
    Dc: A dictionary of words with their frequencies.
    Nc: the number of all the tokens (i.e. the sum of appearances of each word) appearing in the canticaTercets
    """

    Dc = {}
    Nc = 0

    for tercet in canticaTercets:
        words = tercet.split()
        for word in words:
            Nc += 1
            if (word not in Dc.keys()):
                Dc[word] = 1
            else:
                Dc[word] += 1


    return Dc, Nc

To compute the model estimators using the *ML* approach, let's call with function which returns a dictionary where, for each cantica c, it contains another dictionary storing the relative frequency in the cantica for each word:

In [52]:
def Model1_Estimators(DTR, eps = 0.001):
    """
    This function computes, for each cantica, a dictionary storing the relative frequency for each word in the cantica, corresponding to the MLE.

    Parameters:
    - DTR: A dictionary of key=canticaLabel, value=list of tercets in the cantica
    - eps: epsilon, so the value in order to smooth frequencies

    Returned Values:
    - MLE: log Maximum Likelihood Estimators for each word in each cantica c: keys: cantica labels, values: list where index i is the SMOOTHED log frequency of word i of cantica c
    """


    MLE = {}

    globalDict = set() #global dict computed as the union between the cantica dicts keys sets
    canticaTokenCounts = {} #dictionary storing each Nc (token count) for each cantica c
    canticaDicts = {}  #dictionary storing canticaDict foe each cantica

    #For each cantica c
    #1. Build a dictionary of key: word, value: #times the word appears in the cantica c
    #2. Find Nc which is the number of all tokens (i.e. the sum of appearances of each word) appearing in the canticaTercets
    #3. The MLE are found by calculating the frequency of each word in the cantica c

    #1-2. Dictionary building, Nc for each cantica
    for canticaLabel in DTR.keys():
        # canticaLabel can be lInf, lPur, lPar

        canticaDict, Nc = Model1_buildDict(DTR[canticaLabel])
        globalDict = globalDict.union(canticaDict.keys())
        canticaTokenCounts[canticaLabel] = Nc
        canticaDicts[canticaLabel] = canticaDict




    #3. Compute smoothed MLE
    #Smoothing: in order to not have frequency = zero for a certain word, set all the frequencies to eps, so augment the word count with pseudo-count
    for canticaLabel in DTR.keys(): 
        MLE[canticaLabel] = {}
        canticaDict = canticaDicts[canticaLabel]
        Nc = canticaTokenCounts[canticaLabel]

        for word in globalDict:
            #if the word of the globalDict appears also in the canticaDict -> frequency is log{(Nc,j + eps) / Nc}
            #otherwise, frequency is log{(0 + eps) / Nc} = log{eps / Nc}

            #use dict.get(word, defaultNone) in order to return Nc,j = 0 if the word is NOT present in the canticaDict
            #simply using canticaDict[word] would raise an exception in this case

            N_cj = canticaDict.get(word, 0)

            MLE[canticaLabel][word] = np.log(N_cj + eps) - np.log(Nc)




    return MLE, globalDict


Here are the MLE:

In [53]:
MLE, globalDict = Model1_Estimators(DTR)

In [51]:
MLE["lPar"]["ritrovai"]

KeyError: 'ritrovai'

### Predicting the cantica
We now turn to prediction for a given tercet $t$. Again, we can represent the tercet in terms of its tokens
$x = [x_1, x_2, ..., x_N ]$ (here $N$ refers to the length of the validation tercet, and $x_i$
is the $i$-th word of the tercet). <br>
As said before, the likelihood of a word $x$ is expressed as:
$$
P(x \mid c) = P(x_1, x_2, \dots, x_N \mid c) = \prod_{i=1}^{N} P(x_i \mid c)
$$
In the *log-domain* we express it as:
$$
\log P(x \mid c) = \sum_{i=1}^{N} \log \pi_{c, x_i}
$$
**So, in order to compute the log-likelihood of each word, we have to sum all the log-probabilities of each word.** <br>
Let's consider a scenario where we are predicting the cantica for a tercet. Suppose we have the following words in the tercet: 
$$
x = [x_1, x_2, x_3] = [\text{fire}, \text{burning}, \text{night}]
$$
We also have the word probabilities (calculated from the MLE) for a specific cantica, say *Inferno*, stored in $\pi_{\text{Inferno}, x}$, which represents the probability of each word occurring in the *Inferno* cantica.

For this example, suppose the probabilities for each word are as follows:
- $P(\text{fire} \mid \text{Inferno}) = 0.2$
- $P(\text{burning} \mid \text{Inferno}) = 0.3$
- $P(\text{night} \mid \text{Inferno}) = 0.1$

We can calculate the log-likelihood for this tercet in the *Inferno* cantica as:
$$
\log P(x \mid \text{Inferno}) = \log \{P(\text{fire} \mid \text{Inferno}) \cdot P(\text{burning} \mid \text{Inferno}) \cdot P(\text{night} \mid \text{Inferno})\}
$$
Using the formula for the log of a product, we get:
$$
\log P(x \mid \text{Inferno}) = \log P(\text{fire} \mid \text{Inferno}) + \log P(\text{burning} \mid \text{Inferno}) + \log P(\text{night} \mid \text{Inferno})
$$
Substituting the values, we have:
$$
\log P(x \mid \text{Inferno}) = \log 0.2 + \log 0.3 + \log 0.1
$$
This simplifies to:
$$
\log P(x \mid \text{Inferno}) \approx -1.6094 - 1.2040 - 2.3026 = -5.116
$$

Thus, the log-likelihood for the tercet $x = [\text{fire}, \text{burning}, \text{night}]$ in the *Inferno* cantica is approximately $-5.116$.


In pratice (i.e. in the code), we create and compute a **matrix of scores** $S$, where each row corresponds to a class and each column to a test sample (tercet):

In [70]:
def Model1_compute_LogLikelihoods(logMLE, DVAL):
    """
    Compute log-likelihoods for all the classes given a DVAL validation text

    Parameters:
    - logMLE: log estimators for each word witin each class
    - DVAL: text on which ti compute the log-likelihoods

    Returned Values:
    - ll: loglikelihoods
    """

    ll = {canticaLabel: 0 for canticaLabel in logMLE.keys()}

    for canticaLabel in logMLE.keys():
        logMLECanticaLabel = logMLE[canticaLabel]
        for word in DVAL.split():
            logProbWord = logMLECanticaLabel.get(word.strip(), 0.001)
            ll[canticaLabel] += logProbWord


    return ll   #3 values: {ll_lInf, ll_lPur, ll_lPar}
            

In [77]:
def Model1_matrixScore(logMLE, tercetsList):
    """
    Compute a score matrix M having each row equal to each class (lInf, lPur, lPar) and each column to a test sample (tercet)
    """

    numClasses = len(logMLE.keys())
    S = np.zeros((numClasses, len(tercetsList)))

    for tercetColumnCount, tercet in enumerate(tercetsList):
        #tercetColumnCount is for inserting the computed likelihoods in the right trailing column
        logLikelihoods = Model1_compute_LogLikelihoods(logMLE, tercet)

        for row, canticaLabel in enumerate(logMLE.keys()):
            S[row, tercetColumnCount] = logLikelihoods[canticaLabel]

    return S


    

In [78]:
Model1_S = Model1_matrixScore(MLE, DVAL)