# A multinomial model for text

Dante’s “Divina Commedia” is a well known example of a poem in which the author uses different styles
through the different parts of the work. The poem is divided in three Cantiche, respectively “Inferno”
(Hell), “Purgatorio” (Purgatory) and “Paradiso” (Heaven). Each part is written using different linguistic
styles, moving from less to more aulic as we progress from Hell towards Heaven. Each Cantica is divided
in Canti, 34 for Inferno, and 33 for Purgatorio and Paradiso. Each Canto consists of a variable number
of verses (115 to 160), organized in tercets. <br>
In this laboratory we will use statistical methods to analyze how the stylistic differences can be exploited
to understand the Cantica of a given tercet. In particular, we will build a multinomial word model for
the three Canticas and classify tercets excerpts. To avoid biased results, the tercets used to train the
model will be different from those we evaluate on.


### Loading Data
The files data/inferno.txt, data/purgatorio.txt, data/paradiso.txt contain tercets for the
three different parts. To simplifying parsing, the files are already organized so that each line corresponds
to a tercet. We use $25%$ of the tercets as validation, and the remaining ones as
training data.

In [1]:
from load import load_data, split_data

In [8]:
lInf, lPur, lPar = load_data()
print("lInf - size (number of lines):", len(lInf))
print("lPur - size (number of lines):", len(lPur))
print("lPar - size (number of lines):", len(lPar)) 

lInf - size (number of lines): 1597
lPur - size (number of lines): 1608
lPar - size (number of lines): 1607


In [9]:
#Split the data: reserve 25% for validation, 75% for training
lInf_train, lInf_evaluation = split_data(lInf, 4)
lPur_train, lPur_evaluation = split_data(lPur, 4)
lPar_train, lPar_evaluation = split_data(lPar, 4)

print("lInf_train - size (number of lines):", len(lInf_train))
print("lInf_evaluation - size (number of lines):", len(lInf_evaluation))
print("lPur_train - size (number of lines):", len(lPur_train))
print("lPur_evaluation - size (number of lines):", len(lPur_evaluation))
print("lPar_train - size (number of lines):", len(lPar_train))
print("lPar_evaluation - size (number of lines):", len(lPar_evaluation))

lInf_train - size (number of lines): 1197
lInf_evaluation - size (number of lines): 400
lPur_train - size (number of lines): 1206
lPur_evaluation - size (number of lines): 402
lPar_train - size (number of lines): 1205
lPar_evaluation - size (number of lines): 402


## Model #1
We can assume the document is composed of $N$ words (for ex: if the document is just a simple sentence like "How are you", we have $N = 3$). <br>
More precisely, each word corresponds to a **token**: in the *NLP* field, a token is a single unit of text, and in the simplest case, the one considered here, a token is a word. <br>
Moreover, we can abstract and consider that we have $N$ Random Variables $X_0, X_1, X_2, ... , X_n \in D$ that describe the $N$ tokens of the document. $D$ is the *set* of all the possible *distinct* words in the document, has size equal to $M$, and is called *dictionary*: if for example the document is "How are you, you are good", $D$ contains just {"How", "are", "you", "good"} and $M = 4$. <br>
Having said that, we can consider the whole document to be the realization (i.e. observation) of the $N$ R.V. $(X_1, ..., X_n)$. <br>
We also consider all the tokens to be *i.i.d.*, so, for any pair of tokens $X_p$ and $X_z$, we have:
$$
P(X_p, X_z) = P(X_p) \cdot P(X_z) \quad \text{for } p \neq z
$$

We can model everything in terms of a **Categorical Distribution**: <br>
Each $X_i$ (for $i = 1, \ldots, N$) is a categorical random variable that takes values in the dictionary $D = \{D_1, D_2, \ldots, D_M\}$ with probability:

$$
P(X_i = D_j) = \pi_j \quad \text{for } j = 1, \ldots, M
$$

where:

- $\pi_j$ is the probability of observing the $j$-th word from the dictionary,
- $\pi_j \geq 0$ for all $j$,
- and $\sum_{j=1}^{M} \pi_j = 1$.

The whole document can be seen as a sequence of $N$ independent draws from this categorical distribution. <br>
Recalling the **Categorical Distribution** formulas, we can say that each token $X_i$ is a categorical random variable with probability mass function:

$$
f_{X_i}(x) = P(X_i = x) = \prod_{j=1}^{M} \pi_j^{\mathbb{I}[x = D_j]}
$$

where:

- $\pi_j$ is the probability of word $D_j$ in the dictionary,
- $\mathbb{I}[x = D_j]$ is the indicator function that is $1$ if $x = D_j$ and $0$ otherwise.

If, for example, we have a document having Tokens: ["dog", "cat", "dog", "cat", "dog"] and dictionary $D$ = {dog, cat} (so, $N =5$, $M=2$), we would compute the following probability mass functions:
$$
f_{X_1}(\text{"dog"}) = \pi_{"dog"}^1 \times \pi_{"cat"}^0 = \pi_{"dog"}\\
f_{X_2}(\text{"cat"}) = \pi_{"dog"}^0 \times \pi_{"cat"}^1 = \pi_{"cat"}
$$

As a result, we can then compute the **Likelihood function** as the product of all the probability mass functions for each one of the $N$ tokens in the document:
$$
\mathcal{L} \left( \Pi \right) = \mathcal{L}(\pi_1, \pi_2, \dots, \pi_M) = P(X_1, X_2, \dots, X_N) = \prod_{i=1}^{N} \prod_{j=1}^{M} \pi_j^{\mathbb{I}[X_i = D_j]}
$$
Where:

- \( \pi_j \) is the probability of observing the \( j \)-th word in the dictionary \( D \),
- \( \mathbb{I}[X_i = D_j] \) is the indicator function that is 1 if \( X_i = D_j \) and 0 otherwise.

The first product runs over all tokens \( X_1, X_2, \dots, X_N \), and the second product runs over all possible words in the dictionary \( D_1, D_2, \dots, D_M \).