##### Master Degree in Computer Science and Data Science for Economics

# Statistical Language Models

### Alfio Ferrara

Given a corpus $D$ of size $N$, we can compute the probability of a token $w \in D$ as:

$$
P(w) = \frac{count(w)}{\sum\limits_{i=0}^{N}}
$$

Using this simple statistics, we can sample a word from the corpus according to its probability $P(w)$. If we use this to generate a text $d = w_1, w_2, \dots, w_{n-1}, w_n$ we have:

$$
P(w_1, w_2, \dots, w_{n-1}, w_n) = P(w_1)P(w_2) \dots P(w_{n-1})P(w_n) = \prod\limits_{i=1}^{n} P(w_i)
$$

Howewer, this is not realistic. A better way to model this process is to choose words by taking into account the words we generated before, by sampling the $i$th word with a probabiity that is conditioned by the previous $i-1$ words.

Thus, applying the chain rule:

$$
P(w_1, w_2, \dots, w_{n-1}, w_n) = P(w_1) P(w_2 \mid w_1) \dots P(w_{n-1} \mid w_1 \dots w_{n-2}) P(w_n \mid w_1 \dots w_{n-1}) = \prod\limits_{i=1}^{n} P(w_i \mid w_1 \dots w_{i-1})
$$


Now, indexing long sequences as required by this method is unfeasible for a couple of good reasons:
1. it's memory consuming and computationally intractable
2. The logest the sequences the fewer the chances to observe such sequences a sufficient number of times

So, we can apply a Markov approximation by taking into account only subsequences of lenght $k$:

$$
P(w_1, w_2, \dots, w_{n-1}, w_n) = \prod\limits_{i=1}^{n} P(w_i \mid w_{i-k} \dots w_{i-1})
$$

In [1]:
from nlp.langmodel import MarkovLM
import pymongo

## Get a corpus of texts

In [2]:
db = pymongo.MongoClient()['cousine']
recipes = db['foodcom']

In [3]:
def create_corpus(query:  dict = {}, numdocs: int = 3000):
    corpus = []
    for recipe in recipes.find(query).limit(numdocs):
        for sentence in recipe['steps']:
            corpus.append(sentence)
    return corpus

In [4]:
numdocs = 3000
corpus = create_corpus(query={}, numdocs=numdocs)
print(f"Corpus size: {len(corpus)}")
for text in corpus[:4]:
    print(text)

Corpus size: 20262
I a sauce pan, bring water to a boil; slowly add grits and salt, stirring constantly; Reduce heat:simmer, uncovered, for 40-45 minutes or untill thickened, stirrin occasionally.
Add cheese and garlic; stir until cheese is melted, Spray 9-inch baking dish with nonstick cooking spray; Cover and refrigerate for 2 to 2 1/2 hours or until frim.
Before starting the grill, coat the grill rack with nonstick cooking spray; Cut the grits into 3-inch squares; Brush both sides with olive oil.
Grill, covered, over medium heat for 4 to 6 minutes on each side or until lightly browned.


In [5]:
tokenizer = "bert-base-uncased"
brlm = MarkovLM(k=2, tokenizer_model=tokenizer)
frlm = MarkovLM(k=4, tokenizer_model=tokenizer)

In [6]:
brlm.train(corpus=corpus)
frlm.train(corpus=corpus)

100%|██████████| 20262/20262 [00:02<00:00, 8601.39it/s]
100%|██████████| 20262/20262 [00:02<00:00, 7479.36it/s]


## Text generation

In [7]:
print("2gram: ", " ".join(brlm.generate()).replace(" ##", ""))
print("4gram: ", " ".join(frlm.generate()).replace(" ##", ""))

2gram:  [#S] toss well . [#E]
4gram:  [#S] [#S] [#S] in a medium saucepan bring 3 / 4 full . [#E]


## Text classification

In [8]:
italian_q = {'search_terms': 'italian'}
chinese_q = {'search_terms': 'chinese'}
numdocs = 3000
italian_corpus = create_corpus(query=italian_q, numdocs=numdocs)
chinese_corpus = create_corpus(query=chinese_q, numdocs=numdocs)
print(f"Italian: {len(italian_corpus)}")
print(f"Chinese: {len(chinese_corpus)}")

Italian: 30052
Chinese: 24134


In [9]:
it = MarkovLM(k=4, tokenizer_model=tokenizer)
ch = MarkovLM(k=4, tokenizer_model=tokenizer)

In [10]:
it.train(corpus=italian_corpus)
ch.train(corpus=chinese_corpus)

100%|██████████| 30052/30052 [00:03<00:00, 7705.94it/s]
100%|██████████| 24134/24134 [00:03<00:00, 6768.58it/s]


In [12]:
italian_sentence = italian_corpus[6]
chinese_sentence = chinese_corpus[6]

print(f"Italian sentence: {italian_sentence}")
print(f"Italian: {it.log_prob(italian_sentence)}")
print(f"Chinese: {ch.log_prob(italian_sentence)}")
print("========")
print(f"Chinese sentence: {chinese_sentence}")
print(f"Italian: {it.log_prob(chinese_sentence)}")
print(f"Chinese: {ch.log_prob(chinese_sentence)}")

Italian sentence: Place ravioli on a large baking sheet sprinkled with cornstarch.
Italian: -23.0900057470743
Chinese: -21.352223087931076
Chinese sentence: Heat enough oil in a frying pan over medium heat to shallow fry.
Italian: -24.643378642963395
Chinese: -22.686133859693925


## Combine languages 

In [13]:
import copy 

In [14]:
mix = copy.deepcopy(it)

In [15]:
mix.train(chinese_corpus)

100%|██████████| 24134/24134 [00:03<00:00, 6980.59it/s]


In [16]:
print("Mix: ", " ".join(mix.generate()).replace(" ##", ""))

Mix:  [#S] [#S] [#S] let the yeast mixture is foaming add the chicken and cook , stirring occasionally , over med heat until golden brown - they should still be quite moist at this stage with salt and pepper if needed to deglaze the skillet so you ' ll have about 1 cup of additional sauce . [#E]
