# Language Modeling 

- Predicting the next word in the sentence
- Developed for a different languages 
- What word is likely to appear next?

<b> Example </b>

- I'd like to make a collect... 
    
        "call" is one prediction you can make, which means it may have a higher probability to be selected
        
<b> Training Corpus</b> 

- a collection (finite set) of text that we use to find probability distributions over the set of sentences in the corpus 

<b> Goal when training a language model </b>

- assigning higher probability to sentences that are more likely 
- assigning lower probability to sentences that are less likely 

<b> In The previous Jupyter Book we got the probability distribution of a vocab set like the example below</b> 

Random Variables w can take one of these 5 values in the vocab set

vocabulary = {cat, dog, a, the, of}
               
             <.4, .3, .1, .1, .1> 
             

<b>Now we will get distributions of sentences over a training corpus</b>

Essentially, this means that if we had a corpus with 5 sentences, we would have to train a model that predicts words in each sentence and assign a probability that each sentence appears in a corpus. 

For example: 
- I'd like to make a collect...
OR
- collect a make call I'd like to... 

Which is more likely to be represented in the english language as a valid sentence? 

Remember that since we are are getting a probability distribution of sentences, the sum of all sentences together should equal 1. 

<b>Distribution of each sentences</b>

![ps.png](attachment:ps.png)

# How To Train Your Language Model

<b>Naïve Language Model (Not the best model)</b>

- This approach computes the number of times a sentence occurs in a corpus
- We then say that the probability of a sentence is the count of a sentence divided by the total amount of sentences

Let S be the set of sentences and s be a singular sentence: 

P(s) = Count(s)/S

This isn't the best because each sentence is treated as one word which is bad because we have limitations and underestimations in a language model. Remember a language model is used to predict that probability of a word or sentence occuring in a language. 

This will be a problem because it cannot determine the probabililty of new sentences. This means that there is no good generalization for new sentences. 

<b>Let's look at a different approach</b>

We will take our first example sentence

- I'd like to make a collect...

And tokenize it to compute the probabiilty distribution

- p(s) = p(I, would, like, to, make, a, collect, call)

We would like to estimate the probabilty of this sentence. In this example we will take the joint probability of all words occuring in this sentence.

One thing before computing p(s)... we need to learn the chain rule

<b> Chain rule </b>

- The chain rule allows you to use joint probability to estimate the probability of tokens or variables that occur in order.

Recall: P(A,B) = P(A) * P(B|A). Do you remember what this formula is? It's the <b>PRODUCT RULE</b> P(A^B) = P(A) * P(B|A)
    
Okay, so we have P(A,B) = P(A)* P(B|A)... but what about:

- P(A,B,C)??? 

That's easy because using the chain rule we can compute the product of all this way: 

- P(A,B,C) =  P(A)* P(B|A) * P(C| A,B) 

Another way to write this is:

- P(A,B,C) =  P(B)* P(A|B) * P(C| A,B) 

or 

- P(A,B,C) =  P(C)* P(A|C) * P(B| A,C)



<b> Back to our tokenized example using the chain rule</b>

- p(s) = p(I, would, like, to, make, a, collect, call)


<b>p(I, would, like, to, make, a, collect, call)</b>=
<br>
<br> p(I)* 
<br> p(would|I)* 
<br>p(like|I, would)* 
<br>p(to| I, would, like)* 
<br>p(make| I, would, like, to)* 
<br>p(a| I, would, like, to, make)* 
<br>p(collect|I, would, like, to, make, a)* 
<br>p(call|I, would, like, to, make, a, collect)
    
This is the chain rule being applied to get the probability of all these words
    
<b> Let's get an estimate of each probability </b>

<br> p(I) = ???
- simply put, it's just count(i)/Total_Words

<br> p(would|I) = 
- in order to calculate this, we have to come up with a training corpus with more sentences that start with "I"

Example sentences starting with "I" in the training corpus: 

I would 4
I am 20
I can 10
I should 15
I may 3
I like 5

Total_(I|?) sentences = 57

So the probability of (would|I) = 4/57

<br>p(like|I, would) =

<br>p(to| I, would, like) =

<br>p(make| I, would, like, to) =

<br>p(a| I, would, like, to, make) =

<br>p(collect|I, would, like, to, make, a) =

<br>p(call|I, would, like, to, make, a, collect) =
 
    