# Representing Words as Probabilities

We can represent a sentence (a sequence of words) mathematically as 

$$
\begin{equation}
w = \{{w_0, w_1, w_2, \dots,w_{s-1}}\}
\end{equation}
$$

Here, **$s$** represents the total number of words in the sentence. **$w_{0}$** represents the first word in the sentence, **$w_{1}$** represents the second word in the sentence, and so on.

# Exercise:

`Older people, like everyone else, can benefit from accessing ride-sharing, but many are not comfortable with smart-phones.`

You can ignore punctuation and capitalization for now.

1. What is $s$?
2. What is $w_4$? What is $w_6$?
3. What is $V$ (this corpus' vocabulary size, assuming this is the only sentence in the corpus)? You can do this the hard way, by counting manually.

```python
sentence = "Older people, like everyone else, can benefit from accessing ride-sharing, but many are not comfortable with smart-phones"

import re # a relatively efficient, concise way to remove punctuations
vocabulary = set([re.sub(r'[^\w\s]','',word).lower() for word in sentence.split()])
print("The size of the vocabulary is {} words".format(vocabulary))
```

# Independence

In statistics, two events are independent if the outcome of one event does not affect the probability of the outcomes of another event.

You will also often see this written as 
$$
\begin{equation}
P(A,B) = P(A) * P(B)
\end{equation}
$$

In other words, an event A is independent of event B if the **probability of event A and event B happening together** is equal to **the probability of event A multplied by the probability of event B**.

# Bigram Model

A bigram is a group of two tokens (frequently words) that are treated as one distinct entity. For instance, the distinct bigrams in the sentence `I am home now` would be
```python
bigrams = [
    ("I", "am"),
    ("am", "home"),
    ("home", "now")
]
```


### Exercise:
Write a Python function to find all the bigrams in the sentence
`In recent years, Johnson & Johnson has been focusing more on its high-margin pharmaceutical segment via acquisitions.`

**Hints**:
- split the sentence into a list of individual words (`my_sentence.split()`)
- remove punctuation
- lowercase all the letters
- use a for loop to iterate through this list, getting the **i-th** and **i + 1-th** elements of the list

**Challenge**:
Generalize this function to work with `n-grams`.

In [14]:
# create_bigrams(sentence: List[str]): -> List[(str, str)]



## Language Model

Are words in a sentence conditionally independent from each other? In other words, does knowing that the first word `The` change your belief in the likelihood of the second word that follows?

Which of the following sentences is more likely?

```python
sentence_A = "Jack went to Wal-Mart."
sentence_B = "at and the be of I"
```
Notice that all the words in sentence B come from [Wikipedia most common words in the English language](https://en.wikipedia.org/wiki/Most_common_words_in_English). Yet we intuitively know that the sentence is nonsensical and is unlikely to be seen in natural language.

We can express the likelihood of a sentence $w$ as $p(w)$, and define it as
$$
\begin{equation}
p(w) = \prod_{i=0}^{s}p(w_{i+1}|w_{i})
\end{equation}
$$

If we want to generalize this to an **N-Gram** model:

$$
\begin{equation}
p(w) = \prod_{i=0}^{s}p(w_i |w_{i-n+1}, w_{i-n+2}, \dots, w_{i})
\end{equation}
$$
$$
\begin{equation}
p(w) = \prod_{i=0}^{s}p(w_i | w_{i-n}^{i})
\end{equation}
$$

Here, $w_{i-n}^{i} = w_{i-n+1}, w_{i-n+2}, \dots, w_{i}$.

## Model Evaluation: Choosing n in an n-Gram Model

- the larger the dataset, and by implication, the more rich the corpus, the larger the n we can likely try.
- in practice, $n = 2$, $n = 3$, $n = 4$ work well. A larger $n$ tends to begin to overfit (and may be computationally extremely expensive). Remember the **bias-variance** tradeoff:

![http://scott.fortmann-roe.com/docs/BiasVariance.html](images/biasvariance.png)

Here, as $n \rightarrow \infty$, model complexity increases dramatically.
- **tune $n$ based on the performance of the downstream model**: usually n-gram models are the first step in a broader sentiment analysis prediction model, or topic modelling model, recommendation system, or sequence-to-sequence translation task.

### Perplexity

Look again at the definition of likelihood for a particular sentence:

$$
\begin{equation}
L = p(w) = \prod_{i=0}^{s}p(w_i | w_{i-n}^{i})
\end{equation}
$$

Which sentence has a higher perplexity?

##### Sentence A:
> *I love to eat.*

##### Sentence B:
> *My escort was an exceptionally genial sixty-seven-year-old man named Don Seely, an electrical engineer who said that he was between jobs and using the unwanted free time to volunteer his services to the Northern Kentucky Tea Party, the rally’s host organization, as a Webmaster.*

A common way of quantifying the likelihood of your n-gram models, accounting for different sizes of test corpuses, is to use **perplexity**. Remember that our likelihood of seeing a particular sentence is 

$$
\begin{equation}
P = \frac{1}{\sqrt[N]{p(w)}}
\end{equation}
$$
$N$ is the length of all the words in the corpus. We typically use perplexity, instead of simply likelihood, as the overall model evaluation metric, because in general, **in order to compare two different models**, they should be using the same test corpus / vocabulary. 

In [13]:
# def perplexity(likelihood, N): 




## Dealing with Out-of-Vocabulary Words

Let's pretend that our training corpus is

> *This is mistaken logic. It is true that a high variance and low bias model can perform well in some sense.*

Our test corpus is

> *This **is not** true.*

Assuming a bi-gram model is used, what is the **perplexity** of our model? 

We don't need to count each bi-gram. **The answer is infinity**. Why?
$
\begin{equation}
p(w_i = not | w_{i-1} = is) = 0
\end{equation}
$

What you can do instead:

* Look at the **frequency distribution** of words in your corpus
* Decide upon some **threshold cutoff**, where every word below that threshold frequency will be converted into an `UNKNOWN` token. Now, whenever a new word appears that is out of vocabulary, you simply convert it into `UNKNOWN` and run the tests as usual.