# Week 4 Overview

During this week's lessons, you will learn probabilistic retrieval models and statistical language models, particularly the detail of the query likelihood retrieval function with two specific smoothing methods, and how the query likelihood retrieval function is connected with the retrieval heuristics used in the vector space model.

## Key Phrases and Concepts

* p(R=1|q,d) ; query likelihood, p(q|d)
* Statistical and unigram language models
* Maximum likelihood estimate
* Background, collection, and document language models
* Smoothing of unigram language models
* Relation between query likelihood and TF-IDF weighting
* Linear interpolation (i.e., Jelinek-Mercer) smoothing
* Dirichlet Prior smoothing

## Goals and Objectives

* Explain how to interpret p(R=1|q,d) and estimate it based on a large set of collected relevance judgments (or clickthrough information) about query q and document d.
* Explain how to interpret the conditional probability p(q|d) used for scoring documents in the query likelihood retrieval function.
* Explain what a statistical language model and a unigram language model are.
* Explain how to compute the maximum likelihood estimate of a unigram language model.
* Explain how to use unigram language models to discover semantically related words.
* Compute p(q|d) based on a given document language model p(w|d).
* Explain what smoothing does.
* Show that query likelihood retrieval function implements TF-IDF weighting if we smooth the document language model p(w|d) using the collection language model p(w|C) as a reference language model.
* Compute the estimate of p(w|d) using Jelinek-Mercer (JM) smoothing and Dirichlet Prior smoothing, respectively.

## Guiding Questions

A very good note with practical examples can be found at [elastic search blog](https://www.elastic.co/blog/language-models-in-elasticsearch).

### Given a table of relevance judgments in the form of three columns (query, document, and binary relevance judgments), how can we estimate p(R=1|q,d)?

Let:

- $Q$ is set of query, such that $Q = \{q_1, ..., q_n\}$
- $D$ is set of document, such that $D = \{d_1, ..., d_n\}$
- $R$ is a binary random variable denoting relevance, such that $R \in \{0,1\}$

We have:

- $U$ is set of users, such that $U = \{u_1, ..., u_n\}$
- Value of $R$ for each pair $(q,d)$ is judged by user
- $f(q,d)$ is our ranking function

Then:

- $f(q,d) = P(R=1|q,d)$
- $P(R=1|q,d)$ tells us that document $d$ is relevant to query $q$ such that:

$$P(R=1|q,d) = \frac{count(q,d,R=1)}{count(q,d)} = \frac{c(i, R=1)}{N}$$

With constrain:

- $P(R=1|q,d) + P(R=0|q,d) = 1$

### How should we interpret the query likelihood conditional probability p(q|d)?

Problem:

- In the such case, we can not afford complete labeled data for each $(q,d,R)$
- We need to find a way to match new query $q$ with relevant documents $\{d_i, ..., d_n\}$.

Assume:

- We already have data about user preferences based on click histories on particular documents. In other word, we know what kind of documents liked by user. 

Then:

- **Query likehood** $P(q|d)$ is approximation to predict query $q$ based user interest on partcular documents $\{d_i, ..., d_n\}$  

### What is a statistical language model? What is a unigram language model? How many parameters are there in a unigram language model?

- **Statistical language model** is probability distribution over word sequence. It used word sequence strictly, its means $\{w_1+w_2+w_3\} \neq \{w_3+w_1+w_2\}$. Also called as **generative model** because can generate possible word sequence using *finite automaton alorithm*, then each generated word sequences can be used for *auto completion* task. Further, generated strings can be used to help *speech recognition* predict next possible word based on the probability. Since intuitively, the best next word is a word with high probabilty in the context of previous words. The probability of word sequence is:

$$P(q|d) = P(\{w_1, ..., w_n\}|d) = \frac{count(w_1+...+w_n)}{|d|} = \frac{c(q)}{N}$$

- **Unigram language model** is product of each word probability.

$$\prod P(w_i|V) = \prod \frac{count(w_i)}{|V|} = \prod \frac{c(w_i)}{N}$$

where $V$ is vocaboulary and $N$ is vocaboulary size. Since unigram language model depend ont independent calculation, then it has $N$ parameters of $P(w_i)$.

### How do we compute the maximum likelihood estimate of the unigram language model (based on a text sample)?

Let:

- $d$ is document contains text sample

Then:

- Maximum likehood (ML) estimator is:
$$P(w|\theta) = P(w|d) = \frac{c(w,d)}{|d|}$$

- For sequence of $w$ in $q$, ML estimator is:

$$\eqalign{
    P(q|\theta) &= \prod P(w_i|d)\\
                &= \prod \frac{c(w,d)}{|d|}\\
                \text{Transform to logarithmic to avoid underflow:}\\
    P(q|\theta) &= \sum c(w,q)\ log\ P(w|d) 
}$$

### What is a background language model? What is a collection language model? What is a document language model?

- **Background language model**: A model $M_B$ which try to find general probability of word $P(w|M_B)$ based on text of general specific language. For example, english vocaboulary, dictionary, WordNet, etc.
- **Collection language model**: A model $M_C$ which find probability of word $P(w|M_C)$ based on group of documents in the collection.
- **Document language model**: A model $M_D$ which find probability of word $P(w|M_D)$ based only on specific document.

### Why do we need to smooth a document language model in the query likelihood retrieval model? What would happen if we don’t do smoothing?

- We need to smoothing the document language model when there are some unknown words in the query. We make assumtion that the probability of unknown words would be proportional to its probability given by reference language model. So that:
    - Assign $P(w|M_C)$ for word not found in the document (unkown word).
    - Discounted $P(w|M_D)/P(w|M_C)$ for word fount in the document. 
- If we do not smoothing, then the probability of unkown words became zero and probability of word in specific document may too high. 

### When we smooth a document language model using a collection language model as a reference language model, what is the probability assigned to an unseen word in a document?

In the case of retrieval, collection language model is natural choice as reference language model. We use collection as reference when there are any unkown words that did not found in the set of known documents. Thus, we calculate the probability of any unkown words proportional to the collection language model. The calculation of word probability will be:

$$P(w|d) = \begin{cases}
P_{seen}(w|d), \text{ for known words}\\\\
\alpha_d P(w|C), \text{ for unkown words}
\end{cases}$$

where $\alpha_d$ is coefficient to control probabiliy mass of unkown words and $C$ is collection.

### How can we prove that the query likelihood retrieval function implements TF-IDF weighting if we use a collection language model smoothing?

Let use logarithmic maximum likehood as ranking function, then:
$$\eqalign{
    log P(w|d) &= \sum\limits_{w \in V} c(w,q)\ log P(w|d)\\
               &= \text{query matched in } d + \text{query not matched in } d\\
               &= \sum\limits_{w \in V, c(w,d)>0} c(w,q)\ log P_{seen}(w|d) + \sum\limits_{w \in V, c(w,d)=0} c(w,q)\ log\alpha_d \ P(w|C)\\
               &= \sum\limits_{w \in V, c(w,d)>0} c(w,q)\ log P_{seen}(w|d) + \big(P \text{ of all query words} - P \text{ of query words matched in } d \big)\\
               &= \sum\limits_{w \in V, c(w,d)>0} c(w,q)\ log P_{seen}(w|d) + \big(\sum\limits_{w \in V} c(w,q)\ log \alpha_d \ P(w|C) - \sum\limits_{w \in V, c(w,d)>0} c(w,q)\ log \alpha_d \ P(w|C) \big)\\
               &= \sum\limits_{w \in V, c(w,d)>0} c(w,q)\ log \frac{P_{seen}(w|d)}{\alpha_d P(w|C)} + |q| log \alpha_d + \sum\limits_{w \in V} c(w,q)\ log \ P(w|C)
}$$

The decomposition can be seen in an image below:

![query likehood heuristics](images/query-likehood-heuristics.png)

so then, **our ranking function became similar to vector space model with different that the TF-IDF weighting and document length normalization derived from probability with penalization parameter $\alpha_d$ which has high value for short documents and low value for long documents**. We can ignore the last sum calculation since sum of word probability of collection always constant. But, we keep second calculation altough it always constant. We need document length normalization to ensuring rank accuracy. For example, if we have many short document and 4-bit accuracy, then it may be case that two rank values $0.00013 = 0.00014$.

### How does linear interpolation (Jelinek-Mercer) smoothing work? What is the formula?

**Jelinek-Mercer smoothing** is linear interpolation between the maximum likehood estimate and the collection language model controlled by fixed smoothing parameter $\lambda \in [0,1]$:
- High $\lambda$ means increasing the important of the collection model and may diminishing the important of the document model. Higher $\lambda$ may useful for longer query.
- Low $\lambda$ means give priority for the document model. Lower $\lambda$ may useful for shorter query.

Jelinek-Mercer smoothing formula is:

$$P_{seen}(w|d) = (1-\lambda)\ P_{MLE}(w|d) + \lambda \ P(w|C)$$

where $P_{MLE}$ is Maximum Likehood Estimate which is zero if a word not appeared in the document.

For whole query, the formula become:
$$P_{seen}(w|d) = \prod \big[(1-\lambda)\ P_{MLE}(w|d) + \lambda \ P(w|C)\big]$$

### How does Dirichlet prior smoothing work? What is the formula?

**Dirichlet prior smoothing** or *Bayesian smoothing* is linear interpolation between the maximum likehood estimate and the collection language model controlled by dynamic smoothing parameter $\mu \in [0,+\infty)$, also called as dynamic coefficient interpolation:

- High $\mu$ for short document.
- Low $\mu$ for long document.
- $\mu$ satisfy:
$$P(d,\mu) = \frac{|d|}{|d|+\mu}+\frac{\mu}{|d|+\mu}=1$$

Dirichlet prior smoothing formula is:

$$\eqalign{
    P(w|d) &= \frac{c(w,d)+\mu * P(w|C)}{|d|+\mu}\\
           &= \frac{|d|}{|d|+\mu} \frac{c(w,d)}{|d|} + \frac{\mu}{|d|+\mu}P(w|C)
}$$

### What are the similarities and differences between Jelinek-Mercer smoothing and Dirichlet prior smoothing?

Both smoothing formula similar to vector space model, but JM has fixed dependent parameter which lead to ignore document length normalization contrast with Dir has dynamic coefficient:

>Given ranking functing consisted of TF-IDF weighting and length normalization:

>$$log P(w|d) = \sum\limits_{w \in V, c(w,d)>0} c(w,q)\ log \frac{P_{seen}(w|d)}{\alpha_d P(w|C)} + |q| log \alpha_d$$

>Jelinek-Mercer smoothing is:
    
>$$P_{seen}(w|d) = (1-\lambda)\ P_{MLE}(w|d) + \lambda \ P(w|C)$$

>If $\alpha_d = \lambda$, then:

>$$\eqalign{
    \frac{P_{seen}(w|d)}{\alpha_d*P(w|C)} &= \frac{(1-\lambda)\ P_{MLE}(w|d) + \lambda \ P(w|C)}{\lambda * P(w|C)}\\
    &= 1 + \frac{1-\lambda}{\lambda} * \frac{c(w,d)}{|d|*P(w|C)}
}$$

>Ignore the $|q|log \alpha_d$ since $\alpha_d = \lambda$ not depend on the current document being scored.

>Finally, scoring function is:

$$score_{JM}(q,d) = \sum\limits_{w \in q,d} c(w,d) \ log\big(1+\frac{1-\alpha}{\alpha}*\frac{c(w,d)}{|d|*P(w|C)}\big)$$

---

>Dirichlet prior smoothing is:

>$$\eqalign{
    P_{seen}(w|d) &= \frac{c(w,d)+\mu * P(w|C)}{|d|+\mu}\\
                  &= \frac{|d|}{|d|+\mu} \frac{c(w,d)}{|d|} + \frac{\mu}{|d|+\mu}P(w|C)
}$$

>If $\alpha_d = \frac{\mu}{|d|+\mu}$, then:

>$$\eqalign{
    \frac{P_{seen}(w|d)}{\alpha_d*P(w|C)} &= \frac{\frac{c(w,d)+\mu * P(w|C)}{|d|+\mu}}{\frac{\mu}{|d|+\mu}P(w|C)}\\
    &= 1 + \frac{c(w,d)}{\mu * P(w|C)}
}$$

>Finally, scoring function is:

>$$score_{Dir}(q,d) = \sum\limits_{w \in q,d} c(w,q)\ log \big(1+\frac{c(w,d)}{\mu*P(w|C)} + |q| log \frac{\mu}{\mu + |d|}\big)$$

## Additional Readings and Resources

* C. Zhai and S. Massung. Text Data Management and Analysis: A Practical Introduction to Information Retrieval and Text Mining, ACM Book Series, Morgan & Claypool Publishers, 2016. Chapter 6 - Section 6.4