# Week 6 Overview

During this week's lessons, you will learn how machine learning can be used to combine multiple scoring factors to optimize ranking of documents in web search (i.e., learning to rank), and learn techniques used in recommender systems (also called filtering systems), including content-based recommendation/filtering and collaborative filtering. You will also have a chance to review the entire course.

## Key Phrases and Concepts

* Learning to rank, features, and logistic regression
* Content-based filtering
* Collaborative filtering
* Beta-gamma threshold learning
* Linear utility
* User profile
* Exploration-exploitation tradeoff
* Memory-based collaborative filtering
* Cold start 

## Goals and Objectives

After you actively engage in the learning experiences in this module, you should be able to:

* Explain the basic idea of using machine learning to combine multiple features for ranking documents (i.e., learning to rank).
* Explain how we can extend a retrieval system to perform content-based information filtering (recommendation).
* Explain how we can use a linear utility function to evaluate an information filtering system.
* Explain the basic idea of collaborative filtering.
* Explain how the memory-based collaborative filtering algorithm works.

## Guiding Questions

### What’s the basic idea of learning to rank?

Use machine learning to combine many different features into a single learning function to optimize search results.

Let:

- $(Q, D)$ is vector of query-document pair.
- $X_i(Q, D)$ is vector of feature $i$.
- $X_i(Q, D)$ can be BM25 score, p(Q|D), PageRank, or custom scoring function such as URL pattern detection.

Hypothesize:

- The relevance defined by system of $X_i$, such that:

$$p(R=1|Q,D) = s(X_0(Q, D), ..., X_n(Q,D), \lambda)$$

- $s$ is fitting function.
- $\lambda$ is parameters that control weight of each feature in fitting function $s$.

Then:

- Find best value of $\lambda$ that maximize fitting function $s$ by using statistical analysis. 

### How can logistic regression be used to combine multiple features for improving ranking accuracy of a search engine?

Let:

- $s()$ is logistic regression that learn from labeled training data.
- Training data consists of $(q_i, d_i, R)$ tuples.
- $q_i$ is a query consists terms $\{t_1, ..., t_q\}$
- $R$ is user judgment.

Hypothesize:

- The *logistic inference model* says that we can use *random sample* of *query-document-term* triples for which binary relevance judgement have been made, and compute the logarithm of the odds of relevance for term $t_k$ which s present in both document $d_j$ and query $q_i$ by the formula:

$$log P(R|q_i, d_j, t_k) = \beta_0 + \beta_1 x_1 + ... + \beta_n x_n$$

Find:

- Parameters $\beta$ that fit in $log P(R|q_i, d_j, t_k)$.
- Probability of relevance for query $q_i = \{t_1, ..., r_q\}$ is the sum of log odds for all terms:
$$log P(R|q_i, d_j) = \sum\limits_{k=1}^q \big[ log P(R|q_i, d_j, t_k) - log P(R) \big]$$
- $P(R)$ known as *prior odds of relevance* is the probability that a document chosen at random from the collection will be relevant to query $q_i$. For example:
    - $N_j$ is number of assigned judgement.
    - $N_{q,d}$ is number of query-document pairs in the collection.
    - Then:
    $$P(R) = prior = \frac{N_j}{N_{q,d}}$$
    - The log version is:
    $$log P(R) = log \ prior = log \frac{P(R)}{1 - P(R)} = log \frac{prior}{1 - prior}$$

Finally:

- Probability of relevance of a document to a query is:
$$P(R|q_i, d_j) = \frac{1}{1+e^{-log P(R|q_i, d_j)}}$$

Please read [this paper](http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=B6ADD5FAE7AAB2AFF0030B55AB6AC7DA?doi=10.1.1.64.4681&rep=rep1&type=pdf) for further explanation.

---

For example:

Let:

- $X_1$ is BM25 score of the document for the query.
- $X_2$ is PageRank score of the document.
- $X_3$ is BM25 score on the anchor text of the document.

We know that:

- Model of relevant and model of non relevant:


 model | $X_1(Q, D)$ | $X_2(Q, D)$ | $X_3(Q, D)$ |
-------|-------------|-------------|-------------|
$d_1(R=1)$ | 0.7 | 0.11 | 0.65 |
$d_2(R=0)$ | 0.3 | 0.05 | 0.4  |


Find:

- $\beta$ parameters that maximize:

$$P(\{q, d_1, R=1\}, \{q, d_2, R=0\},) = \frac{1}{1 + exp(-\beta_0 - 0.7 \beta_1 - 0.11 \beta_2 - 0.65 \beta_3)} * \Big(1 - \frac{1}{1 + exp(-\beta_0 - 0.3 \beta_1 - 0.05 \beta_2 - 0.4 \beta_3)} \Big)$$

### What is content-based information filtering?

Let:

- $u$ is user.
- $x$ is item.

Find:

- Will user $u$ like item $x$?

Then:

- Learn what $u$ likes.
- Recommend $x$ to $u$ based on user characteristics of liked items.

![content-based-information-filtering](images/content-based-information-filtering.png)

### How can we use a linear utility function to evaluate a filtering system? <br> How should we set the coefficients in such a linear utility function? 

Problem 1:

- Can not waiting for document pooling to make decision what user characteristics, thus can not used MAP (Mean Average Pricision) or NDCG (Normalized Discounted Cumulative Gain).

Solution:

- Should be able to make a decision in real time.

Find:

- Threshold $\theta$ to define abosolute value of relevance in user perspective.

Then:

- $R$ is set of relevant that have value above threshold $\theta$.
- $R'$ is set of non relevant that have value below or equal threshold $\theta$.

Finally:

- A linear utility function can be used to measure user charateristics is:
$$U = \alpha_0 * |R| - \alpha_1 * |R'|$$
that quantify how user give reward to relevant document as subset of whole delivered documents.

---

Problem 2:

- How to set utility parameters $\alpha$?

Solution:

- Find parameters $\alpha$ that maximize the utility function $U$.
- List of modules can be deployed to optimize parameters $\alpha$.

Three basic modules are:

- **Initialization module**: Set initial model of user characteristics based on very limited information.
- **Decision module**: Decide which document should be delivered to user over time to know user preferences.
- **Learning module**: Based on output of decision module, dynamically analyze anomaly of user behavior while enhance decision module performance.

### How can we extend a retrieval system to perform content-based information filtering?

Let:

- Retrieval system implements general **vector space model**.
- $\vec{D}$ is vector of documents.
- $\vec{Q}_+$ is extended query contains user profile (preferences, behavior, etc).

Then:

- Implement **scoring module** to quantify $\vec{D}$ based on $\vec{Q}_+$.
- Implement **Thresholding module** to decide any document to be delivered to user.

Finally:

- Update threshold value $\theta$ based on feedback from user.
- Update $\vec{Q}_+$ by learning from user feedback.

![content-based-recommendation-feedback](images/content-based-recommendation-feedback.png)

### What is the exploration-exploitation tradeoff?

Let:

- **Exploration** is explore unlabelled documents to be delivered to user to get feedback as much as possible to gt know user preferences.
- **Exploitation** is exploit user feedback based on known user preferences at current time.

Constrains:

- If system do too deep exploration, then many unrelevant documents may delvired to users, thus decrease user satisfication.
- If system do too much exploitation, then decrease system capability to learn new user preferences.

Then:

- Use **beta-gamma threshold learning** to solve exploration-exploitation tradeoff.

### How does the beta-gamma threshold learning algorithm work? 

The basic idea of beta-gamma threshold learning is:

1. Given a ranked list of all documents in the the training database sorted by their scores, their relevance, and utility score $U$.
![beta-gamma-threshold](images/beta-gamma-threshold.png)
2. Find $\theta_{optimal}$ is maximum threshold which is used to start exploration, thus $\theta_{optimal}$ also indicate maximum utility.
3. Decide $\theta_{zero}$ which is minimum threshold which is used to stop exploration, thus $\theta_{zero}$ also indicate minimum utility.
4. **Cutoff position** is any documents that have utility between $\theta_{optimal}$ and $\theta_{zero}$. The exploration should executed between cutoff position, such that:
$$\theta = \alpha * \theta_{zero} + (1 - \alpha) * \theta_{optimal}$$
5. $\alpha$ is parameter which control deviation from optimal utility $\theta_{optimal}$. $\alpha$ defined as:
$$\alpha = \beta + (1 - \beta) * e^{-N*\gamma}$$
6. $\beta$ parameter controls the deviation from $\theta_{optimal}$, which can be based on previously observer documents (i.e., training data).
7. $\gamma$ parameter controls the influence of the number of examples in the training data set.
8. $N$ is number of data set.
9. $e^{-N * \gamma}$ tells us that:
    - Less exploration if $N$ became greater.
    - More exploration if $N$ very small.

Please read [this paper](http://www.cs.cmu.edu/~czhai/paper/TREC7-filtering.pdf) for further explanation.

### What is the basic idea of collaborative filtering?

Let:

- $u_i$ is user.
- $U_a$ is set of similar users.
- $x_i$ is item to be recommended to $u_i$.
- $X_a$ is set of items liked by $U_a$.

Problem:

- Will user $u_i$ like item $x_i$?

Assume:

- Users with the same interest will have similar preferences, and vice versa.

Then:

- Find $U_a$ that have similar characteristics to $u_i$.
- Predict $u_i$ preferences based on common preferences of $U_a$.

Finally:

- Recommend $x_i$ to $u_i$ if only if $x_i$ similar or belong to $X_a$.

---

Below is sparse matrix representation used for collaborative filtering where $o$ is object of document and function $f(., .)$ map a user and object to a rating.

![collaborative-filtering](images/collaborative-filtering.png)

### How does the memory-based collaborative filtering algorithm work?

Let:

- $u_i$ is a user with known preferences.
- $u_a$ is a user with unkown preferences. 
- $X_{ij}$ is rating given by user $u_i$ to object $o_j$.
- $n_i$ is average rating of all objects by user $u_i$.
- $n_a$ is average rating of all objects by user $u_a$.
- Normalized ratings for each user $u_i$:
$$V_{ij} = X_{ij} - n_i$$

Predict:

- Rating $\hat{X}_{aj}$ of object $o_j$ given by user $u_a$?

Assume:

- If predicted rating of $o_j$ is high, then object $o_j$ may be a good candidate to be recommended to user $u_a$.

Then:

- Predicted normalized rating of object $o_j$ to user $u_a$ is:

$$\hat{V}_{aj} = k * \sum\limits_{i=1}^m w(u_a, u_i) * V_{ij}$$

Where:

- $k$ is normalizer that ensures $\hat{V}_{aj} \in [0, 1]$:
$$k = \frac{1}{\sum_{i=1}^m w(u_a, u_i)}$$
- $w(u_a, u_i)$ is similarity between user $u_a$ and a particular user $u_i$. The similarity function could be one of this:
    - Pearson Correlation Coefficient:
    $$w_p(u_a, u_i) = \frac{\sum_j (X_{aj} - n_a)(X_{ij} - n_i)}{\sqrt{\sum_j (X_{aj} - n_a)^2 \sum_j(X_{ij} - n_i)^2}}$$
    - Cosine similarity:
    $$w_c(u_a, u_i) = \frac{\sum_j x_{aj} x_{ij}}{\sqrt{\sum_j x_{aj}^2 \sum_j x_{ij}^2}}$$
    
Finally:

- Predicted rating $\hat{X}_{aj}$ of object $o_j$ given by user $u_a$ is:
$$\hat{X}_{aj} = \hat{V}_{aj} + n_a$$

### What is the “cold start” problem in collaborative filtering?

System has not enough information about the user and there are very few contribution from users, thus system can not start collaborative filtering caused very few recommendation occured at the beginning.

## Additional Readings and Resources

* C. Zhai and S. Massung. Text Data Management and Analysis: A Practical Introduction to Information Retrieval and Text Mining, ACM Book Series, Morgan & Claypool Publishers, 2016. Chapters 10 - Section 10.4, Chapters 11
* [Recommender system handbook](https://www.cse.iitk.ac.in/users/nsrivast/HCC/Recommender_systems_handbook.pdf)