## Week 06 - Recommender Systems

#### 6.2 - Learning to Rank
$$log( \frac{ P( R = 1 | Q, D ) }{ 1 - P( R = 1 | Q, D ) }$$

This maps 0 or 1 to the full range of values, where

$$P( R = 1 | Q, D ) = \frac{ 1 }{ 1 + exp( -\beta_0 - \sum_{i = 1}^n \beta_iX_i) }$$

and $X_i(Q, D)$ is a feature. This allows to connect the probability between 0 and 1 to a linear combination of features. The features are like the following: BM25, PageRank, BM25Anchor, etc. So the utlimate goal is to predict the relevance given the feature values. 
$$P( { (Q, D_1, 1), (Q, D_2, 0) } ) = P( R = 1 | Q, D_1 ) * (1 - P( R = 1 | Q, D_2 ))$$

which is the probability or relevance x non-relevance. So the goal is to adjust $\beta$ so $P( { (Q, D,_1 1), (Q, D_2, 0) } )$ reaches maximum. To do this we need to make the right hand calculation for $D_1$ as large as possible and the $D_2$ as small as possible (because 1 - a small $D_2$ would make another large value).

#### 6.4 - Future of Web Search
More specialized search engines (vertical engines)
- Special groups of users (Community engines)
- Personalized (Understand users better)
- Special genre/domain (understand documents better)

Search engines will be able to evolve through learning over time. They will integrate search, navigation, and recommednation/filtering methods. Search engines will also go beyond search by allowing support tasks, like shopping.

###### Data-User-Service (DUS) Triangle
Who are you serving, what data are you managing, and what service do you provide?

Current Search Engines are used for search, operated through queries, and these queries operate on bag-of-words. 
- On the Query side the goal is to reach Personalization (User Modeling)
- On the Bag-of-Words side the goal is to reach large scale semantic analysis (vertical search engines)
- On the search side the goal is to reach intelligent and interactive task support.

#### 6.5/6.6 - Content-Based-Filtering

##### 6.5
Recommender System (filtering system) is built through stable and long-term interest. System must make delivery decision immediately as a document arrives. Essentially filter articles that would not be relevant to the user's interest.

Ways to filter:
- Look at items that are relevant $U$, and check if $X$ is similar.
	- Item similar => content-based filter
- Look at when $X$ is relevant, and then check if $U$ is similar.
	- User similarity => collaborative filtering

Content-based filter will have a binary classifier (user-interest profile) to track user's interest. The initial classifier will be based on initialization, which would be user's initial likes via categories or keywords. Then there is a learning module that will be able to take documents in queue and use user's feedback to optimize recommendations/filtering results. 

Linear Utility will create a score based on succesful recommendations and penalize based on unsuccessful recommendations.

Basic Problems in content-based filtering:
- Making a filter decision (binary classifier)
- Initialization (base start point from examples)
- Learning through relevance judgments and accumulated documents

Extend a retrieval system for filtering
- Reuse retrieval techniques to score documents
- Use a score threshold for filtering decisions
- Learn to improve scoring with traditional feedback
- New approaches to threshold setting and learning

##### 6.6
Difficulties in threshold learning:
- censored data (judgments on delivered documents)
- little to no labled data
- eploration vs exploitation (will user like documents that they have not specified to liking or not liking).

Empirical Utility Optimization compares utility on training data for each candidate score threshold. Choose the threshold that gives maximum utility on train data. A downfall is the biased training sample (can only get an upper bound for true optimal threshold). Overcome this downfall by allowing adjustements to the threshold.

Beta-Gamme Threshold Learning: We have utility on $y$ axis and cutoff position $k$ on the $x$ axis. $\theta_{optimal}$ is cutoff point in which one achieves maximal utility. $\theta_{zero}$ is the point in which one does not obtain any utility. This allows us to explore the threshold because it is not causing a negative utility hit. So we want a threshold to be between these two $\theta$ points. 

$\alpha$ is the tuning parameter for choosing this threshold:
$$\theta = \alpha * \theta_{zero} + (1 - \alpha) * \theta$$

where
$$\alpha = \beta + ( 1 - \beta ) * e^{ -N * \gamma }$$

where $\beta, \gamma \in [0, 1]$ and $N$ is the number of training examples. The larger $N$ means there is less exploration. 

Pros:
- Explicitly address the exploration-exploitation tradeoff
- Arbitrary utility
- Empirically effective

Cons:
- Purely heuristic
- Zero utility lower bound often too conservative

#### 6.7/6.8 - Recommender System

##### - 6.7
Collaborative Filtering makes filtering decisions for an individual user based on judgments of other users to infer individual's interest/preferences from that of similar users.
- Given a user $u$ find similar users
- Predict $u$'s preferences based on the preferences of similar users
- User similarity can be judged based on similarity in preferences on a common set of items

Given objects $O$, users $u$, and a function $f: U x O = R$ the task is to predict $f$ values for other $(u, o)$'s. This is essentially function approximation.

##### - 6.8
Memory-based Approach
- $X_{ij}$ is rating of object $o_j$ by user $u_i$ (Assume $X$ is a matrix)
- $n_i$ is the average rating of all objects by user $u_i$
- $V_{ij} = X_{ij} - n_i$ is the normalized rating (subtract average rating from all ratings)
- Goal is to predict the rating of object by a user

$$\hat v_{aj} = k \sum_{i = 1}^m w(a, i)v_{ij}$$

says that the rating for a user is equivelant to the normailized rating of similar users whereas $w(a, i)$ - weight - is the distance between user $u_a$ and $u_i$.

$$k = \frac{ 1 }{ \sum_{i = 1}^m w(a, i) }$$

is the sum of 1 over all the weights, which are the users.

$$\hat x_{aj} = \hat v_{aj} + n_a$$

is the average rating prediction distanced from the actual rating.

Different functions to calculate $w$:

###### Pearson Correlation Coefficient (sum over commonly-related items)
$$w_p(a, i) = \frac{ \sum_j (X_{aj} - n_a)(X_{ij} - n_i) }{ \sqrt{  \sum_j (X_{aj} - n_a)^2  \sum_j (X_{ij} - n_i)^2 } }$$

###### Cosine Measure
$$w_c(a, i) = \frac{ \sum_{j = 1}^n X_{aj}X_{ij} }{  \sqrt{ \sum_{j = 1}^n X_{aj}^2 \sum_{j = 1}^nX_{ij}^2 } }$$

For missing values, set them to default ratings or average ratings. More complex approaches try to predict the missing values.

Inverse User Frequency (IUF) looks at where two users share similar ratings. If item is popular among users then the similairty is not great, but if the similarity is a rare occurrance than this would be viewed as more interesting.