### Week 05 - Feedback in Text Retrieval

##### 5.1 - Feedback in Text Retrieval
Relevance Feedback is when users make explicit relevance judgments on the initial results. The document collection is updated from the Feedback module.

Implicit Feedback is when the user-clicked docs are assumed to be relevant; skipped ones non-relevant. Observe how user interacts with the results. Used in Google and Bing. 

Pseudo/Blind/Automatic Feedback is when the top-k initial results are assumed to be relevant. This assumption is used to improve the query. This allows additional related words to be learned, and then forwarded to be used in the query. This also does not involve the user.

##### 5.2 - Feedback in Vector Space Model - Rocchio
Modify the query vector by adding new weighted terms and also adjusting weights of old terms.

Query vector is a nucleus within a radius of related documents in vector space. These related docs are either relevant or irrelevant, so to increase the number of relevant documents in relation to the query, move the query in vector space to the centroid of relevant documenats.  
$$q_m = \alpha q + \frac{\beta}{\mathbf{|D_r|}} \sum_{d_j \in D_r} d_j - \frac{\gamma}{\mathbf{|D_n|}} \sum_{d_j \in D_n} d_j$$

$\alpha$, $\beta$, and $\gamma$ are parameters that control the movement. $q$ is the original query. $\alpha$ controls the original query weight. $\beta$ control the inference of positive centroid. $\gamma$ controls the inactive weight of the inactive centroid.

$D_r$ are the related doucments, $D_n$ are the non-related documents.

$$ \frac{\beta}{\mathbf{|D_r|}} \sum_{d_j \in D_r} d_j$$

Is the centroid of the relevant documents.

$$\frac{\gamma}{\mathbf{|D_n|}} \sum_{d_j \in D_n} d_j$$

Is the centroid of the non-relevant documents.

We want to truncate the vector so we only have a small number of words that have the heighest weights in centroid vector.

To avoid overfitting, keep relatively high weight on the original query weights. This is sowe can use relevance feedback and pseudo feedback. For pseudo feedback the $\beta$ should be set to a smaller value becuase the assumption of relevance and therefore less reliable than relevance feedback (which uses a larger value for $\beta$)

##### 5.3 - Feedback in Language Model
Query likelihood cannot support relevance feedback. Kullback-Leibler (KL) divergence retrieval.

KL-Divergence (Cross Entropy)
$$f(q, d) = \sum_{w \in d, p(w | \theta_Q) > 0} [p(w | \hat \theta_Q) log \frac{P_{seen}(w | d)}{\alpha_d p(w | C)}] + log \alpha_d$$

Query LM
$$p(w | \hat \theta_Q) = \frac{c(w | Q)}{\mathbf{|Q|}}$$

By plugging the QLM into the KL-Divergence, we get the Query Likelihood equation from last week. Since the denominator is a constant we can drop that value.

###### Feedback Model Interpretation
Solve the Document $D$ and Query $Q$ in $D(\theta_Q || \theta_D)$ to get results. These results give the feedback documents $\theta_F$, which plugs into the equation:
$$\theta_Q' = (1 - \alpha) \theta_Q + \alpha \theta_F$$

- If $\alpha = 0$ we get no feedback
- if $\alpha = 1$ then we get full feedback

###### Generative Mixture Model to Get $\theta_f$
The background langauge model assists the topic word model in identifying which words are stopwords. These distributions - background and topic word - are controlled by a source that decides which to use.
- $\lambda$ is background model $P(w | C)$
	- Generates the common words
	- Helps reduce the probability of common words in the topic word model
	- Will be used if $\lambda$ is very large
- $1 - \lambda$ is the topic word model $P(w | \theta)$
	- Will use if $\lambda$ is very small
	- Words given a high probability are words that are rare in the background model, but common in the overall distribution constructed by the topic word model

$\lambda$ is the noise feedback in documents. The above process is part of the Mixture Model because there are two distributions mixed together, and we do not know when each distribution will be used.
$$log_p (F | \theta) = \sum_i \sum_w c(w;d_i) log[(1 - \lambda) p(w | \theta) + \lambda p(w | C)]$$

##### 5.4 - Web Search & Web Crawlers
Web search challenges involve:
- Scalability
	- Solved by parallel indexing and searching with MapReduce
- Low quality information and spams
- Dynamics of the web: nw pages and page updates constantly 
	- Both above solved by spam detection and robust ranking

###### Basic Search Engine Technologies - Crawling
Web would be crawled by a crawler to generate cached pages. These cached pages are then indexed like an inverted index in which the user can retrieve the relative information. The user interacts with the inverted index through a browser using queries.
- Crawler: Building a "Toy" Crawler
	- Start with a set of seed pages in a priority queue
	- Fetch pages form the web
	- Parse fetched pages for hyperlinks and add them to the queue
	- Follow hyperlinks in the queue
- A real Crawler is tougher because there are various obstacles to consider
	- Server Failure and Traps
	- Server Load Balancing and Robot Exclusion (Crawler Courtesy)
	- Handling of various file types
	- URL scripts like cgi script, internal references, etc.
	- Recognize redundant pages due to identical or duplicate pages
	- DIscover hidden URLs that are truncated in longer URLs
- Strategies
	- Breadth-First to balance server load
	- Parallel Crawling
	- Incremental Crawling
		- Need to minimize resource overhead
		- Learn from past experience 
		- Target at:
			- Frequently updated pages: Daily or monthly depending on how frequently the page updates
			- Frequently accessed pages: More important that a page is fresh than a page users do not visit
	- Variation
		- Target a subset of pages
		- Typically given in a query

##### 5.5 - Web Indexing 
###### Google FIle System
Uses a centralized management system (GFS Master) to manage locations of file namespace and locations (using a lookup table), which are stored in chunks that are replicated to ensure reliability. Once this information has been obtained, Master communicates with the actual servers to locate where the files actually exist.

###### Map Reduce
Helps parallel processing with fault tolerance and automatic load balancing. Keys are the document IDs, and the values are the strings representing to document. The strings are then broken down into tokens (mapping). Reduce counts how many occurrences there are for each unique token. Key are the words, and value are the counts of the word frequency.

##### 5.6 - Link Analysis
 Hub page are pages with many outgoing links and authority pages are pages with many incoming links.

Anchor text is what a query would match for a page. It is additional text to define a document. Links indicate the utility of a document.
- Hub page has outgoing links
- Authority page as incoming links

We are checking for the probability that a page is a hub or an authority page. Indirect Citation is when a page is cited by another highly cited source. This would boost the authority of a page. Smoothing of citations assumes every page to have non-zero pseudo citation count.

###### Page Rank Algorithm (Capture a Pages Authority)
 Random surfing model: At any page
- With prob $\alpha$, randomly jumping to another page
- With prob $(1 - \alpha)$, randomly picking a link to follow

The internet can be represented as a directed graph with nodes $d$. This directed graph can be further represented as a matrix $M$ where each row's columns $ij$ are equal to 1, and each column in the row represents if a node directs to the other node. The number in the column represents the probability of a link going to the new page. The probability of going from $d_i$ to $d_j$ is:

$p(d_i)$: PageRank score of $d_i$ = average probability of visiting page $d_i$
$$\sum_{j = 1}^N M_{ij} = 1$$

The Equilibrium Equation:
$$P_{t + 1}(d_j) = (1 - \alpha) \sum_{i = 1}^N M_{ij} p_t(d_i) + \sum_{i = 1}^N \frac{1}{N} p_t(d_i)$$
- $P_{t + 1}(d_j)$ is probability of visiting page $d_j$ at time $t + 1$
- $(1 - \alpha) \sum_{i = 1}^N M_{ij} p_t(d_i) + \sum_{i = 1}^N \frac{1}{N} p_t(d_i)$ is probabilioty a surfer was at page $d_i$ at time $t$
	- $(1 - \alpha) \sum_{i = 1}^N M_{ij} p_t(d_i)$ is reaching $dj$ by following a link
	- $\sum_{i = 1}^N \frac{1}{N} p_t(d_i)$ is reaching $dj$ via random jumping. This is a smoothing mechanism to assure non-zero entries.

$M$ is the transition matrix. The row of $M$ represents the current page and all of the other pages the current page can link to. The sum of the row is 1. The column represents the page itself. So the diagonal would be 0, for example, if none of the pages linked to themselves.

$M_{ij} p_t(d_i)$ is the transition probability from $d_i$ to $d_j$.

After dropping the time index we get:
$$p(d_j) = \sum_{i = 1}^N [\frac{1}{N} \alpha + (1 - \alpha)M_{ij}] p(d_i)$$
$$\bar p = (\alpha I + (1 - \alpha) M)^t \bar p$$ 

and $I_{ij} = \frac{1}{N}$. Initial value $p(d) = \frac{1}{N}$ and we iterate until convergence. Iteratively update the $p$ vector of $d$ nodes via matrix multiplication.

Computation is efficient because $M$ can be sparse, normalization does not affect ranking. One issue is the zero-outlink problem where the $p(d_i)$s do not sum to 1. One option is to have $\alpha = 1$ for a page with no outlink.

###### HITS (Hypertext-Induced Topic Search) Algorithm - Capture Authorities and Hubs
The intuition is that pages that are widely cited are good authorities and pages that cite many other pages are good hubs. The idea of HITS is to say that good authorities are cited by good hubs and good hubs point to good authorities. Essentially iterative reinforcement.

Process:
- Build an adjacency matrix with initial values $a(d_i) = h(d_i) = 1$
- Iteratively: ($h$ = Hub Score and $a$ = Authority score)
	- $h(d_i) = \sum_{d_j \in OUT(d_i)} a(d_j)$
	- $a(d_i) = \sum_{d_j \in IN (d_i)} h(d_j)$
- Normalize: $\sum a(d_i)^2 = \sum h(d_i)^2 = 1$