# Week 5 Overview

During this week's lessons, you will learn feedback techniques in information retrieval, including the Rocchio feedback method for the vector space model, and a mixture model for feedback with language models. You will also learn how web search engines work, including web crawling, web indexing, and how links between web pages can be leveraged to score web pages.

## Key Phrases and Concepts

- Relevance feedback
- Pseudo-relevance feedback
- Implicit feedback
- Rocchio feedback
- Kullback-Leiber divergence (KL-divergence) retrieval function
- Mixture language model
- Scalability and efficiency
- Spams
- Crawler, focused crawling, and incremental crawling
- Google File System (GFS)
- MapReduce
- Link analysis and anchor text
- PageRank and HITS 

## Goals and Objectives

* Explain the similarity and differences in the three different kinds of feedback, i.e., relevance feedback, pseudo-relevance feedback, and implicit feedback.
* Explain how the Rocchio feedback algorithm works.
* Explain how the Kullback-Leibler (KL) divergence retrieval function generalizes the query likelihood retrieval function.
* Explain the basic idea of using a mixture model for feedback.
* Explain some of the main general challenges in creating a web search engine.
* Explain what a web crawler is and what factors have to be considered when designing a web crawler.
* Explain the basic idea of Google File System (GFS).
* Explain the basic idea of MapReduce and how we can use it to build an inverted index in parallel.
* Explain how links on the web can be leveraged to improve search results.
* Explain how PageRank and HITS algorithms work.

## Guiding Questions

### Explain the similarity and differences in the three different kinds of feedback, i.e., relevance feedback, pseudo-relevance feedback, and implicit feedback.

**Feedback** is any new inputs that caused search engine to learn how to improve relevant of retrieved documents.

![Feedback](images/feedback.png)

- **Relevance Feedback**: Also known as explicit feedback. User judgement of all retrieved documents from the search engine. In this case, there is no one unjudged document.
- **Pseudo Relevance Feedback**: Also known as blind or automatic feedback. This technique assumed that *top k documents always relevant*, then the other documents will be optionally judged.
- **Implicit Feedback**: This technique contrast with *Relevance feedback*. This technique collect user activities to make inference how users selecting relevant document or skipping un-relevant document. One simple implementation is *user clickthroughs* which assumed that clicked documents are relevant and skipped documents are un-relevant.

### Explain how the Rocchio feedback algorithm works.

Let:

- $D_r$ is set of relevant document vectors, such as $\{\vec{d_i}, ..., \vec{d_n}\}$
- $D_n$ is set of non relevant document vectors, such as $\{\vec{d_i}, ..., \vec{d_n}\}$
- $\vec{q}$ is a initial query vector.
- $\vec{q}_m$ is a modified query vector.

We know that:

- All vectors projected into two dimensional space.
- $D_r$ may spread in the space and some of them grouping together as **cluster of relevant $C_r$**
- $D_n$ may spread in the space and some of them grouping together as **cluster of non relevant $C_n$**

Then:

- A Rocchio feedback algorithm aimed to produce $\vec{q}_m$ from $\vec{q}$ by moving $\vec{q}_m$ toward $C_r$.
- A basic Rocchio feedback algorithm used linear progression to move $\vec{q}$ toward $C_r$:

$$\vec{q}_m = \alpha * \vec{q} + \frac{\beta}{|D_r|} * \sum\limits_{\vec{d_j} \in D_r} \vec{d_j} - \frac{\gamma}{|D_n|} * \sum\limits_{\vec{d_j} \in D_n} \vec{d_j}$$
- $\alpha$, $\beta$, and $\gamma$ are weights that control the $\vec{q}$ acceleration toward $C_r$.


---

**Example**

Let:

- $V$ is current vocabulary used by search engine.
- $V$ is fixed-length term vector.
- $\vec{q}$ contains quantified of term occurences, such that $\vec{q} = \{t_1, ..., t_n \ | \ t \in [0,1]\}$

Assume:

- $V = \{\text{text}, \text{mining}, \text{algorithm}, \text{information}, \text{retrieval}\}$
- $\vec{q} = \{1, 1, 1, 0, 0\}$
- We are give five feedback documents with their term weights, where + indicate relevant, and - indicate non relevant:

relevant | document | $\{ text, mining, algorithm, information, retrieval \}$ |
---------|----------|-------------------------------------------------------|
- | $d_1$ | $\{0.2, 0.2, 2.0, 1.5, 1.0\}$ |
- | $d_2$ | $\{0.2, 0.2, 1.5, 1.0, 1.0\}$ |
+ | $d_3$ | $\{1.5, 1.0, 0.5, 0.5, 0.5\}$ |
+ | $d_4$ | $\{1.5, 1.5, 0.5, 0.2, 0.5\}$ |
+ | $d_5$ | $\{1.5, 1.5, 0.5, 0.2, 0.2\}$ |

Then:

- The centroid of relevant document $C_r$ and non relevant document $C_n$ given below:

relevant | centroid | $\{ text, mining, algorithm, information, retrieval \}$ |
---------|----------|-------------------------------------------------------|
+| $C_r$ | $\big\{ \frac{1.5*3}{3}, \frac{1.0*(1.5*2)}{3}, \frac{0.5*3}{3}, \frac{0.5*(0.2*2)}{3}, \frac{0.2*(0.5*2)}{3} \big\}$ |
-| $C_n$ | $\big\{ \frac{0.2*2}{2}, \frac{0.2*2}{2}, \frac{2.0+1.5}{2}, \frac{1.5+1.0}{2}, \frac{1.0*2}{2} \big\}$ |

- Then compute $\vec{q}_m$ using Rocchio relevant feedback:

$\eqalign{
    \vec{q}_m &= \alpha * \vec{q} + \beta * C_r - \gamma * C_n \\
              &= \{ \alpha + 1.5 * \beta - 0.2 * \gamma, \alpha + 1.0 * \beta - 0.2 * \gamma, \alpha + 0.5 * \beta - 1.75 * \gamma, \alpha + 0.0666 * \beta - 1.25 * \gamma, \alpha + 0.02 * \beta - 1.0 * \gamma \}
}$

### Explain how the Kullback-Leibler (KL) divergence retrieval function generalizes the query likelihood retrieval function.

Let:

- Query Likehood is sum of TF-IDF weighting for each word mathed in documents and queries plus document length normalization:

$f(q,d) = \sum\limits_{w \in d,q} c(w,q) \big[ log \frac{P_{seen}(w|d)}{\alpha_d P(w|C)} \big] + n \ log \ \alpha_d$

- KL-divergence measure the divergence between two distributions of Document Language Model and Query Language Model $P(w|\hat{\theta}_Q)$:

$f(q,d) = \sum\limits_{w \in d,P(w|\theta_Q)>0} P(w|\hat{\theta}_Q) \ log \frac{P_{seen}(w|d)}{\alpha_d P(w|C)} + log \ \alpha_d$

Where:

- Document language model is produced by Query Likehood Estimation.

- Query language model contains current query vector $\theta$ and learned query model $\hat{\theta}$:

$P(w|\hat{\theta}_Q) = \frac{c(w,Q)}{|Q|}$

Then, **KL Divergence is generalization of query likehood model** because:

- KL Divergence used query likehood model for initial retrieval model which eliminate $n$ by subtitute $n = |Q|$. This means that we set query language model to be relative to frequency of word in query, not in collection.
- KL Divergence used feedback language model to learn new weighted query $\hat{\theta}$ from feedback given by users.

---

Feedback language model used linear interpolation which work similar to Rocchio Feedback:

- $D(.||.)$ is simplified notation of KL-divergence.
- $\alpha \in [0,1]$ is parameter to control the strngth of feedback documents, such that:
$\alpha \begin{cases}
0 \ \text{means no feedback},\\
1 \ \text{means full feedback}
\end{cases}$

Since that two condition is not desirable, we expect that $\alpha \in (0,1)$

- $\theta_F$ is feedback language model.

![Model based feedback](images/model-based-feedback.png)

### Explain the basic idea of using a mixture model for feedback.

Let:

- $\lambda$ is mixing parameter of two distributions, such that: $\lambda P_1$ and $(1 - \lambda) P_2$.
- $P(w|C)$ is background model as first distribution.
- $P(w|\theta)$ is topic model as second distribution.
- $F$ is feedback documents.

We want:

- Choose probability $\lambda$ to control two model distributions, such that: $(1-\lambda)P(w|\theta)+\lambda P(w|C)$.
- Choose $\lambda$ which reduce noise in the feedback documents $F$ where $\lambda$ = noise.

Assume:

- $\lambda$ will be fixed in a single value (always convergence).

Then:

- Feedback document $F$ fit to word probabilities $\theta$ similar to unigram language model.
- We need some iteration to find best $\lambda$ to describe maximum likehood of $\theta$ by using EM algorithm.
- Feedback language model is optimization problem of two distributions controlled by parameter $\lambda$:

$$\eqalign{
    \theta_F &= arg \ max_{\theta} \ log \ P(F|\theta)\\
             &= arg \ max_{\theta} \ \sum\limits_{d \in F} \sum\limits_{w} \ c(w,d) * log[(1-\lambda) * P(w|\theta) + \lambda * P(w|C)]
}$$

![Mixture language model](images/mixture-language-model.png)

### Explain some of the main general challenges in creating a web search engine.

- **Scalability**<br>
    - Question:
        - How to store big data?
        - How to serve many users quickly?
    - Answer:
        - Using **Google's MapReduce frawework**.
- **Low quality information and spam**
    - Question:
        - How to prevent low quality informations such as repreated text have high score?
        - How to prevent spams to get high ranking score?
        - How to identify new spams?
    - Answer:
        - Use wide variety of signals to rank pages.
- **Dynamics of the web**:
    - Question:
        - How to prioritize links, which need to be fresh periodically, which need to be update rarely?
        - How to crawl in dynamic content since desktop and web layout have different layout?
        - How to avoid crawler from being back to old link?<br>
    - Answer:
        - Implement **link analysis** which improve search result by leveraging extra information about the networked nature of the web.
        - Use multiple features for ranking, such as web page layout and anchor text.

### Explain what a web crawler is and what factors have to be considered when designing a web crawler.

Crawler also called as spider is program that crawls (traverses, parses, and downloads) pages on the web.

Factors have to be considered when designing a web crawler:

- Avoid heavy requests to the server and prevent *denial of service*. For example: Implement breadth search to keep serve load balanced.
- Optimize crawling speed and maximize throughput. For example: Implement parallel algorithm and distributed crawler.
- Implement focused crawling if possible. For example: Crawl any web page that matched only with specific topic and crawler system that support user query search.
- Built effective graph algorithm to improve crawler ability to follow any new links. For example: Find new links in a web site that not directly linked to any old pages.
- Implement artificial intelligent algorithms to make crawler behave like human, easy to maintenance and can learn from experiences. For example: Avoid unnecessary crawling, such as avoid crawling to fresh links, can decide crawler interval for specific web sites, such as daily for news site and hourly for social media, and prioritize some popular web site to be always fresh by analyze web sites popularity using HITS algorithm.

### Explain the basic idea of Google File System (GFS).

GFS is distributed file system using large clusters of commodity hardware. The GFS divided a big file into several fixed-size chunks of 64 MB. Each chunks is fixed-size means new chunk will be appended to the next allocation sector. GFS cluster consists of multiple nodes of one **Master Node** and large number of **Chunck Servers**. Each data indexed as tuple of `(key, value)`, where `key` contains unique identifier and `value` contains data itself. Group of `(key, value)` called as chunk and each chunk then stored in each server node or chunk server. For example using range of chunk algorithm: If there are 1,000,000 keys, 1/4 of all `(key, value)` required 64 MB allocation and four chunk servers connected, then each chunk server will store 250,000 `(key, value)`. Each chunk assigned with 64-bit unique label by master node and logical mapping of files to constituent chucks are maintened. To ensure reliability, each chunk may be replicated several times in separated chunk servers. The replication factor of each chunk may vary depend on demand, but the default is three times. A master node not stored actual chunk, but store all metadatas associated with chunks. Heart-beat messages is communicaton between master node and chunk servers to ensure al metadata always up to date. 

![Google FS](images/google fs1.png)

More information, please read [GHEMAWAT, S., GOBIOFF, H., AND LEUNG, S.-T. The google file system](https://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf).

### Explain the basic idea of MapReduce and how we can use it to build an inverted index in parallel.

MapReduce is data aggregation technique using parallel programming. Basically, MapReduce consists of `Map()` function and `Reduce()` function. `Map()` should executed first, then feed the output to `Reduce()`. Each function described below:

- **Map()** route query to corresponding servers, pull and merge-sort collected data.
- **Reduce()** do data transformation such as fltering and grouping.

![Map Reduce](images/map reduce.png)

We can easily build an inverted index using MapReduce by write our inverted index algorithm into `Map()` and `Reduce()` functions. MapReduce will execute `Map()` in parallel. Here an example:

In [5]:
class Mapper:
    def Map(docid, doc):
        H = {}
        for term in doc:
            # Execute in parallel
            if term in H.keys():
                H[term] = 1
            else:
                H[term] = H[term] + 1
        for term, count in H.items():
            Emit(term, (docid, count))

class Reducer:
    def reduce(term, postings):
        posting = []
        for _posting in postings:
            posting.append(_posting)
        posting.sort()
        Emit(term, posting)
        
class Emit:
    def __init__(self, term, posting):
        self.db = DB.pull()
        self.db.push(term, posting)

class DB:
    @staticmethod
    def pull():
        """Pull all data from databases"""
        pass
    
    @staticmethod
    def push(term, posting):
        """Push inverted index into databases"""
        pass

### Explain how links on the web can be leveraged to improve search results.

Graph algorithm tells us about two type of links of Digraph (Directed Graph):

- **Outlink** is an edge that point to other node.
- **Inlink** is an edge that point to origin node.

Link anaysis tells us about two types of web page:

- **Hub** page is a web page that have so many *outlinks*. Hub page became a gateway to explore and mining more links since it points to many other pages.
- **Authority** page is a web page that received so many *inlinks*. Authority page is important web page since many other pages are pointing to this page.

![Link analysis](images/link analysis.png)

### Explain how PageRank and HITS algorithms work.

**PageRank** algorithm used by Google for doing link analysis. Basically, PageRank is **edge weighted digraph** algorithm that count outlinks and inlinks of a web page. The idea behind it is *popularity of a web page can be quantified by count its inlinks, more inlinks than more popular the web page is*. PageRank algorithm measure the popularity of specific web page not only by count inlinks of individul web page, but recursively accumulate all inlinks from neighbor web pages. So, *popularity of web page $d_i$ will inherited to web page $d_j$*.

**Random Surfer Model** is original model adapted by Sergey Brin and Larry Page in the development of PageRank Algorithm. Random surfer model try to quantify two assumptions about user behavior: User may keep to following links or get distracted caused to randomly jumping to random web page that not connected directly to current web page. This model similar to **Markov Chain Model**. A good visualization of markov chain can be found at [here](http://setosa.io/ev/markov-chains/).. Below the formal definition:

---

Let:

- $M$ is Transition Matrix.
- $M$ represent edge weighted digraph.
- $d_i$ is a web page in $M_{row}$ and $d_j$ is a web page in $M_{column}$.
- Weight of edge is probability of how likely a node connected to other node. Thus, $M$ is also stochastic matrix, where sum of each row in $M$ should be one:

$$M_{ij} = \text{Probability of going from } d_i \text{ to } d_j$$

$$\sum\limits_{j=1}^N M_{ij} = 1$$

Suppose:

- A simple graph of web page below:

![simple graph](images/simple-graph.png)

- Then the transition matrix $M$:

$$M = \begin{bmatrix}
    0 & 0 & 1/2 & 1/2 \\
    1 & 0 & 0 & 0 \\
    0 & 1 & 0 & 0 \\
    1/2 & 1/2 & 0 & 0 \\
\end{bmatrix}$$

Assume:

- Users may keep to follow links from current web page $d_i$ to connected web pages $d_j$.
- Users may distracted and decide to go another web page $d_j$ which not connected directly from current web page $d_i$.
- Random process is uniformly distributed, thus all web pages have equal probability for being visited. So, the probability of jump to web page $d_j$ is $1/N$, where $N$ is numbers of all web pages.

We know that:

- Some web pages may have new incoming links. Thus at certain time, transition matrix $M$ may get updated.

Then:

- The probability of user visiting page $d_j$ at time $t+1$ is linear propagation of two possibilities: users may following links connected from $d_i$ or users randomly jumping to reach page $d_j$:

$$\eqalign{
    p_{t+1} (d_j) &= \text{possibility of following links} + \text{possibility of random jumping}\\
    p_{t+1} (d_j) &= (1 - \alpha) \sum\limits_{i=1}^N M_{ij} p_t (d_i) + \alpha \sum\limits_{i=1}^N \frac{1}{N} p_t (d_i)
}$$

- $\alpha$ also called as **damping factor** is probability of users choose random jumping. Based on Sergey Brin and Larry Page paper [The Anatomy of a Large Scale Hypertextual.pdf](https://web.archive.org/web/20150927004511/http://infolab.stanford.edu/pub/papers/google.pdf), $\alpha$ usually set to be 0.85.
- $p_t (d_i)$ in unknown probability of user being in web page $d_i$ in the current time $t$.
- $p_{t+1}$ is also called as **Equilibrium equation** for single linear sistem $i$.

---

To find unkown probability $p_t (d_i)$, we use linear system equation that work similar to dynamic system, but with parameter $\alpha$:

---

Let:

- Compute expected probabilites for all $p(d_j)$ in left hand side and for all $p(d_i)$ in right hand side. Thus $p(d_j)$ and $p(d_i)$ are vector of probabilities $\vec{p}$.
- $I$ is uniform matrix of random jumping probabilities. Since it is uniform, then probability of jumping to all $d_{ij}$ are $1/N$, such that:
$$I_{ij} = \frac{1}{N} \quad \forall i,j$$
- Probability of users go to web page $d_j$ is independent for each step, thus user behavior can be modeled using **random walk**.

Then:

- since $size(\vec{p}) = N$, then the equation became linear system of $N$ variables:
$$p(d_j) = \sum\limits_{i=1}^N \Big[ \frac{1}{N} \alpha + (1-\alpha) M_{ij} \Big] * p(d_i) \rightarrow \vec{p} = (\alpha I + (1-\alpha)M)^T \ \vec{p}$$
- Since $size(I) = size(A) = N$, then we can merge them into singular matrix $A$.
- We have linear system of $N$ varables:
$$\begin{bmatrix}
    p_{t+1}(d_1) \\
    . \\
    . \\
    . \\
    p_{t+1}(d_N)
\end{bmatrix}        = A^T
\begin{bmatrix}
    p_{t}(d_1) \\
    . \\
    . \\
    . \\
    p_{t}(d_N)
\end{bmatrix}$$
simplify it:
$$y = A^T * v$$
$v$ is eigenvector, so exist a eigenvalue $\lambda$, such that:
$$A^T * v = y = \lambda * v$$

- $v$ is unkown eigenvector. Use **Power Iteration** method to find $v$. Power iteration is eigenvalue algorithms which used recurent relation mechanism to compute eigenvector until it convergence. The idea is, since $A$ is stochastic matrix, then there is exist greatest eigenvalue $\lambda$ of nonzero eigenvector $v$. Below the power iteration algorithm use to find PageRank vector (eigenvector): 
    1. Set threshold $h$ to identify greatest $\lambda$ and stop the iteration.
    2. $v$ is unkown eigenvector which can be set randomly or uniformly at first iteration.
    3. Compute eigenvalue of $A.v$, such that $\lambda_i = eig(A.v)$.
    4. For each step, compute $\hat{v}$ by normalize dot product of $A.v$\:
    $$\hat{v} = \frac{A.v}{||A.v||}$$
    5. $\hat{v}$ is unit vector in normalized vector space, such that magnitude of $v$ normalized to 1, but direction of $v$ unchanged. More information, please read [this explanation in linear algebra](http://freetext.org/Introduction_to_Linear_Algebra/Basic_Vector_Operations/Normalization/).
    6. Compte new eigenvalue $A.\hat{v}$, such that $\lambda_j = eig(A.\hat{v})$.
    7. Compute distance between $\lambda$, such that $\delta(\lambda) = \lambda_j - \lambda_i$.
    8. If $\delta(\lambda) > h$, then stop the iteration.
    9. If $\delta(\lambda) < h$, then update $v$ and $\lambda$ simultaneusly, such that $v = \hat{v}$ and $\lambda = \lambda_i$. Then continue the iteration.

For further explanation about power iteration, please read [wikipedia](https://en.wikipedia.org/wiki/Power_iteration) or [ML wiki](http://mlwiki.org/index.php/Power_Iteration).

For more information about PageRank algorithm in three different point of views (dynamic models, linear algebra, and random surfer), please read [this lecture note](http://pi.math.cornell.edu/~mec/Winter2009/RalucaRemus/Lecture3/lecture3.html).

-----

Let's implement power_iteration method to find PageRank vector in python:

In [130]:
import numpy as np


def eigenvalue(A, v):
    """
    @param A: Matrix.
    @param v: Eigen vector.
    @return: Eigen value.
    """
    vt = v.transpose()
    ev = vt.dot(A.dot(v)) / vt.dot(v)
    return ev[0,0]

def power_iteration(A, h=1e-9):
    """
    @param A: matrix.
    @param h: treshold
    @return: The eigenvector and eigenvalue
    """
    n, m = A.shape
    
    v = np.ones([m, 1]) / m  # Eigen vector at first iteration
    l = eigenvalue(A, v)     # Eigen value at first iteration
    
    # Repeat until converge
    while True:
        Av = A.dot(v)
        vNorm = Av.A / np.linalg.norm(v)
        l1 = eigenvalue(A, vNorm)
        
        if np.abs(l1 - l) < h:
            break
            
        v = vNorm
        l = l1
    
    return v, l

In [135]:
# Test eigenvalue()
print("Should produce 3", end=': ')
A = np.matrix([[1,0], [2,3]])
v = np.array([[0], [1]])
print(eigenvalue(A,v))

print("Should produce 3", end=': ')
A = np.matrix([[1,0], [2,3]])
v = np.array([[0], [1]])
print(eigenvalue(A.transpose(),v))

Should produce 3: 3.0
Should produce 3: 3.0


In [140]:
# Test power_iteration()
A = np.matrix([[2, -12], [1, -5]])
B = power_iteration(A)
print("Eigenvector (v)")
print(B[0])

print()
print("Eigenvalue (lambda)")
print(B[1])

print()
C = A.dot(B[0])
print("A . v")
print(C)

print()
D = B[0] * B[1]
print("lambda . v")
print(C)

Eigenvector (v)
[[-1.8973666 ]
 [-0.63245553]]

Eigenvalue (lambda)
-2.0000000016142927

A . v
[[3.7947332 ]
 [1.26491107]]

lambda . v
[[3.7947332 ]
 [1.26491107]]


-----

**HITS (Hypertext-Induced Topic Search)** algorithm is designed to compute both inlinks and outlinks, thus HITS measure the importance of authority and alos hub pages. HITS algorithm use **reinforcement mechanism** to help improve scoring for both hubs and authorities. HITS algorithm assume that good authority always pointed by good hub and vice versa:

Let:

- $A$ is adjacency matrix of undirected graph, such that:

$$A_{ij} = \begin{cases}
1 \ \text{if } d_i \rightarrow d_j\\
0 \ \text{if } d_i \not\rightarrow d_j,
\end{cases}$$

- $h(d_i)$ is hub score of web page $d_i$:

$$h(d_i) = \sum\limits_{d_j \in OUT(d_i)} a(d_j)$$

- $a(d_i)$ is authority score of web page $d_i$:

$$h(d_i) = \sum\limits_{d_j \in IN(d_i)} h(d_j)$$


By vectorization, we got:

- Hub vector $\vec{h}$ is product of adjacency matrix $A$ and authority vector $\vec{a}$:

$$\vec{h} = A \vec{a}$$

- Authority vector $\vec{a}$ is product of transposed adjacency matrix $A$ and hub vector $\vec{h}$:

$$\vec{a} = A^T \vec{h}$$


By back substitution, we can:

- Compute hub vector $\vec{h}$ without knowing authority vector $\vec{a}$:

$$\vec{h} = A \cdot A^T \cdot \vec{h}$$

- Compute authority vector $\vec{a}$ without knowing hub vector $\vec{h}$:

$$\vec{a} = A^T \cdot A \cdot \vec{a}$$


Then:

- The initial weight of $\sum \vec{h} = 1$ and $\sum \vec{a} = A^T . \vec{h}$, such that for every $d_i$:

$$\vec{h}_0 = \begin{bmatrix}
1\\
1\\
.\\
.\\
.\\
1
\end{bmatrix} \quad \text{and} \quad \vec{a}_0 =  A^T \cdot \begin{bmatrix}
1\\
1\\
.\\
.\\
.\\
1
\end{bmatrix} $$
Simplify:
$$a(d_i) = h(d_i) = 1$$

- For each $d_i$ at time $t$, iteratively update $\vec{h}$ and $\vec{a}$ by traversing all nodes in graph:

$$HITS_{t} = \begin{cases}
\vec{a}_t = (A^T \cdot A) \cdot \vec{a}_{t-1}\\
\vec{h}_t = (A \cdot A^T) \cdot \vec{h}_{t-1}\\
\end{cases}$$


For further explanation about HITS algorithm, can be found at [here](http://pi.math.cornell.edu/~mec/Winter2009/RalucaRemus/Lecture4/lecture4.html).

----

Lets implement HITS algorihm in Python

In [1]:
import numpy as np


def hits(A, h, maxIter):
    """
    The convergence of HITS is problematic since hard to find the equilibrium solution.
    
    @param A: Transition matrix
    @param h: Hub vector
    @return: Authority vector and Hub vector
    """

    for i in range(maxIter):
        a = A.T.dot(h)
        h = A.dot(a)
        
    return a, h

In [3]:
A = np.matrix([[0, 0, 1], [0, 0, 1], [0, 0, 0]])
h = np.array([[1], [1], [1]])

h = hits(A, h, 10)

print("Authority vectors")
print(h[0])

print("\nHub vector")
print(h[1])

Authority vectors
[[   0]
 [   0]
 [1024]]

Hub vector
[[1024]
 [1024]
 [   0]]


## Additional Readings and Resources

- C. Zhai and S. Massung. Text Data Management and Analysis: A Practical Introduction to Information Retrieval and Text Mining, ACM Book Series, Morgan & Claypool Publishers, 2016. Chapters 7 & 10By 