# TextRank

## TL;DR;

The author extends the idea of PageRank:
$$S(V_{i}) = (1-d) + d* \sum_{V_{j}\in In(V_{i})}\frac{1}{\vert Out(V_{j}) \vert}WS(V_{j})$$
by by introducing weights for updating score of each vertice:
$$WS(V_{i}) = (1-d) + d* \sum_{V_{j}\in In(V_{i})}\frac{w_{ji}}{\sum_{v_{k}\in Out(V_{j})}w_{jk}}WS(V_{j})$$,


## Example of TextRank for Key Word Extraction
For keyword extraction, simply extract words as vertices utilize word co-occurence as reference of edge weights.

In [None]:
import numpy as np
import pandas as pd
import itertools

Example from [TextRank:Bringing Order into Texts](https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf)

In [None]:
example_text = "Compatibility of systems of linear constraints over the set of natural numbers. Criteria of compatibility of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered. Upper bounds forcomponents of a minimal set of solutions and algorithms of construction ofminimal generating sets of solutions for all types of systems are given. These criteria and the corresponding algorithms for constructing a minimal supporting set of solutions can be used in solving all the considered types systems and systems of mixed types."
example_text

Apply naive white space tokenization.

In [None]:
def tokenizer(text):
    tokens = []
    for word in map(lambda s: s.lower(), example_text.split(" ")):
        if word[-1] in ",.":
            tokens += word[:-1], word[-1]
        else:
            tokens.append(word)
    return tokens

In [None]:
tokens = pd.Series(tokenizer(example_text))
tokens

Apply syntactic filters & build graph. In this case we don't apply any filter at all.

In [None]:
def dummy_filter(token_a, token_b):
    return True

def construct_graph(tokens, synatic_filter, window_width=5):
    vocab = tokens.unique()
    mat = pd.DataFrame(index=vocab, columns=vocab)
    for window_start in range(len(tokens) - window_width + 1):
        window = tokens[window_start:window_start+window_width]
        for token_a, token_b in itertools.combinations(window, 2):
            if synatic_filter(token_a, token_b):
                mat.loc[token_a][token_b] = 1.
                mat.loc[token_b][token_a] = 1.
    mat.fillna(0., inplace=True)
    # Remove isolated vertices
    deg = mat.values.sum(axis=1)
    new_indices =mat.index[deg > 0]
    mat = mat.loc[new_indices, new_indices]
    return mat

In [None]:
adj_mat = construct_graph(tokens, dummy_filter)
adj_mat

Calculate score of each vertice.

In [None]:
def text_rank(adj_mat, d, threshold=1e-5, max_iter=100):
    assert 0 < d < 1
    vertices = adj_mat.index
    cur_scores = pd.Series(index=vertices, dtype=float).fillna(1)
    deg_o = adj_mat.sum(axis=1)
    norm_adj_mat = adj_mat.div(deg_o, axis=1)
    for _ in range(max_iter):
        update = norm_adj_mat.mul(cur_scores, axis=1).sum(axis=1)
        new_scores = (1 - d) + d * update
        if np.linalg.norm(new_scores - cur_scores) < threshold:
            return new_scores
        cur_scores = new_scores
    return cur_scores

In [None]:
text_rank(adj_mat,d=0.85, max_iter=100).sort_values(ascending=False)