# TextRank TFIDF

### Basics
* $|document| = n$
* $|vocab| = m$

### TfIdf
* $X$ : tfidf matrix $n \times m$
* $s_i$: one row of the matrix, vector representation of a sentence with tfidf weights.
The vector is $L_2$ normalized, ie, $\|s_i\|^2 = 1$

### Cosinus similarity or cosine Kernel $\mathbf{K}$ 

$\mathbf{K}(X, X) = \large \frac{<X, X>}{\|X\|\times \|X\|} = \frac{X.X^T}{\|X\|^2} = \frac{X.X^T}{\sum_{i=1}^n \sum_{j=1}^n x_{ij}^2}$

$\Longleftrightarrow K(X, X) =$$\large \frac{1}{\sum_{i=1}^n \|s_i\|2}$$ \times
\begin{pmatrix}
s_1 \\
... \\
s_n 
\end{pmatrix}
\begin{pmatrix}
s_1 & ... & s_n 
\end{pmatrix}$

$\Longleftrightarrow K(X, X) =$$\large \frac{1}{\sum_{i=1}^n \|s_i\|2}$$ \times
\begin{pmatrix}
<s_1, s_1> & ... & <s_1, s_n>\\
... \\
<s_n, s_1> & ... & <s_n, s_n>
\end{pmatrix}$ we know that $\|s_i\|^2 = 1$

$\Longleftrightarrow K(X, X) =$$\large \frac{1}{n}$$ \times
\begin{pmatrix}
<s_1, s_1> & ... & <s_1, s_n>\\
... \\
<s_n, s_1> & ... & <s_n, s_n>
\end{pmatrix}$

$K(X, X)$ defines a Gram matrix, representing the distance between each sentences in the futur space.
It can be seen as the correlation of 2 sentences over the vocabulary.

This Gram matrix represent our adjancy matrix. Each cell represent an edge (weighted) between two sentences.

### weakening matrix (ok, pas sûre que ça se traduise comme ça)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity as cos_sim
import inspect
import seaborn 
%run Summary_Processes/Generic_Summarizer.ipynb

class TextRank_TFIDF_Summarizer_Assym_process :
    def __init__(self, a, b, weighted=False, method="tr", lsanbcompfun = None, diag = "none",  tag = None, bias = None) :
        
        self.weighted = weighted
        self.method = method
        self.lsanbcompfun = lsanbcompfun
        self.diag = diag
        self.a = a
        self.b = b
        self.vectorizer = TfidfVectorizer()
        self.bias = bias
        
        methodstr = "TextRank" if method == "tr" else "LSA"
        weightedstr = "weighted" if weighted else "unweighted"
        self.__name__ = ( methodstr + "_TFIDF_Summarizer_Assym_process(" + str(self.a) + ","
                         + str(self.b) + "," + weightedstr+ ","
                         + diag + ", " + str(bias) + " )" + (("-" + tag) if tag is not None else ""))
        
    def preprocess(self, corpus):
        """
        Builds the idf matrix with all sentence tokens of the corpus.
        Also builds the vocabulary of the vocabulary of the corpus.
        It's a dictionay mapping a word to its feature indice (index).
        
        :param corpus:  Array of strings. Each string is a sentence token.
                        The whole dataset is flatten, document are not separated.
        """
        # Learn our representation space, ie, its dimension (vocabulary size)
        # and the idf factors.
        self.vectorizer.fit(corpus)

    def summarize(self, corpus, doc_biais=None):
        """
        :param corpus:  One document from the corpus. Array of string.
                        Each string is a sentence token.
                        
        :return:    ????
        """
        # Transform the document in a vector representation.
        X = self.vectorizer.transform(corpus)
        
        # Calcul de la matrice d'affaiblissement des lien entre les phrases les plus éloignées
        dist = np.array([np.arange(X.shape[0]) - i for i in range(0, X.shape[0])])
        pos = dist > 0
        neg = dist < 0
        factor = (np.power(self.a, np.abs(np.multiply(dist,pos)))
                  + np.power(self.b, np.abs(np.multiply(dist,neg)))
                  - 1)
        
        # Build the similarity matrix defining distance between sentences with cosinus similarity.
        # Apply the "weakening" matrix on the result.
        matrix = np.multiply(factor, cos_sim(X, X))
        
        # Use of a generic method which calls the correct method
        return generic_summarizer(self.method, matrix, corpus, self.weighted, self.lsanbcompfun, diag =  self.diag, bias = self.bias)
    
    def __str__(self):
        return self.__name__