# Text Summarization

Alice continues her journey and now she is in 2015. Now it has become easier, as you can use word2vec! This time Alice needs help to solve the problem of summarizing news texts.

The task of summarization is to obtain a shorter text from the original text, which will contain all (or almost all) the information that was in the original text. Thus, from the text you need to obtain its summary in such a way as to lose as little information as possible.

Methods for solving this problem are usually divided into two categories:
- Extractive Summarization $-$ algorithms based on identifying the most informative parts of the source text (sentences, paragraphs, etc.) and compiling a summary from them.
- Abstractive Summarization $-$ algorithms that generate new text based on the source.

We will work with Extractive Summarization.

## 0. Dataset Preprocessing

In [None]:
import os
import nltk
import numpy as np

from scipy import sparse
from collections import defaultdict
from tqdm import tqdm_notebook as tqdm
from nltk.tokenize import sent_tokenize, word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer

### Loading dataset

We will use data from the CNN/DailyMail news corpus.

In [None]:
DATA_DIR = './cnn_stories_short/'

In [None]:
%%capture

!wget https://www.dropbox.com/s/kofxrgod7kl720m/cnn_stories_short.zip
!mkdir cnn_data
!unzip cnn_stories_short.zip -d $DATA_DIR
!rm -r ./cnn_stories_short/__MACOSX

### Dataset preparation

The dataset consists of source texts and already written summaries for them. We will save original texts.

In [None]:
texts = []
for filename in os.listdir(DATA_DIR):
    with open(os.path.join(DATA_DIR,filename),'r') as input_file:
        all_texts = input_file.read().split('@highlight')
        texts.append(all_texts[0])

#### We will need:
* texts broken into sentences
* sentences broken into tokens
* texts, broken sentences that are broken into tokens

In [None]:
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [None]:
sent_tokenized_texts = [sent_tokenize(text) for text in texts]
tokenized_sentences = [word_tokenize(sent) for text in texts for sent in sent_tokenize(text)]
tokenized_texts = [[word_tokenize(sent) for sent in text] for text in sent_tokenized_texts]

### Loading Word Embedding Model

For the TextRank algorithm, we need to obtain a vector representation for each sentence in the text.

We will use pre-trained Glove vectors. **GloVe** (Global Vectors for Word Representation) is an unsupervised learning algorithm for obtaining vector representations for words, developed by Stanford University. It leverages global word-word co-occurrence statistics from a corpus to create dense vector embeddings that capture semantic meanings. GloVe vectors enable improved performance in various natural language processing tasks by representing words in a continuous vector space, where similar words are located closer together.

Let's load models:

In [None]:
%%capture

!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip glove*.zip

The downloaded archive contains a set of files with vectors of different lengths. Each file stores a word on each line, followed by a space, the values ​​of the vector representation of this word.

In [None]:
word_embeddings = {}
with open('glove.6B.100d.txt', encoding='utf-8') as f:
    for line in f.readlines():
        values = line.split()
        word = values[0]
        word_embeddings[word] = np.asarray(values[1:], dtype='float32')

We stored vectors to word_embeddings value. Thus, word_embeddings is a dictionary, where key is a word and value is a vector of this word.

## Task 1: Word2Vec text representation

*   For each text obtaint it's vector representation by averaging word2vec representation of each word. Just sum it component by component and divide on number of words in sentence. If word embedding model do not contain word initialize it with zeros. Use word representations saved in word_embeddings values.
*   Count cosine similarity between each sentences and obtain matrix of cosine similarity **G**.

In [None]:
# TODO complete transform function. You can add additional values in class constructor if neccesary.

class TfidfEmbeddingVectorizer:

    def __init__(self, embedding_model, dim=100):
        self.embedding_model = embedding_model  # word embedding model (word -> vector)
        self.dim = dim  # dimension of word vectors

    def transform(self, X):
        # X is a list of tokenized sentences (or tokenized texts)
        vectors = []

        for tokens in X:
            # List to store embeddings for words in the sentence
            vecs = []

            for word in tokens:
                # Get the embedding vector if it exists, otherwise use zeros
                if word in self.embedding_model:
                    vec = self.embedding_model[word]
                else:
                    vec = np.zeros(self.dim)

                vecs.append(vec)

            # If there are no words in sentence, use a zero vector
            if vecs:
                sentence_vec = np.mean(vecs, axis=0)
            else:
                sentence_vec = np.zeros(self.dim)

            vectors.append(sentence_vec)

        return np.array(vectors)


In [None]:
sentence_vectorizer = TfidfEmbeddingVectorizer(word_embeddings)

### Building the Cosine Similarity Matrix

For the *TextRank* algorithm, we need to build a weighted graph from the text. The graph will be represented as a matrix of cosine similarity between sentences.

For example, let's build a graph in the form of a distance matrix for one of the texts.
Let's choose one text and build a distance matrix for it. We'll use the cosine distance as a metric.

In [None]:
TEXT_NUM = 5

In [None]:
sentences = tokenized_texts[TEXT_NUM]

Using the vectorizer, we will obtain vectors for all sentences of the text.

In [None]:
vectorized_sentences = sentence_vectorizer.transform(sentences)

In [None]:
vectorized_sentences.shape

(25, 100)

Let's calculate the matrix with cosine distances.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

def get_cosine_similarity_matrix(sentences):
    """
    sentences: np.array where each row is a sentence vector
    returns: cosine similarity matrix G
    """
    G = cosine_similarity(sentences)
    return G


G = get_cosine_similarity_matrix(vectorized_sentences)

## Extractive Summarization $-$ TextRank

Now we will implement the text summarization method itself. It will be based on the *PageRank* algorithm.

*PageRank* $-$ is a recursive algorithm that evaluates the importance of each node in the graph based on its connections with other nodes. Initially, the algorithm was used to evaluate the importance of Internet pages for search engines.

The adaptation of this algorithm for text summarization is called *TextRank*.

The algorithm sequentially goes through all the nodes in the graph and recalculates the PageRank values ​​for each of them using the formula below.

This happens until the process stabilizes, that is, the *PageRank* values ​​for all nodes stop changing significantly with each new iteration.

$$ G = (V,E) - граф $$
$$$$
$$ PageRank(v) = \frac{(1-d)}{N} +  d \sum_{u} \frac {PageRank(u) * W_{(u, v)}} {C(u)}$$

$$v\ -\ вершина\ графа, v \in V $$

$$u\ -\ вершины\ графа,\ такие\ что\ (u,v) \in E$$

$$C(u) - количество \ вершин, \ таких \ что (u,v) \in E$$

$$W_{(u, v)} - вес\ ребра\ (u, v) \in E $$

$$d = 0,85\ -\ коэффициент\ затухания$$

Let's use NetworkX library to Page Rank algorithm.

In [None]:
!pip install networkx



In [None]:
import networkx as nx

nx_graph = nx.from_numpy_array(G)
nx_scores = nx.pagerank(nx_graph)

In [None]:
ranked_sentences = sorted(((nx_scores[i], s, i) for i,s in enumerate(sentences)), reverse=True)

Let's output 5 sentences with the highest TextRank. This will be our final text summation.

In [None]:
SUMMARY_LEN = 5

for i in range(SUMMARY_LEN):
    print(' '.join(ranked_sentences[i][1]))

It was like the end of Braveheart every time a rebel looked into my eyes and said it .
Filming a documentary for VICE , I was detained for shooting where the authorities thought I should n't , beginning endless rounds of questions , emphatic yelling and head-shaking incredulity at my claims of innocence -- and , of course , the requisite implications that I was a spy .
Heady stuff for a teenager , especially when most of the rebels are n't old enough to have known a political system other than Gaddafism .
See the rest of The Rebels of Libya at VICE.COM When we finally got to Misurata , it was surrounded by Gaddafi 's troops and only accessible by sea .
Beaming , he wondered whether I could `` ask Clinton and Obama for new weapons '' so that they could beat Gaddafi and he could fulfill his dream of playing for the Miami Heat or the Dallas Mavericks .


Now let's combine everything into one summarize function, which will receive text divided into sentences as input and output 5 sentences with the highest *TextRank*.

In [None]:
def summarize(sentences,summary_len=5):
    vectorized_sentences = sentence_vectorizer.transform(sentences)
    G = get_cosine_similarity_matrix(vectorized_sentences)
    nx_graph = nx.from_numpy_array(G)
    nx_scores = nx.pagerank(nx_graph)
    ranked_sentences = sorted(((nx_scores[i],s,i) for i,s in enumerate(sentences)), reverse=True)
    summary = []
    for i in range(summary_len):
        summary.append(' '.join(ranked_sentences[i][1]))
    return summary

In [None]:
summarize(tokenized_texts[5])

['It was like the end of Braveheart every time a rebel looked into my eyes and said it .',
 "Filming a documentary for VICE , I was detained for shooting where the authorities thought I should n't , beginning endless rounds of questions , emphatic yelling and head-shaking incredulity at my claims of innocence -- and , of course , the requisite implications that I was a spy .",
 "Heady stuff for a teenager , especially when most of the rebels are n't old enough to have known a political system other than Gaddafism .",
 "See the rest of The Rebels of Libya at VICE.COM When we finally got to Misurata , it was surrounded by Gaddafi 's troops and only accessible by sea .",
 "Beaming , he wondered whether I could `` ask Clinton and Obama for new weapons '' so that they could beat Gaddafi and he could fulfill his dream of playing for the Miami Heat or the Dallas Mavericks ."]

Let's get summaries for all our texts:

In [None]:
system_summaries = [summarize(text) for text in tqdm(tokenized_texts)]

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  system_summaries = [summarize(text) for text in tqdm(tokenized_texts)]


  0%|          | 0/300 [00:00<?, ?it/s]

Let's look on the 10th sample

In [None]:
print("\n".join(system_summaries[10][:5]))

This hammer blow to AIDS research funding will be accompanied by cuts to a range of other HIV/AIDS programs -- cuts that will have negligible effect on the federal deficit but will have real consequences for people living with HIV/AIDS in the United States and around the world .
Much work lies ahead before these and other scientific advances can be parlayed into a broadly applicable cure that can be made available to the 35 million people living with HIV/AIDS worldwide .
Interactive : World AIDS Day and what it means Why , then , are we shortchanging a program that enjoys broad bipartisan and popular support , has done more than any other foreign policy initiative in recent years to burnish America 's image abroad , and has already altered -- though not irreversibly -- the trajectory of the HIV/AIDS pandemic ?
Zero new HIV infections among children can be a reality Similarly , the idea of an `` AIDS-free generation '' today is tossed around with abandon .
But over 30 years , we have de

## Task 2 IDF word2vec modification

Modify your previous solution. For each text obtaint it's vector representation by averaging word2vec representation of each word multiplied by the IDF value of this word.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from collections import defaultdict
import numpy as np

In [None]:
class TfidfEmbeddingVectorizer:
    def __init__(self, embedding_model, dim=100):
        self.embedding_model = embedding_model  # словарь: слово -> вектор
        self.dim = dim  # размерность вектора слова
        self.word_idf_weight = defaultdict(lambda: 1.0)  # по умолчанию idf=1.0 для неизвестных слов

    def fit(self, tokenized_texts):
        """
        tokenized_texts: список токенизированных текстов или предложений (list of list of tokens)
        """
        # Склеиваем токены обратно в строки, чтобы TfidfVectorizer мог их обработать
        joined_texts = [' '.join(tokens) for tokens in tokenized_texts]

        # Строим TF-IDF по склеенным строкам
        tfidf = TfidfVectorizer()
        tfidf.fit(joined_texts)

        # Сохраняем веса idf для каждого слова
        max_idf = max(tfidf.idf_)
        self.word_idf_weight = defaultdict(lambda: max_idf,
                                           [(word, idf) for word, idf in zip(tfidf.get_feature_names_out(), tfidf.idf_)])
        return self

    def transform(self, tokenized_texts):
        """
        tokenized_texts: список токенизированных предложений или текстов (list of list of tokens)
        Возвращает: np.array с векторным представлением каждого предложения
        """
        vectors = []

        for tokens in tokenized_texts:
            # Список взвешенных векторов слов
            vecs = []

            for word in tokens:
                if word in self.embedding_model:
                    vec = self.embedding_model[word] * self.word_idf_weight[word]
                else:
                    vec = np.zeros(self.dim)

                vecs.append(vec)

            # Усредняем, если есть слова, иначе возвращаем вектор из нулей
            if vecs:
                sentence_vec = np.mean(vecs, axis=0)
            else:
                sentence_vec = np.zeros(self.dim)

            vectors.append(sentence_vec)

        return np.array(vectors)


In [None]:
sentence_vectorizer = TfidfEmbeddingVectorizer(word_embeddings)
sentence_vectorizer = sentence_vectorizer.fit(tokenized_sentences)

In [None]:
## TODO copy your function for cosine similarity here

def get_cosine_similarity_matrix(sentences):
    """
    sentences: np.array where each row is a sentence vector
    returns: cosine similarity matrix G
    """
    G = cosine_similarity(sentences)
    return G


In [None]:
def summarize(sentences,summary_len=5):
    vectorized_sentences = sentence_vectorizer.transform(sentences)
    G = get_cosine_similarity_matrix(vectorized_sentences)
    nx_graph = nx.from_numpy_array(G)
    nx_scores = nx.pagerank(nx_graph)
    ranked_sentences = sorted(((nx_scores[i],s,i) for i,s in enumerate(sentences)), reverse=True)
    summary = []
    for i in range(summary_len):
        summary.append(' '.join(ranked_sentences[i][1]))
    return summary

 Summarize your texts

In [None]:
system_summaries = [summarize(text) for text in tqdm(tokenized_texts)]

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  system_summaries = [summarize(text) for text in tqdm(tokenized_texts)]


  0%|          | 0/300 [00:00<?, ?it/s]

Print summary for 7-th sample:

In [None]:
system_summaries[7]

["`` I went down ... to bring my son home , '' Goldman said on CNN 's Larry King Live Wednesday , figuring his ex-wife 's death had made the custody issue a moot point , and `` we find out that this man does n't file custody , but he files to remove my name from a Brazilian birth certificate that they had issued for my son , who was born in Red Bank , New Jersey . ''",
 "`` A child belongs with his family , and there is no reason why David Goldman should not get his child back , '' Clinton said in a recent interview on NBC 's Today show .",
 '`` The fact of the matter is that in order to be a parent , you have to be more than just a DNA donor , Mr. King .',
 'Shortly after Bruna Bianchi Goldman arrived in her homeland she called to say she wanted a divorce , which she obtained in Brazil , and would stay there with their son , Sean .',
 "`` I would tell him that he 's been very brave , as he has fought to have his son returned to him , '' Clinton said in the NBC interview ."]