# Goals of this notebook
- What is text summarization?
- Summarizing text using text rank algorithm.
- Keyword extraction using text rank algorithm.

## 1. What is Text Summarization?
Text summarization is a natural language processing (NLP) technique that involves reducing the length of a document while preserving its main ideas. This is particularly useful in processing and understanding large volumes of text data. The goal is to produce a concise summary that retains the essential information from the original document.

### 1.1 Types of Text Summarization
Text summarization techniques can broadly be categorized into two types:

#### Extractive Summarization
Extractive summarization works by identifying and extracting the most critical sentences, phrases, or sections from the original text to form a summary. The selected content is directly lifted from the source document without alteration.

- This approach is easier to implement because it relies on ranking the importance of sentences rather than generating new content.
- Sentences are ranked based on features such as term frequency (TF), TF-IDF scores, sentence position, or similarity to other sentences.
- The summary may lack coherence or flow, as sentences are extracted from different parts of the text without rephrasing or restructuring.

#### Abstractive Summarization
Abstractive summarization involves generating new sentences that capture the essence of the original text, often rephrasing or paraphrasing the content to create a more fluid and natural summary.

- This approach is more challenging to implement as it requires a deeper understanding of the text's meaning and the ability to generate new content.
- It involves natural language generation (NLG) techniques to create summaries that are not just a subset of the original text.
- Abstractive summaries tend to be more coherent and fluent, as the sentences are crafted to logically flow from one to another.

## 2. Summarizing Text using TextRank algorithm
TextRank is an unsupervised graph-based ranking algorithm that is used for natural language processing tasks such as keyword extraction and text summarization. It is inspired by the PageRank algorithm, which is used by search engines like Google to rank web pages. The core idea behind TextRank is to model the relationships between different elements of a text (such as sentences or words) as a graph, and then apply ranking techniques to identify the most important elements.

TextRank operates by constructing a graph where sentences or words are represented as nodes, and edges between nodes represent the similarity or relationship between them. The importance of each node is determined based on the principle that connections from other important nodes contribute more significantly to the node's own importance.

### 2.1 Steps Involved in TextRank for Summarization

#### Preprocessing the text
Before constructing the graph, the text is preprocessed. This involves which involved tokenization, stop word removal, stemming and lemmatization.

#### Constructing the graph
In TextRank, each sentence from the document is treated as a node in the graph. An edge is created between two nodes if their corresponding sentences are similar. The weight of the edge reflects the degree of similarity between the sentences.

The similarity between two sentences can be calculated using various methods, such as cosine similarity, which measures the cosine of the angle between two vectors in a multidimensional space. The sentences are typically represented as vectors of word frequencies or TF-IDF values.

For two sentences represented by vectors $ A $ and $ B $, cosine similarity is given by:

$$ \text{Cosine Similarity} = \frac{A \cdot B}{\|A\| \|B\|} $$

Where:
- $ A \cdot B $ is the dot product of the vectors
- $ \|A\| \|B\| $ are the magnitudes of the vectors.

#### Applying the TextRank algorithm
Once the graph is constructed, TextRank is applied to rank the sentences. The algorithm works as follows:
- **Assign** an initial score to each sentence (node) in the graph. ypically, this is set to a uniform value (e.g $ \frac{1}{N}$, where $ N $ is the number of sentences).
- **Update** the score of each sentence based on the scores of neighboring sentences and the weights of the edges connecting them. The update rule is similar to that used in PageRank.

$$ S(V_i) = (1 - d) + d \sum_{V_j \in \text{Adj}(V_i)} \frac{W_{ij}}{\sum_{V_k \in \text{Adj}(V_j)} W_{jk}} S(V_j) $$

Where:
- $ S(V_i) $ is the score of the sentence
- $ d $ is the damping factor (typically set to a number close to 1 and it accounts for the probability of jumping to a random sentence.
- $ W_{ij} $ is the weight of the edge between sentences $ V_i $ and $ V_j $.
- $ \text{Adj}(V_i) $ denotes the set of sentences adjacent to $ V_i $.
<br>

- **Iterate** until the scores of the senteces converge (i.e. the change in the scores between iterations falls below a certain threshold).

#### Generating the summary
Once the TextRank scores are computed, the sentences are ranked based on their scores. The top-ranked sentences are then selected to form the summary.

## 3. Keyword Extraction using the TextRank algorithm.
Keyword extraction is the process of identifying and selecting the most relevant words or phrases within a document. These keywords provide a quick summary of the content and are essential for tasks such as information retrieval, document classification, and text summarization. TextRank also helps in keyword extraction.

TextRank for keyword extraction follows a similar approach to its use in text summarization but focuses on identifying significant words or phrases within a document.

### 3.1 Steps Involved in Keyword Extraction Using TextRank
The preprocessing step remains the same.

#### Building the co-occurence graph
In TextRank, each unique word (excluding stopwords) is treated as a node in the graph. An edge is created between two nodes if the corresponding words co-occur within a fixed window size in the text. The window size typically ranges from 2 to 10 words.

The co-occurrence window defines the number of words between two nodes within which they are considered to be related. For instance, if the window size is set to 4, and the sentence is "The cat sat on the mat," the word "cat" would be connected to "The," "sat," "on," and "the."

#### Applying the TextRank algorithm
The step is same as the one shown in **2.1**

#### Extracting key
After applying TextRank, the words are ranked based on their final scores. The top-ranked words are selected as the keywords.

## 4. Text Summarization & Keyword Extraction in Code

In [16]:
import numpy as np
import pandas as pd
import textwrap
import networkx as nx
import plotly.graph_objects as go
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize

The dataset is taken from [BBC News](https://www.kaggle.com/c/learn-ai-bbc)

In [12]:
df = pd.read_csv("./bbc_text_cls.csv")

In [13]:
df.head()

Unnamed: 0,text,labels
0,Ad sales boost Time Warner profit\n\nQuarterly...,business
1,Dollar gains on Greenspan speech\n\nThe dollar...,business
2,Yukos unit buyer faces loan claim\n\nThe owner...,business
3,High fuel prices hit BA's profits\n\nBritish A...,business
4,Pernod takeover talk lifts Domecq\n\nShares in...,business


In [14]:
doc = df[df.labels == "business"]["text"].sample(random_state=101)

In [15]:
def wrap(x):
    return textwrap.fill(
        x, replace_whitespace=False, fix_sentence_endings=True)

In [17]:
print(wrap(doc.iloc[0]))

Rich grab half Colombia poor fund

Half of the money put aside by the
Colombian government to help the country's poor is benefiting people
who do not need it, a study has found.

A total of 24.2 trillion pesos
($10.2bn; Â£5.5bn) is earmarked for subsidies for the poor, the
government department for planning said.  But it also found 12.1
trillion pesos was going to the richest part of the population, rather
than to those in need.  Sound distribution of the cash could cut
poverty levels to 36% from 53%, the government believes.  "Resources
are more than enough to reduce poverty and there is no need for more
tax reforms but a better distribution," deputy planning director Jose
Leibovich said.

Colombia has a population of about 44 million and
half lives below poverty line.  However, some large properties are
paying less in tax as they are situated inside poor areas, which
benefit from cheaper utilities such as electricity and water,
government research found.  Government expenditure in ar

In [2]:
def preprocess_text(text):
    sentences = sent_tokenize(text)
    stop_words = set(stopwords.words('english'))
    word_tokens = [word_tokenize(sent) for sent in sentences]
    filtered_words = [[word for word in words if word.isalnum() and word.lower() not in stop_words] for words in word_tokens]
    
    return sentences, filtered_words

In [3]:
def sentence_similarity(sent1, sent2):
    words1 = set(sent1.lower().split())
    words2 = set(sent2.lower().split())
    
    common_words = words1.intersection(words2)
    
    # calculate Jaccard similarity
    union_count = len(words1.union(words2))
    sim = len(common_words) / union_count if union_count > 0 else 0
    
    return sim

In [4]:
def build_similarity_matrix(sentences):
    similarity_matrix = np.zeros((len(sentences), len(sentences)))
    
    for i in range(len(sentences)):
        for j in range(len(sentences)):
            if i == j:
                similarity_matrix[i][j] = 1
            else:
                similarity_matrix[i][j] = sentence_similarity(sentences[i], sentences[j])
    
    return similarity_matrix

In [5]:
def page_rank(M, num_iterations=100, d=0.85):
    # number of nodes (sentences)
    N = M.shape[0]
    
    # rank vector with unifor probability
    v = np.ones(N) / N
    
    # transition matrix
    M_hat = (d * M + (1 - d) / N)
    
    # loop to update the rank vector
    for _ in range(num_iterations):
        v = M_hat @ v
        
    return v

Just look away from the below code, I had to take help from the internet and LLMs for it. Too complicated.

In [6]:
import plotly.graph_objects as go

def visualize_graph(matrix, labels, scores, title):
    G = nx.Graph()
    
    for i in range(len(labels)):
        for j in range(i + 1, len(labels)):
            if matrix[i, j] > 0:
                G.add_edge(labels[i], labels[j], weight=matrix[i, j])
    
    pos = nx.spring_layout(G, seed=42)
    
    edge_x = []
    edge_y = []
    for edge in G.edges():
        x0, y0 = pos[edge[0]]
        x1, y1 = pos[edge[1]]
        edge_x.append(x0)
        edge_x.append(x1)
        edge_x.append(None)
        edge_y.append(y0)
        edge_y.append(y1)
        edge_y.append(None)
    
    edge_trace = go.Scatter(
        x=edge_x, y=edge_y,
        line=dict(width=0.5, color='#888'),
        hoverinfo='none',
        mode='lines')
    
    node_x = []
    node_y = []
    for node in G.nodes():
        x, y = pos[node]
        node_x.append(x)
        node_y.append(y)
    
    node_trace = go.Scatter(
        x=node_x, y=node_y,
        mode='markers+text',
        text=[f'{labels[i]}: {scores[i]:.2f}' for i in range(len(labels))],
        textposition="top center",
        hoverinfo='text',
        marker=dict(
            showscale=True,
            colorscale='YlGnBu',
            size=[v * 100 for v in scores],
            color=scores,
            colorbar=dict(
                thickness=15,
                title='Node Importance',
                xanchor='left',
                titleside='right'
            ),
            line_width=2))

    fig = go.Figure(data=[edge_trace, node_trace],
                 layout=go.Layout(
                    title=title,
                    titlefont_size=16,
                    showlegend=False,
                    hovermode='closest',
                    margin=dict(b=20,l=5,r=5,t=40),
                    annotations=[dict(
                        text="",
                        showarrow=False,
                        xref="paper", yref="paper"
                    )],
                    xaxis=dict(showgrid=False, zeroline=False),
                    yaxis=dict(showgrid=False, zeroline=False))
                    )
    fig.show()

In [7]:
def generate_summary(text, top_n=3):
    sentences, _ = preprocess_text(text)
    similarity_matrix = build_similarity_matrix(sentences)
    
    # Normalize the similarity matrix row-wise
    # [:, np.newaxis] converts the row vector to column vector
    norm_sim_matrix = similarity_matrix / similarity_matrix.sum(axis=1)[:, np.newaxis]
    
    # Apply PageRank (will return a vector with page ranks)
    scores = page_rank(norm_sim_matrix)
    
    # Rank sentences based on scores
    ranked_sentences = [sentences[i] for i in scores.argsort()[::-1]]
    
    # Extract top N sentences for the summary
    summary = " ".join(ranked_sentences[:top_n])
    
    # Visualize with the graph
    visualize_graph(norm_sim_matrix, sentences, scores, "Sentence Similarity Graph")
    
    return summary

In [8]:
def extract_keywords(text, top_n=10):
    _, filtered_words = preprocess_text(text)
    words = [word for sublist in filtered_words for word in sublist]
    
    unique_words = list(set(words))
    word_index = {word: idx for idx, word in enumerate(unique_words)}
    
    # Build the co-occurrence matrix
    co_occurrence_matrix = np.zeros((len(unique_words), len(unique_words)))
    for sentence in filtered_words:
        for i in range(len(sentence)):
            for j in range(i + 1, len(sentence)):
                w1, w2 = word_index[sentence[i]], word_index[sentence[j]]
                co_occurrence_matrix[w1][w2] += 1
                co_occurrence_matrix[w2][w1] += 1
                
    # Normalize the co-occurrence matrix row-wise
    norm_co_matrix = co_occurrence_matrix / co_occurrence_matrix.sum(axis=1)[:, np.newaxis]
    
    # Apply PageRank
    scores = page_rank(norm_co_matrix)
    
    # Rank words based on scores
    ranked_words = [unique_words[i] for i in scores.argsort()[::-1]]
    
    # Extract top N keywords
    keywords = ranked_words[:top_n]
    
    # Visualize with the graph
    visualize_graph(co_occurrence_matrix, unique_words, scores, "Keyword Co-occurrence Graph")
    
    return keywords

In [20]:
doc = df[df.labels == "business"]["text"].sample(random_state=110)

summary = generate_summary(doc.iloc[0].split("\n", 1)[1], top_n=5)
print("Summary:")
print(summary)
print()
print("Original TItle: ")
print(doc.iloc[0].split("\n", 1)[0])

Summary:
Kraft's new advertising policy, which covers advertising on TV, radio and in print publications, is aimed at children between the ages of six and 11. 
Kraft plans to cut back on advertising of products like Oreo cookies and sugary Kool-Aid drinks as part of an effort to promote healthy eating. "We're working on ways to encourage both adults and children to eat wisely by selecting more nutritionally balanced diets," said Lance Friedmann, Kraft senior vice president. The moves come as the firms face criticism from consumer groups concerned at rising levels of obesity in US children. Kraft rival PepsiCo began a similar labelling initiative last year.

Original TItle: 
Kraft cuts snack ads for children


In [21]:
doc = df[df.labels == "business"]["text"].sample(random_state=115)

keywords = extract_keywords(doc.iloc[0].split("\n", 1)[1], top_n=5)
print("\nKeywords:")
print(keywords)


Keywords:
['minority', 'Italian', '2002', 'Mediobanca', '1bn']
