![license_header_logo](../../../images/license_header_logo.png)

> **Copyright (c) 2021 CertifAI Sdn. Bhd.**<br>
<br>
This program is part of OSRFramework. You can redistribute it and/or modify
<br>it under the terms of the GNU Affero General Public License as published by
<br>the Free Software Foundation, either version 3 of the License, or
<br>(at your option) any later version.
<br>
<br>This program is distributed in the hope that it will be useful
<br>but WITHOUT ANY WARRANTY; without even the implied warranty of
<br>MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
<br>GNU Affero General Public License for more details.
<br>
<br>You should have received a copy of the GNU Affero General Public License
<br>along with this program.  If not, see <http://www.gnu.org/licenses/>.
<br>

# Introduction

In this notebook, we are going to learn the **extractive text summarization** technique using **cosine similarity** to identify the similarity between sentences and then extract those important sentences.

# What will we accomplish?

Steps to implement simple extractive text summarizer:

> Step 1: Input article

> Step 2: Text cleaning

> Step 3: Build similarity matrix

> Step 4: Pick N sentences for summary

# Notebook Content

* [Extractive Text Summarization](#Extractive-Text-Summarization)


* [Getting Started](#Getting-Started)
    * [Import Libraries](#Import-Libraries)
    * [Generate Cleaned Sentences](#Generate-Cleaned-Sentences)
    * [Sentence Similarity](#Sentence-Similarity)
    * [Similarity Matrix](#Similarity-Matrix)
    * [Generate Summary Method](#Generate-Summary-Method)
    * [Summarizing](#Summarizing)

# Extractive Text Summarization


Extractive Summarization is an extractive methods attempt to summarize articles by **selecting a subset of words** that retain the most important points. This approach **weights the important part of sentences** and uses the **same** to form the summary. Different algorithm and techniques are used to define weights for the sentences and further rank them based on importance and similarity among each other.

> Input Document → Sentences similarity → Weight sentences → Select sentences with higher rank.


The limited study is available for abstractive summarization as it requires a deeper understanding of the text as compared to the extractive approach. Purely extractive summaries often times give **better results** compared to automatic abstractive summaries. This is because of the fact that abstractive summarization methods cope with problems such as semantic representation, inference and natural language generation which is relatively harder than data-driven approaches such as sentence extraction.


There are many techniques available to generate extractive summarization. To keep it simple, this tutorial will be using an **unsupervised learning approach** to find the **sentences similarity** and rank them. One benefit of this will be, you don’t need to train and build a model prior start using it for your project.


It’s good to understand **Cosine similarity** to make the best use of code you are going to see. Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them. Since we will be representing our sentences as the bunch of vectors, we can use it to find the similarity among sentences. Its measures cosine of the angle between vectors. Angle will be 0 if sentences are similar.

![text-summarization](../../../images/text-summarization.png)

# Getting Started

## Import Libraries

In [1]:
import nltk
from nltk.corpus import stopwords
from nltk.cluster.util import cosine_distance
import numpy as np
import networkx as nx

In [2]:
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\tanch\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## Generate Cleaned Sentences

In [3]:
def read_article(file_name):
    file = open(file_name, "r")
    filedata = file.readlines()
    article = filedata[0].split(". ")
    sentences = []

    for sentence in article:
        print(sentence)
        sentences.append(sentence.replace("[^a-zA-Z]", " ").split(" "))
    sentences.pop()

    return sentences

## Sentence Similarity

This is where we will be using cosine similarity to find similarity between sentences.

In [4]:
def sentence_similarity(sent1, sent2, stopwords=None):
    if stopwords is None:
        stopwords = []

    sent1 = [w.lower() for w in sent1]
    sent2 = [w.lower() for w in sent2]

    all_words = list(set(sent1 + sent2))

    vector1 = [0] * len(all_words)
    vector2 = [0] * len(all_words)

    # build the vector for the first sentence
    for w in sent1:
        if w in stopwords:
            continue
        vector1[all_words.index(w)] += 1

    # build the vector for the second sentence
    for w in sent2:
        if w in stopwords:
            continue
        vector2[all_words.index(w)] += 1

    return 1 - cosine_distance(vector1, vector2)

## Similarity Matrix

In [5]:
def build_similarity_matrix(sentences, stop_words):
    # Create an empty similarity matrix
    similarity_matrix = np.zeros((len(sentences), len(sentences)))

    for idx1 in range(len(sentences)):
        for idx2 in range(len(sentences)):
            if idx1 == idx2:  # ignore if both are same sentences
                continue
            similarity_matrix[idx1][idx2] = sentence_similarity(sentences[idx1], sentences[idx2], stop_words)

    return similarity_matrix

## Generate Summary Method

In [6]:
def generate_summary(file_name, top_n=5):
    stop_words = stopwords.words('english')
    summarize_text = []

    # Step 1 - Read text anc split it
    sentences = read_article(file_name)

    # Step 2 - Generate Similary Martix across sentences
    sentence_similarity_martix = build_similarity_matrix(sentences, stop_words)

    # Step 3 - Rank sentences in similarity martix
    sentence_similarity_graph = nx.from_numpy_array(sentence_similarity_martix)
    scores = nx.pagerank(sentence_similarity_graph)

    # Step 4 - Sort the rank and pick top sentences
    ranked_sentence = sorted(((scores[i], s) for i, s in enumerate(sentences)), reverse=True)
    print("\nIndexes of top ranked_sentence order are:", ranked_sentence, sep="\n")

    for i in range(top_n):
        summarize_text.append(" ".join(ranked_sentence[i][1]))

    # Step 5 - Offcourse, output the summarize texr
    print("\nSummarize Text:", ". ".join(summarize_text), sep="\n")

## Summarizing

In [7]:
generate_summary("../../../resources/day_08/msft.txt", 1)

In an attempt to build an AI-ready workforce, Microsoft announced Intelligent Cloud Hub which has been launched to empower the next generation of students with AI-ready skills
Envisioned as a three-year collaborative program, Intelligent Cloud Hub will support around 100 institutions with AI infrastructure, course content and curriculum, developer support, development tools and give students access to cloud and AI services
As part of the program, the Redmond giant which wants to expand its reach and is planning to build a strong developer ecosystem in India with the program will set up the core AI infrastructure and IoT Hub for the selected campuses
The company will provide AI development tools and Azure AI services such as Microsoft Cognitive Services, Bot Services and Azure Machine Learning.According to Manish Prakash, Country General Manager-PS, Health and Education, Microsoft India, said, "With AI being the defining technology of our time, it is transforming lives and industry and 

# Extension

There are much-advanced techniques available for text summarization. If you are new to it, you can start with an interesting research paper named 

[Text Summarization Technique: A Brief Survey](http://arxiv.org/abs/1707.02268v3)

# Contributors

**Author**
<br>Chee Lam

# References

1. [Extractive Text Summarization](https://towardsdatascience.com/understand-text-summarization-and-create-your-own-summarizer-in-python-b26a9f09fc70)