# Multi document summarization


### Table of Contents

* [Introduction](#chapter1)
    * [Approaches of Summarisation](#section_1_1)
    * [Scales of Summarisation](#section_1_2)
* [Problem Statement](#chapter2)
* [Importing necessary Packages](#chapter3)
* [Importing the text files](#chapter4)
* [Text summarisation using Text ranking](#chapter5)
* [Multi-document summarization](#chapter6)



# Introduction 
<a class="anchor" id="chapter1"></a>



Summarization has been and continues to be a hot research topic in the data science arena. While text summarization algorithms have existed for a while, major advances in natural language processing and deep learning have been made in recent years.

 One of the challenges with summarization is that it is hard to generalize. . For example, summarizing a news article is very different to summarizing a financial earnings report. Certain text features like document length or genre (tech, sports, finance, travel, etc.) make the task of summarization a serious data science problem to solve.  For this reason, the way summarization works largely depends on the use case and there is no one-size-fits-all solution.

#### Two main approaches to summarization   
<a class="anchor" id="section_1_1"></a>



* *Extractive summarization*: it works by selecting the most meaningful sentences in an article and arranging them in a comprehensive manner. This means the summary sentences are extracted from the article without any modifications.
* *Abstractive summarization*: it works by paraphrasing its own version of the most important sentence in the article.


#### Two scales of document summarization 
 <a class="anchor" id="section_1_2"></a>

* *Single-document summarization*: the task of summarizing a standalone document. Note that a ” document” could refer to different things depending on the use case (URL, internal PDF file, legal contract, financial report, email, etc.).
* *Multi-document summarization*: the task of assembling a collection of documents (usually through a query against a database or search engine) and generating a summary that incorporates perspectives from across documents.



Finally, there are two common metrics any summarizer attempts to optimize:

* *Topic coverage*: does the summary incorporate the main topics from the document?
* *Readability*: do the summary sentences flow in a logical way?

## Problem Statement 
<a class="anchor" id="chapter2"></a>

It is to be able to implement Multi-document summarization. The domain selected is online courses

So the main idea of the assignment is that, it is very important to be able to give an overview of an overall aspect. \
This is very important when a student needs to know what a subject covers without having to go through each os the unit in a course plan. A detail idea on what is happeneing around us is published via newspapers, channels and websites as well as articles. A person can easily understand the overview of what happened in an event by combinging the reports provided from each events.



## Importing necessary Packages
<a class="anchor" id="chapter3"></a>


* bs4 - Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.
* urllib -Urllib module is the URL handling module for python. It is used to fetch URLs (Uniform Resource Locators). It uses the urlopen function and is able to fetch URLs using a variety of different protocols. Urllib is a package that collects several modules for working with URLs, such as: ... parse for parsing URLs
* re- A regular expression is a sequence of characters that define a search pattern. Usually such patterns are used by string-searching algorithms for "find" or "find and replace" operations on strings, or for input validation. It is a technique developed in theoretical computer science and formal language theory.
* heapq- Heap queue algorithm
*nltk - The Natural Language Toolkit, or more commonly NLTK, is a suite of libraries and programs for symbolic and statistical natural language processing for English written in the Python programming language

In [None]:
import bs4 as bs
import urllib.request
import re
import heapq

In [None]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

## Importing the text files
<a class="anchor" id="chapter4"></a>

So a directory is stored with the transcript that the videos consists of.\
Genarally, transcripts are made by a third party Caption generator where a lot of manual work is done as well. So based on the accuracy of the caption, it is nearly precise

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import os
from pathlib import Path
count = 0
a=[]
path_list=[]
d = "drive/My Drive/procastinate"
#path_list = sorted(Path(d).iterdir(), key=os.path.getmtime,reverse=True)
for path in sorted(os.listdir(d)):
    if os.path.isfile(os.path.join(d, path)):
      txt=os.path.join(d, path)
      f = open(txt, "r").read()
      #print(f.read())
      count += 1
      path_list.append(path)
      a.append(f)
#print(count)

## Extractive text summarisation using Sentence Scoring Method
<a class="anchor" id="chapter5"></a>

Extractive text summarization is selecting the most
relevant sentences of the text. This method consists of four
phases, they are
1. Pre-processing
2. Sentence scoring
3. Sentence ranking
4. Summary Extraction

So we are creating a function that does Sentence scoring summarization and since 25% of the text document is the most optimum size for a summary. We take only 1/4 th of the sentences that are ranked in the decreasign order

In [None]:
def summarising(text):
  #1 Preprocessing
  # Removing Square Brackets and Extra Spaces
  text=str(text)
  text = re.sub(r'\[[0-9]*\]', ' ', text)
  text = re.sub(r'\s+', ' ', text)
  # Removing special characters and digits
  formatted_text = re.sub('[^a-zA-Z]', ' ', text )
  formatted_text = re.sub(r'\s+', ' ', formatted_text)
  # Sentencing
  sentence_list = nltk.sent_tokenize(text)
  sentence_list_len = len(sentence_list)
  stopwords = nltk.corpus.stopwords.words('english')
  
  word_frequencies = {}
  for word in nltk.word_tokenize(formatted_text):
    if word not in stopwords:
      if word not in word_frequencies.keys():
        word_frequencies[word] = 1
      else:
        word_frequencies[word] += 1
  maximum_frequncy = max(word_frequencies.values())
  for word in word_frequencies.keys():
    word_frequencies[word] = (word_frequencies[word]/maximum_frequncy)
  #2. Sentence scoring
  sentence_scores = {}
  for sent in sentence_list:
      for word in nltk.word_tokenize(sent.lower()):
          if word in word_frequencies.keys():
              if len(sent.split(' ')) < 30:
                  if sent not in sentence_scores.keys():
                      sentence_scores[sent] = word_frequencies[word]
                  else:
                      sentence_scores[sent] += word_frequencies[word] 
  #3. Sentence ranking
  summary_sentences = heapq.nlargest(sentence_list_len//4, sentence_scores, key=sentence_scores.get)
  #4. Summary Extraction
  summary = ' '.join(summary_sentences)
  #topic = max(word_frequencies, key=word_frequencies.get) 
  return summary

## Multi-document summarization
<a class="anchor" id="chapter6"></a>

We first obtain the documentwise summary that is put together for the multi-document summarisation which is followed by the modulewise summary.

In [None]:
summary_list=[]
print()
for i in range(0,len(a)):
  #print("\n\n")
  ret=summarising(a[i])
  summary_list.append(ret)
  #print('About: '+ path_list[i])
  #print(ret)
full_text= ",".join(summary_list)
print("A COMPLETE OVERVIEW\nThese will be what we'll be covering througout the course. The glimpse of the content is given. \n")
full_text_sum=summarising(full_text)
print(full_text_sum)
for i in range(0,len(summary_list)):
  print("\n\n")
  print('About: '+ path_list[i])
  print(summary_list[i])



A COMPLETE OVERVIEW
These will be what we'll be covering througout the course. The glimpse of the content is given. 

Give your confidence a boost, and watch your procrastination habits disappear.,In fact, in one study 250 adults were encouraged to exercise at least one time per week. You walk into the office today 100% intending to sit down and get right to work on that important task. If you get the project done by your deadline, congratulations, you get your money back. If so, choose one or more of these approaches to boost your confidence and get out of the procrastination zone. Can you imagine how quickly you would get that task done? Get the job done, and then, if you still have time and energy and you want to improve the work, that's fine. Even if you work from home, get up and get dressed in the morning. If you want a 56% boost in the likelihood that you'll stay focused a get a job done, make a plan. Finally, try a 10-minute promise to get some traction on a dreaded task. Deci