## Extractive Text Summarization by Sentence Ranking using 1, 2 and 3-grams weights

We will do 3 sentence rankings using 1-gram, 2-gram and 3-gram weights respectively. And combine them for an overall ranking of sentences of a text corpus.

First start with necessary imports

In [1]:
# Importing libraries
import numpy as np
import pandas as pd
# BeautifulSoup is used to remove html tags from the text
from bs4 import BeautifulSoup 
import re # For regular expressions
import csv
import string
from datetime import datetime

import nltk
import random
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords

In [2]:
# Start time
startTime = datetime.now()

**Open a Text file**

In [3]:
# This will read the file from current working directory
name = input("Enter Text File name: \n")
filename = "./summary/" + name + ".txt"
text = open(filename,"r").read()

Enter Text File name: 
Cambridge Analytica whistleblower Christopher Wylie


**Create Raw Sentences and get** *text* **into all lower case for processing**

In [4]:
raw_sentences = sent_tokenize(text)
text = text.lower()

## Steps

**It is a 5 step process**

1. n-gram Extraction from *text*
- n-gram frequency distribution actoss the entire text - also removing ones with leading and/or trailing stop words
- Writing all n-grams and its frequency as a dictionary and a csv for later use
- Weighting of all sentences of the text using n-gram frequency
- Writing out the sentence and its weight in a csv file

## 1. 1-gram based Sentence Ranking

### 1.1. 1-gram Extraction from the *text*

First let's see how sentences are ranked using 1-gram weights. Here the stopwords are removed for the purpose of summing over 1-grams in a sentence

In [5]:
# List of all 1-grams in text
one_grams = []
# 1. Removing html tags
text = BeautifulSoup(text, "lxml").get_text()
# 2. Tokenize Sentences
sentences = sent_tokenize(text)
for i in range(len(sentences)):
    # 3. Removing non-letter from i-th sentence
    sentence = re.sub("[^a-zA-Z]"," ",sentences[i])
    # 4. Tokenize words of i-th sentence of text
    words = word_tokenize(sentence)
    # 5. Remove stopwords from i-th sentence of text
    stop_words = set(stopwords.words("english"))
    words = [w for w in words if not w in stop_words]
    for w in words:
        if w == '':
            continue
        one_grams.append(w)

### 1.2. 1-gram frequency distribution actoss the entire text - *stop words not included*

Next compute frequency distribution of all unique 1-grams using nltk

In [6]:
# Finding 1-grams and its frequency distribution
one_grams = nltk.FreqDist(one_grams)
# Preparing to write as term-frequecy file
Distr1 = []
UniqOneGrams = list(one_grams.keys())
WordVals = list(one_grams.values())
Distr1.extend(UniqOneGrams)
Distr1.extend(WordVals)
L = int(len(Distr1)/2)
Distr1 = np.array([Distr1], dtype=object)
Distr1.shape = (2,L)
Distr1 = Distr1.transpose()
Distr1 = sorted(Distr1, key=lambda a_entry: a_entry[1], reverse = True)

### 1.3. Writing all 1-grams and its frequency as a dictionary and a csv for later use

Create a dictionary of 1-grams and frequency for later use and write all in a csv file for reference

In [7]:
# Write doc, term, frequency as a Python Dictionary
dic1 = {Distr1[i][0]: Distr1[i][1] for i in range(len(Distr1))}
# Write doc, term, frequency as a csv file
with open("1-grams of test with stopwords.csv", "w", newline = '') as f:
    writer = csv.writer(f)
    writer.writerows(Distr1)

### 1.4. Weighting of all sentences of the text using 1-gram frequency

Using the 1-gram - frequency dictionary compute weight of sentence by adding frequency of all its 1-grams other than stop words

In [8]:
sentence_1gram = []
for i in range(len(sentences)):
    # 1. Removing non-letter from i-th sentence
    sentence = re.sub("[^a-zA-Z]"," ",sentences[i])
    # 2. Tokenize words of i-th sentence of text
    words = word_tokenize(sentence)
    # 3. Remove stopwords from i-th sentence of text
    stop_words = set(stopwords.words("english"))
    words = [w for w in words if not w in stop_words]
    # 4. Make list of i-th sentence and its weight in terms of 1-gram frequency
    v = sum([dic1.get(w) for w in words])
    sentence_weight = [sentence, v]
    sentence_1gram.append(sentence_weight)

### 1.5 Writing out the sentence and its weight in a csv file

This is for future reference

In [9]:
# Write Sentence and its 1-gram weight as a csv file
with open("Sentence weight based on 1-grams.csv", "w", newline = '') as f:
    writer = csv.writer(f)
    writer.writerows(sentence_1gram)

## 2. 2-gram based Sentence Ranking

### 2.1. 2-gram Extraction from the *text*

Next let's see how sentences are ranked differently using 2-gram weights. Here the 2-grams with leading or trailing stopwords are removed for the purpose of summing over 2-grams in a sentence

In [10]:
# List of all 2-grams
bigrams = []
for i in range(len(sentences)):
    # 1. Removing non-letter from i-th sentence
    sentence = re.sub("[^a-zA-Z]"," ",sentences[i])
    # 2. Tokenize words of i-th sentence of text
    words = word_tokenize(sentence)
    bigrm = nltk.bigrams(words)
    bigrm = list(bigrm)
    # 3. Extend the list of bigrams across all sentences of text
    bigrams.extend(bigrm)

### 2.2.a 2-gram frequency distribution actoss the entire text - *also removing ones with leading and/or trailing stop words*

Next compute frequency distribution of all unique 2-grams using nltk

In [11]:
# Finding bigrams and its frequency distribution
bi_grams = nltk.FreqDist(bigrams)
# Preparing to write as term-frequecy file
Distr2 = []
UniqBiGrams = list(bi_grams.keys())
WordVals = list(bi_grams.values())
Distr2.extend(UniqBiGrams)
Distr2.extend(WordVals)
L = int(len(Distr2)/2)
Distr2 = np.array([Distr2])
Distr2.shape = (2,L)
Distr2 = Distr2.transpose()
Distr2 = sorted(Distr2, key=lambda a_entry: a_entry[1], reverse = True)

### 2.2.b 2-gram frequency distribution actoss the entire text - *also removing ones with leading and/or trailing stop words*

In this step remove 2-grams with leading or trailing stop words

In [12]:
a = [line for line in Distr2 if not line[0][0] in stop_words]
b = [line for line in a if not line[0][1] in stop_words]

### 2.3. Writing all 2-grams and its frequency as a dictionary and a csv for later use

Create a dictionary of 2-grams and frequency for later use and write all in a csv file for reference

In [13]:
# Write doc, term, frequency as a Python Dictionary
dic2 = {b[i][0]: b[i][1] for i in range(len(b))}
# Write doc, term, frequency as a csv file
with open("2-grams of test with stopwords removed.csv", "w", newline = '') as f:
    writer = csv.writer(f)
    writer.writerows(b)

### 2.4. Weighting of all sentences of the text using 2-gram frequency

Using the 2-gram - frequency dictionary compute weight of sentence by adding frequency of all its 2-grams that do not contain any stop words

In [14]:
sentence_2gram = []
for i in range(len(sentences)):
    # 1. Removing non-letter from i-th sentence
    sentence = re.sub("[^a-zA-Z]"," ",sentences[i])
    # 2. Tokenize words of i-th sentence of text
    words = word_tokenize(sentence)
    bigrm = nltk.bigrams(words)
    bigrm = list(bigrm)
    # 3. Make list of i-th sentence and its weight in terms of 2-gram frequency
    v = sum([dic2.get(w) for w in bigrm if not dic2.get(w) is None])
    sentence_weight = [sentence, v]
    sentence_2gram.append(sentence_weight)

### 2.5 Writing out the sentence and its weight based on 2-grams in a csv file

This is for future reference

In [15]:
# Write Sentence and its 2-gram weight as a csv file
with open("Sentence weight based on 2-grams.csv", "w", newline = '') as f:
    writer = csv.writer(f)
    writer.writerows(sentence_2gram)

## 3. 3-gram based Sentence Ranking

### 3.1. 3-gram Extraction from the *text*

Next let's see how sentences are ranked differently using 3-gram weights. Here the 3-grams with leading or trailing stopwords are removed for the purpose of summing over 3-grams in a sentence

In [16]:
# List of all 3-grams
trigrams = []
for i in range(len(sentences)):
    # 3. Removing non-letter from i-th sentence
    sentence = re.sub("[^a-zA-Z]"," ",sentences[i])
    # 4. Tokenize words of i-th sentence of text
    words = word_tokenize(sentence)
    trigrm = nltk.trigrams(words)
    trigrm = list(trigrm)
    # 5. Extend the list of bigrams across all sentences of text
    trigrams.extend(trigrm)

### 3.2.a 3-gram frequency distribution actoss the entire text - *also removing ones with leading and/or trailing stop words*

Next compute frequency distribution of all unique 3-grams using nltk

In [17]:
# Finding trigrams and its frequency distribution
tri_grams = nltk.FreqDist(trigrams)
# Preparing to write as term-frequecy file
Distr3 = []
UniqTriGrams = list(tri_grams.keys())
WordVals = list(tri_grams.values())
Distr3.extend(UniqTriGrams)
Distr3.extend(WordVals)
L = int(len(Distr3)/2)
Distr3 = np.array([Distr3])
Distr3.shape = (2,L)
Distr3 = Distr3.transpose()
Distr3 = sorted(Distr3, key=lambda a_entry: a_entry[1], reverse = True)

### 3.2.b 3-gram frequency distribution actoss the entire text - *also removing ones with leading and/or trailing stop words*

In this step remove 3-grams with leading or trailing stop words

In [18]:
c = [line for line in Distr3 if not line[0][0] in stop_words]
d = [line for line in c if not line[0][2] in stop_words]

### 3.3. Writing all 3-grams and its frequency as a dictionary and a csv for later use

Create a dictionary of 3-grams and frequency for later use and write all in a csv file for reference

In [19]:
# Write doc, term, frequency as a Python Dictionary
dic3 = {d[i][0]: d[i][1] for i in range(len(d))}
# Write doc, term, frequency as a csv file
with open("3-grams of test with stopwords removed.csv", "w", newline = '') as f:
    writer = csv.writer(f)
    writer.writerows(d)

### 3.4. Weighting of all sentences of the text using 3-gram frequency

Using the 3-gram - frequency dictionary compute weight of sentence by adding frequency of all its 3-grams that do not contain leading or trailing stop words

In [20]:
sentence_3gram = []
for i in range(len(sentences)):
    # 1. Removing non-letter from i-th sentence
    sentence = re.sub("[^a-zA-Z]"," ",sentences[i])
    # 2. Tokenize words of i-th sentence of text
    words = word_tokenize(sentence)
    trigrm = nltk.trigrams(words)
    trigrm = list(trigrm)
    # 3. Make list of i-th sentence and its weight in terms of 3-gram frequency
    v = sum([dic3.get(w) for w in trigrm if not dic3.get(w) is None])
    sentence_weight = [sentence, v]
    sentence_3gram.append(sentence_weight)

### 3.5 Writing out the sentence and its weight based on 3-grams in a csv file

This is for future reference

In [21]:
# Write Sentence and its 3-gram weight as a csv file
with open("Sentence weight based on 3-grams.csv", "w", newline = '') as f:
    writer = csv.writer(f)
    writer.writerows(sentence_3gram)

## Now combine 1, 2, 3-grams list and their frequency in the *text*

In [22]:
Distr = Distr1 + b + d
with open("All-grams of test with stopwords.csv", "w", newline = '') as f:
    writer = csv.writer(f)
    writer.writerows(Distr)

## And compute Sentence Ranking using combined 1, 2, 3-grams weights

In [23]:
sentence_weight = []
for i in range(len(sentences)):
    # 1. Removing non-letter from i-th sentence
    sentence = re.sub("[^a-zA-Z]"," ",sentences[i])
    weight = sentence_1gram[i][1] + sentence_2gram[i][1] + sentence_3gram[i][1]
    sentenceANDweight = [raw_sentences[i], weight]
    sentence_weight.append(sentenceANDweight)
sentence_weight = sorted(sentence_weight, key=lambda a_entry: a_entry[1], reverse = True)

## Write Sentence Ranking using combined 1, 2, 3-grams weights

In [24]:
# Write Sentence and its all-gram weight as a csv file
with open("Sentence weight based on all-grams.csv", "w", newline = '') as f:
    writer = csv.writer(f)
    writer.writerows(sentence_weight)

In [25]:
print(datetime.now() - startTime)

0:00:05.323112
