# Text Summarization using LexRank and TexRank

**Downloading the data**

In [1]:
!gdown --id 1epJmR0cAV65GIgflnmJpXHXwkC5BYuU6

Downloading...
From: https://drive.google.com/uc?id=1epJmR0cAV65GIgflnmJpXHXwkC5BYuU6
To: /content/BBC Business News.zip
  0% 0.00/928k [00:00<?, ?B/s]100% 928k/928k [00:00<00:00, 58.1MB/s]


**Changing the file name for operations later**

In [2]:
import os
os.rename("BBC Business News.zip","BBC_Business_News.zip")

**Extracting the content**

In [3]:
!unzip /content/BBC_Business_News.zip

Archive:  /content/BBC_Business_News.zip
   creating: BBC Business News/News Articles/
   creating: BBC Business News/News Articles/business/
  inflating: BBC Business News/News Articles/business/001.txt  
  inflating: BBC Business News/News Articles/business/002.txt  
  inflating: BBC Business News/News Articles/business/003.txt  
  inflating: BBC Business News/News Articles/business/004.txt  
  inflating: BBC Business News/News Articles/business/005.txt  
  inflating: BBC Business News/News Articles/business/006.txt  
  inflating: BBC Business News/News Articles/business/007.txt  
  inflating: BBC Business News/News Articles/business/008.txt  
  inflating: BBC Business News/News Articles/business/009.txt  
  inflating: BBC Business News/News Articles/business/010.txt  
  inflating: BBC Business News/News Articles/business/011.txt  
  inflating: BBC Business News/News Articles/business/012.txt  
  inflating: BBC Business News/News Articles/business/013.txt  
  inflating: BBC Business 

**Installing all the required libraries**

In [4]:
pip install lexrank

Collecting lexrank
[?25l  Downloading https://files.pythonhosted.org/packages/e1/25/f139d8526e014b6bf6021305492cd7ccffbfa10999802fce4813808b04e4/lexrank-0.1.0-py3-none-any.whl (69kB)
[K     |████▊                           | 10kB 15.7MB/s eta 0:00:01[K     |█████████▍                      | 20kB 20.8MB/s eta 0:00:01[K     |██████████████                  | 30kB 24.7MB/s eta 0:00:01[K     |██████████████████▊             | 40kB 15.3MB/s eta 0:00:01[K     |███████████████████████▌        | 51kB 9.1MB/s eta 0:00:01[K     |████████████████████████████▏   | 61kB 9.1MB/s eta 0:00:01[K     |████████████████████████████████| 71kB 4.2MB/s 
Collecting urlextract>=0.7
  Downloading https://files.pythonhosted.org/packages/c3/24/0f5c690a4ef9b5d30845517ef14c35ce6a3d96e5b0ae0db6895bb194ab10/urlextract-1.2.0-py3-none-any.whl
Collecting path.py>=10.5
  Downloading https://files.pythonhosted.org/packages/8f/04/130b7a538c25693c85c4dee7e25d126ebf5511b1eb7320e64906687b159e/path.py-12.5.0-py3

In [5]:
pip install rouge-score

Collecting rouge-score
  Downloading https://files.pythonhosted.org/packages/1f/56/a81022436c08b9405a5247b71635394d44fe7e1dbedc4b28c740e09c2840/rouge_score-0.0.4-py2.py3-none-any.whl
Installing collected packages: rouge-score
Successfully installed rouge-score-0.0.4


**Importing required packages**

In [6]:
from lexrank import STOPWORDS, LexRank
from path import Path
from rouge_score import rouge_scorer
import logging
from gensim.summarization import summarize
import numpy as np

**Extracting documents and target summary and storing in proper format**

In [7]:
documents = []
target = []
documents_dir = Path('/content/BBC Business News/News Articles/business')
target_dir = Path('/content/BBC Business News/Summaries/business')

for file_path in documents_dir.files('*.txt'):
    with file_path.open(mode='rt', encoding='utf-8') as fp:
        documents.append(fp.readlines())

for file_path in target_dir.files('*.txt'):
    with file_path.open(mode='rt', encoding='utf-8') as fp:
        target.append(fp.readlines())

# Summarization using LexRank using lexrank library

**Initializing the lexrank summarizer**

In [8]:
lxr = LexRank(documents, stopwords=STOPWORDS['en'])

**Storing the summaries with different sentence counts**



In [9]:
summariesL10 = [] # lexrank summarization for count 10
summariesL15 = [] # lexrank summarization for count 15
summariesL20 = [] # lexrank summarization for count 20
summariesL25 = [] # lexrank summarization for count 25

for sentences in documents:
  summariesL10.append(lxr.get_summary(sentences, summary_size=10))
  summariesL15.append(lxr.get_summary(sentences, summary_size=15))
  summariesL20.append(lxr.get_summary(sentences, summary_size=20))
  summariesL25.append(lxr.get_summary(sentences, summary_size=25))

# Summarization using TexRank using gensim library

**Storing the summaries with different ratios**

In [18]:
summariesT10 = [] # texrank summarization for ratio 0.20
summariesT15 = [] # texrank summarization for ratio 0.30
summariesT20 = [] # texrank summarization for ratio 0.40
summariesT25 = [] # texrank summarization for ratio 0.50

for sentences in documents:
  summariesT10.append(summarize(str(sentences), ratio=0.20))
  summariesT15.append(summarize(str(sentences), ratio=0.30))
  summariesT20.append(summarize(str(sentences), ratio=0.40))
  summariesT25.append(summarize(str(sentences), ratio=0.50))

# Calculating Rouge scores using rouge-score library

**Initializing the scorer for Rouge1 and Rouge2 scores**

In [19]:
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2'], use_stemmer=True)

**Calculating average scores for different sentence counts of LexRank summaries**



In [20]:
scoresL10 = scorer.score(str(summariesL10),str(target))
scoresL15 = scorer.score(str(summariesL15),str(target))
scoresL20 = scorer.score(str(summariesL20),str(target))
scoresL25 = scorer.score(str(summariesL25),str(target))

In [21]:
sL10 = {}
sL15 = {}
sL20 = {}
sL25 = {}
for k,v in scoresL10.items():
  # v is the list of grades for student k
  sL10[k] = round(sum(v)/ float(len(v)),5)
for k,v in scoresL15.items():
  # v is the list of grades for student k
  sL15[k] = round(sum(v)/ float(len(v)),5)
for k,v in scoresL20.items():
  # v is the list of grades for student k
  sL20[k] = round(sum(v)/ float(len(v)),5)
for k,v in scoresL25.items():
  # v is the list of grades for student k
  sL25[k] = round(sum(v)/ float(len(v)),5)

In [22]:
print("Average Rouge scores for sentence count 10 using lexrank -->", sL10)
print("Average Rouge scores for sentence count 15 using lexrank -->", sL15)
print("Average Rouge scores for sentence count 20 using lexrank -->", sL20)
print("Average Rouge scores for sentence count 25 using lexrank -->", sL25)

Average Rouge scores for sentence count 10 using lexrank --> {'rouge1': 0.71336, 'rouge2': 0.66244}
Average Rouge scores for sentence count 15 using lexrank --> {'rouge1': 0.68259, 'rouge2': 0.65904}
Average Rouge scores for sentence count 20 using lexrank --> {'rouge1': 0.6788, 'rouge2': 0.65807}
Average Rouge scores for sentence count 25 using lexrank --> {'rouge1': 0.6777, 'rouge2': 0.65767}


**Calculating average scores for different ratios of TexRank summaries**


In [15]:
scoresT10 = scorer.score(str(summariesT10),str(target))
scoresT15 = scorer.score(str(summariesT15),str(target))
scoresT20 = scorer.score(str(summariesT20),str(target))
scoresT25 = scorer.score(str(summariesT25),str(target))

In [16]:
sT10 = {}
sT15 = {}
sT20 = {}
sT25 = {}
for k,v in scoresT10.items():
  # v is the list of grades for student k
  sT10[k] = round(sum(v)/ float(len(v)),5)
for k,v in scoresT15.items():
  # v is the list of grades for student k
  sT15[k] = round(sum(v)/ float(len(v)),5)
for k,v in scoresT20.items():
  # v is the list of grades for student k
  sT20[k] = round(sum(v)/ float(len(v)),5)
for k,v in scoresT25.items():
  # v is the list of grades for student k
  sT25[k] = round(sum(v)/ float(len(v)),5)

In [17]:
print("Average Rouge scores for ratio 0.20 using textrank -->", sT10)
print("Average Rouge scores for ratio 0.30 using textrank -->", sT15)
print("Average Rouge scores for ratio 0.40 using textrank -->", sT20)
print("Average Rouge scores for ratio 0.50 using textrank -->", sT25)

Average Rouge scores for ratio 0.20 using textrank --> {'rouge1': 0.72699, 'rouge2': 0.55998}
Average Rouge scores for ratio 0.30 using textrank --> {'rouge1': 0.85103, 'rouge2': 0.64423}
Average Rouge scores for ratio 0.40 using textrank --> {'rouge1': 0.88468, 'rouge2': 0.68823}
Average Rouge scores for ratio 0.50 using textrank --> {'rouge1': 0.82937, 'rouge2': 0.69791}
