## Summarizing BBC news
In this lab, you will self-study two unsupervised graph-based summarization methods, namely LexRank and TextRank, and apply them to summarize news data.

First of all, download [data](http://mlg.ucd.ie/files/datasets/bbc-fulltext.zip) and extract files. 

In [1]:
# importing required modules
from zipfile import ZipFile

with ZipFile('bbc-fulltext.zip', 'r') as zip:
    # printing all the contents of the zip file
    zip.printdir()
  
    # extracting all the files
    print('Extracting all the files now...')
    zip.extractall()
    print('Done!')

File Name                                             Modified             Size
bbc/                                           2015-04-05 16:29:08            0
bbc/entertainment/                             2010-03-30 00:45:20            0
bbc/entertainment/289.txt                      2010-03-30 00:45:20         2261
bbc/entertainment/262.txt                      2010-03-30 00:45:20         4810
bbc/entertainment/276.txt                      2010-03-30 00:45:20         2127
bbc/entertainment/060.txt                      2010-03-30 00:45:20         1046
bbc/entertainment/074.txt                      2010-03-30 00:45:20         1586
bbc/entertainment/048.txt                      2010-03-30 00:45:20         2121
bbc/entertainment/114.txt                      2010-03-30 00:45:20         1481
bbc/entertainment/100.txt                      2010-03-30 00:45:20         1821
bbc/entertainment/128.txt                      2010-03-30 00:45:20         1238
bbc/entertainment/316.txt               

Below, Politics news is selected. *(Note that you are free to use other categories as you would like i.e. tech, sports, business, and entertainment.)*

In the Politics category, there are 417 news articles. The goal is to summarize **each news article**, at least 10 news. The compression ratio should be within 25%-30%. 

In [1]:
# !pip install path
from path import Path

documents = []
documents_dir = Path('bbc/politics')
for file_path in documents_dir.files('*.txt'):
    with file_path.open(mode='rt', encoding='utf-8') as fp:
        documents.append(fp.readlines())

Use sentences in one of the news *as an example*.

In [2]:
sentences = documents[0]
print(sentences)

['Labour plans maternity pay rise\n', '\n', 'Maternity pay for new mothers is to rise by £1,400 as part of new proposals announced by the Trade and Industry Secretary Patricia Hewitt.\n', '\n', 'It would mean paid leave would be increased to nine months by 2007, Ms Hewitt told GMTV\'s Sunday programme. Other plans include letting maternity pay be given to fathers and extending rights to parents of older children. The Tories dismissed the maternity pay plan as "desperate", while the Liberal Democrats said it was misdirected.\n', '\n', 'Ms Hewitt said: "We have already doubled the length of maternity pay, it was 13 weeks when we were elected, we have already taken it up to 26 weeks. "We are going to extend the pay to nine months by 2007 and the aim is to get it right up to the full 12 months by the end of the next Parliament." She said new mothers were already entitled to 12 months leave, but that many women could not take it as only six of those months were paid. "We have made a firm co

## LexRank

**TODO #1**: Study an algorithm of LexRank and describe how it works.

**TODO #2**: Use the LexRank library to summarize data as shown in the example below.

Note: Make sure that, in your final summary the selected sentences must be ordered chronologically.

Reference: [LexRank library](https://pypi.org/project/lexrank/)

---



Run LexRank to summarize input document.

In [3]:
# !pip install lexrank
from lexrank import STOPWORDS, LexRank
lxr = LexRank(documents, stopwords=STOPWORDS['en'])

Get scores of each sentence.

In [4]:
# 'fast_power_method' speeds up the calculation, but requires more RAM
scores_cont = lxr.rank_sentences(sentences,
                                 threshold=None,
                                 fast_power_method=False,)
print(scores_cont)

[1.10540489 1.         1.05086576 1.         1.08395518 1.
 1.11241192 1.         1.04556705 1.         0.60179519]


Print high-ranked sentences.

In [5]:
summary = lxr.get_summary(sentences, summary_size=2, threshold=.25)
print(summary)

['Ms Hewitt also stressed the plans would be paid for by taxpayers, not employers. But David Frost, director general of the British Chambers of Commerce, warned that many small firms could be "crippled" by the move. "While the majority of any salary costs may be covered by the government\'s statutory pay, recruitment costs, advertising costs, retraining costs and the strain on the company will not be," he said. Further details of the government\'s plans will be outlined on Monday. New mothers are currently entitled to 90% of average earnings for the first six weeks after giving birth, followed by £102.80 a week until the baby is six months old.\n', '\n']


In [6]:
# get summary with continuous LexRank
summary_cont = lxr.get_summary(sentences, threshold=None)
print(summary_cont)

['Ms Hewitt said: "We have already doubled the length of maternity pay, it was 13 weeks when we were elected, we have already taken it up to 26 weeks. "We are going to extend the pay to nine months by 2007 and the aim is to get it right up to the full 12 months by the end of the next Parliament." She said new mothers were already entitled to 12 months leave, but that many women could not take it as only six of those months were paid. "We have made a firm commitment. We will definitely extend the maternity pay, from the six months where it now is to nine months, that\'s the extra £1,400." She said ministers would consult on other proposals that could see fathers being allowed to take some of their partner\'s maternity pay or leave period, or extending the rights of flexible working to carers or parents of older children. The Shadow Secretary of State for the Family, Theresa May, said: "These plans were announced by Gordon Brown in his pre-budget review in December and Tony Blair is now 

## TextRank
**TODO #3**: Study an algorithm of TextRank and describe how it works.

**TODO #4**: Use the TextRank library to summarize data as shown in the example below.

Note: Make sure that, in your final summary the selected sentences must be ordered chronologically.

Reference: [TextRank library](https://pypi.org/project/summa/)

---

Join all sentences into one piece of text.

In [7]:
text = ' '.join(sentences)
print(text)

Labour plans maternity pay rise
 
 Maternity pay for new mothers is to rise by £1,400 as part of new proposals announced by the Trade and Industry Secretary Patricia Hewitt.
 
 It would mean paid leave would be increased to nine months by 2007, Ms Hewitt told GMTV's Sunday programme. Other plans include letting maternity pay be given to fathers and extending rights to parents of older children. The Tories dismissed the maternity pay plan as "desperate", while the Liberal Democrats said it was misdirected.
 
 Ms Hewitt said: "We have already doubled the length of maternity pay, it was 13 weeks when we were elected, we have already taken it up to 26 weeks. "We are going to extend the pay to nine months by 2007 and the aim is to get it right up to the full 12 months by the end of the next Parliament." She said new mothers were already entitled to 12 months leave, but that many women could not take it as only six of those months were paid. "We have made a firm commitment. We will definitel

In [8]:
from summa.summarizer import summarize
print(summarize(text, ratio=0.25))

Maternity pay for new mothers is to rise by £1,400 as part of new proposals announced by the Trade and Industry Secretary Patricia Hewitt.
The Tories dismissed the maternity pay plan as "desperate", while the Liberal Democrats said it was misdirected.
"We are going to extend the pay to nine months by 2007 and the aim is to get it right up to the full 12 months by the end of the next Parliament." She said new mothers were already entitled to 12 months leave, but that many women could not take it as only six of those months were paid.
We will definitely extend the maternity pay, from the six months where it now is to nine months, that's the extra £1,400." She said ministers would consult on other proposals that could see fathers being allowed to take some of their partner's maternity pay or leave period, or extending the rights of flexible working to carers or parents of older children.


## Lab part

In [9]:
lab_docs = []
documents_dir = Path('bbc/tech')
for file_path in documents_dir.files('*.txt'):
    with file_path.open(mode='rt', encoding='utf-8') as fp:
        lab_docs.append(fp.readlines())

1. Study an algorithm of LexRank and describe how it works.<br>
<u>Ans.</u> It is the extractive-based text summarization algorithm which construct a graph data structure and using eigenvector concept for choosing an important sentences to be the result. Start by calculating TF-IDf for each words in document then construct a matrix with |s| $\times$ |s| where |s| = no. of sentences in the document. After construct a matrix, it calculating the idf-modified-similarity for each poisition in the matrix. At this point, we will get the relationship of every pair of sentences. Finally, it calculate the centrality probability to determine which sentence is important for document context and the threshold is the average probability of all sentences.

2. Use the LexRank library to summarize article.
- I use continuous LexRank because I want algorithm to concern about the context.

In [11]:
def lexRankSummary(num):
    result = []
    for i in range(num):
        print(f'Article {i + 1}')
        sentences = lab_docs[i]
        scores_cont = lxr.rank_sentences(sentences,
                                        threshold=None,
                                        fast_power_method=True,)
        print(scores_cont)
        
        summary = lxr.get_summary(sentences, threshold=None)
        summary = [i.replace('\n', '') for i in summary]
        sum_text = ' '.join(summary)
        print(sum_text, '\n')
        
        result.append(f'## Article {i+1}\n{' '.join(sentences)}\n### Summary\n{sum_text}\n\n')
        
    return result
        
        
lexRank = lexRankSummary(10)

Article 1
[1.07991332 1.         0.9151488  1.         0.81387248 1.
 0.78406742 1.         1.09394175 1.         1.05256816 1.
 0.76561659 1.         1.33278971 1.         1.16208175 1.
 1.        ]
The author of one such article began a petition drive against the use of the ink. The greatest part of the opposition to ink has often been sheer ignorance. Local newspapers have carried stories that the ink is harmful, radioactive or even that the ultraviolet readers may cause health problems. Others, such as the aggressively middle of the road, Coalition of Non-governmental Organizations, have lauded the move as an important step forward. This type of ink has been used in many elections in the world, in countries as varied as Serbia, South Africa, Indonesia and Turkey. The other common type of ink in elections is indelible visible ink - but as the elections in Afghanistan showed, improper use of this type of ink can cause additional problems. The use of "invisible" ink is not without its

In [12]:
with open('LexRank.txt', 'w', encoding='utf-8') as f:
    for line in lexRank:
        f.write(f"{line}")

3. Study an algorithm of TextRank and desctribe how it works<br>
<u>Ans.</u> This algorithm uses graph-based structure same as LexRank but it calculates the similarity between the sentences to determine the important of each sentences. Finally, the algorithm ranks the sentences from graph which each node represents the sentences from document.

4. Use the TextRank to summarize article.

In [15]:
def textRankSummary(num):
    result = []
    for i in range(num):
        sentences = lab_docs[i]
        clean_sent = [sentence.replace('\n', '') for sentence in sentences]
        text = ' '.join(clean_sent)
        
        sum_text = summarize(text, ratio=0.25)
        print(sum_text, '\n')
        
        record = f'## Article {i + 1}\n {' '.join(sentences)}\n### Summary\n{sum_text}\n\n'
        result.append(record)
    
    return result
        
textRank = textRankSummary(10)

Ink helps drive democracy in Asia  The Kyrgyz Republic, a small, mountainous state of the former Soviet republic, is using invisible ink and ultraviolet readers in the country's elections as part of a drive to prevent multiple voting.
In an effort to live up to its reputation in the 1990s as "an island of democracy", the Kyrgyz President, Askar Akaev, pushed through the law requiring the use of ink during the upcoming Parliamentary and Presidential elections.
The use of ink is only one part of a general effort to show commitment towards more open elections - the German Embassy, the Soros Foundation and the Kyrgyz government have all contributed to purchase transparent ballot boxes.
At the entrance to each polling station, one election official will scan voter's fingers with UV lamp before allowing them to enter, and every voter will have his/her left thumb sprayed with ink before receiving the ballot.
The other common type of ink in elections is indelible visible ink - but as the elect

In [16]:
with open('textRank.txt', 'w', encoding='utf-8') as f:
    for line in textRank:
        f.write(f"{line}")