## TEXT SUMMARISATION 
* Explore 3 different text summarisation models on a subsample of Australia's un-speech for 2015
    * TF-IDF
    * LSA Algoirthm
    * TextRank Algoirthm
* Apply the most robust of these methods for long text to summarise each of Australia's speeches from 1970-2015
    * export the top (5) sentences for each year to either txt or word file for reading

In [1]:
# packages 
## other specific packages are imported in their relevant sections 
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk import tokenize
import matplotlib.pyplot as plt
import html
import re
import collections 

In [2]:
# read pickled DF from notebook #1 (initial exploratory data visuals)
df = pd.read_pickle('df_un_general_debates.pkl')

In [3]:
# E.G. of data set again (same as from notebook (1) and cleaned in notebook (2)) etc.
df.head()

Unnamed: 0,session,year,country,country_name,speaker,position,text,length,tokens,len_tokens
0,25,1970,ALB,Albania,Mr. NAS,,33: May I first convey to our President the co...,51419,"[may, first, convey, president, congratulation...",4125
1,25,1970,ARG,Argentina,Mr. DE PABLO PARDO,,177.\t : It is a fortunate coincidence that pr...,29286,"[fortunate, coincidence, precisely, time, unit...",2327
2,25,1970,AUS,Australia,Mr. McMAHON,,100.\t It is a pleasure for me to extend to y...,31839,"[pleasure, extend, mr, president, warmest, con...",2545
3,25,1970,AUT,Austria,Mr. KIRCHSCHLAEGER,,155.\t May I begin by expressing to Ambassado...,26616,"[may, begin, expressing, ambassador, hambro, b...",2135
4,25,1970,BEL,Belgium,Mr. HARMEL,,"176. No doubt each of us, before coming up to ...",25911,"[doubt, us, coming, rostrum, wonders, usefulne...",2025


In [20]:
#E.G. for the 3 different models testing, will try on: Australia & year = 2015
df.query('country=="AUS" & year==2015')

Unnamed: 0,session,year,country,country_name,speaker,position,text,length,tokens,len_tokens,paragraphs
7322,70,2015,AUS,Australia,Ms. Julie Bishop,Minister for Foreign Affairs,We meet this day at an important time for the ...,12832,"[meet, day, important, time, united, nations, ...",1095,[We meet this day at an important time for the...


In [45]:
df_australia_2015 = df.query('country=="AUS" & year==2015')

In [3]:
## re-use earlier regex clean function from notebook #2

def regex_clean(text):
    # convert html escapes like &amp; to characters.
    text = html.unescape(text) 
    # tags like <tab>
    text = re.sub(r'<[^<>]*>', ' ', text)
    # markdown URLs like [Some text](https://....)
    text = re.sub(r'\[([^\[\]]*)\]\([^\(\)]*\)', r'\1', text)
    # text or code in brackets like [0]
    text = re.sub(r'\[[^\[\]]*\]', ' ', text)
    # standalone sequences of specials, matches &# but not #cool
    text = re.sub(r'(?:^|\s)[&#<>{}\[\]+|\\:-]{1,}(?:\s|$)', ' ', text)
    # standalone sequences of hyphens like --- or ==
    text = re.sub(r'(?:^|\s)[\-=\+]{2,}(?:\s|$)', ' ', text)
    # sequences of white spaces
    text = re.sub(r'\s+', ' ', text)
    
    return text.strip()

In [127]:
# clean text apply
df_australia_2015['text_clean'] = df_australia_2015['text'].apply(regex_clean)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_australia_2015['text_clean'] = df_australia_2015['text'].apply(regex_clean)


* full text of 2015 if desired
    * https://www.foreignminister.gov.au/minister/julie-bishop/speech/national-statement-united-nations-general-assembly-70th-session
    

## TF-IDF Summarisation
* simplest approach is to summarise sentences based on aggregate TF-IDF values of words in that sentence


* apply the TF-IDF vectorisation, then aggregate values to sentence level
    * generate a scroe for each sentence as a sum of the TF-IDF values for each word in that sentence
    * meaning a sentence with a high score contains many important words compared to other sentences in the doc

In [118]:
# quickly tokenize text for the tf-idf
sentences = tokenize.sent_tokenize(df_australia_2015.iloc[0]['text_clean'])

In [120]:
# initialise tfidf vectorizer object and apply to tokenized sentences
tfidfVectorizer = TfidfVectorizer()
words_tfidf = tfidfVectorizer.fit_transform(sentences)

In [84]:
# Parameter to specify number of summary sentences required
num_summary_sentence = 5

# Sort the sentences in descending order by the sum of TF-IDF values
sent_sum = words_tfidf.sum(axis=1)
important_sent = np.argsort(sent_sum, axis=0)[::-1]

# Print three most important sentences in the order they appear in the article
for i in range(0, len(sentences)):
    if i in important_sent[:num_summary_sentence]:
        print (f"Sentence Number: {i}: ", sentences[i], '\n')

Sentence Number: 13:  under that article, united nations members pledge to take action individually and jointly to promote higher standards of living; solutions to international economic, social, health and related problems; and universal respect for, and observance of, human rights and fundamental freedoms. 

Sentence Number: 15:  the 2030 agenda for sustainable development (resolution 70/1), adopted unanimously last friday by the general assembly, is a manifestation of the 16/23 15-29595 29/09/2015 a/70/pv.18 australia pledge and a testament to the fundamental role of the organization. 

Sentence Number: 43:  human rights have been at the very centre of the united nations over the past 70 years, from the united nations charter in 1945 to the sustainable development goals in 2015. with the rise of terrorist groups like daesh, the continuing depredations of the north korean regime, and the persistence of forced labour and other contemporary forms of slavery, the need for the united nat

* Just for curiosity, print the 3 least important sentences
    * they are indeed, pretty broad and not particularly useful

In [70]:
# Parameter to specify number of summary sentences required
num_summary_sentence = 3

# Sort the sentences in descending order by the sum of TF-IDF values
sent_sum = words_tfidf.sum(axis=1)
important_sent = np.argsort(sent_sum, axis=0)

# Print three most (least) sentences in the order they appear in the article
for i in range(0, len(sentences)):
    if i in important_sent[:num_summary_sentence]:
        print (f"Sentence Number: {i}: ", sentences[i], '\n')

Sentence Number: 8:  terrorism today is a global threat. 

Sentence Number: 29:  there is an inescapable truth. 

Sentence Number: 80:  the role of peacekeeping is fundamental. 



## LSA ALGORITHM
* (Latent Semantic Analysis) 
* LSA assumes words that are close in meaning will occur in the same documents


* First is representing the entire doc in a sentence-term matrix
    * each row = sentence
    * each column = word 
    * the value of each cell in this matrix is the word frequency, often scaled as TF-IDF weights
    
    
* Objective is to reduce all words to a few topics by creating a modified representation of the sentence-term matrix
    * to create this: apply method of non-negative matrix factorisation (NMF) that expresses this matrix as the product of 2 new decomposed matrices with fewer rows/columns
    
    
* use `sumy` package; which has integrated stop words + tokenizer and stemmer functionality from nltk

In [4]:
# !pip install sumy
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.nlp.stemmers import Stemmer
from sumy.utils import get_stop_words
from sumy.summarizers.lsa import LsaSummarizer

In [5]:
LANGUAGE = "english"
stemmer = Stemmer(LANGUAGE)

In [128]:
# quickly manually specify
num_summary_sentence = 5

# parser = PlaintextParser.from_string(article1['text'], Tokenizer(LANGUAGE))
parser = PlaintextParser.from_string(df_australia_2015.iloc[0]['text_clean'], Tokenizer(LANGUAGE))

summarizer = LsaSummarizer(stemmer)
summarizer.stop_words = get_stop_words(LANGUAGE)

for sentence in summarizer(parser.document, num_summary_sentence):
    print (str(sentence), '\n')

Australia recognizes the dedicated, often courageous work of the thousands of United Nations personnel in the field who protect vulnerable citizens, deliver vital humanitarian assistance, rebuild damaged societies and support development. 

Under that Article, United Nations Members pledge to take action individually and jointly to promote higher standards of living; solutions to international economic, social, health and related problems; and universal respect for, and observance of, human rights and fundamental freedoms. 

Last week the Australian Government announced a new domestic policy, a $100 million women’s safety package, improving front-line support services and providing educational resources to help change community attitudes to violence and abuse. 

With the rise of terrorist groups like Daesh, the continuing depredations of the North Korean regime, and the persistence of forced labour and other contemporary forms of slavery, the need for the United Nations to prosecute a 

* There is some crossover with tf-idf, e.g. 2nd sentence here is the top tf-idf sentence
* While tf-idf seems to have an inclination for facts, LSA seems to represent concepts better
* LSA also picked up on a significant section of the talk about incorporating improved policies etc. for Women, which tf-idf did not pick up - perhaps because of the commoness of the gender pronouns

## TextRank Algorithm
* Models a document like a graph, utilising sentences as nodes


* A function is required to compute similarity of sentences, which build the edges (node connectors) 
    * This function weights the graph edges, the higher the similarity between sentences, the more important the edge will be between them in the graph
    
    
    
* The TextRank Algo determines the relation of similarity between sentences based on the content they both share
    * This overlap is calculated as: number of common tokens between them
        * Then divided by the length of each (to avoid over-promoting long sentences)
        
        
        
* The result is a dense graph representing the document
    * The most significant sentences are selected and presented in the same order they appear in the document

In [10]:
# note: also uses some packages imported above in the LSA section
from sumy.summarizers.text_rank import TextRankSummarizer

In [82]:
num_summary_sentence = 5 

# implement with sumy package: textrank implementation 
parser = PlaintextParser.from_string(df_australia_2015.iloc[0]['text_clean'], Tokenizer(LANGUAGE))
summarizer = TextRankSummarizer(stemmer)
summarizer.stop_words = get_stop_words(LANGUAGE)

# loop through summarizer object (takes # of summary sentences (int) as parameter))
for sentence in summarizer(parser.document, num_summary_sentence):
    print (str(sentence), '\n')

Australia recognizes the dedicated, often courageous work of the thousands of United Nations personnel in the field who protect vulnerable citizens, deliver vital humanitarian assistance, rebuild damaged societies and support development. 

The United Nations women, peace and security agenda has changed our collective thinking on the role of women in conflict. 

Human rights have been at the very centre of the United Nations over the past 70 years, from the United Nations Charter in 1945 to the Sustainable Development Goals in 2015. 

Australia is standing for a seat on the United Nations Human Rights Council for the 2018-2020 term. 

Our term on the Human Rights Council would reflect Australia’s inclusive, diverse society and build on the Australian Government’s strong domestic human rights agenda. 



* from human perspective reading the article, seems to capture the main essence of Australia's overall speech better
    * particularly the notion of Australia applying for a seat and what they plan to do once there
* Although, the speech covers many different topics, so it is hard to capture all of it -- but the textrank algo does a pretty decent job

## Deconstructing 1 Country over Time (1970-2015)
* Pick 1 country and extract salient points each year over duration
    * Will use Australia for this example
* extract top sentences for each year
    * export those sentences to a document for a human to read and gain a quick broad overview of Australia's UN speeches content for the last 45 + years

In [6]:
# specify DF with just australia for all years in data
df_australia = df.query('country=="AUS"')

In [7]:
# apply text cleaning function (if text_clean doesn't already exist)
df_australia['text_clean'] = df_australia['text'].apply(regex_clean)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_australia['text_clean'] = df_australia['text'].apply(regex_clean)


In [8]:
#E.G. 
df_australia.head(3)

Unnamed: 0,session,year,country,country_name,speaker,position,text,length,tokens,len_tokens,text_clean
2,25,1970,AUS,Australia,Mr. McMAHON,,100.\t It is a pleasure for me to extend to y...,31839,"[pleasure, extend, mr, president, warmest, con...",2545,"100. It is a pleasure for me to extend to you,..."
73,26,1971,AUS,Australia,Mr. BOWEN,,"38.\t I should like, on behalf of Australia,,...",20857,"[like, behalf, australia, extend, congratulati...",1674,"38. I should like, on behalf of Australia,, to..."
190,27,1972,AUS,Australia,Mr. Bo wen,,"Mr. President, I should like, on behalf of the...",24700,"[mr, president, like, behalf, australian, dele...",2032,"Mr. President, I should like, on behalf of the..."


* loop through the `df_australia` and apply the text-rank algo to summarise top (5) sentences for each year
    * append those sentences to the `collections.defaultdict` object
    * convert dict to DF
    * (optional) export DF to text or word document to read

In [11]:
## colletions.dict method:
sum_sents_year_df = collections.defaultdict(list)

# 
num_summary_sentence = 5 

for i in df_australia[['year', 'text_clean']].index:
    # sentences = tokenize.sent_tokenize(df_australia['text_clean'][i]) # not needed for textrank actually
    
    # initialise textrank option + stop words etc.
    parser = PlaintextParser.from_string(df_australia['text_clean'][i], Tokenizer(LANGUAGE))
    summarizer = TextRankSummarizer(stemmer)
    summarizer.stop_words = get_stop_words(LANGUAGE)
    
    # append top sentences to collections.dict
    for sentence in summarizer(parser.document, num_summary_sentence):
        sum_sents_year_df['year'].append(df_australia['year'][i])
        sum_sents_year_df['sent'].append(sentence)  

In [12]:
# construct DF from dict 
summary_sents_aus = pd.DataFrame.from_dict(sum_sents_year_df)

In [19]:
# E.G. of DF 
summary_sents_aus

Unnamed: 0,year,sent
0,1970,After the traumatic experience of the Second W...
1,1970,In its twenty-five years the United Nations ha...
2,1970,"In our view, the United Nations has a signific..."
3,1970,The United Nations has a related responsibilit...
4,1970,I have reaffirmed Australia's support for an a...
...,...,...
225,2015,"Australia recognizes the dedicated, often cour..."
226,2015,"The United Nations women, peace and security a..."
227,2015,Human rights have been at the very centre of t...
228,2015,Australia is standing for a seat on the United...


In [186]:
# E.G. of a sentence from the last row 
summary_sents_aus.iloc[229].sent

<Sentence: Our term on the Human Rights Council would reflect Australia’s inclusive, diverse society and build on the Australian Government’s strong domestic human rights agenda.>

In [65]:
# can restructure the sentences to group by year; but doesn't matter if outputing to text or word file

# pd.DataFrame(summary_sents_aus.set_index('year').stack([0]), columns=['sentences'])
# australia_summaries_df = pd.DataFrame(summary_sents_aus.set_index('year').stack([0]), columns=['sentences'])

In [27]:
# write all years summary sentences to text file
summary_sents_aus.to_csv(r'summary_sents_aus_text.txt', header=True, index=False, sep='\t', mode='w+')

## (Optional) Save Output to Microsoft Word Document

In [39]:
# !pip install python-docx

import docx
from docx.shared import Cm

In [59]:
# specify doc using docx 
doc = docx.Document('summary_sentences_australia.docx')

In [60]:
# add table in shape of current DF dimensions
t = doc.add_table(summary_sents_aus.shape[0]+1, df.shape[1])

In [62]:
# optional - refit cell width
t.allow_autofit = False

for cell in t.columns[0].cells:
    cell.width = Cm(12)

In [63]:
# add the header rows.
for j in range(summary_sents_aus.shape[-1]):
    t.cell(0,j).text = summary_sents_aus.columns[j]

# add the rest of the data frame
for i in range(summary_sents_aus.shape[0]):
    for j in range(summary_sents_aus.shape[-1]):
        t.cell(i+1,j).text = str(summary_sents_aus.values[i,j])

In [64]:
# save to word document
doc.save('summary_sentences_australia.docx')

## Possible Future Scope of Works:
* maybe try implementing the `rogue_scorer` to determine actual accuracy of the summarisation techniques
    * would require a human-written summary to compare against or another way to label a proper comparison piece of text 