# TEXT SUMMARIZATION

Here we will fetch data from online sources such as wikipedia. Then we will parse those articles and convert them into strings. Then we will summarize each article (which might be 3-4 pages long) into 4-5 sentences.

It can be done by 
1. Simple NLP approach and 
2. A more complex deep NLP approach. 

We will be looking into Simple NLP approach

**Action plan**<br/>
Here we have to atfirst tokenize an article into sentences. Then preprocess. Then prepare histogram (that will contain counts for each word in sentences of the article). Then we have to make weighted histogram i.e. every word in a sentence will have a weight. Then based on the weight of the words, each sentence will have a score(depending on the words present in the sentence). Then based on the score of the sentences in the article, we can select top 2-3 sentences as the summary of entire article.

In [28]:
# Text Summarization using NLP
# Install BeautifulSoup 4 - pip install beautifulsoup4
# Install lxml - pip install lxml

# Importing the libraries
import bs4 as bs
import urllib.request
import re
import nltk
nltk.download('stopwords')

# Gettings the data source that needs to be summarized
source = urllib.request.urlopen('https://en.wikipedia.org/wiki/Global_warming').read()
print(source[:1000])

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/sayantan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
b'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>Global warming - Wikipedia</title>\n<script>document.documentElement.className = document.documentElement.className.replace( /(^|\\s)client-nojs(\\s|$)/, "$1client-js$2" );</script>\n<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Global_warming","wgTitle":"Global warming","wgCurRevisionId":870505915,"wgRevisionId":870505915,"wgArticleId":5042951,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1 maint: Uses authors parameter","CS1 maint: Explicit use of et al.","Webarchive template wayback links","CS1: Julian\xe2\x80\x93Gregorian uncertainty","Pages containi

To fetch the articles from wikipedia, we need to scrape wikipedia. And for that we need `beautifulsoup4` and `lxml`.

`lxml` is a type of parser that beautiful soup uses to parse a HTML document.

In [29]:
# Parsing the data/ creating BeautifulSoup object
soup = bs.BeautifulSoup(source,'lxml') 
#print(soup) 

Soup is more visible than what we got from original HTML. But still a lot of cleaning needs to be done to just get the string part.

In [30]:
# Fetching the data
text = ""
for paragraph in soup.find_all('p'):
    text += paragraph.text # to extract only the text part of the paragraph
text[:1000]

"\n\n\nGlobal warming is a long-term rise in the average temperature of the Earth's climate system, an aspect of climate change shown by temperature measurements and by multiple effects of the warming.[2][3] The term commonly refers to the mainly human-caused observed warming since pre-industrial times and its projected continuation,[4] though there were also much earlier periods of global warming.[5] In the modern context the terms are commonly used interchangeably,[6] but global warming more specifically relates to worldwide surface temperature increases; while climate change is any regional or global statistically identifiable persistent change in the state of climate which lasts for decades or longer, including warming or cooling.[7][8] Many of the observed warming changes since the 1950s are unprecedented in the instrumental temperature record, and in historical and paleoclimate proxy records of climate change over thousands to millions of years.[2]\nIn 2013, the Intergovernmental

`p` is the paragraph tag in the HTML. In wikipedia all the text is within `paragraph` tag and thats why we will be using `p` tag to select the text part of wikipedia article and adding it to `text` variable. [Some sites also put the articles within `div` tag or `span` tag. Thus we might have to change this based on which website we are fetching the data]

After selecting only the text part based on `p` tag we still see some unrequired part in the string. Thus, we need to preprocess these part.

In [31]:
# Preprocessing the data_1
text = re.sub(r'\[[0-9]*\]',' ',text) # to replace the reference with a single space
text = re.sub(r'\s+',' ',text) # remove extra spaces
text[:1000]

" Global warming is a long-term rise in the average temperature of the Earth's climate system, an aspect of climate change shown by temperature measurements and by multiple effects of the warming. The term commonly refers to the mainly human-caused observed warming since pre-industrial times and its projected continuation, though there were also much earlier periods of global warming. In the modern context the terms are commonly used interchangeably, but global warming more specifically relates to worldwide surface temperature increases; while climate change is any regional or global statistically identifiable persistent change in the state of climate which lasts for decades or longer, including warming or cooling. Many of the observed warming changes since the 1950s are unprecedented in the instrumental temperature record, and in historical and paleoclimate proxy records of climate change over thousands to millions of years. In 2013, the Intergovernmental Panel on Climate Change (IPCC

In [32]:
# Preprocessing the data_2
clean_text = text.lower()
clean_text = re.sub(r'\W',' ',clean_text) # remove non-word characters
clean_text = re.sub(r'\d',' ',clean_text) # remove digits
clean_text = re.sub(r'\s+',' ',clean_text) # remove extra spaces
clean_text[:1000]

' global warming is a long term rise in the average temperature of the earth s climate system an aspect of climate change shown by temperature measurements and by multiple effects of the warming the term commonly refers to the mainly human caused observed warming since pre industrial times and its projected continuation though there were also much earlier periods of global warming in the modern context the terms are commonly used interchangeably but global warming more specifically relates to worldwide surface temperature increases while climate change is any regional or global statistically identifiable persistent change in the state of climate which lasts for decades or longer including warming or cooling many of the observed warming changes since the s are unprecedented in the instrumental temperature record and in historical and paleoclimate proxy records of climate change over thousands to millions of years in the intergovernmental panel on climate change ipcc fifth assessment rep

We built `clean_text` and `text` both. We will built the word histogram based on `clean_text` and not `text` because `text` contains a lot of unnecessary stuff including punctuations digits etc. 

But once summarized output we might need some digits like 30 degree centigrade as it is a global warming article. Thus, digits can be important in the summary sentence but not for word histogram. So we are going to do histogram on the `clean_text` and summary on the `text`.

In [39]:
# Tokenize sentences because we want to find most meaningful sentences
sentences = nltk.sent_tokenize(text) # clean_text do NOT contain period so sent_tokenize won't word
print(sentences[:2])
# Stopword list
stop_words = nltk.corpus.stopwords.words('english') 
# need to remove stopwords from word histograms to have only impactful words

[" Global warming is a long-term rise in the average temperature of the Earth's climate system, an aspect of climate change shown by temperature measurements and by multiple effects of the warming.", 'The term commonly refers to the mainly human-caused observed warming since pre-industrial times and its projected continuation, though there were also much earlier periods of global warming.']


In [41]:
# Word counts: [word histogram]
word2count = {} # will contain word histogram

for word in nltk.word_tokenize(clean_text): # we need clean_text as it only contains words
     if word not in stop_words: # ignore stopwords present in the article
        if word not in word2count.keys(): # count the impactful words
            word2count[word] = 1
        else:
            word2count[word] += 1   

print(list(word2count.items())[:5])

[('global', 75), ('warming', 83), ('long', 8), ('term', 11), ('rise', 8)]


In [45]:
# Converting counts to weights: [weighted histogram]
for key in word2count.keys():
    word2count[key] = word2count[key]/max(word2count.values())

print(list(word2count.items())[:5])

[('global', 0.8064516129032258), ('warming', 0.8924731182795699), ('long', 0.08602150537634409), ('term', 0.11827956989247312), ('rise', 0.08602150537634409)]


`word2count[key]` is the count value of a particular word or key. We have to divide the count value with maximum count value of the entire `word2count` dictionary, to build weighted histogram.

In [47]:
# Produce sentence scores    
sent2score = {}
for sentence in sentences:
    for word in nltk.word_tokenize(sentence.lower()):
        if word in word2count.keys():
            if len(sentence.split(' ')) < 25: #
                if sentence not in sent2score.keys():
                    sent2score[sentence] = word2count[word]
                else:
                    sent2score[sentence] += word2count[word]
    
print(list(sent2score.items())[:2])

[('The largest human influence has been the emission of greenhouse gases such as carbon dioxide, methane, and nitrous oxide.', 3.5196808510638293), ('In view of the dominant role of human activity in causing it, the phenomenon is sometimes called "anthropogenic global warming" or "anthropogenic climate change".', 5.169669412033858)]


**Code explanation**
 * `for word in nltk.word_tokenize(sentence.lower()):` - sentence was made from `text` so we need to lower it
 
 * if a word from the sentence is present in `word2count` dictionary, then we check whether the sentence (that contains this word) is present in the `sent2score` dictionary or not? If not, we place the sentence as a key and the weighted value of the word as its value. If it is already present, we will keep on adding the weighted value of the word.
 
 * `if len(sentence.split(' ')) < 25:`- Some sentences might NOT be important, but they might be very long. Longer sentence can get a higher value. So we are excluding sentences that are longer than 25 words, to avoid this from happening. 

In [55]:
# Gettings best 5 lines    
import heapq
best_sentences = heapq.nlargest(5, sent2score, key=sent2score.get)

print('-'*117)
for sentence in best_sentences:
    print(sentence)
print('-'*117)

---------------------------------------------------------------------------------------------------------------------
In 1986 and November 1987, NASA climate scientist James Hansen gave testimony to Congress on global warming.
In the late 19th century, scientists first argued that human emissions of greenhouse gases could change the climate.
The phrase began to come into common use, and in 1976 Mikhail Budyko's statement that "a global warming up has started" was widely reported.
Global oil companies have begun to acknowledge climate change exists and is caused by human activities and the burning of fossil fuels.
Adaptation is especially important in developing countries since those countries are predicted to bear the brunt of the effects of global warming.
---------------------------------------------------------------------------------------------------------------------
