## Simple Natural Language Based Approach

Steps: <br>
1. Tokenize the paragraph in different sentences
2. Preprocess the sentences
3. Prepare Histogram
4. Convert the histogram into weighted histogram (Divide each of the count by max_count)
5. Find sentence scores (we can add weights for that sentence)
6. Sort by sentence score in descending order
7. Pick N largest 

In [30]:
import bs4 as bs
import urllib.request 
import re
import nltk
import heapq
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

### Fetch the data from wikipedia

In [2]:
source = urllib.request.urlopen('https://en.wikipedia.org/wiki/Climate_change').read()

In [4]:
soup = bs.BeautifulSoup(source, 'lxml')

In [6]:
text = ''
for p in soup.find_all('p'):
    text+= p.text

### Preprocessing the text

In [9]:
text = re.sub(r'\[[0-9]*\]',' ', text) # remove reference links e.g. [1][2]
text = re.sub(r'\s+', ' ' , text)

In [12]:
clean_text = text.lower()
clean_text = re.sub(r'\W', ' ', clean_text)
clean_text = re.sub(r'\d', ' ', clean_text)
clean_text = re.sub(r'\s+', ' ', clean_text)

## clean_text will be used for creating histogram and using that histogram we will score the sentences in text

### Prepare for Histogram

In [15]:
sentences = nltk.sent_tokenize(text)
len(sentences)

475

In [18]:
stopwords = nltk.corpus.stopwords.words('english')

In [20]:
# Basic Histogram
word2count = {}

for word in nltk.word_tokenize(clean_text):
    if word not in stopwords:
        if word in word2count.keys():
            word2count[word] +=1
        else:
            word2count[word] = 1
#print(word2count)

In [22]:
# Weighted Histogram
max_value = max(word2count.values())

for key in word2count.keys():
    word2count[key] = word2count[key]/max_value

### Sentence Scores

In [37]:
sent2score ={}

for sentence in sentences:
    for word in nltk.word_tokenize(sentence.lower()):
        if word in word2count.keys():
            if len(sentence.split(' ')) < 25:
                if sentence not in sent2score.keys():
                    sent2score[sentence] = word2count[word]
                else:
                    sent2score[sentence] += word2count[word]
            
print(len(sent2score))

374


### Find Summary (select N largest sentences)

In [38]:
best_sentences = heapq.nlargest(5, sent2score, key = sent2score.get)

In [36]:
# <30
print(' '.join(best_sentences))

Scientifically, global warming refers only to increased surface warming, while climate change describes both global warming and its effects on Earth's climate system, such as precipitation changes.  In common usage, climate change describes global warming—the ongoing increase in global average temperature—and its effects on Earth's climate system. Adapting to climate change through efforts like flood control measures or drought-resistant crops partially reduces climate change risks, although some limits to adaptation have already been reached. People who hold unwarranted doubt about climate change are called climate change "skeptics", although "contrarians" or "deniers" are more appropriate terms. Climate change can also be used more broadly to include changes to the climate that have happened throughout Earth's history.


In [39]:
# <25
print(' '.join(best_sentences))

 In common usage, climate change describes global warming—the ongoing increase in global average temperature—and its effects on Earth's climate system. People who hold unwarranted doubt about climate change are called climate change "skeptics", although "contrarians" or "deniers" are more appropriate terms. Climate change can also be used more broadly to include changes to the climate that have happened throughout Earth's history. The long-term effects of climate change on oceans include further ice melt, ocean warming, sea level rise, ocean acidification and ocean deoxygenation. Climate change in a broader sense also includes previous long-term changes to Earth's climate.
