In [1]:
# To perform text summarization of a Wikipedia Page
import bs4 as bs 
import urllib.request 
import re 
import heapq 
import nltk 
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

Fetching Articles from Wikipedia and Displaying it

In [2]:
# Scrape the data from url and read it 
scraped_data = urllib.request.urlopen('https://en.wikipedia.org/wiki/Big_Bang') 
article = scraped_data.read() 

In [3]:
# Parse the data 
parsed_article = bs.BeautifulSoup(article,'lxml') 

In [4]:
# Select text inside paragraph <p> tags 
paragraphs = parsed_article.find_all('p') 
article_text = "" 
for p in paragraphs: 
 article_text += p.text 
print(article_text) 



The Big Bang theory is the prevailing cosmological model of the observable universe from the earliest known periods through its subsequent large-scale evolution.[1][2][3] The model describes how the universe expanded from an initial state of high density and temperature,[4] and offers a comprehensive explanation for a broad range of observed phenomena, including the abundance of light elements, the cosmic microwave background (CMB) radiation, and large-scale structure.
Crucially, the theory is compatible with Hubble–Lemaître law — the observation that the farther away galaxies are, the faster they are moving away from Earth. Extrapolating this cosmic expansion backwards in time using the known laws of physics, the theory describes an increasingly concentrated cosmos preceded by a singularity in which space and time lose meaning (typically named "the Big Bang singularity").[5] Detailed measurements of the expansion rate of the universe place the Big Bang singularity at around 13.8 bil

Preprocessing and Showing the output of preprocessing

In [5]:
# Removing Square Brackets and Extra Spaces 
article_text = re.sub(r'\[[0-9]*\]', ' ', article_text) 
article_text = re.sub(r'\s+', ' ', article_text) 
print(article_text) 

 The Big Bang theory is the prevailing cosmological model of the observable universe from the earliest known periods through its subsequent large-scale evolution. The model describes how the universe expanded from an initial state of high density and temperature, and offers a comprehensive explanation for a broad range of observed phenomena, including the abundance of light elements, the cosmic microwave background (CMB) radiation, and large-scale structure. Crucially, the theory is compatible with Hubble–Lemaître law — the observation that the farther away galaxies are, the faster they are moving away from Earth. Extrapolating this cosmic expansion backwards in time using the known laws of physics, the theory describes an increasingly concentrated cosmos preceded by a singularity in which space and time lose meaning (typically named "the Big Bang singularity"). Detailed measurements of the expansion rate of the universe place the Big Bang singularity at around 13.8 billion years ago, 

In [6]:
# Removing special characters and digits
formatted_article_text = re.sub('[^a-zA-Z]', ' ', article_text ) 
formatted_article_text = re.sub(r'\s+', ' ', formatted_article_text) 
print(formatted_article_text)

 The Big Bang theory is the prevailing cosmological model of the observable universe from the earliest known periods through its subsequent large scale evolution The model describes how the universe expanded from an initial state of high density and temperature and offers a comprehensive explanation for a broad range of observed phenomena including the abundance of light elements the cosmic microwave background CMB radiation and large scale structure Crucially the theory is compatible with Hubble Lema tre law the observation that the farther away galaxies are the faster they are moving away from Earth Extrapolating this cosmic expansion backwards in time using the known laws of physics the theory describes an increasingly concentrated cosmos preceded by a singularity in which space and time lose meaning typically named the Big Bang singularity Detailed measurements of the expansion rate of the universe place the Big Bang singularity at around billion years ago which is thus considered 

Converting Text into Sentences 

In [7]:
# Generating a list of all sentences 
sentence_list = nltk.sent_tokenize(article_text) 
print(sentence_list) 

[' The Big Bang theory is the prevailing cosmological model of the observable universe from the earliest known periods through its subsequent large-scale evolution.', 'The model describes how the universe expanded from an initial state of high density and temperature, and offers a comprehensive explanation for a broad range of observed phenomena, including the abundance of light elements, the cosmic microwave background (CMB) radiation, and large-scale structure.', 'Crucially, the theory is compatible with Hubble–Lemaître law — the observation that the farther away galaxies are, the faster they are moving away from Earth.', 'Extrapolating this cosmic expansion backwards in time using the known laws of physics, the theory describes an increasingly concentrated cosmos preceded by a singularity in which space and time lose meaning (typically named "the Big Bang singularity").', 'Detailed measurements of the expansion rate of the universe place the Big Bang singularity at around 13.8 billi

Finding the Weighted Frequency of Occurrence and Displaying the Output

In [8]:
stopwords = nltk.corpus.stopwords.words('english') 
word_frequencies = {} 
for word in nltk.word_tokenize(formatted_article_text):  
  if word not in stopwords: 
    if word not in word_frequencies.keys(): 
      word_frequencies[word] = 1 
    else: 
      word_frequencies[word] += 1 
maximum_frequncy = max(word_frequencies.values()) 
for word in word_frequencies.keys(): 
  word_frequencies[word] = (word_frequencies[word]/maximum_frequncy) 
print(word_frequencies)

{'The': 0.3464052287581699, 'Big': 0.45751633986928103, 'Bang': 0.43137254901960786, 'theory': 0.2679738562091503, 'prevailing': 0.006535947712418301, 'cosmological': 0.1437908496732026, 'model': 0.16993464052287582, 'observable': 0.05228758169934641, 'universe': 1.0, 'earliest': 0.032679738562091505, 'known': 0.09803921568627451, 'periods': 0.006535947712418301, 'subsequent': 0.006535947712418301, 'large': 0.10457516339869281, 'scale': 0.1111111111111111, 'evolution': 0.0392156862745098, 'describes': 0.0457516339869281, 'expanded': 0.006535947712418301, 'initial': 0.0392156862745098, 'state': 0.12418300653594772, 'high': 0.05228758169934641, 'density': 0.24836601307189543, 'temperature': 0.11764705882352941, 'offers': 0.013071895424836602, 'comprehensive': 0.013071895424836602, 'explanation': 0.0392156862745098, 'broad': 0.013071895424836602, 'range': 0.0196078431372549, 'observed': 0.08496732026143791, 'phenomena': 0.026143790849673203, 'including': 0.0392156862745098, 'abundance': 0

Calculating the Sentence Scores and Displaying the output

In [9]:
sentence_scores = {} 
for sent in sentence_list : 
  for word in nltk.word_tokenize(sent.lower()) : 
    if word in word_frequencies.keys() :
      if len(sent.split(' ')) < 30 :
        if sent not in sentence_scores.keys() :  
          sentence_scores[sent] = word_frequencies[word]  
        else :
          sentence_scores[sent] += word_frequencies[word]
print(sentence_scores)

{' The Big Bang theory is the prevailing cosmological model of the observable universe from the earliest known periods through its subsequent large-scale evolution.': 1.862745098039216, 'Crucially, the theory is compatible with Hubble–Lemaître law — the observation that the farther away galaxies are, the faster they are moving away from Earth.': 0.6209150326797386, 'Detailed measurements of the expansion rate of the universe place the Big Bang singularity at around 13.8 billion years ago, which is thus considered the age of the universe.': 2.7450980392156863, 'After its initial expansion, an event that is by itself often called "the Big Bang", the universe cooled sufficiently to allow the formation of subatomic particles, and later atoms.': 1.6862745098039216, 'Besides these primordial building materials, astronomers observe the gravitational effects of an unknown dark matter surrounding galaxies.': 0.9803921568627452, "Measurements of the redshifts of supernovae indicate that the expa

Displaying the Summary 


In [10]:
summary_sentences = heapq.nlargest(7, sentence_scores, key=sentence_scores.get) 
summary = ' '.join(summary_sentences) 
print(summary)

According to theory, the energy density in matter decreases with the expansion of the universe, but the dark energy density remains constant (or nearly so) as the universe expands. When the size of the universe at Big Bang is described, it refers to the size of the observable universe, and not the entire universe. If the mass density of the universe were greater than the critical density, then the universe would reach a maximum size and then begin to collapse. The four possible types of matter are known as cold dark matter, warm dark matter, hot dark matter, and baryonic matter. Detailed measurements of the expansion rate of the universe place the Big Bang singularity at around 13.8 billion years ago, which is thus considered the age of the universe. Measurements of the redshift–magnitude relation for type Ia supernovae indicate that the expansion of the universe has been accelerating since the universe was about half its present age. Extrapolation of the expansion of the universe back