In [1]:
import bs4 as bs
import urllib.request
import re
import nltk
from nltk.tokenize import word_tokenize,sent_tokenize
from nltk.chunk import named_entity,ne_chunk

In [7]:
scraped_data = urllib.request.urlopen('https://en.wikipedia.org/wiki/Italy')
article = scraped_data.read()

parsed_article = bs.BeautifulSoup(article,'lxml')

paragraphs = parsed_article.find_all('p')

article_text = ""

for p in paragraphs:
    article_text += p.text


In [8]:
article_text


'\n\nCoordinates: 43°N 12°E\ufeff / \ufeff43°N 12°E\ufeff / 43; 12\n–\xa0in Europe\xa0(light green &\xa0dark grey)–\xa0in the European Union\xa0(light green)\xa0 –\xa0 [Legend]Italy (Italian: Italia [iˈtaːlja] (listen)), officially the Italian Republic (Italian: Repubblica Italiana [reˈpubblika itaˈljaːna]),[12][13][14][15]  is a country consisting of a peninsula delimited by the Alps and surrounded by several islands. Italy is located in south-central Europe,[16][17] and is considered part of western Europe.[18][19] A unitary parliamentary republic with Rome as its capital, the country covers a total area of 301,340\xa0km2 (116,350\xa0sq\xa0mi) and shares land borders with France, Switzerland, Austria, Slovenia, and the enclaved microstates of Vatican City and San Marino. Italy has a territorial enclave in Switzerland (Campione) and a maritime exclave in Tunisian waters (Lampedusa). With around 60 million inhabitants, Italy is the third-most populous member state of the European Union

In [9]:
# Removing Square Brackets and Extra Spaces
article_text = re.sub(r'\[[0-9]*\]', ' ', article_text)
article_text = re.sub(r'\s+', ' ', article_text)
#article_text

In [10]:
# Removing special characters and digits
formatted_article_text = re.sub('[^a-zA-Z]', ' ', article_text )
formatted_article_text = re.sub(r'\s+', ' ', formatted_article_text)
#formatted_article_text

In [11]:
#Converting Text To Sentences
sentence_list = nltk.sent_tokenize(article_text)

Find Weighted Frequency of Occurrence
To find the frequency of occurrence of each word, we use the formatted_article_text variable. We used this variable to find the frequency of occurrence since it doesn't contain punctuation, digits, or other special characters. Take a look at the following script:

In [12]:
stopwords = nltk.corpus.stopwords.words('english')

word_frequencies = {}
for word in nltk.word_tokenize(formatted_article_text):
    if word not in stopwords:
        if word not in word_frequencies.keys():
            word_frequencies[word] = 1
        else:
            word_frequencies[word] += 1

# weighted frequency, 

we can simply divide the number of occurances of all the words by the frequency of the most occurring word, as shown below:

In [13]:
maximum_frequncy = max(word_frequencies.values())

for word in word_frequencies.keys():
    word_frequencies[word] = (word_frequencies[word]/maximum_frequncy)

In [14]:
#Calculating Sentence Scores
#We have now calculated the weighted frequencies 
#for all the words. Now is the time to calculate the 
#scores for each sentence by adding weighted frequencies of 
#the words that occur in that particular sentence. 
#The following script calculates sentence scores:


sentence_scores = {}
for sent in sentence_list:
    for word in nltk.word_tokenize(sent.lower()):
        if word in word_frequencies.keys():
            if len(sent.split(' ')) < 45:
                if sent not in sentence_scores.keys():
                    sentence_scores[sent] = word_frequencies[word]
                else:
                    sentence_scores[sent] += word_frequencies[word]

In [15]:
sentence_scores

{'Italy is located in south-central Europe, and is considered part of western Europe.': 0.19655172413793104,
 'A unitary parliamentary republic with Rome as its capital, the country covers a total area of 301,340 km2 (116,350 sq mi) and shares land borders with France, Switzerland, Austria, Slovenia, and the enclaved microstates of Vatican City and San Marino.': 0.617241379310345,
 'Italy has a territorial enclave in Switzerland (Campione) and a maritime exclave in Tunisian waters (Lampedusa).': 0.05172413793103448,
 'With around 60 million inhabitants, Italy is the third-most populous member state of the European Union.': 0.24137931034482757,
 'Due to its central geographic location in Southern Europe and the Mediterranean, Italy has historically been home to myriad peoples and cultures.': 0.17586206896551726,
 'An Italic tribe known as the Latins formed the Roman Kingdom in the 8th century BC, which eventually became a republic with a government of the Senate and the People.': 0.6206

In the script above, we first create an empty sentence_scores dictionary. The keys of this dictionary will be the sentences themselves and the values will be the corresponding scores of the sentences. Next, we loop through each sentence in the sentence_list and tokenize the sentence into words.

We then check if the word exists in the word_frequencies dictionary. This check is performed since we created the sentence_list list from the article_text object; on the other hand, the word frequencies were calculated using the formatted_article_text object, which doesn't contain any stop words, numbers, etc.

We do not want very long sentences in the summary, therefore, we calculate the score for only sentences with less than 30 words (although you can tweak this parameter for your own use-case). Next, we check whether the sentence exists in the sentence_scores dictionary or not. If the sentence doesn't exist, we add it to the sentence_scores dictionary as a key and assign it the weighted frequency of the first word in the sentence, as its value. On the contrary, if the sentence exists in the dictionary, we simply add the weighted frequency of the word to the existing value.

# Getting the Summary
Now we have the sentence_scores dictionary that contains sentences with their corresponding score. To summarize the article, we can take top N sentences with the highest scores. The following script retrieves top 7 sentences and prints them on the screen.

In [16]:
import heapq
summary_sentences = heapq.nlargest(15, sentence_scores, key=sentence_scores.get)

summary = ' '.join(summary_sentences)
print(summary)

Notable Renaissance philosophers include: Giordano Bruno, one of the major scientific figures of the western world; Marsilio Ficino, one of the most influential humanist philosophers of the period; and Niccolò Machiavelli, one of the main founders of modern political science. In the last decade, Italy has become one of the world's leading producers of renewable energy, ranking as the world's fourth largest holder of installed solar energy capacity and the sixth largest holder of wind power capacity in 2010. The Roman Empire was among the most powerful economic, cultural, political and military forces in the world of its time, and it was one of the largest empires in world history. Eni, with operations in 79 countries, is one of the seven "Supermajor" oil companies in the world, and one of the world's largest industrial companies. In the last decade, Italy has become one of the world's largest producers of renewable energy, ranking as the second largest producer in the European Union an