In [7]:
import bs4 as bs
import urllib.request
import re
import nltk
from nltk.tokenize import word_tokenize,sent_tokenize
from nltk.chunk import named_entity,ne_chunk

In [55]:
scraped_data = urllib.request.urlopen('https://en.wikipedia.org/wiki/Natural_language_processing')
article = scraped_data.read()

parsed_article = bs.BeautifulSoup(article,'lxml')

paragraphs = parsed_article.find_all('p')

article_text = ""

for p in paragraphs:
    article_text += p.text


In [56]:
article_text


'Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data.\nChallenges in natural language processing frequently involve speech recognition, natural language understanding, and natural-language generation.\nNatural language processing has its roots in the 1950s. Already in 1950, Alan Turing published an article titled "Computing Machinery and Intelligence" which proposed what is now called the Turing test as a criterion of intelligence, a task that involves the automated interpretation and generation of natural language, but at the time not articulated as a problem separate from artificial intelligence.\nThe premise of symbolic NLP is well-summarized by John Searle\'s Chinese room experiment: Given a collection of rules (e.g., a Chinese phrasebook, with questions an

In [57]:
# Removing Square Brackets and Extra Spaces
article_text = re.sub(r'\[[0-9]*\]', ' ', article_text)
article_text = re.sub(r'\s+', ' ', article_text)
#article_text

In [58]:
# Removing special characters and digits
formatted_article_text = re.sub('[^a-zA-Z]', ' ', article_text )
formatted_article_text = re.sub(r'\s+', ' ', formatted_article_text)
#formatted_article_text

In [59]:
#Converting Text To Sentences
sentence_list = nltk.sent_tokenize(article_text)

Find Weighted Frequency of Occurrence
To find the frequency of occurrence of each word, we use the formatted_article_text variable. We used this variable to find the frequency of occurrence since it doesn't contain punctuation, digits, or other special characters. Take a look at the following script:

In [60]:
stopwords = nltk.corpus.stopwords.words('english')

word_frequencies = {}
for word in nltk.word_tokenize(formatted_article_text):
    if word not in stopwords:
        if word not in word_frequencies.keys():
            word_frequencies[word] = 1
        else:
            word_frequencies[word] += 1

# weighted frequency, 

we can simply divide the number of occurances of all the words by the frequency of the most occurring word, as shown below:

In [61]:
maximum_frequncy = max(word_frequencies.values())

for word in word_frequencies.keys():
    word_frequencies[word] = (word_frequencies[word]/maximum_frequncy)

In [65]:
#Calculating Sentence Scores
#We have now calculated the weighted frequencies 
#for all the words. Now is the time to calculate the 
#scores for each sentence by adding weighted frequencies of 
#the words that occur in that particular sentence. 
#The following script calculates sentence scores:


sentence_scores = {}
for sent in sentence_list:
    for word in nltk.word_tokenize(sent.lower()):
        if word in word_frequencies.keys():
            if len(sent.split(' ')) < 45:
                if sent not in sentence_scores.keys():
                    sentence_scores[sent] = word_frequencies[word]
                else:
                    sentence_scores[sent] += word_frequencies[word]

In [66]:
sentence_scores

{'Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data.': 6.333333333333334,
 'Challenges in natural language processing frequently involve speech recognition, natural language understanding, and natural-language generation.': 4.296296296296297,
 'Natural language processing has its roots in the 1950s.': 2.2222222222222223,
 'Up to the 1980s, most natural language processing systems were based on complex sets of hand-written rules.': 3.1111111111111116,
 'Starting in the late 1980s, however, there was a revolution in natural language processing with the introduction of machine learning algorithms for language processing.': 5.296296296296297,
 "This was due to both the steady increase in computational power (see Moore's law) and the gradual lessening of the domin

In the script above, we first create an empty sentence_scores dictionary. The keys of this dictionary will be the sentences themselves and the values will be the corresponding scores of the sentences. Next, we loop through each sentence in the sentence_list and tokenize the sentence into words.

We then check if the word exists in the word_frequencies dictionary. This check is performed since we created the sentence_list list from the article_text object; on the other hand, the word frequencies were calculated using the formatted_article_text object, which doesn't contain any stop words, numbers, etc.

We do not want very long sentences in the summary, therefore, we calculate the score for only sentences with less than 30 words (although you can tweak this parameter for your own use-case). Next, we check whether the sentence exists in the sentence_scores dictionary or not. If the sentence doesn't exist, we add it to the sentence_scores dictionary as a key and assign it the weighted frequency of the first word in the sentence, as its value. On the contrary, if the sentence exists in the dictionary, we simply add the weighted frequency of the word to the existing value.

# Getting the Summary
Now we have the sentence_scores dictionary that contains sentences with their corresponding score. To summarize the article, we can take top N sentences with the highest scores. The following script retrieves top 7 sentences and prints them on the screen.

In [67]:
import heapq
summary_sentences = heapq.nlargest(15, sentence_scores, key=sentence_scores.get)

summary = ' '.join(summary_sentences)
print(summary)

Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. Starting in the late 1980s, however, there was a revolution in natural language processing with the introduction of machine learning algorithms for language processing. However, part-of-speech tagging introduced the use of hidden Markov models to natural language processing, and increasingly, research has focused on statistical models, which make soft, probabilistic decisions based on attaching real-valued weights to the features making up the input data. For instance, the term neural machine translation (NMT) emphasizes the fact that deep learning-based approaches to machine translation directly learn sequence-to-sequence transformations, obviating the need for intermediate steps such as word alignment and l