In [1]:
import bs4 as bs
import urllib.request
import re
import nltk

scraped_data = urllib.request.urlopen('https://en.wikipedia.org/wiki/Natural_language_processing')
article = scraped_data.read()

parsed_article = bs.BeautifulSoup(article,'lxml')

paragraphs = parsed_article.find_all('p')

article_text = ""

for p in paragraphs:
    article_text += p.text

In [2]:
article_text

'Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data.  The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.\nChallenges in natural language processing frequently involve speech recognition, natural language understanding, and natural language generation.\nNatural language processing has its roots in the 1950s. Already in 1950, Alan Turing published an article titled "Computing Machinery and Intelligence" which proposed what is now called the Turing test as a criterion of intelligence, a task that involves the automated inte

In [3]:
# Removing Square Brackets and Extra Spaces
article_text = re.sub(r'\[[0-9]*\]', ' ', article_text)
article_text = re.sub(r'\s+', ' ', article_text)

In [4]:
article_text

'Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves. Challenges in natural language processing frequently involve speech recognition, natural language understanding, and natural language generation. Natural language processing has its roots in the 1950s. Already in 1950, Alan Turing published an article titled "Computing Machinery and Intelligence" which proposed what is now called the Turing test as a criterion of intelligence, a task that involves the automated interpr

In [5]:
# Removing special characters and digits
formatted_article_text = re.sub('[^a-zA-Z]', ' ', article_text )
formatted_article_text = re.sub(r'\s+', ' ', formatted_article_text)

In [6]:
formatted_article_text

'Natural language processing NLP is a subfield of linguistics computer science and artificial intelligence concerned with the interactions between computers and human language in particular how to program computers to process and analyze large amounts of natural language data The goal is a computer capable of understanding the contents of documents including the contextual nuances of the language within them The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves Challenges in natural language processing frequently involve speech recognition natural language understanding and natural language generation Natural language processing has its roots in the s Already in Alan Turing published an article titled Computing Machinery and Intelligence which proposed what is now called the Turing test as a criterion of intelligence a task that involves the automated interpretation and generation of na

In [7]:
import nltk
sentence_list = nltk.sent_tokenize(article_text)

In [8]:
sentence_list

['Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data.',
 'The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them.',
 'The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.',
 'Challenges in natural language processing frequently involve speech recognition, natural language understanding, and natural language generation.',
 'Natural language processing has its roots in the 1950s.',
 'Already in 1950, Alan Turing published an article titled "Computing Machinery and Intelligence" which proposed what is now called the Turing test as a criterion of intelligence, a task that involves 

In [9]:
stopwords = nltk.corpus.stopwords.words('english')

word_frequencies = {}
for word in nltk.word_tokenize(formatted_article_text):
    if word not in stopwords:
        if word not in word_frequencies.keys():
            word_frequencies[word] = 1
        else:
            word_frequencies[word] += 1

In [10]:
word_frequencies

{'Natural': 3,
 'language': 29,
 'processing': 17,
 'NLP': 17,
 'subfield': 1,
 'linguistics': 9,
 'computer': 4,
 'science': 3,
 'artificial': 2,
 'intelligence': 3,
 'concerned': 1,
 'interactions': 1,
 'computers': 2,
 'human': 2,
 'particular': 1,
 'program': 1,
 'process': 2,
 'analyze': 2,
 'large': 3,
 'amounts': 1,
 'natural': 18,
 'data': 5,
 'The': 6,
 'goal': 1,
 'capable': 1,
 'understanding': 4,
 'contents': 1,
 'documents': 4,
 'including': 1,
 'contextual': 1,
 'nuances': 1,
 'within': 1,
 'technology': 1,
 'accurately': 1,
 'extract': 1,
 'information': 1,
 'insights': 1,
 'contained': 1,
 'well': 2,
 'categorize': 1,
 'organize': 1,
 'Challenges': 1,
 'frequently': 2,
 'involve': 2,
 'speech': 5,
 'recognition': 2,
 'generation': 2,
 'roots': 1,
 'Already': 1,
 'Alan': 1,
 'Turing': 2,
 'published': 1,
 'article': 1,
 'titled': 1,
 'Computing': 1,
 'Machinery': 1,
 'Intelligence': 1,
 'proposed': 2,
 'called': 2,
 'test': 1,
 'criterion': 1,
 'task': 3,
 'involves': 1,

In [11]:
maximum_frequncy = max(word_frequencies.values())

for word in word_frequencies.keys():
    word_frequencies[word] = (word_frequencies[word]/maximum_frequncy)

In [12]:
word_frequencies

{'Natural': 0.10344827586206896,
 'language': 1.0,
 'processing': 0.5862068965517241,
 'NLP': 0.5862068965517241,
 'subfield': 0.034482758620689655,
 'linguistics': 0.3103448275862069,
 'computer': 0.13793103448275862,
 'science': 0.10344827586206896,
 'artificial': 0.06896551724137931,
 'intelligence': 0.10344827586206896,
 'concerned': 0.034482758620689655,
 'interactions': 0.034482758620689655,
 'computers': 0.06896551724137931,
 'human': 0.06896551724137931,
 'particular': 0.034482758620689655,
 'program': 0.034482758620689655,
 'process': 0.06896551724137931,
 'analyze': 0.06896551724137931,
 'large': 0.10344827586206896,
 'amounts': 0.034482758620689655,
 'natural': 0.6206896551724138,
 'data': 0.1724137931034483,
 'The': 0.20689655172413793,
 'goal': 0.034482758620689655,
 'capable': 0.034482758620689655,
 'understanding': 0.13793103448275862,
 'contents': 0.034482758620689655,
 'documents': 0.13793103448275862,
 'including': 0.034482758620689655,
 'contextual': 0.03448275862068

In [13]:
sentence_scores = {}
for sent in sentence_list:
    for word in nltk.word_tokenize(sent.lower()):
        if word in word_frequencies.keys():
            if len(sent.split(' ')) < 30:
                if sent not in sentence_scores.keys():
                    sentence_scores[sent] = word_frequencies[word]
                else:
                    sentence_scores[sent] += word_frequencies[word]

In [14]:
sentence_scores

{'The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them.': 1.6551724137931034,
 'The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.': 0.6206896551724139,
 'Challenges in natural language processing frequently involve speech recognition, natural language understanding, and natural language generation.': 6.03448275862069,
 'Natural language processing has its roots in the 1950s.': 2.2413793103448274,
 'Up to the 1980s, most natural language processing systems were based on complex sets of hand-written rules.': 3.0344827586206895,
 'Starting in the late 1980s, however, there was a revolution in natural language processing with the introduction of machine learning algorithms for language processing.': 5.172413793103448,
 "This was due to both the steady increase in computational power (see Moore's law) a

In [15]:
import heapq
summary_sentences = heapq.nlargest(10, sentence_scores, key=sentence_scores.get)

summary = ' '.join(summary_sentences)
print(summary)

Challenges in natural language processing frequently involve speech recognition, natural language understanding, and natural language generation. Starting in the late 1980s, however, there was a revolution in natural language processing with the introduction of machine learning algorithms for language processing. Up to the 1980s, most natural language processing systems were based on complex sets of hand-written rules. The cache language models upon which many speech recognition systems now rely are examples of such statistical models. The following is a list of some of the most commonly researched tasks in natural language processing. Though natural language processing tasks are closely intertwined, they can be subdivided into categories for convenience. transformational grammar), whose theoretical underpinnings discouraged the sort of corpus linguistics that underlies the machine-learning approach to language processing. Natural language processing has its roots in the 1950s. Media r