## 9.1 Text Summarization with NLTK

In [2]:
# $ pip install beautifulsoup4
# $ pip install lxml

### 9.1.1. Scrapping Wikipedia Article

In [4]:
import bs4 as bs
import urllib.request
import re

raw_data = urllib.request.urlopen('https://en.wikipedia.org/wiki/Natural_language_processing')
document = raw_data.read()

parsed_document = bs.BeautifulSoup(document,'lxml')

article_paras = parsed_document.find_all('p')

scrapped_data = ""

for para in article_paras:
    scrapped_data += para.text

In [5]:
print(scrapped_data[:1000])

Natural language processing (NLP) is a subfield of linguistics, computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data.
Challenges in natural language processing frequently involve speech recognition, natural language understanding, and natural language generation.
The history of natural language processing (NLP) generally started in the 1950s, although work can be found from earlier periods.
In 1950, Alan Turing published an article titled "Computing Machinery and Intelligence" which proposed what is now called the Turing test as a criterion of intelligence[clarification needed].
The Georgetown experiment in 1954 involved fully automatic translation of more than sixty Russian sentences into English. The authors claimed that within three or five years, machine translation would be a solved p

### 9.1.2. Text Cleaning

In [6]:
scrapped_data = re.sub(r'\[[0-9]*\]', ' ',  scrapped_data)
scrapped_data = re.sub(r'\s+', ' ',  scrapped_data)

In [7]:
formatted_text = re.sub('[^a-zA-Z]', ' ', scrapped_data)
formatted_text = re.sub(r'\s+', ' ', formatted_text)

### 9.1.3. Finding Word Frequencies

In [8]:
import nltk
all_sentences = nltk.sent_tokenize(scrapped_data)

In [9]:
stopwords = nltk.corpus.stopwords.words('english')

word_freq = {}
for word in nltk.word_tokenize(formatted_text):
    if word not in stopwords:
        if word not in word_freq.keys():
            word_freq[word] = 1
        else:
            word_freq[word] += 1

In [10]:
max_freq = max(word_freq.values())

for word in word_freq.keys():
    word_freq[word] = (word_freq[word]/max_freq)

### 9.1.4. Finding Sentence Scores

In [11]:
sentence_scores = {}
for sentence in all_sentences:
    for token in nltk.word_tokenize(sentence.lower()):
        if token in word_freq.keys():
            if len(sentence.split(' ')) <25:
                if sentence not in sentence_scores.keys():
                    sentence_scores[sentence] = word_freq[token]
                else:
                    sentence_scores[sentence] += word_freq[token]

### 9.1.5. Printing Summaries

In [12]:
import heapq
selected_sentences= heapq.nlargest(5, sentence_scores, key=sentence_scores.get)

text_summary = ' '.join(selected_sentences)
print(text_summary)

Challenges in natural language processing frequently involve speech recognition, natural language understanding, and natural language generation. Starting in the late 1980s, however, there was a revolution in natural language processing with the introduction of machine learning algorithms for language processing. Since the so-called "statistical revolution" in the late 1980s and mid-1990s, much natural language processing research has relied heavily on machine learning. Little further research in machine translation was conducted until the late 1980s when the first statistical machine translation systems were developed. Up to the 1980s, most natural language processing systems were based on complex sets of hand-written rules.
