In [3]:
# %pip install beautifulsoup4
# %pip install lxml
# %pip install nltk
# import nltk
# nltk.download('punkt')
# nltk.download('stopwords')

In [4]:
import bs4 as bs
import urllib.request
import re
import nltk

Scraping a wikipedia article and cleaning it a bit.

In [5]:

scraped_data = urllib.request.urlopen('https://en.wikipedia.org/wiki/Artificial_intelligence')
article = scraped_data.read()

parsed_article = bs.BeautifulSoup(article,'lxml')

paragraphs = parsed_article.find_all('p')

article_text = ""

for p in paragraphs:
    article_text += p.text

In [6]:
paragraphs[3].text

'The term "artificial intelligence" had previously been used to describe machines that mimic and display "human" cognitive skills that are associated with the human mind, such as "learning" and "problem-solving". This definition has since been rejected by major AI researchers who now describe AI in terms of rationality and acting rationally, which does not limit how intelligence can be articulated.[b]\n'

In [7]:
article_text = re.sub(r'\[[0-9]*\]', ' ', article_text)
article_text = re.sub(r'\s+', ' ', article_text)

In [8]:
article_text[:500]

' Artificial intelligence (AI) is intelligence demonstrated by machines, as opposed to the natural intelligence displayed by animals including humans. AI research has been defined as the field of study of intelligent agents, which refers to any system that perceives its environment and takes actions that maximize its chance of achieving its goals.[a] The term "artificial intelligence" had previously been used to describe machines that mimic and display "human" cognitive skills that are associated'

The `article_text` object contains text without brackets. However, we do not want to remove anything else from the article since this is the original article. We will not remove other numbers, punctuation marks and special characters from this text since we will use this text to create summaries and weighted word frequencies will be replaced in this article.

To clean the text and calculate weighted frequences, we will create another object.

In [9]:
formatted_article_text = re.sub('[^a-zA-Z]', ' ', article_text )
formatted_article_text = re.sub(r'\s+', ' ', formatted_article_text)

In [10]:
formatted_article_text[:500]

' Artificial intelligence AI is intelligence demonstrated by machines as opposed to the natural intelligence displayed by animals including humans AI research has been defined as the field of study of intelligent agents which refers to any system that perceives its environment and takes actions that maximize its chance of achieving its goals a The term artificial intelligence had previously been used to describe machines that mimic and display human cognitive skills that are associated with the h'

#### Converting Text To Sentences

At this point we have preprocessed the data. Next, we need to tokenize the article into sentences. We will use the `article_text` object for tokenizing the article to sentence since it contains full stops. The `formatted_article_text` does not contain any punctuation and therefore cannot be converted into sentences using the full stop as a parameter.

The following script performs sentence tokenization:

In [11]:
sentence_list = nltk.sent_tokenize(article_text)

In [12]:
sentence_list[:2]

[' Artificial intelligence (AI) is intelligence demonstrated by machines, as opposed to the natural intelligence displayed by animals including humans.',
 'AI research has been defined as the field of study of intelligent agents, which refers to any system that perceives its environment and takes actions that maximize its chance of achieving its goals.']

#### Find Weighted Frequency of Occurrence

To find the frequency of occurrence of each word, we use the `formatted_article_text` variable. We used this variable to find the frequency of occurrence since it doesn't contain punctuation, digits, or other special characters

In [13]:
stopwords = nltk.corpus.stopwords.words('english')

word_frequencies = {}
for word in nltk.word_tokenize(formatted_article_text):
    if word not in stopwords:
        if word not in word_frequencies.keys():
            word_frequencies[word] = 1
        else:
            word_frequencies[word] += 1

In [14]:
list(word_frequencies.items())[:4]

[('Artificial', 8), ('intelligence', 97), ('AI', 156), ('demonstrated', 3)]

In [15]:
maximum_frequncy = max(word_frequencies.values())

for word in word_frequencies.keys():
    word_frequencies[word] = (word_frequencies[word]/maximum_frequncy)

In [16]:
list(word_frequencies.items())[:4]

[('Artificial', 0.05128205128205128),
 ('intelligence', 0.6217948717948718),
 ('AI', 1.0),
 ('demonstrated', 0.019230769230769232)]

#### Calculating Sentence Scores

We have now calculated the weighted frequencies for all the words. Now is the time to calculate the scores for each sentence by adding weighted frequencies of the words that occur in that particular sentence. The following script calculates sentence scores:

In [17]:
print(sentence_list[2].lower())
nltk.word_tokenize( sentence_list[2].lower())

[a] the term "artificial intelligence" had previously been used to describe machines that mimic and display "human" cognitive skills that are associated with the human mind, such as "learning" and "problem-solving".


['[',
 'a',
 ']',
 'the',
 'term',
 '``',
 'artificial',
 'intelligence',
 "''",
 'had',
 'previously',
 'been',
 'used',
 'to',
 'describe',
 'machines',
 'that',
 'mimic',
 'and',
 'display',
 '``',
 'human',
 "''",
 'cognitive',
 'skills',
 'that',
 'are',
 'associated',
 'with',
 'the',
 'human',
 'mind',
 ',',
 'such',
 'as',
 '``',
 'learning',
 "''",
 'and',
 '``',
 'problem-solving',
 "''",
 '.']

In [18]:
sentence_scores = {}
for sent in sentence_list:
    for word in nltk.word_tokenize(sent.lower()):
        if word in word_frequencies.keys():
            if len(sent.split(' ')) < 30:
                if sent not in sentence_scores.keys():
                    sentence_scores[sent] = word_frequencies[word]
                else:
                    sentence_scores[sent] += word_frequencies[word]

In [19]:
list(sentence_scores.items())[:4]

[(' Artificial intelligence (AI) is intelligence demonstrated by machines, as opposed to the natural intelligence displayed by animals including humans.',
  2.6217948717948723),
 ('As machines become increasingly capable, tasks considered to require "intelligence" are often removed from the definition of AI, a phenomenon known as the AI effect.',
  1.1538461538461537),
 ('For instance, optical character recognition is frequently excluded from things considered to be AI, having become a routine technology.',
  0.3525641025641026),
 ('The various sub-fields of AI research are centered around particular goals and the use of particular tools.',
  0.7948717948717949)]

#### Getting the Summary

Now we have the `sentence_scores` dictionary that contains sentences with their corresponding score. To summarize the article, we can take top N sentences with the highest scores. The following script retrieves top 7 sentences and prints them on the screen.

In [20]:
import heapq
summary_sentences = heapq.nlargest(7, sentence_scores, key=sentence_scores.get)

summary = ' '.join(summary_sentences)
print(summary)

 Artificial intelligence (AI) is intelligence demonstrated by machines, as opposed to the natural intelligence displayed by animals including humans. The artificial intelligence algorithms will also be used to further improve diagnosis over time, via an application of machine learning called precision medicine. A machine with general intelligence can solve a wide variety of problems with breadth and versatility similar to human intelligence. Deep learning has drastically improved the performance of programs in many important subfields of artificial intelligence, including computer vision, speech recognition, image classification and others. [r] AI founder John McCarthy said: "Artificial intelligence is not, by definition, simulation of human intelligence". A superintelligence, hyperintelligence, or superhuman intelligence, is a hypothetical agent that would possess intelligence far surpassing that of the brightest and most gifted human mind. The main agenda for these scientific diploma