Preparing the environment

In [1]:
!pip3 install networkx



You should consider upgrading via the 'C:\Python310\python.exe -m pip install --upgrade pip' command.


In [2]:
import re
import nltk
import string
import numpy as np
import networkx as nx
from nltk.cluster.util import cosine_distance

In [3]:
nltk.download('punkt')


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ITS\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [4]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ITS\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [5]:
stopwords = nltk.corpus.stopwords.words('english')
print(stopwords)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [6]:
def preprocess(text):
  formatted_text = text.lower()
  tokens = []
  for token in nltk.word_tokenize(formatted_text):
    tokens.append(token)
  tokens = [word for word in tokens if word not in stopwords and word not in string.punctuation]
  formatted_text = ' '.join(element for element in tokens)

  return formatted_text

In [7]:
original_text = """Artificial intelligence is human like intelligence. It is the study of intelligent artificial agents. Science and engineering to produce intelligent machines. Solve problems and have intelligence. An AI system is made up of an agent and its surroundings. An agent is anything that can perceive its surroundings through sensors and act on them through effectors. Intelligent agents must be capable of setting and achieving goals. In classical planning problems, the agent can assume that it is the only system acting in the world, allowing the agent to be confident in the outcomes of its actions. However, if the agent is not the only actor, the agent must be able to reason under uncertainty. This necessitates an agent that not only assesses its surroundings and makes predictions, but also evaluates its predictions and adapts based on its findings."""
original_text = re.sub(r'\s+', ' ', original_text)
original_text

'Artificial intelligence is human like intelligence. It is the study of intelligent artificial agents. Science and engineering to produce intelligent machines. Solve problems and have intelligence. An AI system is made up of an agent and its surroundings. An agent is anything that can perceive its surroundings through sensors and act on them through effectors. Intelligent agents must be capable of setting and achieving goals. In classical planning problems, the agent can assume that it is the only system acting in the world, allowing the agent to be confident in the outcomes of its actions. However, if the agent is not the only actor, the agent must be able to reason under uncertainty. This necessitates an agent that not only assesses its surroundings and makes predictions, but also evaluates its predictions and adapts based on its findings.'

Function to calculate similarity between sentences

In [8]:
original_sentences = [sentence for sentence in nltk.sent_tokenize(original_text)]
original_sentences

['Artificial intelligence is human like intelligence.',
 'It is the study of intelligent artificial agents.',
 'Science and engineering to produce intelligent machines.',
 'Solve problems and have intelligence.',
 'An AI system is made up of an agent and its surroundings.',
 'An agent is anything that can perceive its surroundings through sensors and act on them through effectors.',
 'Intelligent agents must be capable of setting and achieving goals.',
 'In classical planning problems, the agent can assume that it is the only system acting in the world, allowing the agent to be confident in the outcomes of its actions.',
 'However, if the agent is not the only actor, the agent must be able to reason under uncertainty.',
 'This necessitates an agent that not only assesses its surroundings and makes predictions, but also evaluates its predictions and adapts based on its findings.']

In [9]:
formatted_sentences = [preprocess(original_sentence) for original_sentence in original_sentences]
formatted_sentences

['artificial intelligence human like intelligence',
 'study intelligent artificial agents',
 'science engineering produce intelligent machines',
 'solve problems intelligence',
 'ai system made agent surroundings',
 'agent anything perceive surroundings sensors act effectors',
 'intelligent agents must capable setting achieving goals',
 'classical planning problems agent assume system acting world allowing agent confident outcomes actions',
 'however agent actor agent must able reason uncertainty',
 'necessitates agent assesses surroundings makes predictions also evaluates predictions adapts based findings']

In [10]:
def calculate_sentence_similarity(sentence1, sentence2):
  words1 = [word for word in nltk.word_tokenize(sentence1)]
  words2 = [word for word in nltk.word_tokenize(sentence2)]
  #print(words1)
  #print(words2)

  all_words = list(set(words1 + words2))
  #print(all_words)

  vector1 = [0] * len(all_words)
  vector2 = [0] * len(all_words)
  #print(vector1)
  #print(vector2)

  for word in words1: # Bag of words
    #print(word)
    vector1[all_words.index(word)] += 1
  for word in words2:
    vector2[all_words.index(word)] += 1
  
  #print(vector1)
  #print(vector2)

  return 1 - cosine_distance(vector1, vector2)

In [11]:
calculate_sentence_similarity(formatted_sentences[0], formatted_sentences[1])

0.18898223650461365

In [12]:
test = ['human', 'study', 'intelligence', 'agents', 'intelligent', 'artificial', 'like']
test.index('agents')

3

Function to create the similarity matrix

In [13]:
def calculate_similarity_matrix(sentences):
  similarity_matrix = np.zeros((len(sentences), len(sentences)))
  #print(similarity_matrix)
  for i in range(len(sentences)):
    for j in range(len(sentences)):
      if i == j:
        continue
      similarity_matrix[i][j] = calculate_sentence_similarity(sentences[i], sentences[j])
  return similarity_matrix

In [14]:
calculate_similarity_matrix(formatted_sentences)

array([[0.        , 0.18898224, 0.        , 0.43643578, 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.18898224, 0.        , 0.2236068 , 0.        , 0.        ,
        0.        , 0.37796447, 0.        , 0.        , 0.        ],
       [0.        , 0.2236068 , 0.        , 0.        , 0.        ,
        0.        , 0.16903085, 0.        , 0.        , 0.        ],
       [0.43643578, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.1490712 , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        ,
        0.3380617 , 0.        , 0.34641016, 0.28284271, 0.23904572],
       [0.        , 0.        , 0.        , 0.        , 0.3380617 ,
        0.        , 0.        , 0.19518001, 0.23904572, 0.20203051],
       [0.        , 0.37796447, 0.16903085, 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.11952286, 0.        ],
       [0.        , 0.        , 0.       

Function to summarize the texts

In [15]:
for i, score in enumerate(original_sentences):
    print(i, score)

0 Artificial intelligence is human like intelligence.
1 It is the study of intelligent artificial agents.
2 Science and engineering to produce intelligent machines.
3 Solve problems and have intelligence.
4 An AI system is made up of an agent and its surroundings.
5 An agent is anything that can perceive its surroundings through sensors and act on them through effectors.
6 Intelligent agents must be capable of setting and achieving goals.
7 In classical planning problems, the agent can assume that it is the only system acting in the world, allowing the agent to be confident in the outcomes of its actions.
8 However, if the agent is not the only actor, the agent must be able to reason under uncertainty.
9 This necessitates an agent that not only assesses its surroundings and makes predictions, but also evaluates its predictions and adapts based on its findings.


In [16]:
!pip3 install networkx



You should consider upgrading via the 'C:\Python310\python.exe -m pip install --upgrade pip' command.





In [17]:
!pip3 install pyg-nightly



You should consider upgrading via the 'C:\Python310\python.exe -m pip install --upgrade pip' command.


In [18]:
!pip3 install scipy



You should consider upgrading via the 'C:\Python310\python.exe -m pip install --upgrade pip' command.


In [19]:
def summarize(text, number_of_sentences, percentage = 0):
  original_sentences = [sentence for sentence in nltk.sent_tokenize(text)]
  formatted_sentences = [preprocess(original_sentence) for original_sentence in original_sentences]
  similarity_matrix = calculate_similarity_matrix(formatted_sentences)
  #print(similarity_matrix)

  similarity_graph = nx.from_numpy_array(similarity_matrix)
  #print(similarity_graph.nodes)
  #print(similarity_graph.edges)

  scores = nx.pagerank(similarity_graph)
  #print(scores)
  ordered_scores = sorted(((scores[i], score) for i, score in enumerate(original_sentences)), reverse=True)
  #print(ordered_scores)

  if percentage > 0:
    number_of_sentences = int(len(formatted_sentences) * percentage)

  best_sentences = []
  for sentence in range(number_of_sentences):
    best_sentences.append(ordered_scores[sentence][1])
  
  return original_sentences, best_sentences, ordered_scores

In [20]:
original_sentences, best_sentences, scores = summarize(original_text, 3)

In [21]:
original_sentences

['Artificial intelligence is human like intelligence.',
 'It is the study of intelligent artificial agents.',
 'Science and engineering to produce intelligent machines.',
 'Solve problems and have intelligence.',
 'An AI system is made up of an agent and its surroundings.',
 'An agent is anything that can perceive its surroundings through sensors and act on them through effectors.',
 'Intelligent agents must be capable of setting and achieving goals.',
 'In classical planning problems, the agent can assume that it is the only system acting in the world, allowing the agent to be confident in the outcomes of its actions.',
 'However, if the agent is not the only actor, the agent must be able to reason under uncertainty.',
 'This necessitates an agent that not only assesses its surroundings and makes predictions, but also evaluates its predictions and adapts based on its findings.']

In [22]:
best_sentences

['An AI system is made up of an agent and its surroundings.',
 'In classical planning problems, the agent can assume that it is the only system acting in the world, allowing the agent to be confident in the outcomes of its actions.',
 'However, if the agent is not the only actor, the agent must be able to reason under uncertainty.']

In [23]:
summary = ' '.join(best_sentences)
summary

'An AI system is made up of an agent and its surroundings. In classical planning problems, the agent can assume that it is the only system acting in the world, allowing the agent to be confident in the outcomes of its actions. However, if the agent is not the only actor, the agent must be able to reason under uncertainty.'

In [24]:
len(summary)

322

In [25]:
scores

[(0.1245571289537778,
  'An AI system is made up of an agent and its surroundings.'),
 (0.12310223605619375,
  'In classical planning problems, the agent can assume that it is the only system acting in the world, allowing the agent to be confident in the outcomes of its actions.'),
 (0.12116024545245735,
  'However, if the agent is not the only actor, the agent must be able to reason under uncertainty.'),
 (0.11497935278324921, 'It is the study of intelligent artificial agents.'),
 (0.10279926964350226,
  'An agent is anything that can perceive its surroundings through sensors and act on them through effectors.'),
 (0.09570610397641657,
  'Intelligent agents must be capable of setting and achieving goals.'),
 (0.09039359458072403, 'Artificial intelligence is human like intelligence.'),
 (0.08211996000504274, 'Solve problems and have intelligence.'),
 (0.0819075007757374,
  'This necessitates an agent that not only assesses its surroundings and makes predictions, but also evaluates its 

In [26]:
from IPython.core.display import HTML
def visualize(title, sentence_list, best_sentences):
  text = ''

  display(HTML(f'<h1>Summary - {title}</h1>'))
  for sentence in sentence_list:
    if sentence in best_sentences:
      text += ' ' + str(sentence).replace(sentence, f"<mark>{sentence}</mark>")
    else:
      text += ' ' + sentence
  display(HTML(f""" {text} """))

In [27]:
visualize('Artificial intelligence', original_sentences, best_sentences)

Extracting texts from the Internet

In [28]:
!pip3 install goose3



You should consider upgrading via the 'C:\Python310\python.exe -m pip install --upgrade pip' command.





In [29]:
from goose3 import Goose
g = Goose()
url = 'https://en.wikipedia.org/wiki/Automatic_summarization'
article = g.extract(url)

In [30]:
article.cleaned_text

'Automatic summarization is the process of shortening a set of data computationally, to create a subset (a summary) that represents the most important or relevant information within the original content.\n\nIn addition to text, images and videos can also be summarized. Text summarization finds the most informative sentences in a document;[1] various methods of image summarization are the subject of ongoing research, with some looking to display the most representative images from a given collection or generating a video;[2][3][4] video summarization extracts the most important frames from the video content.[5]\n\nThere are two general approaches to automatic summarization: extraction and abstraction.\n\nHere, content is extracted from the original data, but the extracted content is not modified in any way. Examples of extracted content include key-phrases that can be used to "tag" or index a text document, or key sentences (including headings) that collectively comprise an abstract, an

In [31]:
original_sentences, best_sentences, scores = summarize(article.cleaned_text, 120, 0.2)

In [32]:
(120 / len(original_sentences)) * 100

40.0

In [33]:
original_sentences

['Automatic summarization is the process of shortening a set of data computationally, to create a subset (a summary) that represents the most important or relevant information within the original content.',
 'In addition to text, images and videos can also be summarized.',
 'Text summarization finds the most informative sentences in a document;[1] various methods of image summarization are the subject of ongoing research, with some looking to display the most representative images from a given collection or generating a video;[2][3][4] video summarization extracts the most important frames from the video content.',
 '[5]\n\nThere are two general approaches to automatic summarization: extraction and abstraction.',
 'Here, content is extracted from the original data, but the extracted content is not modified in any way.',
 'Examples of extracted content include key-phrases that can be used to "tag" or index a text document, or key sentences (including headings) that collectively comprise

In [34]:
best_sentences

['The main difficulty in supervised extractive summarization is that the known summaries must be manually created by extracting sentences so the sentences in an original training document can be labeled as "in summary" or "not in summary".',
 '"Summarizing Conceptual Graphs for Automatic Summarization Task".',
 'Some unsupervised summarization approaches are based on finding a "centroid" sentence, which is the mean word vector of all the sentences in the document.',
 'Another important distinction is that TextRank was used for single document summarization, while LexRank has been applied to multi-document summarization.',
 'Instead of trying to learn explicit features that characterize keyphrases, the TextRank algorithm[12] exploits the structure of the text itself to determine keyphrases that appear "central" to the text in the same way that PageRank selects important Web pages.',
 'For example, if we rank unigrams and find that "advanced", "natural", "language", and "processing" all 

In [35]:
scores

[(0.008204982222197498,
  'The main difficulty in supervised extractive summarization is that the known summaries must be manually created by extracting sentences so the sentences in an original training document can be labeled as "in summary" or "not in summary".'),
 (0.00731691148230209,
  '"Summarizing Conceptual Graphs for Automatic Summarization Task".'),
 (0.007170985765395681,
  'Some unsupervised summarization approaches are based on finding a "centroid" sentence, which is the mean word vector of all the sentences in the document.'),
 (0.006900476592854994,
  'Another important distinction is that TextRank was used for single document summarization, while LexRank has been applied to multi-document summarization.'),
 (0.006748052778874626,
  'Instead of trying to learn explicit features that characterize keyphrases, the TextRank algorithm[12] exploits the structure of the text itself to determine keyphrases that appear "central" to the text in the same way that PageRank selects 

In [36]:
visualize(article.title, original_sentences, best_sentences)