<a href="https://colab.research.google.com/github/Al3xGROS/WebScappingML/blob/main/WebScrappingML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Web Scrapping & Article Summarization

## Introduction

The goal of this project is to be able to scrappe any Wikipedia page (in English) the user wants and summarize it.
First we are going to talk a little bit about the scrapping part of the project and then we will see the how to summarize a text in python.


## Web Scrapping

To scrappe a page in python, there are a lot of librairies and differents methods but we chose the librairies ```Requests``` and ```BeautifulSoup```.
The first one is used to get the html content of any web page from the url of this same page. And the second one is to naviguate the html page more easily.

First we need to import the librairies :

In [None]:
# LIBRAIRIES
from bs4.element import PageElement
import requests
from bs4 import BeautifulSoup
import re

And then we have the function that scrappe the wikipedia page and get all the paragraphe of the page in one single string.
We get all the text from the ```<p>``` nodes. And then create a regex to keep only the text after the beginning of the article.

In [None]:
URL = "https://en.wikipedia.org/wiki/Margaret_Hamilton_(software_engineer)"

# SCRAPPING
def scrape(url):
  page = requests.get(url)

  soup = BeautifulSoup(page.content, 'html.parser')

  title = soup.find("b")
  realTitle = title.get_text()

  data = '' 
  text = ""
  for data in soup.find_all("p"):
    text += data.get_text()

  result = text[text.find(realTitle):]
  patn = re.sub(r"\[[^\]]*\]", "", result)
  return patn

scrape(URL)

'Margaret Heafield Hamilton (born August 17, 1936) is an American computer scientist, systems engineer, and business owner. She was director of the Software Engineering Division of the MIT Instrumentation Laboratory, which developed on-board flight software for NASA\'s Apollo program. She later founded two software companies—Higher Order Software in 1976 and Hamilton Technologies in 1986, both in Cambridge, Massachusetts.\nHamilton has published more than 130 papers, proceedings, and reports, about sixty projects, and six major programs. She is one of the people credited with coining the term "software engineering".\nOn November 22, 2016, Hamilton received the Presidential Medal of Freedom from president Barack Obama for her work leading to the development of on-board flight software for NASA\'s Apollo Moon missions.\nMargaret Elaine Heafield was born August 17, 1936, in Paoli, Indiana, to Kenneth Heafield and Ruth Esther Heafield (née Partington). The family later moved to Michigan, w

But if you ever want to scrappe a wikipedia page there exist a librairy called ```Wikipedia``` that is perfect for it.

Let see how it works.

We install the librairy.

In [None]:
pip install wikipedia

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting wikipedia
  Downloading wikipedia-1.4.0.tar.gz (27 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: wikipedia
  Building wheel for wikipedia (setup.py) ... [?25l[?25hdone
  Created wheel for wikipedia: filename=wikipedia-1.4.0-py3-none-any.whl size=11695 sha256=e36f70eea026ae41dda9c86d4de317e79cb285c7a455ba17fc7aad376a8792d7
  Stored in directory: /root/.cache/pip/wheels/07/93/05/72c05349177dca2e0ba31a33ba4f7907606f7ddef303517c6a
Successfully built wikipedia
Installing collected packages: wikipedia
Successfully installed wikipedia-1.4.0


We import it.
And we can get the content of the page we want by using ```wikipedia.page``` and then ```variable.content``` (variable with the content of wikipedia.page).

In [None]:
import wikipedia
wikisearch = wikipedia.page("Margaret_Hamilton_(software_engineer)")
wikicontent = wikisearch.content

wikicontent

'Margaret Heafield Hamilton (born August 17, 1936) is an American computer scientist, systems engineer, and business owner. She was director of the Software Engineering Division of the MIT Instrumentation Laboratory, which developed on-board flight software for NASA\'s Apollo program. She later founded two software companies—Higher Order Software in 1976 and Hamilton Technologies in 1986, both in Cambridge, Massachusetts.\nHamilton has published more than 130 papers, proceedings, and reports, about sixty projects, and six major programs. She is one of the people credited with coining the term "software engineering".On November 22, 2016, Hamilton received the Presidential Medal of Freedom from president Barack Obama for her work leading to the development of on-board flight software for NASA\'s Apollo Moon missions.\n\n\n== Early life and education ==\nMargaret Elaine Heafield was born August 17, 1936, in Paoli, Indiana, to Kenneth Heafield and Ruth Esther Heafield (née Partington). The

In [None]:
## RESUME PART

# imports
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation
from heapq import nlargest



In [None]:
# preparation of the text
stopwords=list(STOP_WORDS)
punctuation=punctuation+ '\n'
nlp = spacy.load('en_core_web_sm')
doc= nlp(scrape(URL))
tokens=[token.text for token in doc]

tokens

['Margaret',
 'Heafield',
 'Hamilton',
 '(',
 'born',
 'August',
 '17',
 ',',
 '1936',
 ')',
 'is',
 'an',
 'American',
 'computer',
 'scientist',
 ',',
 'systems',
 'engineer',
 ',',
 'and',
 'business',
 'owner',
 '.',
 'She',
 'was',
 'director',
 'of',
 'the',
 'Software',
 'Engineering',
 'Division',
 'of',
 'the',
 'MIT',
 'Instrumentation',
 'Laboratory',
 ',',
 'which',
 'developed',
 'on',
 '-',
 'board',
 'flight',
 'software',
 'for',
 'NASA',
 "'s",
 'Apollo',
 'program',
 '.',
 'She',
 'later',
 'founded',
 'two',
 'software',
 'companies',
 '—',
 'Higher',
 'Order',
 'Software',
 'in',
 '1976',
 'and',
 'Hamilton',
 'Technologies',
 'in',
 '1986',
 ',',
 'both',
 'in',
 'Cambridge',
 ',',
 'Massachusetts',
 '.',
 '\n',
 'Hamilton',
 'has',
 'published',
 'more',
 'than',
 '130',
 'papers',
 ',',
 'proceedings',
 ',',
 'and',
 'reports',
 ',',
 'about',
 'sixty',
 'projects',
 ',',
 'and',
 'six',
 'major',
 'programs',
 '.',
 'She',
 'is',
 'one',
 'of',
 'the',
 'people'

In [None]:
# word frequencies
word_frequencies={}
for word in doc:
    if word.text.lower() not in stopwords:
        if word.text.lower() not in punctuation:
            if word.text not in word_frequencies.keys():
                word_frequencies[word.text] = 1
            else:
                word_frequencies[word.text] += 1

word_frequencies

{'Margaret': 4,
 'Heafield': 4,
 'Hamilton': 34,
 'born': 3,
 'August': 2,
 '17': 2,
 '1936': 2,
 'American': 1,
 'computer': 16,
 'scientist': 2,
 'systems': 12,
 'engineer': 2,
 'business': 1,
 'owner': 1,
 'director': 1,
 'Software': 5,
 'Engineering': 2,
 'Division': 2,
 'MIT': 8,
 'Instrumentation': 1,
 'Laboratory': 2,
 'developed': 8,
 'board': 4,
 'flight': 8,
 'software': 38,
 'NASA': 5,
 'Apollo': 13,
 'program': 5,
 'later': 4,
 'founded': 3,
 'companies': 1,
 '—': 1,
 'Higher': 2,
 'Order': 2,
 '1976': 2,
 'Technologies': 2,
 '1986': 2,
 'Cambridge': 3,
 'Massachusetts': 3,
 'published': 2,
 '130': 1,
 'papers': 1,
 'proceedings': 1,
 'reports': 1,
 'projects': 1,
 'major': 2,
 'programs': 3,
 'people': 1,
 'credited': 2,
 'coining': 1,
 'term': 6,
 'engineering': 13,
 'November': 2,
 '22': 1,
 '2016': 1,
 'received': 1,
 'Presidential': 1,
 'Medal': 1,
 'Freedom': 1,
 'president': 1,
 'Barack': 1,
 'Obama': 1,
 'work': 6,
 'leading': 1,
 'development': 8,
 'Moon': 4,
 'mis

In [None]:
# normalized word frequencies
max_frequency=max(word_frequencies.values())
for word in word_frequencies.keys():
    word_frequencies[word]=word_frequencies[word]/max_frequency

word_frequencies

{'Margaret': 0.10526315789473684,
 'Heafield': 0.10526315789473684,
 'Hamilton': 0.8947368421052632,
 'born': 0.07894736842105263,
 'August': 0.05263157894736842,
 '17': 0.05263157894736842,
 '1936': 0.05263157894736842,
 'American': 0.02631578947368421,
 'computer': 0.42105263157894735,
 'scientist': 0.05263157894736842,
 'systems': 0.3157894736842105,
 'engineer': 0.05263157894736842,
 'business': 0.02631578947368421,
 'owner': 0.02631578947368421,
 'director': 0.02631578947368421,
 'Software': 0.13157894736842105,
 'Engineering': 0.05263157894736842,
 'Division': 0.05263157894736842,
 'MIT': 0.21052631578947367,
 'Instrumentation': 0.02631578947368421,
 'Laboratory': 0.05263157894736842,
 'developed': 0.21052631578947367,
 'board': 0.10526315789473684,
 'flight': 0.21052631578947367,
 'software': 1.0,
 'NASA': 0.13157894736842105,
 'Apollo': 0.34210526315789475,
 'program': 0.13157894736842105,
 'later': 0.10526315789473684,
 'founded': 0.07894736842105263,
 'companies': 0.026315789

In [None]:
# sentence tokens
sentence_tokens= [sent for sent in doc.sents]

sentence_tokens

[Margaret Heafield Hamilton (born August 17, 1936) is an American computer scientist, systems engineer, and business owner.,
 She was director of the Software Engineering Division of the MIT Instrumentation Laboratory, which developed on-board flight software for NASA's Apollo program.,
 She later founded two software companies—Higher Order Software in 1976 and Hamilton Technologies in 1986, both in Cambridge, Massachusetts.,
 Hamilton has published more than 130 papers, proceedings, and reports, about sixty projects, and six major programs.,
 She is one of the people credited with coining the term "software engineering".,
 On November 22, 2016, Hamilton received the Presidential Medal of Freedom from president Barack Obama for her work leading to the development of on-board flight software for NASA's Apollo Moon missions.,
 Margaret Elaine Heafield was born August 17, 1936, in Paoli, Indiana, to Kenneth Heafield and Ruth Esther Heafield (née Partington).,
 The family later moved to Mi

In [None]:
# evaluate the sentences by giving them a score
sentence_scores = {}
for sent in sentence_tokens:
    for word in sent:
        if word.text.lower() in word_frequencies.keys():
            if sent not in sentence_scores.keys():                            
             sentence_scores[sent]=word_frequencies[word.text.lower()]
            else:
             sentence_scores[sent]+=word_frequencies[word.text.lower()]

sentence_scores

{Margaret Heafield Hamilton (born August 17, 1936) is an American computer scientist, systems engineer, and business owner.: 1.0789473684210529,
 She was director of the Software Engineering Division of the MIT Instrumentation Laboratory, which developed on-board flight software for NASA's Apollo program.: 3.026315789473684,
 She later founded two software companies—Higher Order Software in 1976 and Hamilton Technologies in 1986, both in Cambridge, Massachusetts.: 2.421052631578948,
 Hamilton has published more than 130 papers, proceedings, and reports, about sixty projects, and six major programs.: 0.3157894736842105,
 She is one of the people credited with coining the term "software engineering".: 1.6052631578947367,
 On November 22, 2016, Hamilton received the Presidential Medal of Freedom from president Barack Obama for her work leading to the development of on-board flight software for NASA's Apollo Moon missions.: 1.8947368421052633,
 Margaret Elaine Heafield was born August 17, 

In [None]:
# get 30% of the maximum score sentences
select_length=int(len(sentence_tokens)*0.3)
select_length
summary=nlargest(select_length, sentence_scores,key=sentence_scores.get)

summary

[Her areas of expertise include: systems design and software development, enterprise and process modeling, development paradigm, formal systems modeling languages, system-oriented objects for systems modeling and development, automated life-cycle environments, methods for maximizing software reliability and reuse, domain analysis, correctness by built-in language properties, open-architecture techniques for robust systems, full life-cycle automation, quality assurance, seamless integration, error detection and recovery techniques, human-machine interface systems, operating systems, end-to-end testing techniques, and life-cycle management techniques.,
 The asynchronous executive designed by J. Halcombe Laning was used by Hamilton's team to develop asynchronous flight software:
 Because of the flight software's system-software's error detection and recovery techniques that included its system-wide "kill and recompute" from a "safe place" restart approach to its snapshot and rollback tech

In [None]:
# get the final summary
final_summary=[word for word in summary]
summary=''.join(final_summary)
summary

'Her areas of expertise include: systems design and software development, enterprise and process modeling, development paradigm, formal systems modeling languages, system-oriented objects for systems modeling and development, automated life-cycle environments, methods for maximizing software reliability and reuse, domain analysis, correctness by built-in language properties, open-architecture techniques for robust systems, full life-cycle automation, quality assurance, seamless integration, error detection and recovery techniques, human-machine interface systems, operating systems, end-to-end testing techniques, and life-cycle management techniques.The asynchronous executive designed by J. Halcombe Laning was used by Hamilton\'s team to develop asynchronous flight software:\nBecause of the flight software\'s system-software\'s error detection and recovery techniques that included its system-wide "kill and recompute" from a "safe place" restart approach to its snapshot and rollback tech

Next, we decide to try the Gensim library which is a variation of the TextRank algorithm. The TextRank algorithm is a graph-based ranking algorithm which measures the relationship between the words and finds the most relevant sentences and keywords in a text.TextRank is an extractive and unsupervised model. It works as follows : first we split the text in input in sentences which are then represented in vectors. The similirities between those vectors are calculated and stored in a matrix called the similarity matrix. This matrix is then converted to a graph with the sentences being the vertices of the graph and the similarity scores the edges. Then the best ranked sentences are selected for the summary.

In [None]:
from gensim.summarization.summarizer import summarize
from gensim.summarization import keywords

In this example, we use the ratio parameter which specifies the portion of the sentences in the original text we want to be included. In our case, 20%

In [None]:
summ_pre = summarize(wikicontent, ratio = 0.2)
print("Percent summary")
print(summ_pre)

Percent summary
Margaret Heafield Hamilton (born August 17, 1936) is an American computer scientist, systems engineer, and business owner.
She was director of the Software Engineering Division of the MIT Instrumentation Laboratory, which developed on-board flight software for NASA's Apollo program.
She later founded two software companies—Higher Order Software in 1976 and Hamilton Technologies in 1986, both in Cambridge, Massachusetts.
She is one of the people credited with coining the term "software engineering".On November 22, 2016, Hamilton received the Presidential Medal of Freedom from president Barack Obama for her work leading to the development of on-board flight software for NASA's Apollo Moon missions.
She developed software for predicting weather, programming on the LGP-30 and the PDP-1 computers at Marvin Minsky's Project MAC.
At the time, computer science and software engineering were not yet established disciplines; instead, programmers learned on the job with hands-on ex

In [None]:
summ_words = summarize(wikicontent, word_count=200)
print("Word count summary")
print(summ_words)

Word count summary
She was director of the Software Engineering Division of the MIT Instrumentation Laboratory, which developed on-board flight software for NASA's Apollo program.
She is one of the people credited with coining the term "software engineering".On November 22, 2016, Hamilton received the Presidential Medal of Freedom from president Barack Obama for her work leading to the development of on-board flight software for NASA's Apollo Moon missions.
From 1961 to 1963, Hamilton worked on the Semi-Automatic Ground Environment (SAGE) Project at the MIT Lincoln Lab, where she was one of the programmers who wrote software for the prototype AN/FSQ-7 computer (the XD-1), used by the U.S. Air Force to search for possibly unfriendly aircraft.
Hamilton's team was responsible for developing in-flight software, which included algorithms designed by various senior scientists for the Apollo command module, Apollo lunar module, and the subsequent Skylab.
This included error detection and reco

In [None]:
pip install sumy


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sumy
  Downloading sumy-0.11.0-py2.py3-none-any.whl (97 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m97.3/97.3 KB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting breadability>=0.1.20
  Downloading breadability-0.1.20.tar.gz (32 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting docopt<0.7,>=0.6.1
  Downloading docopt-0.6.2.tar.gz (25 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pycountry>=18.2.23
  Downloading pycountry-22.3.5.tar.gz (10.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.1/10.1 MB[0m [31m40.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: breadability, docopt, pycount

In [None]:
import sumy
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lex_rank import LexRankSummarizer

In [None]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
parser=PlaintextParser.from_string(wikicontent, Tokenizer("english"))

In [None]:
summarizer = LexRankSummarizer()
summary = summarizer(parser.document, 10)
for sentence in summary:
  print(sentence)

She was director of the Software Engineering Division of the MIT Instrumentation Laboratory, which developed on-board flight software for NASA's Apollo program.
She studied mathematics at the University of Michigan in 1955 before transferring to Earlham College, where her mother was a student; she earned a BA in mathematics with a minor in philosophy in 1958.
I was the first one to get it to work.
It was her efforts on this project that made her a candidate for the position at NASA as the lead developer for Apollo flight software.
Hamilton was initially hired as a programmer for this process but moved on into systems design.
=== Apollo program === In one of the critical moments of the Apollo 11 mission, the Apollo Guidance Computer, together with the on-board flight software, averted an abort of the landing on the Moon.
It then sent out an alarm, which meant to the astronaut, 'I'm overloaded with more tasks than I should be doing at this time and I'm going to keep only the more importa

In [None]:
pip install bert-extractive-summarizer

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting bert-extractive-summarizer
  Downloading bert_extractive_summarizer-0.10.1-py3-none-any.whl (25 kB)
Collecting transformers
  Downloading transformers-4.25.1-py3-none-any.whl (5.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.8/5.8 MB[0m [31m43.0 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m92.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.11.1-py3-none-any.whl (182 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m182.4/182.4 KB[0m [31m21.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers, bert-extractive-summa

In [None]:
from summarizer import Summarizer
model = Summarizer()

Downloading:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-large-uncased were not used when initializing BertModel: ['cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

In [None]:
result = model(wikicontent, ratio=0.2)

In [None]:
print(result)

Margaret Heafield Hamilton (born August 17, 1936) is an American computer scientist, systems engineer, and business owner. She was director of the Software Engineering Division of the MIT Instrumentation Laboratory, which developed on-board flight software for NASA's Apollo program. The family later moved to Michigan, where Margaret graduated from Hancock High School in 1954. She says her poet father and headmaster grandfather inspired her to include a minor in philosophy in her studies. Hamilton said:

What they used to do when you came into this organization as a beginner, was to assign you this program which nobody was able to ever figure out or get to run. When I was the beginner they gave it to me as well. Another part of her team designed and developed the systems software. She worked to gain hands-on experience during a time when computer science courses were uncommon and software engineering courses did not exist. Hamilton had thought long and hard about this. By some accounts,

In [None]:
result1 = model(wikicontent, num_sentences=5)
print(result1)

Margaret Heafield Hamilton (born August 17, 1936) is an American computer scientist, systems engineer, and business owner. === SAGE Project ===
From 1961 to 1963, Hamilton worked on the Semi-Automatic Ground Environment (SAGE) Project at the MIT Lincoln Lab, where she was one of the programmers who wrote software for the prototype AN/FSQ-7 computer (the XD-1), used by the U.S. Air Force to search for possibly unfriendly aircraft. === Businesses ===
In 1976, Hamilton co-founded with Saydean Zeldin a company called Higher Order Software (HOS) to further develop ideas about error prevention and fault tolerance emerging from their experience at MIT working on the Apollo program. In 2019, she was awarded The Washington Award. The relationship between design and verification".
