<a href="https://colab.research.google.com/github/Al3xGROS/WebScappingML/blob/main/WebScrappingML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Web Scrapping & Article Summarization

## Introduction

The goal of this project is to be able to scrappe any Wikipedia page (in English) the user wants and summarize it.
First we are going to talk a little bit about the scrapping part of the project and then we will see the how to summarize a text in python.


## Web Scrapping

To scrappe a page in python, there are a lot of librairies and differents methods but we chose the librairies ```Requests``` and ```BeautifulSoup```.
The first one is used to get the html content of any web page from the url of this same page. And the second one is to naviguate the html page more easily.

First we need to import the librairies :

In [1]:
# LIBRAIRIES
from bs4.element import PageElement
import requests
from bs4 import BeautifulSoup
import re

And then we have the function that scrappe the wikipedia page and get all the paragraphe of the page in one single string.
We get all the text from the ```<p>``` nodes. And then create a regex to keep only the text after the beginning of the article.

In [None]:
URL = "https://en.wikipedia.org/wiki/Margaret_Hamilton_(software_engineer)"

# SCRAPPING
def scrape(url):
  page = requests.get(url)

  soup = BeautifulSoup(page.content, 'html.parser')

  title = soup.find("b")
  realTitle = title.get_text()

  data = '' 
  text = ""
  for data in soup.find_all("p"):
    text += data.get_text()

  result = text[text.find(realTitle):]
  patn = re.sub(r"\[[^\]]*\]", "", result)
  return patn

scrape(URL)

'Margaret Heafield Hamilton (born August 17, 1936) is an American computer scientist, systems engineer, and business owner. She was director of the Software Engineering Division of the MIT Instrumentation Laboratory, which developed on-board flight software for NASA\'s Apollo program. She later founded two software companies—Higher Order Software in 1976 and Hamilton Technologies in 1986, both in Cambridge, Massachusetts.\nHamilton has published more than 130 papers, proceedings, and reports, about sixty projects, and six major programs. She is one of the people credited with coining the term "software engineering".\nOn November 22, 2016, Hamilton received the Presidential Medal of Freedom from president Barack Obama for her work leading to the development of on-board flight software for NASA\'s Apollo Moon missions.\nMargaret Elaine Heafield was born August 17, 1936, in Paoli, Indiana, to Kenneth Heafield and Ruth Esther Heafield (née Partington). The family later moved to Michigan, w

But if you ever want to scrappe a wikipedia page there exist a librairy called ```Wikipedia``` that is perfect for it.

Let see how it works.

We install the librairy.

In [2]:
pip install wikipedia

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting wikipedia
  Downloading wikipedia-1.4.0.tar.gz (27 kB)
Building wheels for collected packages: wikipedia
  Building wheel for wikipedia (setup.py) ... [?25l[?25hdone
  Created wheel for wikipedia: filename=wikipedia-1.4.0-py3-none-any.whl size=11695 sha256=a96dfc30ff36947318c69016048c4d62b711a331cb48ddbad733d8d67a502f17
  Stored in directory: /root/.cache/pip/wheels/07/93/05/72c05349177dca2e0ba31a33ba4f7907606f7ddef303517c6a
Successfully built wikipedia
Installing collected packages: wikipedia
Successfully installed wikipedia-1.4.0


We import it.
And we can get the content of the page we want by using ```wikipedia.page``` and then ```variable.content``` (variable with the content of wikipedia.page).

In [6]:
import wikipedia
wikisearch = wikipedia.page("Margaret_Hamilton_(software_engineer)")
wikicontent = wikisearch.content

wikicontent

'Margaret Heafield Hamilton (born August 17, 1936) is an American computer scientist, systems engineer, and business owner. She was director of the Software Engineering Division of the MIT Instrumentation Laboratory, which developed on-board flight software for NASA\'s Apollo program. She later founded two software companies—Higher Order Software in 1976 and Hamilton Technologies in 1986, both in Cambridge, Massachusetts.\nHamilton has published more than 130 papers, proceedings, and reports, about sixty projects, and six major programs. She is one of the people credited with coining the term "software engineering".On November 22, 2016, Hamilton received the Presidential Medal of Freedom from president Barack Obama for her work leading to the development of on-board flight software for NASA\'s Apollo Moon missions.\n\n\n== Early life and education ==\nMargaret Elaine Heafield was born August 17, 1936, in Paoli, Indiana, to Kenneth Heafield and Ruth Esther Heafield (née Partington). The

In [None]:
## RESUME PART

# imports
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation
from heapq import nlargest

In [None]:
# preparation of the text
stopwords=list(STOP_WORDS)
punctuation=punctuation+ '\n'
nlp = spacy.load('en_core_web_sm')
doc= nlp(scrape(URL))
tokens=[token.text for token in doc]

tokens

['Margaret', 'Heafield', 'Hamilton', '(', 'born', 'August', '17', ',', '1936', ')', 'is', 'an', 'American', 'computer', 'scientist', ',', 'systems', 'engineer', ',', 'and', 'business', 'owner', '.', 'She', 'was', 'director', 'of', 'the', 'Software', 'Engineering', 'Division', 'of', 'the', 'MIT', 'Instrumentation', 'Laboratory', ',', 'which', 'developed', 'on', '-', 'board', 'flight', 'software', 'for', 'NASA', "'s", 'Apollo', 'program', '.', 'She', 'later', 'founded', 'two', 'software', 'companies', '—', 'Higher', 'Order', 'Software', 'in', '1976', 'and', 'Hamilton', 'Technologies', 'in', '1986', ',', 'both', 'in', 'Cambridge', ',', 'Massachusetts', '.', '\n', 'Hamilton', 'has', 'published', 'more', 'than', '130', 'papers', ',', 'proceedings', ',', 'and', 'reports', ',', 'about', 'sixty', 'projects', ',', 'and', 'six', 'major', 'programs', '.', 'She', 'is', 'one', 'of', 'the', 'people', 'credited', 'with', 'coining', 'the', 'term', '"', 'software', 'engineering', '"', '.', '\n', 'On', 

In [None]:
# word frequencies
word_frequencies={}
for word in doc:
    if word.text.lower() not in stopwords:
        if word.text.lower() not in punctuation:
            if word.text not in word_frequencies.keys():
                word_frequencies[word.text] = 1
            else:
                word_frequencies[word.text] += 1

word_frequencies

{'Margaret': 4, 'Heafield': 4, 'Hamilton': 34, 'born': 3, 'August': 2, '17': 2, '1936': 2, 'American': 1, 'computer': 16, 'scientist': 2, 'systems': 12, 'engineer': 2, 'business': 1, 'owner': 1, 'director': 1, 'Software': 5, 'Engineering': 2, 'Division': 2, 'MIT': 8, 'Instrumentation': 1, 'Laboratory': 2, 'developed': 8, 'board': 4, 'flight': 8, 'software': 38, 'NASA': 5, 'Apollo': 13, 'program': 5, 'later': 4, 'founded': 3, 'companies': 1, '—': 1, 'Higher': 2, 'Order': 2, '1976': 2, 'Technologies': 2, '1986': 2, 'Cambridge': 3, 'Massachusetts': 3, 'published': 2, '130': 1, 'papers': 1, 'proceedings': 1, 'reports': 1, 'projects': 1, 'major': 2, 'programs': 3, 'people': 1, 'credited': 2, 'coining': 1, 'term': 6, 'engineering': 13, 'November': 2, '22': 1, '2016': 1, 'received': 1, 'Presidential': 1, 'Medal': 1, 'Freedom': 1, 'president': 1, 'Barack': 1, 'Obama': 1, 'work': 6, 'leading': 1, 'development': 8, 'Moon': 4, 'missions': 2, 'Elaine': 1, 'Paoli': 1, 'Indiana': 2, 'Kenneth': 1, 'R

In [None]:
# normalized word frequencies
max_frequency=max(word_frequencies.values())
for word in word_frequencies.keys():
    word_frequencies[word]=word_frequencies[word]/max_frequency

word_frequencies

{'Margaret': 0.10526315789473684, 'Heafield': 0.10526315789473684, 'Hamilton': 0.8947368421052632, 'born': 0.07894736842105263, 'August': 0.05263157894736842, '17': 0.05263157894736842, '1936': 0.05263157894736842, 'American': 0.02631578947368421, 'computer': 0.42105263157894735, 'scientist': 0.05263157894736842, 'systems': 0.3157894736842105, 'engineer': 0.05263157894736842, 'business': 0.02631578947368421, 'owner': 0.02631578947368421, 'director': 0.02631578947368421, 'Software': 0.13157894736842105, 'Engineering': 0.05263157894736842, 'Division': 0.05263157894736842, 'MIT': 0.21052631578947367, 'Instrumentation': 0.02631578947368421, 'Laboratory': 0.05263157894736842, 'developed': 0.21052631578947367, 'board': 0.10526315789473684, 'flight': 0.21052631578947367, 'software': 1.0, 'NASA': 0.13157894736842105, 'Apollo': 0.34210526315789475, 'program': 0.13157894736842105, 'later': 0.10526315789473684, 'founded': 0.07894736842105263, 'companies': 0.02631578947368421, '—': 0.0263157894736

In [None]:
# sentence tokens
sentence_tokens= [sent for sent in doc.sents]

sentence_tokens

[Margaret Heafield Hamilton (born August 17, 1936) is an American computer scientist, systems engineer, and business owner.,
 She was director of the Software Engineering Division of the MIT Instrumentation Laboratory, which developed on-board flight software for NASA's Apollo program.,
 She later founded two software companies—Higher Order Software in 1976 and Hamilton Technologies in 1986, both in Cambridge, Massachusetts.,
 Hamilton has published more than 130 papers, proceedings, and reports, about sixty projects, and six major programs.,
 She is one of the people credited with coining the term "software engineering".,
 On November 22, 2016, Hamilton received the Presidential Medal of Freedom from president Barack Obama for her work leading to the development of on-board flight software for NASA's Apollo Moon missions.,
 Margaret Elaine Heafield was born August 17, 1936, in Paoli, Indiana, to Kenneth Heafield and Ruth Esther Heafield (née Partington).,
 The family later moved to Mi

In [None]:
# evaluate the sentences by giving them a score
sentence_scores = {}
for sent in sentence_tokens:
    for word in sent:
        if word.text.lower() in word_frequencies.keys():
            if sent not in sentence_scores.keys():                            
             sentence_scores[sent]=word_frequencies[word.text.lower()]
            else:
             sentence_scores[sent]+=word_frequencies[word.text.lower()]

sentence_scores

{Margaret Heafield Hamilton (born August 17, 1936) is an American computer scientist, systems engineer, and business owner.: 1.0789473684210529,
 She was director of the Software Engineering Division of the MIT Instrumentation Laboratory, which developed on-board flight software for NASA's Apollo program.: 3.026315789473684,
 She later founded two software companies—Higher Order Software in 1976 and Hamilton Technologies in 1986, both in Cambridge, Massachusetts.: 2.421052631578948,
 Hamilton has published more than 130 papers, proceedings, and reports, about sixty projects, and six major programs.: 0.3157894736842105,
 She is one of the people credited with coining the term "software engineering".: 1.6052631578947367,
 On November 22, 2016, Hamilton received the Presidential Medal of Freedom from president Barack Obama for her work leading to the development of on-board flight software for NASA's Apollo Moon missions.: 1.8947368421052633,
 Margaret Elaine Heafield was born August 17, 

In [None]:
# get 30% of the maximum score sentences
select_length=int(len(sentence_tokens)*0.3)
select_length
summary=nlargest(select_length, sentence_scores,key=sentence_scores.get)

summary

[Her areas of expertise include: systems design and software development, enterprise and process modeling, development paradigm, formal systems modeling languages, system-oriented objects for systems modeling and development, automated life-cycle environments, methods for maximizing software reliability and reuse, domain analysis, correctness by built-in language properties, open-architecture techniques for robust systems, full life-cycle automation, quality assurance, seamless integration, error detection and recovery techniques, human-machine interface systems, operating systems, end-to-end testing techniques, and life-cycle management techniques.,
 The asynchronous executive designed by J. Halcombe Laning was used by Hamilton's team to develop asynchronous flight software:
 Because of the flight software's system-software's error detection and recovery techniques that included its system-wide "kill and recompute" from a "safe place" restart approach to its snapshot and rollback tech

In [None]:
# get the final summary
final_summary=[word for word in summary]
summary=''.join(final_summary)
summary

'Her areas of expertise include: systems design and software development, enterprise and process modeling, development paradigm, formal systems modeling languages, system-oriented objects for systems modeling and development, automated life-cycle environments, methods for maximizing software reliability and reuse, domain analysis, correctness by built-in language properties, open-architecture techniques for robust systems, full life-cycle automation, quality assurance, seamless integration, error detection and recovery techniques, human-machine interface systems, operating systems, end-to-end testing techniques, and life-cycle management techniques.The asynchronous executive designed by J. Halcombe Laning was used by Hamilton\'s team to develop asynchronous flight software:\nBecause of the flight software\'s system-software\'s error detection and recovery techniques that included its system-wide "kill and recompute" from a "safe place" restart approach to its snapshot and rollback tech