# Extraction de données d'un site web 

Pour effecter des requests HTTP on va utiliser le package `Requests`. Avec la méthode get on va pouvoir récupérer le code HTML. 


In [4]:
import requests
url = "https://www.gov.uk/search/news-and-communications"

page = requests.get(url)

print(page.content)

b'<!DOCTYPE html>\n<!--[if lt IE 9]><html class="lte-ie8 govuk-template" lang="en"><![endif]--><!--[if gt IE 8]><!--><html class="govuk-template" lang="en">\n<!--<![endif]-->\n  <head>\n<meta http-equiv="Content-Type" content="text/html; charset=utf-8">\n<meta property="og:description" content="Find news and communications from government">\n<meta property="og:title" content="News and communications">\n<meta property="og:url" content="https://www.gov.uk/search/news-and-communications">\n<meta property="og:type" content="article">\n<meta property="og:site_name" content="GOV.UK">\n<meta name="govuk:base_title" content="News and communications - GOV.UK">\n<meta name="govuk:search-result-count" content="117112">\n<meta name="twitter:card" content="summary">\n<meta name="govuk:public-updated-at" content="2021-02-09T10:00:56.000+00:00">\n<meta name="govuk:updated-at" content="2021-02-09T10:00:56.756Z">\n<meta name="govuk:first-published-at" content="2019-02-01T12:31:51.000+00:00">\n<meta nam

Maintenant que nous avons l'ensemble du code HTML, il faut Parser les données pour pouvoir les utiliser. 

Parser signifie "parcourir le contenu d'un texte ou d'un fichier en l'analysant pour verifier sa syntaxe ou en extraire des elements" 

Pour ce faire on va utiliser le package Beautiful Soup : 

## Parser les données avec Beautiful Soup 

On va parser sur les attributs HTML `class`et `id` 

`pip install beautifulsoup4`

La variablie `soup`ci-dessous possède toutes les fonctions qui facilitent l'obtention de données à partir de HTML. 



In [5]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(page.content, 'html.parser')

print(soup)

<!DOCTYPE html>

<!--[if lt IE 9]><html class="lte-ie8 govuk-template" lang="en"><![endif]--><!--[if gt IE 8]><!--><html class="govuk-template" lang="en">
<!--<![endif]-->
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="Find news and communications from government" property="og:description"/>
<meta content="News and communications" property="og:title"/>
<meta content="https://www.gov.uk/search/news-and-communications" property="og:url"/>
<meta content="article" property="og:type"/>
<meta content="GOV.UK" property="og:site_name"/>
<meta content="News and communications - GOV.UK" name="govuk:base_title"/>
<meta content="117112" name="govuk:search-result-count"/>
<meta content="summary" name="twitter:card"/>
<meta content="2021-02-09T10:00:56.000+00:00" name="govuk:public-updated-at"/>
<meta content="2021-02-09T10:00:56.756Z" name="govuk:updated-at"/>
<meta content="2019-02-01T12:31:51.000+00:00" name="govuk:first-published-at"/>
<meta content="t

In [6]:
soup.title

<title>News and communications - GOV.UK</title>

In [7]:
soup.title.string

'News and communications - GOV.UK'

In [8]:
soup.find_all('a')

[<a class="gem-c-skip-link govuk-skip-link govuk-!-display-none-print" data-module="govuk-skip-link" href="#content">Skip to main content</a>,
 <a class="govuk-link" href="/help/cookies">View cookies</a>,
 <a class="govuk-link" data-module="gem-track-click" data-track-action="Cookie banner settings clicked from confirmation" data-track-category="cookieBanner" href="/help/cookies">change your cookie settings</a>,
 <a class="govuk-header__link govuk-header__link--homepage" data-track-action="logoLink" data-track-category="headerClicked" data-track-dimension="GOV.UK" data-track-dimension-index="29" data-track-label="https://www.gov.uk" href="https://www.gov.uk" id="logo" title="Go to the GOV.UK homepage">
 <span class="govuk-header__logotype">
 <!--[if gt IE 8]><!-->
 <svg aria-hidden="true" class="govuk-header__logotype-crown gem-c-layout-super-navigation-header__logotype-crown" focusable="false" height="30" viewbox="0 0 132 97" width="36" xmlns="http://www.w3.org/2000/svg">
 <path d="M2

In [11]:
soup.find(id="lien1")

In [14]:
soup.find_all("p", class_="title")

[]

In [20]:
titres = soup.find_all("a", class_="gem-c-document-list__item-title")
print(titres)

[<a class="gem-c-document-list__item-title govuk-link" data-ecommerce-index="1" data-ecommerce-path="/government/news/uk-and-us-launch-innovation-prize-challenges-in-privacy-enhancing-technologies-to-tackle-financial-crime-and-public-health-emergencies" data-ecommerce-row="1" data-track-action="News and communications.1" data-track-category="navFinderLinkClicked" data-track-label="/government/news/uk-and-us-launch-innovation-prize-challenges-in-privacy-enhancing-technologies-to-tackle-financial-crime-and-public-health-emergencies" data-track-options='{"dimension28":20,"dimension29":"UK and US launch innovation prize challenges in privacy-enhancing technologies to tackle financial crime and public health emergencies"}' href="/government/news/uk-and-us-launch-innovation-prize-challenges-in-privacy-enhancing-technologies-to-tackle-financial-crime-and-public-health-emergencies">UK and US launch innovation prize challenges in privacy-enhancing technologies to tackle financial crime and publ

In [21]:
for title in titres:
  print(title.string)

UK and US launch innovation prize challenges in privacy-enhancing technologies to tackle financial crime and public health emergencies
UK launches Israel talks to boost trade between services superpowers
Independent review of UKRI published
The Sizewell C Project development consent decision announced
Government to strengthen and modernise reservoir safety regime
Defra response to OEP report Taking stock: protecting, restoring and improving the environment in England
Further support for small businesses feeling the squeeze as £4.5 billion Recovery Loan Scheme extended
Draft statutory instrument: The Renewable Transport Fuels Obligations (Amendment) Order 2022
Easier access to locally-applied HRT to treat postmenopausal vaginal symptoms in landmark MHRA reclassification
Crackdown on corrupt elites abusing UK legal system to silence critics
UK House Price Index for May 2022
Major milestone as work is set to begin at York Central
DVLA announces change in the law to enable more healthcare 

In [23]:
descriptions = soup.find_all("p", class_="gem-c-document-list__item-description")

for description in descriptions:
  print(description.string)

UK and US launch innovation prize challenges in privacy-enhancing technologies.
Trade Secretary Anne-Marie Trevelyan will today [20 July] launch negotiations between the UK and Israel for a new, innovation-focused trade deal.
Independent review of UK Research and Innovation (UKRI), led by Sir David Grant, published by Department for Business Energy & Industrial Strategy.
Today, 20 July 2022, the Sizewell C Project application has been granted development consent by the Secretary of State for Business, Energy and Industrial Strategy.   
Reform of the regulatory programme will be delivered in collaboration with reservoir owners and engineers over the coming years
Response from Defra to the Office for Environmental Protection’s first monitoring report on the government’s 25 Year Environment Plan
The Government has extended a vital support scheme offering Government-backed loans to small businesses for a further two years
Announces publication of a draft statutory instrument that will help

# Code complet d'extraction et sauvegarde csv :

In [24]:
import requests
from bs4 import BeautifulSoup
import csv

# lien de la page à scrapper
url = "https://www.gov.uk/search/news-and-communications"
reponse = requests.get(url)
page = reponse.content

# affiche la page HTML
# print(page)

# transforme (parse) le HTML en objet BeautifulSoup
soup = BeautifulSoup(page, "html.parser")

# récupération de tous les titres
titres = soup.find_all("a", class_="gem-c-document-list__item-title")
titre_textes = []
for titre in titres:
	titre_textes.append(titre.string)

# récupération de toutes les descriptions
descriptions = soup.find_all("p", class_="gem-c-document-list__item-description")
description_textes = []
for description in descriptions:
	description_textes.append(description.string)

# création du fichier data.csv
en_tete = ['titre', 'description']
with open('data.csv', 'w') as fichier_csv:
	writer = csv.writer(fichier_csv, delimiter=',')
	writer.writerow(en_tete)
	# zip permet d'itérer sur deux listes à la fois
	for titre, description in zip(titre_textes, description_textes):
		writer.writerow([titre, description])

b'<!DOCTYPE html>\n<!--[if lt IE 9]><html class="lte-ie8 govuk-template" lang="en"><![endif]--><!--[if gt IE 8]><!--><html class="govuk-template" lang="en">\n<!--<![endif]-->\n  <head>\n<meta http-equiv="Content-Type" content="text/html; charset=utf-8">\n<meta property="og:description" content="Find news and communications from government">\n<meta property="og:title" content="News and communications">\n<meta property="og:url" content="https://www.gov.uk/search/news-and-communications">\n<meta property="og:type" content="article">\n<meta property="og:site_name" content="GOV.UK">\n<meta name="govuk:base_title" content="News and communications - GOV.UK">\n<meta name="govuk:search-result-count" content="117113">\n<meta name="twitter:card" content="summary">\n<meta name="govuk:public-updated-at" content="2021-02-09T10:00:56.000+00:00">\n<meta name="govuk:updated-at" content="2021-02-09T10:00:56.756Z">\n<meta name="govuk:first-published-at" content="2019-02-01T12:31:51.000+00:00">\n<meta nam