# **Scraping Data from a Static Website: Example Wikipedia**

**In this project, we will learn how to scrape HTML tables from a webpage, step by step.**

## Install Necessary Libraries

In [32]:
!pip install requests
!pip install bs4
!pip install pandas



## Import libraries

In [1]:
# Requests is library to send HTTP requests and fetch web page content
import requests 

# BeautifulSoup is a library to parse HTML content
from bs4 import BeautifulSoup 
# Pandas is a library to handle and organize data in DataFrame format
import pandas as pd

# time is a module, which provides various time-related functions is this project we use sllep() methode
import time

# re is a module for regular expressions in Python
import re


## Define the headers and the website link to scrape

In [34]:
# The 'user-agent' headers are specific to your system. This helps mimic a browser.
HEADERS = {
    "user-agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0"
}

# Website link
url = "https://en.wikipedia.org/wiki/List_of_largest_French_companies"

## Faire une requête et vérifier si elle a été exécutée avec succès

In [40]:
# Sends the GET request to the URL with the 'user-agent' header set
response = requests.get(url) #headers=HEADERS

# Check the return status code
response.status_code

200

**response.status_code** returns 200, which means the HTTP request was executed successfully.

The **User-Agent** header is not always required, but it is recommended to include it to prevent some websites from blocking you.



## Parse HTML with BeautifulSoup

In [41]:
# Parses the HTML content of the response using BeautifulSoup with the 'html.parser' to create a BeautifulSoup object
soup = BeautifulSoup(response.text, "html.parser")

#### The variable **soup** contains the BeautifulSoup object.

In [42]:
# Prints the prettified HTML content, making it more readable with proper indentation
print(soup.prettify())

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-pinned-clientpref-1 vector-feature-night-mode-enabled skin-theme-clientpref-day vector-toc-available" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   List of largest French companies - Wikipedia
  </title>
  <script>
   (function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature

## Selecting Tags and Class or ID

With **BeautifulSoup** and its methods **find()** and **find_all()**, you can retrieve specific data by passing the tag, along with the class or ID as parameters to target the desired data.

* **find()**: returns the first HTML element found

* **find_all()**: returns a list of all matching HTML elements


In [59]:
# Find table with the class "wikitable sortable"
table = soup.find("table", class_="wikitable sortable")

In [60]:
table

<table class="wikitable sortable">
<caption>
</caption>
<tbody><tr>
<th>Rank
</th>
<th>Fortune 500<br/>rank
</th>
<th>Name
</th>
<th>Industry
</th>
<th>Revenue<br/><small>(USD millions)</small>
</th>
<th>Profits<br/><small>(USD millions)</small>
</th>
<th>Employees
</th>
<th>Headquarters
</th></tr>
<tr>
<td>1
</td>
<td>20
</td>
<td><a href="/wiki/TotalEnergies" title="TotalEnergies">TotalEnergies</a>
</td>
<td><a class="mw-redirect" href="/wiki/Oil_and_gas" title="Oil and gas">Oil and gas</a>
</td>
<td style="text-align:center;">218,945
</td>
<td style="text-align:center;">21,384
</td>
<td style="text-align:center;">102,579
</td>
<td><a href="/wiki/Courbevoie" title="Courbevoie">Courbevoie</a>
</td></tr>
<tr>
<td>2
</td>
<td>49
</td>
<td><a href="/wiki/%C3%89lectricit%C3%A9_de_France" title="Électricité de France">Électricité de France</a>
</td>
<td><a href="/wiki/Electric_utility" title="Electric utility">Electric utility</a>
</td>
<td style="text-align:center;">151,040
</td>
<td styl

The `table` variable contains the **first** table on the page of the given link, determined by the `table` tag and its class `"wikitable sortable"`.

In [62]:
# Finds all tables with the class "wikitable sortable"
tables = soup.find_all("table", class_="wikitable sortable")

The `tables` variable contains **all** the table on the page of the given link, determined by the `table` tag and its class `"wikitable sortable"`.

In [63]:
len(tables)

2

In [65]:
tables

[<table class="wikitable sortable">
 <caption>
 </caption>
 <tbody><tr>
 <th>Rank
 </th>
 <th>Fortune 500<br/>rank
 </th>
 <th>Name
 </th>
 <th>Industry
 </th>
 <th>Revenue<br/><small>(USD millions)</small>
 </th>
 <th>Profits<br/><small>(USD millions)</small>
 </th>
 <th>Employees
 </th>
 <th>Headquarters
 </th></tr>
 <tr>
 <td>1
 </td>
 <td>20
 </td>
 <td><a href="/wiki/TotalEnergies" title="TotalEnergies">TotalEnergies</a>
 </td>
 <td><a class="mw-redirect" href="/wiki/Oil_and_gas" title="Oil and gas">Oil and gas</a>
 </td>
 <td style="text-align:center;">218,945
 </td>
 <td style="text-align:center;">21,384
 </td>
 <td style="text-align:center;">102,579
 </td>
 <td><a href="/wiki/Courbevoie" title="Courbevoie">Courbevoie</a>
 </td></tr>
 <tr>
 <td>2
 </td>
 <td>49
 </td>
 <td><a href="/wiki/%C3%89lectricit%C3%A9_de_France" title="Électricité de France">Électricité de France</a>
 </td>
 <td><a href="/wiki/Electric_utility" title="Electric utility">Electric utility</a>
 </td>
 <td st

The **tables** variable contains a list of HTML The tables variable contains a list of HTML `<table>` elements.

## Retrieve the table headers

In [74]:
# retrieve all rows in table, including headers
table_rows = table.find_all("tr")#[1].find_all("th")
table_rows

[<tr>
 <th>Rank
 </th>
 <th>Fortune 500<br/>rank
 </th>
 <th>Name
 </th>
 <th>Industry
 </th>
 <th>Revenue<br/><small>(USD millions)</small>
 </th>
 <th>Profits<br/><small>(USD millions)</small>
 </th>
 <th>Employees
 </th>
 <th>Headquarters
 </th></tr>,
 <tr>
 <td>1
 </td>
 <td>20
 </td>
 <td><a href="/wiki/TotalEnergies" title="TotalEnergies">TotalEnergies</a>
 </td>
 <td><a class="mw-redirect" href="/wiki/Oil_and_gas" title="Oil and gas">Oil and gas</a>
 </td>
 <td style="text-align:center;">218,945
 </td>
 <td style="text-align:center;">21,384
 </td>
 <td style="text-align:center;">102,579
 </td>
 <td><a href="/wiki/Courbevoie" title="Courbevoie">Courbevoie</a>
 </td></tr>,
 <tr>
 <td>2
 </td>
 <td>49
 </td>
 <td><a href="/wiki/%C3%89lectricit%C3%A9_de_France" title="Électricité de France">Électricité de France</a>
 </td>
 <td><a href="/wiki/Electric_utility" title="Electric utility">Electric utility</a>
 </td>
 <td style="text-align:center;">151,040
 </td>
 <td style="text-align:c

In [75]:
len(table_rows)

25

After counting the length of the list of rows, we get `25` rows, with the first row containing the headers.

In [None]:
# first row in list
table_rows[0]

<tr>
<th>Rank
</th>
<th>Fortune 500<br/>rank
</th>
<th>Name
</th>
<th>Industry
</th>
<th>Revenue<br/><small>(USD millions)</small>
</th>
<th>Profits<br/><small>(USD millions)</small>
</th>
<th>Employees
</th>
<th>Headquarters
</th></tr>

In [81]:
# Display the second row
table_rows[1]

<tr>
<td>1
</td>
<td>20
</td>
<td><a href="/wiki/TotalEnergies" title="TotalEnergies">TotalEnergies</a>
</td>
<td><a class="mw-redirect" href="/wiki/Oil_and_gas" title="Oil and gas">Oil and gas</a>
</td>
<td style="text-align:center;">218,945
</td>
<td style="text-align:center;">21,384
</td>
<td style="text-align:center;">102,579
</td>
<td><a href="/wiki/Courbevoie" title="Courbevoie">Courbevoie</a>
</td></tr>

In [82]:
# Display the last row
table_rows[-1]

<tr>
<td>24
</td>
<td>401
</td>
<td><a class="mw-redirect" href="/wiki/Air_France-KLM" title="Air France-KLM">Air France-KLM</a>
</td>
<td><a href="/wiki/Airline" title="Airline">Airline</a>
</td>
<td style="text-align:center;">32,452
</td>
<td style="text-align:center;">1,010
</td>
<td style="text-align:center;">76,271
</td>
<td>Paris
</td></tr>

As you can see from the Wikipedia page, the table has `header row` and `24 rows of values`. 

In [None]:
# Use list comprehension to get the headers
tables_headers = [header.text.strip() for header in table_rows[0].find_all("th")]
tables_headers

['Rank',
 'Fortune 500rank',
 'Name',
 'Industry',
 'Revenue(USD millions)',
 'Profits(USD millions)',
 'Employees',
 'Headquarters']

In [94]:
# You can use the previous techniques with list comprehension, or
# You can use this method:
list_headers = []
list_words = table_rows[0].find_all("th")
for i in range(len(list_words)):
    # Use the 'text' attribute to get just the text data and 'strip()' to remove '\n' or extra spaces
    word = list_words[i].text.strip()
    list_headers.append(word)

list_headers

['Rank',
 'Fortune 500rank',
 'Name',
 'Industry',
 'Revenue(USD millions)',
 'Profits(USD millions)',
 'Employees',
 'Headquarters']

The use of two techniques to get the headers is to help understand list comprehension better, as it is very useful for writing concise code.

## Retrieve the values of table

In [96]:
# Using slicing to remove the header row from the list
table_values = table_rows[1:]
table_values

[<tr>
 <td>1
 </td>
 <td>20
 </td>
 <td><a href="/wiki/TotalEnergies" title="TotalEnergies">TotalEnergies</a>
 </td>
 <td><a class="mw-redirect" href="/wiki/Oil_and_gas" title="Oil and gas">Oil and gas</a>
 </td>
 <td style="text-align:center;">218,945
 </td>
 <td style="text-align:center;">21,384
 </td>
 <td style="text-align:center;">102,579
 </td>
 <td><a href="/wiki/Courbevoie" title="Courbevoie">Courbevoie</a>
 </td></tr>,
 <tr>
 <td>2
 </td>
 <td>49
 </td>
 <td><a href="/wiki/%C3%89lectricit%C3%A9_de_France" title="Électricité de France">Électricité de France</a>
 </td>
 <td><a href="/wiki/Electric_utility" title="Electric utility">Electric utility</a>
 </td>
 <td style="text-align:center;">151,040
 </td>
 <td style="text-align:center;">10,828
 </td>
 <td style="text-align:center;">171,862
 </td>
 <td>Paris
 </td></tr>,
 <tr>
 <td>3
 </td>
 <td>64
 </td>
 <td><a href="/wiki/BNP_Paribas" title="BNP Paribas">BNP Paribas</a>
 </td>
 <td>Banking
 </td>
 <td style="text-align:center;"

In [None]:
data_rows = [row.find_all("td") for row in table_values]
data_rows

[[<td>1
  </td>,
  <td>20
  </td>,
  <td><a href="/wiki/TotalEnergies" title="TotalEnergies">TotalEnergies</a>
  </td>,
  <td><a class="mw-redirect" href="/wiki/Oil_and_gas" title="Oil and gas">Oil and gas</a>
  </td>,
  <td style="text-align:center;">218,945
  </td>,
  <td style="text-align:center;">21,384
  </td>,
  <td style="text-align:center;">102,579
  </td>,
  <td><a href="/wiki/Courbevoie" title="Courbevoie">Courbevoie</a>
  </td>],
 [<td>2
  </td>,
  <td>49
  </td>,
  <td><a href="/wiki/%C3%89lectricit%C3%A9_de_France" title="Électricité de France">Électricité de France</a>
  </td>,
  <td><a href="/wiki/Electric_utility" title="Electric utility">Electric utility</a>
  </td>,
  <td style="text-align:center;">151,040
  </td>,
  <td style="text-align:center;">10,828
  </td>,
  <td style="text-align:center;">171,862
  </td>,
  <td>Paris
  </td>],
 [<td>3
  </td>,
  <td>64
  </td>,
  <td><a href="/wiki/BNP_Paribas" title="BNP Paribas">BNP Paribas</a>
  </td>,
  <td>Banking
  </td>,

In [None]:
# Define a list to store row values
list_values = []

# Loop to retrieve the row values
for i in range(len(data_rows)):
    # Define a variable that contains the row for processing
    row = data_rows[i]
    row_values = [element.text.strip() for element in row]

    # Add cleaned values to the list
    list_values.append(row_values)

list_values

## Put the data into a DataFrame

In [118]:
data = pd.DataFrame(data=list_values, columns=tables_headers)
data

Unnamed: 0,Rank,Fortune 500rank,Name,Industry,Revenue(USD millions),Profits(USD millions),Employees,Headquarters
0,1,20,TotalEnergies,Oil and gas,218945,21384,102579,Courbevoie
1,2,49,Électricité de France,Electric utility,151040,10828,171862,Paris
2,3,64,BNP Paribas,Banking,136073,11864,182656,Paris
3,4,104,Société Générale,Banking,99163,2695,124089,Paris
4,5,118,Crédit Agricole,Banking,93358,6863,75125,Paris
5,6,119,Dior - LVMH,Apparel,93137,6815,197141,Paris
6,7,122,Carrefour,Retail,91790,1794,305333,Massy
7,8,126,Axa,Insurance,90406,7772,94705,Paris
8,9,130,Engie,Electric utility,89258,2387,97297,Courbevoie
9,10,166,Vinci,Construction,75551,5083,279426,Nanterre


## Save the data into a CSV or XLS file.

In [119]:
# Save data into csv file
data.to_csv("largest-companies-in-french.csv", index=False, encoding="utf-8")

In [120]:
# Save data into xlsx file
data.to_csv("largest-companies-in-french.xlsx", index=False, encoding="utf-8")

In [None]:
# Modification du nom de la première colonne pour inclure l'année
liste_variables[0] = "Rank " + str(rank_annees[ann])
liste_variables[0]

'Rank 2019'

Le code suivant retourne toutes les lignes du tableau sans en-tête.

In [None]:
# Récupération des lignes de données, excluant la première ligne (les en-têtes)
lignes_data = table.find_all("tr")[1:]
# afficher la premiere ligne
lignes_data[0]

<tr>
<td>1
</td>
<td>25
</td>
<td align="left"><a class="mw-redirect" href="/wiki/Total_SA" title="Total SA">Total</a>
</td>
<td align="left">Courbevoie
</td>
<td>184.2
</td>
<td>11.4
</td>
<td>256.8
</td>
<td>149.5
</td>
<td align="left">Oil and gas
</td></tr>

In [None]:
# Liste pour stocker les lignes de données extraites
valeurs_lignes = []
for row in lignes_data:
    row_df = row.find_all("td")
    # Extraction des valeurs de chaque cellule de la ligne
    ligne = [content.text.strip() for content in row_df]
    valeurs_lignes.append(ligne)

# afficher les deux premiers lignes
valeurs_lignes[:2]

[['1',
  '25',
  'Total',
  'Courbevoie',
  '184.2',
  '11.4',
  '256.8',
  '149.5',
  'Oil and gas'],
 ['2',
  '34',
  'BNP Paribas',
  'Paris',
  '101.6',
  '8.4',
  '2,333.0',
  '68.7',
  'Banking']]

## 7. Mettre des données scrappées dans une dataframe

In [None]:
# Création d'un DataFrame avec les données extraites et les en-têtes de colonnes
df = pd.DataFrame(data=valeurs_lignes, columns=liste_variables)
df.head()

Unnamed: 0,Rank 2019,Fortune 2000rank,Name,Headquarters,Revenue(USD billions),Profits(USD billions),Total Assets(USD billions),Market Value(USD billions),Industry
0,1,25,Total,Courbevoie,184.2,11.4,256.8,149.5,Oil and gas
1,2,34,BNP Paribas,Paris,101.6,8.4,2333.0,68.7,Banking
2,3,85,Axa,Paris,139.7,2.2,1034.5,63.6,Insurance
3,4,104,Crédit Agricole,Paris,52.2,4.7,1856.9,38.4,Banking
4,5,114,Sanofi,Paris,40.7,5.1,127.4,102.0,Pharmaceuticals


## 8. Sauvegardez les données dans un fichier

In [None]:
# Sauvegarde des données dans un fichier CSV
df.to_csv(f"List_of_largest_French_companies-{rank_annees[ann]}.csv", index=False)

In [None]:
# Sauvegarde des données dans un fichier Excel
# df.to_excel(f"List_of_largest_French_companies-{rank_annees[ann]}.xlsx", index=False)

In [None]:
# Création d'un DataFrame avec les données extraites et les en-têtes de colonnes
df = pd.DataFrame(data=valeurs_lignes, columns=liste_variables)
df.head()

Unnamed: 0,Rank 2019,Fortune 2000rank,Name,Headquarters,Revenue(USD billions),Profits(USD billions),Total Assets(USD billions),Market Value(USD billions),Industry
0,1,25,Total,Courbevoie,184.2,11.4,256.8,149.5,Oil and gas
1,2,34,BNP Paribas,Paris,101.6,8.4,2333.0,68.7,Banking
2,3,85,Axa,Paris,139.7,2.2,1034.5,63.6,Insurance
3,4,104,Crédit Agricole,Paris,52.2,4.7,1856.9,38.4,Banking
4,5,114,Sanofi,Paris,40.7,5.1,127.4,102.0,Pharmaceuticals


In [None]:

# Sauvegarde des données dans un fichier CSV ou Excel
df.to_csv(f"List_of_largest_French_companies-{rank_annees[ann]}.csv", index=False)

# df.to_excel(f"List_of_largest_French_companies-{rank_annees[ann]}.xlsx", index=False)

=================================================================================================================================================

# Méthode pour récupérer des tableaux de données sur Wikipedia

In [None]:
import requests 
from bs4 import BeautifulSoup 
import pandas as pd
import time #  module intégré dans Python
import re #  module intégré dans Python


# Les entêtes 'user-agent' sont spécifiques à votre système. Cela permet d'imiter un navigateur.
HEADERS = {
    "user-agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0"
}


def scrape_tableau(url):
    """  
    Méthode pour scraper des tableaux d'une page web.
    Args:
        url (str): Lien de la page à scraper.
    Retourne: Une liste de DataFrames, chaque DataFrame représentant un tableau extrait de la page.
    """
    # Envoi de la requête GET à l'URL avec l'entête 'user-agent' défini
    response = requests.get(url, headers=HEADERS)
    
    # Pause pour ne pas surcharger le serveur avec des requêtes consécutives
    time.sleep(3)
    
    # Vérifie si la requête a réussi
    if response.status_code == 200:
        # Analyse la réponse avec BeautifulSoup en utilisant l'analyseur HTML
        soup = BeautifulSoup(response.text, "html.parser")

        # Recherche tous les tableaux avec la classe "wikitable sortable"
        tables = soup.find_all("table", class_="wikitable sortable")
        
        # Recherche des titres des sections d'années
        annee = soup.find_all("div", class_="mw-heading mw-heading2")
        
        # Récupération des années sous forme de liste
        annees = [annee[i].find("h2").text for i in range(len(annee))]
        
        # Liste des années extraites (seulement les chiffres)
        rank_annees = ["".join(re.findall(r'\d+', annees[i])) for i in range(len(annees))
                        if "".join(re.findall(r'\d+', annees[i])) != ""]
        
        # Liste pour stocker les DataFrames extraits des tableaux
        liste_tableaux = []

        # Boucle pour itérer à travers les tableaux et les années correspondantes
        for table, ann in zip(tables, range(len(rank_annees))):
            # Extraction des entêtes de colonnes du tableau
            variables = table.find_all("th")

            # Liste en compréhension pour récupérer les en-têtes des tableaux
            liste_variables = [variable.text.strip() for variable in variables]
            
            # Modification du nom de la première colonne pour inclure l'année
            liste_variables[0] = "Rank of " + str(rank_annees[ann])

            # Récupération des lignes de données, excluant la première ligne (les en-têtes)
            lignes_data = table.find_all("tr")[1:]

            # Liste pour stocker les lignes de données extraites
            valeurs_lignes = []
            for row in lignes_data:
                row_df = row.find_all("td")
                # Extraction des valeurs de chaque cellule de la ligne
                ligne = [content.text.strip() for content in row_df]
                valeurs_lignes.append(ligne)

            # Création d'un DataFrame avec les données extraites et les en-têtes de colonnes
            df = pd.DataFrame(data=valeurs_lignes, columns=liste_variables)

            # Sauvegarde des données dans un fichier CSV ou Excel
            df.to_csv(f"List_of_largest_French_companies-{rank_annees[ann]}.csv", index=False)
            # Vous pouvez aussi sauvegarder en format Excel en décommentant la ligne suivante:
            # df.to_excel(f"List_of_largest_French_companies-{rank_annees[ann]}.xlsx", index=False)
            
            # Ajout du DataFrame à la liste
            liste_tableaux.append(df)

    # Retourne la liste de DataFrames
    return liste_tableaux

In [None]:
# Le lien dans la variable "url" pointe vers une page Wikipedia contenant deux tableaux de données.
# La fonction 'scrape_tableau' est définie pour récupérer ces deux tableaux et les stocker dans une liste.

url = "https://en.wikipedia.org/wiki/List_of_largest_French_companies" 

# Appel de la fonction 'scrape_tableau' avec l'URL fournie pour récupérer les tableaux.
liste_tableaux = scrape_tableau(url)

In [None]:

# Affichage du premier tableau extrait
df0 = liste_tableaux[0]

# Affichage du 5 premiers lignes du dataframe
df0.head(5)

Unnamed: 0,Rank of 2024,Fortune 500rank,Name,Industry,Revenue(USD millions),Profits(USD millions),Employees,Headquarters
0,1,20,TotalEnergies,Oil and gas,218945,21384,102579,Courbevoie
1,2,49,Électricité de France,Electric utility,151040,10828,171862,Paris
2,3,64,BNP Paribas,Banking,136073,11864,182656,Paris
3,4,104,Société Générale,Banking,99163,2695,124089,Paris
4,5,118,Crédit Agricole,Banking,93358,6863,75125,Paris


In [None]:
df1 = liste_tableaux[1]
df1.head()

Unnamed: 0,Rank of 2019,Fortune 2000rank,Name,Headquarters,Revenue(USD billions),Profits(USD billions),Total Assets(USD billions),Market Value(USD billions),Industry
0,1,25,Total,Courbevoie,184.2,11.4,256.8,149.5,Oil and gas
1,2,34,BNP Paribas,Paris,101.6,8.4,2333.0,68.7,Banking
2,3,85,Axa,Paris,139.7,2.2,1034.5,63.6,Insurance
3,4,104,Crédit Agricole,Paris,52.2,4.7,1856.9,38.4,Banking
4,5,114,Sanofi,Paris,40.7,5.1,127.4,102.0,Pharmaceuticals
