# Data Scraping
Let's download everything telecom-related.

Main sources:
- www.telecom-paris.fr
- https://synapses.telecom-paris.fr/catalogue/2024-2025
- https://fr.wikipedia.org/wiki/T%C3%A9l%C3%A9com_Paris
- https://eole.telecom-paris.fr/ (needs authentification, we will not use it for now)
- https://doc.telecom-paris.fr/

To run the code, it is sufficient to run the first cell, and then the cells in the wanted section.
All pages follow the template:

> ```md
> # Title: title
> 
> ## this > is > a > path
> 
> content
> ```

In [1]:
# Imports
from bs4 import BeautifulSoup
import requests
import html2text as h2t
import os
import pandas

BASE_PATH = "data/"
show_examples = True

## Let's download synapses
Le site n'est pas facile a scrapper (génération dynamique du contenu). J'ai utilisé une extension navigateur pour récupérer à la main les liens de toutes les pages depuis le [catalogue](https://synapses.telecom-paris.fr/catalogue/2024-2025) (19 pages).
J'ai donc un csv avec les liens de toutes les UE.

In [2]:
def get_clean_text_synapses(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')

    # Extract title
    title = soup.title.string

    # Remove elements with class "container header"
    for container in soup.find_all(class_="container header"): # THis is the stuff at the top
        container.decompose()
    for footer in soup.find_all(id = "footer"): # This is the stuff at the bottom
        footer.decompose()
    for script in soup.find_all('script'):
        script.decompose()
    for style in soup.find_all('style'):
        style.decompose()
        
    # Convert the HTML content to Markdown
    h = h2t.HTML2Text()
    h.ignore_links = False
    h.ignore_images = True
    h.ignore_emphasis = True
    
    # Format the text
    cleaned_content = h.handle(str(soup))
    final_markdown = f"# Title: {title}\n\n{cleaned_content}"
    return final_markdown

# Example usage
if show_examples:
    url = "https://synapses.telecom-paris.fr/catalogue/2024-2025/ue/2225/CSC-4IG01-TP-interactive-3d-application-development?from=P5173"
    clean_text = get_clean_text_synapses(url)
    print(clean_text)

print("Done")

# Title: UE CSC_4IG01_TP | Catalogue 2024-2025 | SynapseS

#  [ ](/catalogue/2024-2025) Enseignement scientifique & technique \-
CSC_4IG01_TP : Interactive 3D Application Development

## Domaine > Informatique, Image-Données-Signal.

## Descriptif

This course introduces the fundamental knowledge of developing interactive
applications using OpenGL. The basics of computer graphics and graphics
processor unit (GPU) programming are discussed. In training, C++, OpenGL, and
object-oriented programming are used for practical exercises.

  * Informations générales
  * Pré-requis
  * Acquisition de l'UE
  * Descriptif & programme

24 heures en présentiel (16 blocs ou créneaux)

### Diplôme(s) concerné(s)

  * [Echange international non diplomant](/catalogue/2024-2025/diplome/1/PEI-echange-international-non-diplomant)
  * [Master M2 - Interaction, Graphic & Design](/catalogue/2024-2025/diplome/26/M2IGD-master-m2-interaction-graphic-design)
  * [Diplôme d'ingénieur](/catalogue/2024-2025/diplome/

In [10]:
# Combine the csv files into one

csv_folder = "synapses_raw/links"

files = os.listdir(csv_folder)

combined_csv = []

for file in files:
    with open(f"{csv_folder}/{file}", "r") as f:
        lines = f.readlines()
        for line in lines:
            # Remove first and last character
            line = line[1:-1]
            # Remove everything after a ? or a #
            line = line.split("?")[0]
            line = line.split("#")[0]
            # Make sure there is "https://synapses.telecom-paris.fr/catalogue/2024-2025" at the beginning
            if "https://synapses.telecom-paris.fr/catalogue/2024-2025" in line:
                combined_csv.append(line)

combined_csv = list(set(combined_csv))

print(f"Number of links: {len(combined_csv)}")

# Save the links to a new csv file
df = pandas.DataFrame(combined_csv)
df.to_csv("synapses_raw/combined_links.csv", index=False, header=False)

print("Done")


Number of links: 407
Done


In [3]:
# Let's now download the full website (raw html).

synapses_path = "synapses_raw/raw_html"

# Load the links
df = pandas.read_csv("synapses_raw/combined_links.csv", header=None)
links = df[0].tolist()

# Create the folder
if not os.path.exists(synapses_path):
    os.makedirs(synapses_path)

# Download the html
for i, link in enumerate(links):
    print(f"Downloading {i+1}/{len(links)}")
    response = requests.get(link)
    with open(f"{synapses_path}/{i}.html", "w") as f:
        f.write(response.text)

print("Done")

Downloading 1/407
Downloading 2/407
Downloading 3/407
Downloading 4/407
Downloading 5/407
Downloading 6/407
Downloading 7/407
Downloading 8/407
Downloading 9/407
Downloading 10/407
Downloading 11/407
Downloading 12/407
Downloading 13/407
Downloading 14/407
Downloading 15/407
Downloading 16/407
Downloading 17/407
Downloading 18/407
Downloading 19/407
Downloading 20/407
Downloading 21/407
Downloading 22/407
Downloading 23/407
Downloading 24/407
Downloading 25/407
Downloading 26/407
Downloading 27/407
Downloading 28/407
Downloading 29/407
Downloading 30/407
Downloading 31/407
Downloading 32/407
Downloading 33/407
Downloading 34/407
Downloading 35/407
Downloading 36/407
Downloading 37/407
Downloading 38/407
Downloading 39/407
Downloading 40/407
Downloading 41/407
Downloading 42/407
Downloading 43/407
Downloading 44/407
Downloading 45/407
Downloading 46/407
Downloading 47/407
Downloading 48/407
Downloading 49/407
Downloading 50/407
Downloading 51/407
Downloading 52/407
Downloading 53/407
Do

In [None]:
# Now, let's convert the html to markdown

synapses_path = "synapses_raw/raw_html"
markdown_path = BASE_PATH + "synapses"

# Create the folder
if not os.path.exists(markdown_path):
    os.makedirs(markdown_path)

# Load the html files
files = os.listdir(synapses_path)

for i, file in enumerate(files):
    print(f"Converting {i+1}/{len(files)}")
    with open(f"{synapses_path}/{file}", "r") as f:
        html = f.read()
        soup = BeautifulSoup(html, 'html.parser')

        # Extract title
        title = soup.title.string

        # Remove elements with class "container header"
        for container in soup.find_all(class_="container header"): # THis is the stuff at the top
            container.decompose()
        for footer in soup.find_all(id = "footer"): # This is the stuff at the bottom
            footer.decompose()
        for script in soup.find_all('script'):
            script.decompose()
        for style in soup.find_all('style'):
            style.decompose()
            
        # Convert the HTML content to Markdown
        h = h2t.HTML2Text()
        h.ignore_links = False
        h.ignore_images = True
        h.ignore_emphasis = True
        
        # Format the text
        cleaned_content = h.handle(str(soup))
        final_markdown = f"# Title: {title}\n\n{cleaned_content}"
        with open(f"{markdown_path}/{i}.md", "w") as f:
            f.write(final_markdown)

print("Done")


## Let's download telecom-paris.fr
Pour le coup, pas de problèmes. Juste un peu de processing avant de pouvoir convertir en markdown.

In [89]:
def get_clean_text_telecom(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    
    h = h2t.HTML2Text()
    h.ignore_images = True
    h.ignore_emphasis = True
    h.ignore_links = False

    # Extract title
    title = soup.title.string

    # Stuff at the top
    for container in soup.find_all(class_="bandeau-imt"): 
        container.decompose()
    for footer in soup.find_all(id = "masthead"):
        footer.decompose()
    
    # Isolate fil d'ariane, remove it and format it
    fil = soup.find(class_ = "fil-ariane liste")
    fil_markdown = h.handle(str(fil))

    fil_markdown = fil_markdown.replace("\n", " ").replace("*", ">")
    fil_markdown = " ".join(fil_markdown.split())
    fil_markdown = fil_markdown.replace(">", "##", 1)

    # Remove all classes with "fil-ariane" inside
    for container in soup.find_all(class_="fil-ariane"): # THis is the stuff at the top
        container.decompose()
    for container in soup.find_all(class_ = "local-nav"):
        container.decompose()
    
    # Stuff at the bottom
    for container in soup.find_all(class_ = "site-footer"):
        container.decompose()
    for container in soup.find_all(class_ = "block-bas-de-page"):
        container.decompose()
    for container in soup.find_all(id = "poucette"):
        container.decompose()

    # Remove all scripts and styles
    for script in soup.find_all('script'):
        script.decompose()
    for style in soup.find_all('style'):
        style.decompose()
    
    # Remove cookies
    for cookie in soup.find_all(class_ = "cli-row"):
        cookie.decompose()
    for cookie in soup.find_all(class_ = "wt-cli-cookie-bar-container"):
        cookie.decompose()

    # Format the text
    cleaned_content = h.handle(str(soup))

    final_markdown = f"# Title: {title}\n\n{fil_markdown}\n\n{cleaned_content}"

    return final_markdown

# Example usage
if show_examples:
    url = "https://www.telecom-paris.fr/fr/ingenieur/formation/2e-annee-orientation/3d-systemes-interactifs/"
    clean_text = get_clean_text_telecom(url)
    print(clean_text)

print("Done")

# Title: Filière 3D et systèmes interactifs

## [Accueil](https://www.telecom-paris.fr "https://www.telecom-paris.fr") > [fr](https://www.telecom-paris.fr/fr "fr") > [Ingénieurs](https://www.telecom-paris.fr/fr/ingenieur "Ingénieurs") > [Votre formation d’ingénieur](https://www.telecom-paris.fr/fr/ingenieur/formation "Votre formation d’ingénieur") > [Votre 2e année : une orientation à la carte](https://www.telecom-paris.fr/fr/ingenieur/formation/2e-annee-orientation "Votre 2e année : une orientation à la carte") > [Filière 3D et systèmes interactifs](https://www.telecom-paris.fr/fr/ingenieur/formation/2e-annee-orientation/3d-systemes-interactifs)

[](https://www.telecom-paris.fr/fr/accueil)

# Filière 3D et systèmes interactifs

### Pour celles et ceux qui aiment

  * La conception 3D et la réalité virtuelle
  * Les dispositifs et systèmes interactifs 
  * Les interfaces tactiles, mobiles, gestuelles, etc.
  * Les jeux vidéo et les effets spéciaux

  * 

## Objectifs

Cette filière vis

In [91]:
def download_website_telecom(root_url, folder_name="telecom-paris", visited=None):
    global i
    if visited is None:
        visited = set()
    
    if root_url in visited:
        return
    
    visited.add(root_url)
    
    base_folder = os.path.join(BASE_PATH, folder_name)
    response = requests.get(root_url)
    soup = BeautifulSoup(response.content, 'html.parser')
    
    # Find all links on the root page
    links = [a['href'] for a in soup.find_all('a', href=True)]
    
    # Filter out external links and duplicates
    links = list(set([link for link in links if link.startswith(root_url)]))
    
    # Create directory if it doesn't exist
    if not os.path.exists(base_folder):
        os.makedirs(base_folder)
    
    for link in links:
        try:
            clean_text = get_clean_text_telecom(link)
            # Create a filename from the link
            filename = os.path.join(base_folder, link.replace(root_url, "").replace("/", "_") + ".md")
            with open(filename, 'w', encoding='utf-8') as file:
                file.write(clean_text)
            i+=1
            print(f"{i}: Downloaded and saved: {link}")
            
            # Recursive call to follow links on the current page
            download_website_telecom(link, folder_name, visited)
        except Exception as e:
            print(f"Failed to download {link}: {e}")


# Download all webpages from the Telecom website, starting from the root page
root_url = "https://www.telecom-paris.fr/"

i = 0
# Download the website
download_website_telecom(root_url)
print("Done")

1: Downloaded and saved: https://www.telecom-paris.fr/fr/ecole/bref
2: Downloaded and saved: https://www.telecom-paris.fr/fr/ecole/bref
3: Downloaded and saved: https://www.telecom-paris.fr/fr/ecole/bref/gouvernance
4: Downloaded and saved: https://www.telecom-paris.fr/fr/ecole/bref/gouvernance
5: Downloaded and saved: https://www.telecom-paris.fr/fr/ecole/bref/brochures
6: Downloaded and saved: https://www.telecom-paris.fr/fr/ecole/bref/brochures
7: Downloaded and saved: https://www.telecom-paris.fr/fr/ecole/bref/marches-publics
8: Downloaded and saved: https://www.telecom-paris.fr/fr/ecole/bref/marches-publics
9: Downloaded and saved: https://www.telecom-paris.fr/fr/ecole/bref/chiffres-cles
10: Downloaded and saved: https://www.telecom-paris.fr/fr/ecole/bref/chiffres-cles
11: Downloaded and saved: https://www.telecom-paris.fr/fr/ecole/bref/logos
12: Downloaded and saved: https://www.telecom-paris.fr/fr/ecole/bref/logos
13: Downloaded and saved: https://www.telecom-paris.fr/fr/ecole/b

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


266: Downloaded and saved: https://www.telecom-paris.fr/fr/entreprise/former-collaborateurs
Failed to download https://www.telecom-paris.fr/wp-content-EvDsK19/uploads/2024/09/rentree-tse-2tonnes-2024.jpg: 'NoneType' object has no attribute 'string'
267: Downloaded and saved: https://www.telecom-paris.fr/fr/campus/vie/acces-orientation
268: Downloaded and saved: https://www.telecom-paris.fr/fr/campus/vie/acces-orientation
269: Downloaded and saved: https://www.telecom-paris.fr/fr/masteres-specialises/ia-big-data
270: Downloaded and saved: https://www.telecom-paris.fr/fr/masteres-specialises/ia-big-data
271: Downloaded and saved: https://www.telecom-paris.fr/fr/ecole/alumni/portraits/benedicte-david
272: Downloaded and saved: https://www.telecom-paris.fr/fr/ecole/alumni/portraits/benedicte-david
273: Downloaded and saved: https://www.telecom-paris.fr/restauration-qualite-respect
274: Downloaded and saved: https://www.telecom-paris.fr/restauration-qualite-respect
275: Downloaded and saved

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Failed to download https://www.telecom-paris.fr/wp-content-EvDsK19/uploads/2024/10/lancement-parrainage-promo-2027-actu.jpg: 'NoneType' object has no attribute 'string'
1338: Downloaded and saved: https://www.telecom-paris.fr/fr/ecole/bref/ecosysteme
1339: Downloaded and saved: https://www.telecom-paris.fr/agenda/french-tech-paris-saclay
1340: Downloaded and saved: https://www.telecom-paris.fr/fr/ingenieur/pedagogie/mooc
1341: Downloaded and saved: https://www.telecom-paris.fr/fr/ingenieur/notre-vision
1342: Downloaded and saved: https://www.telecom-paris.fr/news/pressroom
1343: Downloaded and saved: https://www.telecom-paris.fr/fr/ideas/quantum-secure-networks-partnership
1344: Downloaded and saved: https://www.telecom-paris.fr/fr/ideas/performance-algorithmes-maitrise-impact-carbone
1345: Downloaded and saved: https://www.telecom-paris.fr/fr/international
1346: Downloaded and saved: https://www.telecom-paris.fr/fr/international/strategie/partenariats
1347: Downloaded and saved: https

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1459: Downloaded and saved: https://www.telecom-paris.fr/fr/recherche/research-and-innovation-webinars
Failed to download https://www.telecom-paris.fr/wp-content-EvDsK19/uploads/2024/10/Mosaik-2024-57.jpg: 'NoneType' object has no attribute 'string'
1460: Downloaded and saved: https://www.telecom-paris.fr/fr/recherche/labos/traitement-information-ltci
1461: Downloaded and saved: https://www.telecom-paris.fr/en/home
1462: Downloaded and saved: https://www.telecom-paris.fr/en/home
1463: Downloaded and saved: https://www.telecom-paris.fr/fr/recherche/axes-strategiques/innovation-numerique
1464: Downloaded and saved: https://www.telecom-paris.fr/fr/ideas/contre-biais-algorithmes-recommandation
1465: Downloaded and saved: https://www.telecom-paris.fr/fr/ideas/ai-act-game
1466: Downloaded and saved: https://www.telecom-paris.fr/fr/ecole/mecenat
1467: Downloaded and saved: https://www.telecom-paris.fr/fr/ecole/alumni/portraits
1468: Downloaded and saved: https://www.telecom-paris.fr/fr/format

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1484: Downloaded and saved: https://www.telecom-paris.fr/fr/ideas/numerique-societe
Failed to download https://www.telecom-paris.fr/wp-content-EvDsK19/uploads/2024/09/restauration.jpg: 'NoneType' object has no attribute 'string'
1485: Downloaded and saved: https://www.telecom-paris.fr/fr/ideas/migrant-connecte
1486: Downloaded and saved: https://www.telecom-paris.fr/fr/ideas/impacts-sociaux-numerique-travailleurs-clic
1487: Downloaded and saved: https://www.telecom-paris.fr/fr/ingenieur/comment-integrer/apprentissage-ats-dut
1488: Downloaded and saved: https://www.telecom-paris.fr/fr/innovation-entrepreneuriat/lieux-innovation
1489: Downloaded and saved: https://www.telecom-paris.fr/fr/ideas/numerique-confiance
1490: Downloaded and saved: https://www.telecom-paris.fr/fr/ideas/regulation-intelligence-artificielle
1491: Downloaded and saved: https://www.telecom-paris.fr/fr/ideas/ia-hybride-explicable-imagerie-medicale
1492: Downloaded and saved: https://www.telecom-paris.fr/fr/internatio

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1500: Downloaded and saved: https://www.telecom-paris.fr/fr/doctorat/sujets-de-theses
Failed to download https://www.telecom-paris.fr/wp-content-EvDsK19/uploads/2024/10/decisions-incertain-VF-TH-Dunod-6b.jpg: 'NoneType' object has no attribute 'string'
1501: Downloaded and saved: https://www.telecom-paris.fr/fr/entreprise/partenaires/parrains-campus-logos
1502: Downloaded and saved: https://www.telecom-paris.fr/fr/campus/bibliotheque/ressources
1503: Downloaded and saved: https://www.telecom-paris.fr/fr/ideas/explicabilite-confiance-intelligence-artificielle
1504: Downloaded and saved: https://www.telecom-paris.fr/fr/international/etudiants/programmes
1505: Downloaded and saved: https://www.telecom-paris.fr/fr/ecole/responsabilite-sociale/egalite-femmes-hommes
1506: Downloaded and saved: https://www.telecom-paris.fr/fr/entreprise/evenements/retour-activites
1507: Downloaded and saved: https://www.telecom-paris.fr/fr/recherche/axes-strategiques/sciences-donnees-intelligence-artificielle

## Let's download doc.telecom-paris.fr
Difficile de faire plus simple, on vire juste la barre de navigation. En plus, toutes les pages sont deja sur la page d'accueil, pas besoin d'ajuster la prodfondeur.

In [112]:
def get_clean_text_doc(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')

    # Extract title
    title = soup.title.string

    # Extract the path as an array
    path = url.split("/")

    # Remove nav bar
    for container in soup.find_all(class_="wy-nav-side"):
        container.decompose()

    # Remove footer
    for container in soup.find_all("footer"):
        container.decompose()

    # Replace relative links (here using markdown syntax, meaning ./ or ../ on the original page) with absolute links
    for link in soup.find_all('a'):
        if link.get('href') and link['href'].startswith("./"):
            new_path = path[:-1]
            new_path.append(link['href'][2:])
            link['href'] = "/".join(new_path)
        if link.get('href') and link['href'].startswith("../"):
            new_path = path[:-2]
            new_path.append(link['href'][3:])
            link['href'] = "/".join(new_path)

        
    # Convert the HTML content to Markdown
    h = h2t.HTML2Text()
    h.ignore_links = False
    h.ignore_images = True
    h.ignore_emphasis = True
    
    # Format the text
    cleaned_content = h.handle(str(soup))
    final_markdown = f"# Title: {title}\n\n## {title.replace("—", ">")} \n\n {cleaned_content}"
    return final_markdown

# Example usage
if show_examples:
    url = "https://doc.telecom-paris.fr/reseau/param.html"
    clean_text = get_clean_text_doc(url)
    print(clean_text)

# Title: Configuration réseau — Aide informatique  documentation

## Configuration réseau > Aide informatique  documentation 

 [Aide informatique](https://doc.telecom-paris.fr/index.html)

  * [](https://doc.telecom-paris.fr/index.html) »
  * Configuration réseau
  * 

* * *

# Configuration réseau¶

## Site de Palaiseau¶

L’accès au réseau filaire et sans-fil de l’école est protégé par une
authentification 802.1x. Le paramétrage est décrit ci-dessous. Suivant celle-
ci vous obtiendrez des droits d’accès aux différentes ressources.

Pour les personnels et étudiants deux connexions sont possibles :

    

  * connexion filaire ou réseau WiFi Campus-Telecom :

    * accès possible aux ressources internes protégées selon vos droits

    * accès protégé à Internet permettant un usage web classique

    * conseillé pour les postes de travail de l’école

  * réseau WiFi eduroam :

    * seules les ressources internes publiques sont accessibles (ex : sites web ouvertes, messagerie)

    * ac

In [113]:
def download_website_doc(root_url, folder_name = "doc"):
    base_folder = os.path.join(BASE_PATH, folder_name)
    response = requests.get(root_url)
    soup = BeautifulSoup(response.content, 'html.parser')

    # Find all links on the root page
    links = [a['href'] for a in soup.find_all('a', href=True)]

    # In this specific case, all internal links are relative.
    for i in range(len(links)):
        if not links[i].startswith("http"):
            links[i] = root_url + links[i]
    
    # Remove the tag at the end of the links
    for i in range(len(links)):
        if links[i].find("#") != -1:
            links[i] = links[i][:links[i].find("#")]

    # Filter out external links and duplicates
    links = list(set([link for link in links if link.startswith(root_url)]))


    print(links)
    # Create directory if it doesn't exist
    if not os.path.exists(base_folder):
        os.makedirs(base_folder)
    
    for link in links:
        try:
            clean_text = get_clean_text_doc(link)
            # Create a filename from the link
            filename = os.path.join(base_folder, link.replace(root_url, "").replace("/", "_") + ".md")
            with open(filename, 'w', encoding='utf-8') as file:
                file.write(clean_text)
            print(f"Downloaded and saved: {link}")
        except Exception as e:
            print(f"Failed to download {link}: {e}")

# Download all webpages from the doc website, starting from the root page
root_url = "https://doc.telecom-paris.fr/"

# Download the website
download_website_doc(root_url)
print("Done")

['https://doc.telecom-paris.fr/wifi_invite/param.html', 'https://doc.telecom-paris.fr/impression/imprimantes.html', 'https://doc.telecom-paris.fr/telephonie/wifi.html', 'https://doc.telecom-paris.fr/impression/web.html', 'https://doc.telecom-paris.fr/', 'https://doc.telecom-paris.fr/salles_tp/salles_tp.html', 'https://doc.telecom-paris.fr/zimbra/acces.html', 'https://doc.telecom-paris.fr/zimbra/conseils/index.html', 'https://doc.telecom-paris.fr/impression/postes.html', 'https://doc.telecom-paris.fr/zimbra/index.html', 'https://doc.telecom-paris.fr/reseau/param.html', 'https://doc.telecom-paris.fr/impression/index.html']
Downloaded and saved: https://doc.telecom-paris.fr/wifi_invite/param.html
Downloaded and saved: https://doc.telecom-paris.fr/impression/imprimantes.html
Downloaded and saved: https://doc.telecom-paris.fr/telephonie/wifi.html
Downloaded and saved: https://doc.telecom-paris.fr/impression/web.html
Downloaded and saved: https://doc.telecom-paris.fr/
Downloaded and saved: h

## Let's download a few wikipedia pages
On va essayer de récupérer les pages les plus importantes, et les pages liées à ces pages.

In [120]:
def get_clean_text_wikipedia(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')

    # Extract title
    title = soup.title.string

    # Only keep main content
    soup = soup.find(class_="mw-parser-output")

    # Remove useless stuff
    for container in soup.find_all(class_ = "navbox-container"):
        container.decompose()

    for container in soup.find_all(class_ = "bandeau_portail"):
        container.decompose()

    for container in soup.find_all(class_ = "mw-normal-catlinks"):
        container.decompose()
    
    for container in soup.find_all(class_ = "mw-editsection"):
        container.decompose()

    # Remove style and scripts
    for script in soup.find_all('script'):
        script.decompose()
    for style in soup.find_all('style'):
        style.decompose()
        
    # Convert the HTML content to Markdown
    h = h2t.HTML2Text()
    h.ignore_links = False
    h.ignore_images = True
    h.ignore_emphasis = True
    
    # Format the text
    cleaned_content = h.handle(str(soup))
    final_markdown = f"# Title: {title}\n\n## {title.replace("—", ">")} {cleaned_content}"
    return final_markdown

# Example usage
if show_examples:
    url = "https://fr.wikipedia.org/wiki/T%C3%A9l%C3%A9com_Paris"
    clean_text = get_clean_text_wikipedia(url)
    print(clean_text)

# Title: Télécom Paris — Wikipédia

## Télécom Paris > Wikipédia [](/wiki/Aide:Paronymie "Aide:Paronymie")

Cet article possède un [paronyme](/wiki/Paronymie "Paronymie"), voir [Télécom
SudParis](/wiki/T%C3%A9l%C3%A9com_SudParis "Télécom SudParis").

Institut Mines-Télécom, Télécom Paris

[](/w/index.php?title=Fichier:Logo_T%C3%A9l%C3%A9com_ParisTech.svg&lang=fr)

[](/wiki/Fichier:Telecom_Paris_main_court.jpg)

HistoireFondation|  [1878](/wiki/1878 "1878")  
---|---  
StatutType|  [École
d'ingénieurs](/wiki/%C3%89tudes_d%27ing%C3%A9nieurs_en_France "Études
d'ingénieurs en France") et [Grand
établissement](/wiki/Grand_%C3%A9tablissement "Grand établissement")
([Institut Mines-Télécom](/wiki/Institut_Mines-T%C3%A9l%C3%A9com "Institut
Mines-Télécom"))  
---|---  
Nom officiel|  École supérieure de télégraphie, École nationale supérieure des
télécommunications (ENST)  
Régime linguistique|  [Français](/wiki/Fran%C3%A7ais
"Français")[](https://www.wikidata.org/wiki/Q2311820?uselang=fr#P2936

In [126]:
def download_website_wikipedia(root_url, folder_name = "wikipedia"):
    base_folder = os.path.join(BASE_PATH, folder_name)
    response = requests.get(root_url)
    soup = BeautifulSoup(response.content, 'html.parser')

    # Only keep main content
    soup = soup.find(class_="mw-parser-output")

    # Remove useless stuff
    for container in soup.find_all(class_ = "navbox-container"):
        container.decompose()

    for container in soup.find_all(class_ = "bandeau_portail"):
        container.decompose()

    for container in soup.find_all(class_ = "mw-normal-catlinks"):
        container.decompose()
    
    for container in soup.find_all(class_ = "mw-editsection"):
        container.decompose()



    # Find all links on the root page
    links = [a['href'] for a in soup.find_all('a', href=True)]

    for i in range(len(links)):
        if links[i].startswith("/wiki/"):
            links[i] = "https://fr.wikipedia.org" + links[i]

    # Filter out external links and duplicates
    links = list(set([link for link in links if link.startswith("https://fr.wikipedia.org")]))


    # Create directory if it doesn't exist
    if not os.path.exists(base_folder):
        os.makedirs(base_folder)
    
    for link in links:
        try:
            clean_text = get_clean_text_wikipedia(link)
            # Create a filename from the link
            filename = os.path.join(base_folder, link.replace(root_url, "").replace("/", "_") + ".md")
            with open(filename, 'w', encoding='utf-8') as file:
                file.write(clean_text)
            print(f"Downloaded and saved: {link}")
        except Exception as e:
            print(f"Failed to download {link}: {e}")

# Download page and all subpages from the Telecom wikipedia and from the student wikipedia
root_url = "https://fr.wikipedia.org/wiki/T%C3%A9l%C3%A9com_Paris"
download_website_wikipedia(root_url)

root_url = "https://fr.wikipedia.org/wiki/Cat%C3%A9gorie:%C3%89l%C3%A8ve_de_T%C3%A9l%C3%A9com_Paris"
download_website_wikipedia(root_url)

print("Done")

Downloaded and saved: https://fr.wikipedia.org/wiki/Liste_des_%C3%A9coles_d%27ing%C3%A9nieurs_en_France
Downloaded and saved: https://fr.wikipedia.org/wiki/Paronymie
Downloaded and saved: https://fr.wikipedia.org/wiki/Brest
Downloaded and saved: https://fr.wikipedia.org/wiki/Portail:Paris
Downloaded and saved: https://fr.wikipedia.org/wiki/Cat%C3%A9gorie:%C3%89l%C3%A8ve_de_T%C3%A9l%C3%A9com_Paris
Downloaded and saved: https://fr.wikipedia.org/wiki/Universit%C3%A9_Paris_Descartes
Failed to download https://fr.wikipedia.org/w/index.php?title=T%C3%A9l%C3%A9com_Paris&action=edit&section=0: 'NoneType' object has no attribute 'find_all'
Downloaded and saved: https://fr.wikipedia.org/wiki/1876
Downloaded and saved: https://fr.wikipedia.org/wiki/Aide:Paronymie
Downloaded and saved: https://fr.wikipedia.org/wiki/Mast%C3%A8re_sp%C3%A9cialis%C3%A9
Downloaded and saved: https://fr.wikipedia.org/wiki/%C3%89cole_polytechnique_f%C3%A9d%C3%A9rale_de_Lausanne
Downloaded and saved: https://fr.wikipedia.