# Exercice 08 - Ingestion depuis le Web (API)

## Objectifs
- Appeler une API REST avec Python
- Transformer les donnees JSON en DataFrame
- Ingerer les donnees web dans Bronze
- Gerer les erreurs et les retries

---

## 1. Architecture d'ingestion Web

```
+------------------+                    +------------------------+
|                  |                    |        MinIO           |
|    INTERNET      |     PYTHON         |                        |
|                  |    + SPARK         |  +------------------+  |
|  +------------+  |  =============>    |  |     BRONZE       |  |
|  | API REST   |  |                    |  +------------------+  |
|  | (JSON)     |  |                    |  | /api_users/      |  |
|  +------------+  |                    |  |   /2024-01-15/   |  |
|                  |                    |  | /api_posts/      |  |
|  +------------+  |                    |  |   /2024-01-15/   |  |
|  | Fichiers   |  |                    |  +------------------+  |
|  | CSV/JSON   |  |                    |                        |
|  +------------+  |                    +------------------------+
|                  |
+------------------+

Etapes :
1. Appeler l'API avec requests
2. Parser le JSON
3. Creer un DataFrame Spark
4. Sauvegarder dans Bronze
```

## 2. Configuration

In [1]:
from pyspark.sql import SparkSession
from datetime import datetime
import requests
import json

# Creer la SparkSession
spark = SparkSession.builder \
    .appName("Ingestion Web") \
    .config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:3.4.1,com.amazonaws:aws-java-sdk-bundle:1.12.262") \
    .config("spark.hadoop.fs.s3a.endpoint", "http://minio:9000") \
    .config("spark.hadoop.fs.s3a.access.key", "minioadmin") \
    .config("spark.hadoop.fs.s3a.secret.key", "minioadmin123") \
    .config("spark.hadoop.fs.s3a.path.style.access", "true") \
    .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
    .getOrCreate()

date_ingestion = datetime.now().strftime("%Y-%m-%d")
print(f"Spark pret - Date : {date_ingestion}")

Spark pret - Date : 2026-01-14


## 3. Appeler une API REST simple

In [2]:
# API de test : JSONPlaceholder (API publique gratuite)
url_users = "https://jsonplaceholder.typicode.com/users"

# Appeler l'API
response = requests.get(url_users)

print(f"Status code : {response.status_code}")
print(f"Content-Type : {response.headers.get('Content-Type')}")

Status code : 200
Content-Type : application/json; charset=utf-8


In [3]:
# Parser le JSON
data = response.json()

print(f"Type : {type(data)}")
print(f"Nombre d'elements : {len(data)}")
print("\nPremier element :")
print(json.dumps(data[0], indent=2))

Type : <class 'list'>
Nombre d'elements : 10

Premier element :
{
  "id": 1,
  "name": "Leanne Graham",
  "username": "Bret",
  "email": "Sincere@april.biz",
  "address": {
    "street": "Kulas Light",
    "suite": "Apt. 556",
    "city": "Gwenborough",
    "zipcode": "92998-3874",
    "geo": {
      "lat": "-37.3159",
      "lng": "81.1496"
    }
  },
  "phone": "1-770-736-8031 x56442",
  "website": "hildegard.org",
  "company": {
    "name": "Romaguera-Crona",
    "catchPhrase": "Multi-layered client-server neural-net",
    "bs": "harness real-time e-markets"
  }
}


## 4. Convertir en DataFrame Spark

In [4]:
# Creer un DataFrame depuis une liste Python
# Les donnees de l'API ont des structures imbriquees (address, company, geo)
# On definit le schema explicitement pour eviter les erreurs d'inference

from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType

# Schema pour les donnees JSONPlaceholder /users
schema = StructType([
    StructField("id", IntegerType(), True),
    StructField("name", StringType(), True),
    StructField("username", StringType(), True),
    StructField("email", StringType(), True),
    StructField("address", StructType([
        StructField("street", StringType(), True),
        StructField("suite", StringType(), True),
        StructField("city", StringType(), True),
        StructField("zipcode", StringType(), True),
        StructField("geo", StructType([
            StructField("lat", StringType(), True),
            StructField("lng", StringType(), True)
        ]), True)
    ]), True),
    StructField("phone", StringType(), True),
    StructField("website", StringType(), True),
    StructField("company", StructType([
        StructField("name", StringType(), True),
        StructField("catchPhrase", StringType(), True),
        StructField("bs", StringType(), True)
    ]), True)
])

df_users = spark.createDataFrame(data, schema=schema)

print("Schema :")
df_users.printSchema()

Schema :
root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- username: string (nullable = true)
 |-- email: string (nullable = true)
 |-- address: struct (nullable = true)
 |    |-- street: string (nullable = true)
 |    |-- suite: string (nullable = true)
 |    |-- city: string (nullable = true)
 |    |-- zipcode: string (nullable = true)
 |    |-- geo: struct (nullable = true)
 |    |    |-- lat: string (nullable = true)
 |    |    |-- lng: string (nullable = true)
 |-- phone: string (nullable = true)
 |-- website: string (nullable = true)
 |-- company: struct (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- catchPhrase: string (nullable = true)
 |    |-- bs: string (nullable = true)



In [5]:
# Afficher les donnees
df_users.select("id", "name", "email", "phone").show()

+---+--------------------+--------------------+--------------------+
| id|                name|               email|               phone|
+---+--------------------+--------------------+--------------------+
|  1|       Leanne Graham|   Sincere@april.biz|1-770-736-8031 x5...|
|  2|        Ervin Howell|   Shanna@melissa.tv| 010-692-6593 x09125|
|  3|    Clementine Bauch|  Nathan@yesenia.net|      1-463-123-4447|
|  4|    Patricia Lebsack|Julianne.OConner@...|   493-170-9623 x156|
|  5|    Chelsey Dietrich|Lucio_Hettinger@a...|       (254)954-1289|
|  6|Mrs. Dennis Schulist|Karley_Dach@jaspe...|1-477-935-8478 x6430|
|  7|     Kurtis Weissnat|Telly.Hoeger@bill...|        210.067.6132|
|  8|Nicholas Runolfsd...|Sherwood@rosamond.me|   586.493.6943 x140|
|  9|     Glenna Reichert|Chaim_McDermott@d...|(775)976-6794 x41206|
| 10|  Clementina DuBuque|Rey.Padberg@karin...|        024-648-3804|
+---+--------------------+--------------------+--------------------+



In [6]:
# Aplatir les structures imbriquees
from pyspark.sql import functions as F

df_users_flat = df_users.select(
    "id",
    "name",
    "username",
    "email",
    "phone",
    "website",
    F.col("address.street").alias("street"),
    F.col("address.city").alias("city"),
    F.col("address.zipcode").alias("zipcode"),
    F.col("company.name").alias("company_name")
)

df_users_flat.show(truncate=False)

+---+------------------------+----------------+-------------------------+---------------------+-------------+-----------------+--------------+----------+------------------+
|id |name                    |username        |email                    |phone                |website      |street           |city          |zipcode   |company_name      |
+---+------------------------+----------------+-------------------------+---------------------+-------------+-----------------+--------------+----------+------------------+
|1  |Leanne Graham           |Bret            |Sincere@april.biz        |1-770-736-8031 x56442|hildegard.org|Kulas Light      |Gwenborough   |92998-3874|Romaguera-Crona   |
|2  |Ervin Howell            |Antonette       |Shanna@melissa.tv        |010-692-6593 x09125  |anastasia.net|Victor Plains    |Wisokyburgh   |90566-7771|Deckow-Crist      |
|3  |Clementine Bauch        |Samantha        |Nathan@yesenia.net       |1-463-123-4447       |ramiro.info  |Douglas Extension|McKenzie

## 5. Sauvegarder dans Bronze

In [7]:
# Ajouter les metadonnees
df_final = df_users_flat \
    .withColumn("_source", F.lit("api_jsonplaceholder")) \
    .withColumn("_endpoint", F.lit("users")) \
    .withColumn("_ingestion_date", F.lit(date_ingestion))

# Sauvegarder
chemin_bronze = f"s3a://bronze/api_users/{date_ingestion}"
df_final.write.mode("overwrite").parquet(chemin_bronze)

print(f"Sauvegarde reussie : {chemin_bronze}")
print(f"Lignes ingerees : {df_final.count()}")

Sauvegarde reussie : s3a://bronze/api_users/2026-01-14
Lignes ingerees : 10


## 6. Fonction d'ingestion API generique

In [8]:
def ingerer_api(url, nom_dataset, date=None):
    """
    Ingere des donnees depuis une API REST vers Bronze.
    
    Args:
        url: URL de l'API
        nom_dataset: Nom du dataset pour le chemin Bronze
        date: Date d'ingestion (optionnel)
    
    Returns:
        dict: Statistiques d'ingestion
    """
    if date is None:
        date = datetime.now().strftime("%Y-%m-%d")
    
    try:
        # Appeler l'API
        response = requests.get(url, timeout=30)
        response.raise_for_status()
        
        # Parser le JSON
        data = response.json()
        
        # Creer le DataFrame
        if isinstance(data, list):
            df = spark.createDataFrame(data)
        else:
            df = spark.createDataFrame([data])
        
        # Ajouter les metadonnees
        df = df.withColumn("_source", F.lit(url)) \
               .withColumn("_ingestion_date", F.lit(date))
        
        # Sauvegarder
        chemin = f"s3a://bronze/{nom_dataset}/{date}"
        df.write.mode("overwrite").parquet(chemin)
        
        nb_lignes = df.count()
        print(f"[OK] {nom_dataset} : {nb_lignes} lignes -> {chemin}")
        
        return {"status": "OK", "lignes": nb_lignes, "chemin": chemin}
        
    except requests.exceptions.RequestException as e:
        print(f"[ERREUR] {nom_dataset} : Erreur API - {e}")
        return {"status": "ERREUR", "erreur": str(e)}
    except Exception as e:
        print(f"[ERREUR] {nom_dataset} : {e}")
        return {"status": "ERREUR", "erreur": str(e)}

In [9]:
# Tester avec d'autres endpoints
endpoints = [
    ("https://jsonplaceholder.typicode.com/posts", "api_posts"),
    ("https://jsonplaceholder.typicode.com/comments", "api_comments"),
    ("https://jsonplaceholder.typicode.com/todos", "api_todos")
]

print("Ingestion multi-endpoints :")
print("=" * 50)

for url, nom in endpoints:
    ingerer_api(url, nom, date_ingestion)

Ingestion multi-endpoints :
[OK] api_posts : 100 lignes -> s3a://bronze/api_posts/2026-01-14
[OK] api_comments : 500 lignes -> s3a://bronze/api_comments/2026-01-14
[OK] api_todos : 200 lignes -> s3a://bronze/api_todos/2026-01-14


## 7. Gestion des erreurs et retries

In [10]:
import time

def appeler_api_avec_retry(url, max_retries=3, delay=2):
    """
    Appelle une API avec retry en cas d'erreur.
    
    Args:
        url: URL de l'API
        max_retries: Nombre maximum de tentatives
        delay: Delai entre les tentatives (secondes)
    
    Returns:
        Donnees JSON ou None
    """
    for tentative in range(max_retries):
        try:
            response = requests.get(url, timeout=30)
            response.raise_for_status()
            return response.json()
            
        except requests.exceptions.RequestException as e:
            print(f"  Tentative {tentative + 1}/{max_retries} echouee : {e}")
            if tentative < max_retries - 1:
                print(f"  Attente de {delay} secondes...")
                time.sleep(delay)
                delay *= 2  # Backoff exponentiel
    
    print(f"  Echec apres {max_retries} tentatives")
    return None

In [11]:
# Test avec une URL valide
data = appeler_api_avec_retry("https://jsonplaceholder.typicode.com/albums")
if data:
    print(f"Succes : {len(data)} elements recuperes")

Succes : 100 elements recuperes


## 8. Ingerer un fichier CSV depuis le Web

In [12]:
# Telecharger un CSV depuis une URL
url_csv = "https://raw.githubusercontent.com/datasets/covid-19/main/data/countries-aggregated.csv"

# Telecharger et lire le CSV
response = requests.get(url_csv)
if response.status_code == 200:
    # Sauvegarder temporairement
    with open("/tmp/covid_data.csv", "w") as f:
        f.write(response.text)
    
    # Lire avec Spark
    df_covid = spark.read.csv("/tmp/covid_data.csv", header=True, inferSchema=True)
    print(f"Donnees COVID : {df_covid.count()} lignes")
    df_covid.show(5)

Donnees COVID : 161568 lignes
+----------+-----------+---------+---------+------+
|      Date|    Country|Confirmed|Recovered|Deaths|
+----------+-----------+---------+---------+------+
|2020-01-22|Afghanistan|        0|        0|     0|
|2020-01-23|Afghanistan|        0|        0|     0|
|2020-01-24|Afghanistan|        0|        0|     0|
|2020-01-25|Afghanistan|        0|        0|     0|
|2020-01-26|Afghanistan|        0|        0|     0|
+----------+-----------+---------+---------+------+
only showing top 5 rows


In [13]:
# Sauvegarder dans Bronze
chemin = f"s3a://bronze/covid_data/{date_ingestion}"
df_covid.write.mode("overwrite").parquet(chemin)
print(f"Sauvegarde : {chemin}")

Sauvegarde : s3a://bronze/covid_data/2026-01-14


---

## Exercice

**Objectif** : Ingerer des donnees depuis une API publique

**Consigne** :
1. Utilisez l'API REST Countries : `https://restcountries.com/v3.1/all`
2. Extrayez les champs : nom, capitale, region, population, superficie
3. Sauvegardez dans Bronze

A vous de jouer :

In [15]:
# TODO: Appeler l'API REST Countries
import requests
from pyspark.sql import functions as F
from datetime import datetime

url_countries = "https://restcountries.com/v3.1/all?fields=name,capital,region,population,area"

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}

response = requests.get(url_countries, headers=headers)

# Vérification du statut
if response.status_code == 200:
    data_raw = response.json()
    print(f"Données récupérées : {len(data_raw)} pays trouvés.")
    
    if len(data_raw) > 0:
        print("Exemple:", data_raw[0])
else:
    print(f"Erreur API : {response.status_code}")
    print(response.text)

Données récupérées : 250 pays trouvés.
Exemple: {'name': {'common': 'Antigua and Barbuda', 'official': 'Antigua and Barbuda', 'nativeName': {'eng': {'official': 'Antigua and Barbuda', 'common': 'Antigua and Barbuda'}}}, 'capital': ["Saint John's"], 'region': 'Americas', 'area': 442.0, 'population': 103603}


In [16]:
# TODO: Extraire les champs utiles et creer un DataFrame
pays_clean = []

if 'data_raw' in locals() and data_raw:
    for item in data_raw:
        nom = item.get("name", {}).get("common", "Inconnu")
        capital_list = item.get("capital", [])
        capitale = capital_list[0] if capital_list else None
        region = item.get("region")
        population = item.get("population")
        superficie = item.get("area")
        
        # Ajout à la liste
        pays_clean.append({
            "nom": nom,
            "capitale": capitale,
            "region": region,
            "population": population,
            "superficie": superficie
        })

    # Création du DataFrame Spark
    df_countries = spark.createDataFrame(pays_clean)

    print(f"DataFrame créé avec {df_countries.count()} lignes.")
    df_countries.show(5)
    df_countries.printSchema()

else:
    print("Erreur : La variable 'data_raw' est vide. Vérifiez l'étape 1.")

DataFrame créé avec 250 lignes.
+------------+-------------------+----------+--------+----------+
|    capitale|                nom|population|  region|superficie|
+------------+-------------------+----------+--------+----------+
|Saint John's|Antigua and Barbuda|    103603|Americas|     442.0|
|     Thimphu|             Bhutan|    784043|    Asia|   38394.0|
|        Rome|              Italy|  58927633|  Europe|  301336.0|
|    Funafuti|             Tuvalu|     10643| Oceania|      26.0|
|  The Valley|           Anguilla|     16010|Americas|      91.0|
+------------+-------------------+----------+--------+----------+
only showing top 5 rows
root
 |-- capitale: string (nullable = true)
 |-- nom: string (nullable = true)
 |-- population: long (nullable = true)
 |-- region: string (nullable = true)
 |-- superficie: double (nullable = true)



In [18]:
# TODO: Sauvegarder dans Bronze
if 'df_countries' in locals():
    from datetime import datetime
    date_ingestion_exo = datetime.now().strftime("%Y-%m-%d")
    
    chemin_bronze = f"s3a://bronze/api_countries/{date_ingestion_exo}"

    # Ajout des métadonnées de traçabilité
    df_final = df_countries \
        .withColumn("_source", F.lit("restcountries_api")) \
        .withColumn("_ingestion_date", F.lit(date_ingestion_exo))

    # Écriture en mode overwrite
    df_final.write.mode("overwrite").parquet(chemin_bronze)

    print(f"✅ Sauvegarde réussie dans : {chemin_bronze}")
    
    # Petite vérification de lecture
    print("Vérification lecture :")
    spark.read.parquet(chemin_bronze).select("nom", "capitale", "_ingestion_date").show(3)

else:
    print("Erreur : DataFrame 'df_countries' introuvable.")

✅ Sauvegarde réussie dans : s3a://bronze/api_countries/2026-01-15
Vérification lecture :
+-------------+--------------+---------------+
|          nom|      capitale|_ingestion_date|
+-------------+--------------+---------------+
|        India|     New Delhi|     2026-01-15|
|   Martinique|Fort-de-France|     2026-01-15|
|New Caledonia|        Nouméa|     2026-01-15|
+-------------+--------------+---------------+
only showing top 3 rows


---

## Resume

Dans ce notebook, vous avez appris :
- Comment **appeler une API REST** avec Python requests
- Comment **convertir du JSON** en DataFrame Spark
- Comment **aplatir** les structures imbriquees
- Comment **gerer les erreurs** avec retry
- Comment **ingerer des fichiers** CSV depuis le Web

### Prochaine etape
Dans le prochain notebook, nous apprendrons a nettoyer les donnees pour la couche Silver.