# ANIP Challenge – Tâche 1 : Collecte & Préparation des Données

## 🎯 Objectif
Cette première tâche consiste à collecter, nettoyer et préparer les données mises à disposition
(par l’ANIP ou d’autres sources ouvertes si nécessaire) afin de produire un dataset exploitable,
cohérent et documenté.
Le résultat de cette étape doit permettre d’entamer l’analyse et la visualisation (Tâche 2)
sans ambiguïtés ni incohérences.

## Livrables attendus
- Un ou plusieurs **datasets finaux** (CSV/Excel) nettoyés, harmonisés et documentés.
- Un **notebook de préparation** (ce fichier), incluant :
- La collecte (scraping / API / importation de fichiers)
- Le nettoyage (gestion des valeurs manquantes, doublons, incohérences)
- L’harmonisation (format des dates, unités, typages, renommage des colonnes, etc.)
- Un **glossaire/dictionnaire des variables** décrivant chaque champ :
- Nom de la variable
- Définition
- Unité de mesure (si applicable)
- Source et période couverte

## Structure du projet
- `data/raw/` : données brutes (telles que collectées)
- `data/processed/` : données nettoyées et harmonisées (résultats de la Tâche 1)
- `docs/glossaire.md` : glossaire/dictionnaire des variables
- `notebooks/Tache_1_Nom_Prenom_JJMM.ipynb` : ce notebook

## Configuration des imports

In [18]:
import os
import re
import json
import time
import zipfile
import warnings
import logging
from io import BytesIO, StringIO
from pathlib import Path
from datetime import datetime
from typing import Optional, Dict, List, Union, Any
from urllib.parse import urljoin, urlparse
from abc import ABC, abstractmethod
from dataclasses import dataclass, field

import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from bs4 import BeautifulSoup

## Configuration Globale

In [19]:
@dataclass
class GlobalConfig:
    """
    GlobalConfig serves as a centralized configuration data class that stores essential
    constants and settings for the application. It provides default values for parameters related
    to country details, API configurations, request handling, and datasets. It aims to reduce
    hardcoding of key values and centralize configuration management for better maintainability.

    :ivar COUNTRY_CODE: ISO 3166-1 alpha-2 code representing the country.
    :type COUNTRY_CODE: str
    :ivar COUNTRY_NAME: The official name of the country.
    :type COUNTRY_NAME: str
    :ivar START_YEAR: The starting year for data processing or analysis.
    :type START_YEAR: int
    :ivar END_YEAR: The ending year for data processing or analysis.
    :type END_YEAR: int
    :ivar WORLD_BANK_API_URL: The base URL for the World Bank API.
    :type WORLD_BANK_API_URL: str
    :ivar INSTAD_BASE_URL: The base URL for the INSTAD application.
    :type INSTAD_BASE_URL: str
    :ivar OVERPASS_API_URL: The base URL for the Overpass API.
    :type OVERPASS_API_URL: str
    :ivar DEFAULT_PER_PAGE: The default number of items per page for paginated results.
    :type DEFAULT_PER_PAGE: int
    :ivar REQUEST_TIMEOUT: The duration (in seconds) before a request times out.
    :type REQUEST_TIMEOUT: int
    :ivar RETRY_ATTEMPTS: The number of retry attempts allowed after a failed request.
    :type RETRY_ATTEMPTS: int
    :ivar DELAY_BETWEEN_REQUESTS: The delay (in seconds) between successive requests to APIs,
        to avoid throttling or rate-limiting issues.
    :type DELAY_BETWEEN_REQUESTS: float
    :ivar DEFAULT_WB_INDICATORS: A list of default World Bank indicators used for data
        retrieval, represented by their respective codes.
    :type DEFAULT_WB_INDICATORS: List[str]
    :ivar OSM_ADMIN_LEVELS: A dictionary mapping administrative levels to their corresponding
        numeric codes as used in OpenStreetMap (OSM) data.
    :type OSM_ADMIN_LEVELS: Dict[str, str]
    :ivar EXTERNAL_CSV_URLS: A list of URLs pointing to external CSV resources used for
        additional data ingestion.
    :type EXTERNAL_CSV_URLS: List[str]
    """

    COUNTRY_CODE: str = "BJ"
    COUNTRY_NAME: str = "Bénin"

    START_YEAR: int = 2015
    END_YEAR: int = 2025

    WORLD_BANK_API_URL: str = "https://api.worldbank.org/v2"
    INSTAD_BASE_URL: str = "https://instad.bj"
    OVERPASS_API_URL: str = "https://overpass-api.de/api/interpreter"

    DEFAULT_PER_PAGE: int = 100
    REQUEST_TIMEOUT: int = 30
    RETRY_ATTEMPTS: int = 3
    DELAY_BETWEEN_REQUESTS: float = 0.5

    DEFAULT_WB_INDICATORS: List[str] = field(default_factory=lambda: [
        "SP.POP.TOTL",  # Population totale
        "NY.GDP.MKTP.CD",  # PIB (USD courants)
        "NY.GDP.PCAP.CD",  # PIB par habitant
        "SE.PRM.NENR",  # Taux net de scolarisation primaire
        "SH.DYN.MORT",  # Taux de mortalité infantile
        "AG.LND.TOTL.K2",  # Superficie totale (km²)
        "SL.TLF.TOTL.IN",  # Population active totale
        "SP.DYN.TFRT.IN",  # Indice synthétique de fécondité
    ])

    OSM_ADMIN_LEVELS: Dict[str, str] = field(default_factory=lambda: {
        "pays": "2",
        "département": "4",
        "commune": "6"
    })

    EXTERNAL_CSV_URLS: List[str] = field(default_factory=lambda: [
        "https://data.uis.unesco.org/medias/education/SDG4.csv",
    ])

## Configuration du Logging et de pandas

In [20]:
def configuration_environnement():
    """
    Configures the runtime environment by suppressing specific warnings, setting logging
    parameters, configuring pandas display options, and customizing matplotlib and seaborn
    styles.

    The function is designed to improve the clarity and readability of outputs during
    data analysis and visualization tasks.

    :return: None
    """

    warnings.filterwarnings("ignore", category=UserWarning, module="bs4")
    warnings.filterwarnings("ignore", category=FutureWarning, module="pandas")

    logging.basicConfig(
        level=logging.INFO,
        format="%(asctime)s | %(levelname)-8s | %(name)s | %(message)s",
        datefmt="%Y-%m-%d %H:%M:%S"
    )

    pd.set_option("display.max_rows", 100)
    pd.set_option("display.max_columns", None)
    pd.set_option("display.float_format", "{:.2f}".format)
    pd.set_option("display.expand_frame_repr", False)
    pd.set_option("display.precision", 2)

    plt.style.use("seaborn-v0_8-whitegrid")
    plt.rcParams.update({
        "figure.figsize": (12, 8),
        "axes.titlesize": 14,
        "axes.labelsize": 12,
        "xtick.labelsize": 10,
        "ytick.labelsize": 10,
        "legend.fontsize": 10,
    })
    sns.set_palette("Set2")

    logging.info("Environnement configuré avec succès")
    logging.info(f"Début de collecte: {datetime.now().strftime('%d/%m/%Y %H:%M:%S')}")

## Configuration du gestionnaire des répertoires

In [21]:
class GestionnaireRepertoire:
    """
    Manages a directory structure, providing functionality to initialize, create, and retrieve
    specific directories. This class encapsulates logic for directory handling, logging
    processes, and organizing the structure based on pre-defined specifications.

    Intended for use in applications requiring consistent folder structures, such as
    data processing, logging, and result exports.

    :ivar base_dir: The base directory from which the folder structure is created
        and managed. Defaults to the current directory if not specified.
    :type base_dir: Path
    :ivar logger: Logger instance used for logging activities and debug information
        during directory operations.
    :type logger: logging.Logger
    """

    def __init__(self, base_dir: Optional[Path] = None):
        """
        Initializes an instance of the class, setting up a base directory and a logger for
        the instance. The constructor allows an optional base directory to be specified
        or defaults to the current directory.

        :param base_dir: Optional base directory for the instance.
        :type base_dir: Optional[Path]
        """
        self.base_dir = base_dir or Path(".")
        self.logger = logging.getLogger(self.__class__.__name__)
        self._directory = {}

    def _create_directory(self, name: str, path: Path) -> None:
        """
        Creates a directory at the specified path if it doesn't already exist.

        This method attempts to create a directory while logging the process. If
        the directory already exists, it logs that information as a debug statement.
        If the directory is successfully created, it logs this as an info statement.
        In case of an error during the creation process, the error is logged as an
        error statement, and the exception is re-raised.

        :param name: Name of the directory for descriptive logging.
        :type name: str
        :param path: Path object representing the location where the directory
            needs to be created.
        :type path: Path
        """
        try:
            if path.exists():
                self.logger.debug(f"{name} - {path} (existe déjà)")
            else:
                path.mkdir(parents=True, exist_ok=True)
                self.logger.info(f"{name} - {path} créer avec succès")
        except Exception as e:
            self.logger.error(f"Erreur lors de la création de {name}: {e}")
            raise

    def initialize_directory_structure(self) -> Dict[str, Path]:
        """
        Initializes and creates the directory structure for the application, ensuring
        all required folders exist based on the pre-defined structure. If a folder
        does not exist, it will be created. The directory structure includes folders
        for storing data, logs, and exported files. This function logs the number
        of directories created upon completion.

        :return: A dictionary mapping directory names to their corresponding `Path`
                 objects after initialization.
        :rtype: Dict[str, Path]
        """
        structure = {
            "data": "data",
            "raw": "data/raw",
            "processed": "data/processed",
            "final_data": "data/final_data",
            "logs": "logs",
            "exports": "exports"
        }

        self._directory = {
            name: self._create_directory(name, self.base_dir / path) or self.base_dir / path for name, path in
            structure.items()
        }

        self.logger.info(f"Structure de {len(self._directory)} dossier créer ")

        return self._directory

    def get_path(self, name: str) -> Path:
        """
        Retrieve the Path associated with a given name. If the name is not found in the
        directory, the base directory is returned as a fallback.

        :param name: The name to look up in the directory.
        :type name: str
        :return: The Path associated with the given name, or the base directory if not found.
        :rtype: Path
        """
        return self._directory.get(name, self.base_dir)

## Collecteurs de données

### Classe abstraite centraliser pour tout les types de collecteurs

In [22]:
class AbstractCollector(ABC):
    """
    Abstract base class for data collection.

    This base class provides a framework for collecting data, making HTTP requests with
    retries, and saving collected data in various formats. Subclasses must implement
    the abstract `collect_data` method to fetch specific data as required. Additional
    utility methods assist in creating HTTP sessions, handling retries, and logging
    information.

    :ivar config: Configuration object containing settings like retry attempts and
                  request timeouts.
    :type config: GlobalConfig
    :ivar logger: Logger instance for capturing information, warnings, and errors during
                  the collector's lifecycle.
    :type logger: logging.Logger
    :ivar session: Configured HTTP session for making requests with customized headers
                   for user agents and language preferences.
    :type session: requests.Session
    """

    def __init__(self, config: GlobalConfig):
        """
        Initializes the object with a configuration, sets up a logger,
        and creates an HTTP session for subsequent usage.

        :param config: The global configuration object to initialize the instance with
        :type config: GlobalConfig
        """
        self.config = config
        self.logger = logging.getLogger(self.__class__.__name__)
        self.session = self._create_session()

    @staticmethod
    def _create_session() -> requests.Session:
        """
        Create and configure a new session with custom headers.

        The session is initialized with a user-agent string tailored for
        educational research bots, and default headers for accepting JSON,
        HTML, and other content types. Default language preference is set
        to French with English as a fallback.

        :return: A configured `requests.Session` with updated headers.
        :rtype: requests.Session
        """
        session = requests.Session()
        session.headers.update({
            "User-Agent": "Mozilla/5.0 (Educational Research Bot/1.0)",
            "Accept": "application/json, text/html, */*",
            "Accept-Language": "fr,en;q=0.9",
        })
        return session

    def _make_request_with_retry(self, url: str, **kwargs) -> Optional[requests.Response]:
        """
        Retries an HTTP request a set number of times in case of failure. Logs each
        attempt and, upon persistent failure, logs an error message. Incorporates
        exponential backoff in case of retries.

        :param url: The URL to which the HTTP request is made
        :param kwargs: Additional keyword arguments to customize the request, such as
                       headers, data, or query parameters
        :return: A requests.Response object if the request is successful; None if all
                 retry attempts failed
        """
        for attempt in range(self.config.RETRY_ATTEMPTS):
            try:
                response = self.session.request(method=kwargs.pop("method", "get"),
                                                url=url,
                                                timeout=self.config.REQUEST_TIMEOUT,
                                                **kwargs)
                response.raise_for_status()
                return response
            except requests.RequestException as e:
                self.logger.warning(f"🔄 Tentative {attempt + 1}/{self.config.RETRY_ATTEMPTS} échouée pour {url}: {e}")
                if attempt < self.config.RETRY_ATTEMPTS - 1:
                    time.sleep(2 ** attempt)

        self.logger.error(f"❌ Échec définitif pour {url}")
        return None

    @abstractmethod
    def collect_data(self) -> pd.DataFrame:
        """
        Defines an abstract method to collect data, which must be implemented by
        subclasses. This method is expected to return data in the form of a
        DataFrame object from the pandas library.

        :raises NotImplementedError: If the subclass does not implement this method.

        :returns: A pandas DataFrame containing the collected data.
        :rtype: pd.DataFrame
        """
        pass

    def save_data(self, data: pd.DataFrame, file_path: Path, format_type: str = "csv") -> bool:
        """
        Saves a provided dataset to a chosen file format at a specified file path. The supported file
        formats are CSV, Excel, JSON, and Parquet. Metadata such as the collection time, date, and
        collector class is appended to the dataset prior to saving. Logs will capture the success
        or failure of the operation along with the file size and number of rows, if applicable.

        :param data: The pandas DataFrame containing the data to be saved.
        :param file_path: Path object representing the file to which the data will be saved.
        :param format_type: Desired file format for saving the data. Supported values are 'csv',
            'excel', 'json', and 'parquet'. Default is 'csv'.
        :return: A boolean indicating whether the data was successfully saved.
        """
        if data.empty:
            self.logger.warning("Aucune données à sauvegarder")
            return False

        try:
            meta_data = data.copy()
            meta_data['collection_time'] = datetime.now().time()
            meta_data['collection_date'] = datetime.now().date()
            meta_data['collector_class'] = self.__class__.__name__

            if format_type.lower() == "csv":
                data.to_csv(file_path, index=False, encoding="utf-8")
            elif format_type.lower() == "excel":
                data.to_excel(file_path, index=False, engine="xlsxwriter")
            elif format_type.lower() == "json":
                data.to_json(file_path, orient="records", force_ascii=False, indent=4, date_format="iso")
            elif format_type.lower() == "parquet":
                data.to_parquet(file_path, index=False, engine="pyarrow")
            else:
                raise ValueError("Format non valide. Vous ne pouvez choisir que csv, excel, json ou parquet.")

            size_mb = file_path.stat().st_size / (1024 * 1024)
            self.logger.info(f"Sauvegardé: {file_path.name} ({len(data)} lignes, {size_mb:.2f} MB)")
            return True
        except Exception as e:
            self.logger.error(f"Erreur lors de la sauvegarde {file_path}: {e}")
            return False

### Collecteur de données pour World Bank

In [23]:
class WorldBankDataCollector(AbstractCollector):
    """
    Handles data collection from the World Bank API.

    This class provides methods for fetching data from the World Bank for specific indicators
    over a range of years. The data is formatted into pandas DataFrames for further analysis
    and processing, supporting features like pagination and retrying on request failures.

    :ivar config: Configuration object containing API settings, such as base URL, country
        code, default indicators, start and end year, and delay between requests.
    :type config: object
    :ivar logger: Logger instance used for logging information, warnings, and errors during
        the data collection process.
    :type logger: logging.Logger
    """

    def _fetch_indicator_data(self, indicator: str, start_year: int = None, end_year: int = None,
                              per_page: int = None) -> pd.DataFrame:
        """
        Fetches indicator data from the World Bank API for a specific country within a specified time range
        and formats the data into a pandas DataFrame. The function allows filtering by start and end years
        and configuring the number of entries per page.

        :param indicator: The unique identifier of the indicator data to fetch.
        :type indicator: str
        :param start_year: The starting year for the data range (default is derived from configuration).
        :type start_year: int, optional
        :param end_year: The ending year for the data range (default is derived from configuration).
        :type end_year: int, optional
        :param per_page: The number of results per page to fetch (default is derived from configuration).
        :type per_page: int, optional

        :return: A pandas DataFrame containing the requested indicator data with columns such as:
            `indicator_code`, `indicator_name`, `country_code`, `country_name`, `year`, `value`,
            and `source`.
        :rtype: pd.DataFrame
        """
        url = f"{self.config.WORLD_BANK_API_URL}/country/{self.config.COUNTRY_CODE}/indicator/{indicator}"

        params = {
            "data": f"{start_year or self.config.START_YEAR}:{end_year or self.config.END_YEAR}",
            "format": "json",
            "per_page": per_page or self.config.DEFAULT_PER_PAGE,
        }

        response = self._make_request_with_retry(url, params=params)

        if response is None:
            return pd.DataFrame()

        try:
            data = response.json()
            entries = data[1] if isinstance(data, list) and len(data) > 1 and data[1] else []

            record = []

            for entry in entries:
                record.append({
                    "indicator_code": entry["indicator"]["id"],
                    "indicator_name": entry["indicator"]["value"],
                    "country_code": entry["country"]["id"],
                    "country_name": entry["country"]["value"],
                    "year": pd.to_numeric(entry["date"], errors="coerce"),
                    "value": pd.to_numeric(entry["value"], errors="coerce"),
                    "source": "World Bank API",
                })

            return pd.DataFrame(record)
        except (ValueError, KeyError) as e:
            self.logger.error(f"Erreur parsing JSON pour {indicator}: {e}")
            return pd.DataFrame()

    def collect_data(self, indicators: Optional[List[str]] = None, start_year: int = None, end_year: int = None,
                     per_page: int = None) -> pd.DataFrame:
        """
        Collects data from the World Bank based on the provided indicators and time frame.

        This method facilitates fetching and aggregating data for a given list of
        indicators over a specified range of years. The data is retrieved from the
        World Bank API, consolidated into a dataframe, and returned for further
        processing.

        :param indicators: Optional list of indicators to collect data for. If not
            provided, default indicators defined in the configuration will be used.
        :type indicators: List[str], optional
        :param start_year: The start year of the data range to collect. If None,
            the default value or full range supported by the API may be used.
        :type start_year: int, optional
        :param end_year: The end year of the data range to collect. If None, the
            default value or full range supported by the API may be used.
        :type end_year: int, optional
        :param per_page: The number of results to retrieve per page. If None, a
            default limit will be used based on API specifications.
        :type per_page: int, optional
        :return: A DataFrame containing the collected data for the specified
            indicators and years. If no data is collected, an empty DataFrame is
            returned.
        :rtype: pandas.DataFrame
        """
        indicators = indicators or self.config.DEFAULT_WB_INDICATORS

        donnees = []
        self.logger.info(f"Début collecte World Bank ({len(indicators)} indicateurs)")

        for i, indicator in enumerate(indicators, start=1):
            self.logger.info(f"[{i}/{len(indicators)}] Collecte: {indicator}")

            fetch_data = self._fetch_indicator_data(indicator, start_year, end_year, per_page)
            if not fetch_data.empty:
                donnees.append(fetch_data)
                self.logger.info(f"{len(fetch_data)} enregistrements pour {indicator}")

            time.sleep(self.config.DELAY_BETWEEN_REQUESTS)

        datasets = pd.concat(donnees, ignore_index=True) if donnees else pd.DataFrame()
        self.logger.info(f"Fin collecte World Bank ({len(datasets)} enregistrements)")
        return datasets

### Collecte de données par Web Scraping

In [24]:
class WebScrapingDataCollector(AbstractCollector):
    """
    Collects and aggregates data by scraping tables from predefined HTML web pages.

    This class is used for web scraping tasks where data is extracted from HTML tables
    and aggregated into a unified pandas DataFrame. It supports scraping from multiple
    URLs and handles metadata tagging for each scraped table, including identifying
    the source, the URL, and the table's index. The collected data can be further
    processed or analyzed.

    :ivar logger: Logger used for logging the scraping process info, warnings, and errors.
    :type logger: logging.Logger
    """

    def _scrape_html_tables(self, url: str, source_name: str, max_tables: int = 5) -> pd.DataFrame:
        """
        Scrapes tables from an HTML page at the specified URL and returns them as a pandas DataFrame.
        Each table is assigned metadata, including its source URL, source name, and an index indicating its
        position among the other scraped tables.

        :param url: The URL of the webpage containing the HTML tables.
        :type url: str
        :param source_name: The name of the data source for added metadata within the resulting DataFrame.
        :type source_name: str
        :param max_tables: The maximum number of tables to scrape from the webpage. Defaults to 5.
        :type max_tables: int
        :return: A pandas DataFrame containing the combined data from the extracted HTML tables.
                 If no tables are found or scraping fails, an empty DataFrame is returned.
        :rtype: pd.DataFrame
        """
        response = self._make_request_with_retry(url)

        if not response:
            return pd.DataFrame()

        try:
            soup = BeautifulSoup(response.content, "html.parser")
            tables = soup.find_all("table")

            self.logger.info(f"{len(tables)} tableaux trouvés sur {url}")

            scraped_tables = []
            for i, table in enumerate(tables[:max_tables]):
                try:
                    dataframe = pd.read_html(StringIO(str(table)))[0]
                    dataframe["source_url"] = url
                    dataframe["source_name"] = source_name
                    dataframe['table_index'] = i
                    scraped_tables.append(dataframe)
                    self.logger.debug(f"Tableau {i + 1}: {dataframe.shape}")
                except Exception as e:
                    self.logger.error(f"Erreur tableau {i + 1} sur {url}: {e}")

            dataset = pd.concat(scraped_tables, ignore_index=True) if scraped_tables else pd.DataFrame()
            return dataset
        except Exception as e:
            self.logger.error(f"Erreur scraping {url}: {e}")
            return pd.DataFrame()

    def collect_data(self, max_tables: int = 5) -> pd.DataFrame:
        """
        Scrapes data from predefined URLs and aggregates the results into a single DataFrame.

        The method retrieves tables from multiple sources, processes them, and concatenates
        the scraped data into one unified DataFrame. If no data is collected, an empty
        DataFrame is returned.

        :param max_tables: Maximum number of tables to scrape from each URL. Defaults to 5.
        :type max_tables: int
        :return: Aggregated dataset containing the scraped data from all specified URLs.
        :rtype: pd.DataFrame
        """
        self.logger.info("Début du web scraping")

        urls_for_scraping = {
            "instad_trimestres": "https://instad.bj/publications/publications-trimestrielles",
            "instad_mois": "https://instad.bj/publications/publications-mensuelles",
            "demographic_external": [
                "https://hub.worldpop.org/project/categories?id=3",
                "https://dhsprogram.com/data/available-datasets.cfm",
            ]
        }

        scraped_data = []
        urls = []

        for source_name, url_for_scraping in urls_for_scraping.items():
            if isinstance(urls_for_scraping, str):
                urls = [url_for_scraping]

            for url in urls:
                donnees = self._scrape_html_tables(url, source_name, max_tables)
                if not donnees.empty:
                    scraped_data.append(donnees)

        datasets = pd.concat(scraped_data, ignore_index=True) if scraped_data else pd.DataFrame()
        self.logger.info(f"Scraping terminé: {len(datasets)} enregistrements")
        return datasets

### Collecte de données géographique

In [25]:
class GeographicDataCollector(AbstractCollector):
    def _execute_overpass_query(self, query: str, data_type: str) -> pd.DataFrame:
        response = self._make_request_with_retry(self.config.OVERPASS_API_URL, params={"data": query}, method="post")

        if response is None:
            return pd.DataFrame()

        try:
            data = response.json()
            elements = data.get("elements", [])

            records = []
            if not elements.empty:
                for element in elements:
                    if "tags" in element:
                        record = {
                            "name": element["tags"].get("name"),
                            "osm_id": element.get("id"),
                            "latitude": element.get("lat") or element.get("center", {}).get("lat"),
                            "longitude": element.get("lon") or element.get("center", {}).get("lon"),
                            "data_type": data_type,
                            "source": "OpenStreetMap",
                        }

                        if data_type == "cities":
                            record.update({
                                "place_type": element["tags"].get("place"),
                                "population": pd.to_numeric(element["tags"].get("population"), errors="coerce")
                            })
                        else:
                            record.update({
                                "admin_level": element["tags"].get("admin_level"),
                                "wikidata": element["tags"].get("wikidata")
                            })

                        records.append(record)

                self.logger.info(f"📍 {len(records)} éléments {data_type} collectés")
                return pd.DataFrame(records)
            else:
                return pd.DataFrame()

        except (ValueError, KeyError) as e:
            self.logger.error(f"Erreur parsing Overpass pour {data_type}: {e}")
            return pd.DataFrame()

    def _collecte_cities(self) -> pd.DataFrame:
        query = f"""
        [out:json][timeout:60];
        area["ISO3166-1"="{self.config.COUNTRY_CODE}"][admin_level="2"];
        (
          node(area)["place"~"city|town|village"];
          way(area)["place"~"city|town|village"];
          relation(area)["place"~"city|town|village"];
        );
        out center tags;
        """

        return self._execute_overpass_query(query, "cities")

    def _collecte_administratives_boundaries(self, admin_level: str, level_name: str) -> pd.DataFrame:
        query = f"""
        [out:json][timeout:60];
        relation["boundary"="administrative"]["admin_level"="{admin_level}"]["name"~"{self.config.COUNTRY_NAME}|Benin"];
        out center tags;
        """

        return self._execute_overpass_query(query, f"admin_boundaries_{level_name}")

    def collect_data(self) -> Dict[str, pd.DataFrame]:
        pass