# üìä TourismIQ - Notebook 1: Data Collection & EDA

**Objectif**: Explorer les donn√©es collect√©es et pr√©parer le terrain pour le Quality Scorer

## Donn√©es disponibles
- **DATAtourisme**: 50k POIs (labels, descriptions, g√©oloc, types)
- **Opendatasoft**: 10k communes (population, superficie, densit√©)
- **Open-Meteo**: 13 r√©gions (climat, temp√©rature, pr√©cipitations)

## Questions cl√©s
1. Quelle est la structure des POIs DATAtourisme ?
2. Quel est le taux de compl√©tude des champs importants ?
3. Quelle est la distribution g√©ographique des POIs ?
4. Quels types de POIs avons-nous ?
5. Quelles features utiliser pour le Quality Scorer ?

In [None]:
# Imports
import pandas as pd
import numpy as np
import json
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
from collections import Counter
import warnings
warnings.filterwarnings('ignore')

# Config viz
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
%matplotlib inline

print("‚úÖ Imports OK")

## 1. Chargement des donn√©es

In [None]:
# Chemins des fichiers
DATA_DIR = Path("../data/raw")

pois_file = DATA_DIR / "datatourisme_pois_50k.parquet"
communes_file = DATA_DIR / "communes_population_all.parquet"
climate_file = DATA_DIR / "climate_regions.parquet"

# V√©rifier existence
print("üìÅ Fichiers disponibles:")
print(f"  POIs: {pois_file.exists()} ({pois_file.stat().st_size / 1024 / 1024:.2f} MB)" if pois_file.exists() else "  POIs: ‚ùå")
print(f"  Communes: {communes_file.exists()} ({communes_file.stat().st_size / 1024:.2f} KB)" if communes_file.exists() else "  Communes: ‚ùå")
print(f"  Climat: {climate_file.exists()} ({climate_file.stat().st_size / 1024:.2f} KB)" if climate_file.exists() else "  Climat: ‚ùå")

In [None]:
# Charger POIs
print("üìÇ Chargement des POIs...")
df_pois = pd.read_parquet(pois_file)
print(f"‚úÖ {len(df_pois):,} POIs charg√©s")
print(f"   Colonnes: {len(df_pois.columns)}")
print(f"   M√©moire: {df_pois.memory_usage(deep=True).sum() / 1024 / 1024:.2f} MB")

In [None]:
# Charger communes
print("üìÇ Chargement des communes...")
df_communes = pd.read_parquet(communes_file)
print(f"‚úÖ {len(df_communes):,} communes charg√©es")
print(f"   Colonnes: {list(df_communes.columns)}")

In [None]:
# Charger climat
print("üìÇ Chargement du climat...")
df_climate = pd.read_parquet(climate_file)
print(f"‚úÖ {len(df_climate)} r√©gions charg√©es")
print(f"   Colonnes: {list(df_climate.columns)}")
df_climate.head()

## 2. Structure des POIs DATAtourisme

In [None]:
# Aper√ßu g√©n√©ral
print("üìä Informations g√©n√©rales:")
print(f"  Shape: {df_pois.shape}")
print(f"  Colonnes: {len(df_pois.columns)}")
print(f"\nüìã Liste des colonnes:")
for i, col in enumerate(df_pois.columns, 1):
    dtype = df_pois[col].dtype
    non_null = df_pois[col].notna().sum()
    pct = non_null / len(df_pois) * 100
    print(f"  {i:2d}. {col:30s} - {str(dtype):10s} - {non_null:6,} ({pct:5.1f}%) non-null")

In [None]:
# √âchantillon de donn√©es
print("üîç √âchantillon (premier POI):")
df_pois.iloc[0]

## 3. Analyse de Compl√©tude

Pour le Quality Scorer, on doit √©valuer la compl√©tude des champs importants.

In [None]:
# Champs cl√©s pour le Quality Score
key_fields = [
    '@id',           # ID unique
    '@type',         # Type de POI
    'rdfs:label',    # Nom/label
    'hasDescription',# Description
    'isLocatedAt',   # Localisation
    'hasContact',    # Contact
    'hasBeenCreatedBy', # Cr√©ateur/source
    'hasBeenPublishedBy', # √âditeur
]

print("üìä Compl√©tude des champs cl√©s:")
print("\n" + "="*60)
completeness = {}
for field in key_fields:
    if field in df_pois.columns:
        non_null = df_pois[field].notna().sum()
        pct = non_null / len(df_pois) * 100
        completeness[field] = pct
        print(f"{field:25s}: {non_null:6,} / {len(df_pois):6,} ({pct:5.1f}%)")
    else:
        print(f"{field:25s}: ABSENT")
print("="*60)

In [None]:
# Visualisation compl√©tude
fig, ax = plt.subplots(figsize=(12, 6))
fields = list(completeness.keys())
values = list(completeness.values())

colors = ['green' if v > 80 else 'orange' if v > 50 else 'red' for v in values]
bars = ax.barh(fields, values, color=colors, alpha=0.7)

ax.set_xlabel('Compl√©tude (%)', fontsize=12)
ax.set_title('Compl√©tude des Champs Cl√©s - DATAtourisme POIs', fontsize=14, fontweight='bold')
ax.set_xlim(0, 100)
ax.axvline(x=80, color='green', linestyle='--', alpha=0.3, label='Excellent (>80%)')
ax.axvline(x=50, color='orange', linestyle='--', alpha=0.3, label='Moyen (>50%)')
ax.grid(axis='x', alpha=0.3)
ax.legend()

# Ajouter valeurs
for i, (field, value) in enumerate(zip(fields, values)):
    ax.text(value + 2, i, f'{value:.1f}%', va='center', fontsize=10)

plt.tight_layout()
plt.show()

## 4. Extraction et Parsing des Champs Complexes

Les champs DATAtourisme sont souvent en JSON (nested). On doit les parser.

In [None]:
# Helper: parser JSON strings
def parse_json_field(value):
    """Parse un champ qui peut √™tre string JSON ou d√©j√† pars√©"""
    if pd.isna(value):
        return None
    if isinstance(value, str):
        try:
            return json.loads(value)
        except:
            return value
    return value

print("‚úÖ Helper parse_json_field d√©fini")

In [None]:
# Exemple: parser @type (type de POI)
print("üîç Analyse du champ @type:")
print("\nExemples de valeurs:")
for i in range(min(5, len(df_pois))):
    type_val = df_pois['@type'].iloc[i] if '@type' in df_pois.columns else None
    print(f"  {i+1}. {type_val}")

In [None]:
# Extraire types principaux
def extract_main_type(type_val):
    """Extrait le type principal d'un POI"""
    type_val = parse_json_field(type_val)
    if isinstance(type_val, list):
        # Prendre le type le plus sp√©cifique (souvent le premier non-g√©n√©rique)
        for t in type_val:
            if t not in ['schema:Thing', 'schema:Place', 'olo:OrderedList']:
                return t
        return type_val[0] if type_val else None
    return type_val

if '@type' in df_pois.columns:
    df_pois['type_principal'] = df_pois['@type'].apply(extract_main_type)
    print("‚úÖ Colonne 'type_principal' cr√©√©e")
    print(f"\nüìä Distribution des types (top 20):")
    print(df_pois['type_principal'].value_counts().head(20))
else:
    print("‚ö†Ô∏è  Colonne @type absente")

In [None]:
# Visualiser distribution des types
if 'type_principal' in df_pois.columns:
    top_types = df_pois['type_principal'].value_counts().head(15)
    
    fig, ax = plt.subplots(figsize=(12, 8))
    top_types.plot(kind='barh', ax=ax, color='steelblue', alpha=0.7)
    ax.set_xlabel('Nombre de POIs', fontsize=12)
    ax.set_title('Top 15 Types de POIs - DATAtourisme', fontsize=14, fontweight='bold')
    ax.grid(axis='x', alpha=0.3)
    
    # Ajouter valeurs
    for i, v in enumerate(top_types.values):
        ax.text(v + 50, i, f'{v:,}', va='center', fontsize=10)
    
    plt.tight_layout()
    plt.show()

## 5. Extraction GPS & Distribution G√©ographique

In [None]:
# Extraire coordonn√©es GPS
def extract_coordinates(located_at):
    """Extrait latitude, longitude depuis isLocatedAt"""
    located_at = parse_json_field(located_at)
    if not located_at or not isinstance(located_at, list):
        return None, None
    
    for location in located_at:
        if isinstance(location, dict) and 'schema:geo' in location:
            geo = location['schema:geo']
            if isinstance(geo, dict):
                lat = geo.get('schema:latitude')
                lon = geo.get('schema:longitude')
                if lat and lon:
                    try:
                        return float(lat), float(lon)
                    except:
                        pass
    return None, None

print("üó∫Ô∏è  Extraction des coordonn√©es GPS...")
if 'isLocatedAt' in df_pois.columns:
    coords = df_pois['isLocatedAt'].apply(extract_coordinates)
    df_pois['latitude'] = coords.apply(lambda x: x[0])
    df_pois['longitude'] = coords.apply(lambda x: x[1])
    
    # Stats
    pois_with_coords = df_pois['latitude'].notna().sum()
    print(f"‚úÖ POIs avec coordonn√©es: {pois_with_coords:,} / {len(df_pois):,} ({pois_with_coords/len(df_pois)*100:.1f}%)")
else:
    print("‚ö†Ô∏è  Colonne isLocatedAt absente")

In [None]:
# Distribution g√©ographique
if 'latitude' in df_pois.columns and df_pois['latitude'].notna().sum() > 0:
    df_geo = df_pois[df_pois['latitude'].notna()].copy()
    
    print(f"üìç Distribution g√©ographique ({len(df_geo):,} POIs):")
    print(f"  Latitude: {df_geo['latitude'].min():.4f} √† {df_geo['latitude'].max():.4f}")
    print(f"  Longitude: {df_geo['longitude'].min():.4f} √† {df_geo['longitude'].max():.4f}")
    
    # Scatter plot
    fig, ax = plt.subplots(figsize=(14, 10))
    
    # Densit√© avec hexbin
    hb = ax.hexbin(df_geo['longitude'], df_geo['latitude'], 
                   gridsize=50, cmap='YlOrRd', alpha=0.7, mincnt=1)
    
    ax.set_xlabel('Longitude', fontsize=12)
    ax.set_ylabel('Latitude', fontsize=12)
    ax.set_title('Distribution G√©ographique des POIs en France', fontsize=14, fontweight='bold')
    ax.grid(alpha=0.3)
    
    # Colorbar
    cb = plt.colorbar(hb, ax=ax)
    cb.set_label('Densit√© de POIs', fontsize=11)
    
    plt.tight_layout()
    plt.show()
else:
    print("‚ö†Ô∏è  Pas de coordonn√©es GPS disponibles")

## 6. Extraction Descriptions & Analyse Textuelle

In [None]:
# Extraire description FR
def extract_description_fr(desc_list):
    """Extrait description courte en fran√ßais"""
    desc_list = parse_json_field(desc_list)
    if not desc_list or not isinstance(desc_list, list):
        return None
    
    for desc in desc_list:
        if isinstance(desc, dict) and 'shortDescription' in desc:
            short_desc = desc['shortDescription']
            if isinstance(short_desc, dict):
                # Essayer plusieurs cl√©s pour le fran√ßais
                return short_desc.get('fr') or short_desc.get('@fr') or short_desc.get('fr-FR')
            elif isinstance(short_desc, str):
                return short_desc
    return None

print("üìù Extraction des descriptions...")
if 'hasDescription' in df_pois.columns:
    df_pois['description'] = df_pois['hasDescription'].apply(extract_description_fr)
    
    # Stats
    pois_with_desc = df_pois['description'].notna().sum()
    print(f"‚úÖ POIs avec description: {pois_with_desc:,} / {len(df_pois):,} ({pois_with_desc/len(df_pois)*100:.1f}%)")
    
    # Longueur descriptions
    df_pois['description_length'] = df_pois['description'].fillna('').str.len()
    
    print(f"\nüìä Statistiques longueur descriptions:")
    print(df_pois['description_length'].describe())
else:
    print("‚ö†Ô∏è  Colonne hasDescription absente")

In [None]:
# Distribution longueur descriptions
if 'description_length' in df_pois.columns:
    fig, axes = plt.subplots(1, 2, figsize=(15, 5))
    
    # Histogramme
    desc_lengths = df_pois[df_pois['description_length'] > 0]['description_length']
    axes[0].hist(desc_lengths, bins=50, color='steelblue', alpha=0.7, edgecolor='black')
    axes[0].set_xlabel('Longueur (caract√®res)', fontsize=11)
    axes[0].set_ylabel('Nombre de POIs', fontsize=11)
    axes[0].set_title('Distribution Longueur Descriptions', fontsize=12, fontweight='bold')
    axes[0].axvline(desc_lengths.median(), color='red', linestyle='--', label=f'M√©diane: {desc_lengths.median():.0f}')
    axes[0].legend()
    axes[0].grid(alpha=0.3)
    
    # Boxplot
    axes[1].boxplot(desc_lengths, vert=True)
    axes[1].set_ylabel('Longueur (caract√®res)', fontsize=11)
    axes[1].set_title('Boxplot Longueur Descriptions', fontsize=12, fontweight='bold')
    axes[1].grid(alpha=0.3)
    
    plt.tight_layout()
    plt.show()

In [None]:
# Exemples de descriptions
if 'description' in df_pois.columns:
    print("üìù Exemples de descriptions:\n")
    print("="*80)
    
    sample_indices = df_pois[df_pois['description'].notna()].sample(min(3, len(df_pois))).index
    
    for i, idx in enumerate(sample_indices, 1):
        row = df_pois.loc[idx]
        nom = row.get('rdfs:label', 'N/A')
        if isinstance(nom, dict):
            nom = nom.get('fr', nom.get('@fr', 'N/A'))
        desc = row['description']
        length = len(desc) if desc else 0
        
        print(f"\n{i}. {nom}")
        print(f"   Type: {row.get('type_principal', 'N/A')}")
        print(f"   Longueur: {length} caract√®res")
        print(f"   Description: {desc[:200]}..." if length > 200 else f"   Description: {desc}")
        print("-"*80)

## 7. Features pour le Quality Scorer

Identifier les features importantes pour scorer la qualit√© des POIs.

In [None]:
# Calculer features de compl√©tude
print("‚öôÔ∏è  Calcul des features de qualit√©...\n")

features_quality = {}

# 1. Has name/label
if 'rdfs:label' in df_pois.columns:
    features_quality['has_name'] = df_pois['rdfs:label'].notna().astype(int)
    print(f"‚úÖ has_name: {features_quality['has_name'].sum():,} / {len(df_pois):,}")

# 2. Has description
if 'description' in df_pois.columns:
    features_quality['has_description'] = df_pois['description'].notna().astype(int)
    features_quality['description_length'] = df_pois['description_length']
    print(f"‚úÖ has_description: {features_quality['has_description'].sum():,} / {len(df_pois):,}")
    print(f"‚úÖ description_length: moyenne {features_quality['description_length'].mean():.1f} caract√®res")

# 3. Has GPS
if 'latitude' in df_pois.columns:
    features_quality['has_gps'] = df_pois['latitude'].notna().astype(int)
    print(f"‚úÖ has_gps: {features_quality['has_gps'].sum():,} / {len(df_pois):,}")

# 4. Has type
if 'type_principal' in df_pois.columns:
    features_quality['has_type'] = df_pois['type_principal'].notna().astype(int)
    print(f"‚úÖ has_type: {features_quality['has_type'].sum():,} / {len(df_pois):,}")

# 5. Has contact
if 'hasContact' in df_pois.columns:
    features_quality['has_contact'] = df_pois['hasContact'].notna().astype(int)
    print(f"‚úÖ has_contact: {features_quality['has_contact'].sum():,} / {len(df_pois):,}")

print(f"\nüìä Total features calcul√©es: {len(features_quality)}")

In [None]:
# Score de compl√©tude simple (baseline)
df_features = pd.DataFrame(features_quality)

# Score simplifi√© (0-100)
df_features['completeness_score_simple'] = (
    df_features['has_name'] * 20 +
    df_features['has_description'] * 30 +
    df_features['has_gps'] * 20 +
    df_features['has_type'] * 15 +
    df_features['has_contact'] * 15
)

print("üéØ Score de compl√©tude (baseline):")
print(f"  Moyenne: {df_features['completeness_score_simple'].mean():.1f} / 100")
print(f"  M√©diane: {df_features['completeness_score_simple'].median():.1f} / 100")
print(f"  √âcart-type: {df_features['completeness_score_simple'].std():.1f}")
print(f"\nüìä Distribution:")
print(df_features['completeness_score_simple'].describe())

In [None]:
# Visualiser distribution score
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Histogramme
axes[0].hist(df_features['completeness_score_simple'], bins=20, 
             color='steelblue', alpha=0.7, edgecolor='black')
axes[0].set_xlabel('Score de Compl√©tude (0-100)', fontsize=11)
axes[0].set_ylabel('Nombre de POIs', fontsize=11)
axes[0].set_title('Distribution Score de Compl√©tude (Baseline)', fontsize=12, fontweight='bold')
axes[0].axvline(df_features['completeness_score_simple'].mean(), 
                color='red', linestyle='--', label=f"Moyenne: {df_features['completeness_score_simple'].mean():.1f}")
axes[0].legend()
axes[0].grid(alpha=0.3)

# Boxplot par qualit√©
score_categories = pd.cut(df_features['completeness_score_simple'], 
                          bins=[0, 40, 60, 80, 100], 
                          labels=['Low (<40)', 'Medium (40-60)', 'Good (60-80)', 'Excellent (80-100)'])

category_counts = score_categories.value_counts().sort_index()
category_counts.plot(kind='bar', ax=axes[1], color=['red', 'orange', 'lightgreen', 'green'], alpha=0.7)
axes[1].set_xlabel('Cat√©gorie de Qualit√©', fontsize=11)
axes[1].set_ylabel('Nombre de POIs', fontsize=11)
axes[1].set_title('R√©partition par Cat√©gorie de Qualit√©', fontsize=12, fontweight='bold')
axes[1].grid(axis='y', alpha=0.3)
axes[1].tick_params(axis='x', rotation=45)

# Ajouter valeurs
for i, v in enumerate(category_counts.values):
    axes[1].text(i, v + 100, f'{v:,}\n({v/len(df_features)*100:.1f}%)', 
                 ha='center', fontsize=9)

plt.tight_layout()
plt.show()

## 8. Analyse D√©mographique (Communes)

In [None]:
# Overview communes
print("üèòÔ∏è  Donn√©es Communes (Opendatasoft):\n")
print(f"  Total communes: {len(df_communes):,}")
print(f"  Population totale: {df_communes['population'].sum():,.0f}")
print(f"  Population moyenne: {df_communes['population'].mean():,.0f}")
print(f"\nüìä Top 10 communes par population:")
print(df_communes.nlargest(10, 'population')[['nom_commune', 'population', 'densite_hab_km2']])

In [None]:
# Distribution population
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Histogramme (√©chelle log)
axes[0].hist(df_communes['population'], bins=50, color='steelblue', alpha=0.7, edgecolor='black')
axes[0].set_xlabel('Population', fontsize=11)
axes[0].set_ylabel('Nombre de communes', fontsize=11)
axes[0].set_title('Distribution Population Communes', fontsize=12, fontweight='bold')
axes[0].set_yscale('log')
axes[0].grid(alpha=0.3)

# Densit√©
axes[1].hist(df_communes['densite_hab_km2'], bins=50, color='coral', alpha=0.7, edgecolor='black')
axes[1].set_xlabel('Densit√© (hab/km¬≤)', fontsize=11)
axes[1].set_ylabel('Nombre de communes', fontsize=11)
axes[1].set_title('Distribution Densit√© Communes', fontsize=12, fontweight='bold')
axes[1].set_yscale('log')
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

## 9. Analyse Climat (R√©gions)

In [None]:
# Overview climat
print("üå§Ô∏è  Donn√©es Climatiques (Open-Meteo):\n")
print(df_climate[['region', 'temp_avg_annual', 'precipitation_annual_mm', 'climate_type']].to_string(index=False))

In [None]:
# Visualisation climat
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# 1. Temp√©rature moyenne par r√©gion
df_climate_sorted = df_climate.sort_values('temp_avg_annual')
axes[0, 0].barh(df_climate_sorted['region'], df_climate_sorted['temp_avg_annual'], 
                color='orangered', alpha=0.7)
axes[0, 0].set_xlabel('Temp√©rature Moyenne Annuelle (¬∞C)', fontsize=10)
axes[0, 0].set_title('Temp√©rature par R√©gion', fontsize=11, fontweight='bold')
axes[0, 0].grid(axis='x', alpha=0.3)

# 2. Pr√©cipitations par r√©gion
df_climate_sorted = df_climate.sort_values('precipitation_annual_mm')
axes[0, 1].barh(df_climate_sorted['region'], df_climate_sorted['precipitation_annual_mm'], 
                color='steelblue', alpha=0.7)
axes[0, 1].set_xlabel('Pr√©cipitations Annuelles (mm)', fontsize=10)
axes[0, 1].set_title('Pr√©cipitations par R√©gion', fontsize=11, fontweight='bold')
axes[0, 1].grid(axis='x', alpha=0.3)

# 3. Distribution types climatiques
climate_counts = df_climate['climate_type'].value_counts()
climate_counts.plot(kind='bar', ax=axes[1, 0], color='seagreen', alpha=0.7)
axes[1, 0].set_xlabel('Type Climatique', fontsize=10)
axes[1, 0].set_ylabel('Nombre de R√©gions', fontsize=10)
axes[1, 0].set_title('Distribution Types Climatiques', fontsize=11, fontweight='bold')
axes[1, 0].tick_params(axis='x', rotation=45)
axes[1, 0].grid(axis='y', alpha=0.3)

# 4. Scatter temp vs pr√©cipitations
colors_map = {'mediterranean': 'orangered', 'oceanic': 'steelblue', 
              'continental': 'gold', 'mountain': 'purple'}
for climate_type in df_climate['climate_type'].unique():
    data = df_climate[df_climate['climate_type'] == climate_type]
    axes[1, 1].scatter(data['temp_avg_annual'], data['precipitation_annual_mm'], 
                       label=climate_type, s=100, alpha=0.7, 
                       color=colors_map.get(climate_type, 'gray'))

axes[1, 1].set_xlabel('Temp√©rature Moyenne (¬∞C)', fontsize=10)
axes[1, 1].set_ylabel('Pr√©cipitations (mm)', fontsize=10)
axes[1, 1].set_title('Temp√©rature vs Pr√©cipitations par Type', fontsize=11, fontweight='bold')
axes[1, 1].legend()
axes[1, 1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

## 10. Conclusions & Next Steps

### üìä R√©sum√© des Donn√©es

**POIs DATAtourisme:**
- 50k POIs collect√©s
- Compl√©tude variable selon les champs (√† am√©liorer pour le scoring)
- Distribution g√©ographique couvre toute la France
- Types vari√©s (h√©bergement, restauration, culture, activit√©s)

**Communes & Climat:**
- 10k communes avec donn√©es d√©mographiques
- 13 r√©gions avec donn√©es climatiques
- Pr√™t pour enrichissement g√©ographique

### üéØ Features Identifi√©es pour Quality Scorer

**Compl√©tude (40%):**
- has_name, has_description, has_gps, has_type, has_contact
- description_length

**Richesse Textuelle (30%):**
- description_richness (√† calculer avec textstat)
- nb_languages (multilinguisme)

**Contexte G√©o (20%):**
- commune_population, commune_density
- poi_density_radius (nb POIs similaires √† proximit√©)
- climate_type

**Freshness (10%):**
- last_update_days (si disponible)

### ‚è≠Ô∏è Next Steps

1. **Feature Engineering** (Notebook 02):
   - Calculer les 20 features du Quality Scorer
   - Enrichir avec donn√©es communes/climat
   - Cr√©er target synth√©tique (score 0-100)

2. **Quality Scorer Training** (Notebook 03):
   - Entra√Æner LightGBM/XGBoost
   - Tuning hyperparam√®tres
   - Validation (target R¬≤ > 0.75)

3. **Gap Detection** (Notebook 04):
   - HDBSCAN clustering
   - Analyse distribution types
   - Random Forest opportunit√©s

In [None]:
# Sauvegarder version enrichie pour le prochain notebook
print("üíæ Sauvegarde des donn√©es enrichies...")

# Combiner POIs avec features de base
df_pois_enriched = df_pois.copy()
for col in df_features.columns:
    df_pois_enriched[col] = df_features[col]

output_file = Path("../data/processed/pois_enriched_eda.parquet")
output_file.parent.mkdir(parents=True, exist_ok=True)

df_pois_enriched.to_parquet(output_file, index=False, compression='snappy')

print(f"‚úÖ Sauvegard√©: {output_file}")
print(f"   Records: {len(df_pois_enriched):,}")
print(f"   Colonnes: {len(df_pois_enriched.columns)}")
print(f"   Taille: {output_file.stat().st_size / 1024 / 1024:.2f} MB")

In [None]:
print("\n" + "="*80)
print("‚úÖ NOTEBOOK 1 - EDA TERMIN√â")
print("="*80)
print("\nüìà Prochaine √©tape: Notebook 02 - Feature Engineering pour Quality Scorer")
print("\nüéØ Objectif: Cr√©er les 20 features ML et le target score synth√©tique (0-100)")