# üîç Analyse Comparative des Datasets Churn

**Objectif** : Identifier et valider tous les datasets disponibles

**Datasets √† analyser** :
1. `telco_customer_churn_ibm.csv` (IBM - 948 KB),
2. `orange_telco_churn.csv` (955 KB),
3. `WA_Fn-UseC_-Telco-Customer-Churn.csv` (UCI - 955 KB),
4. `churn-bigml-80.csv` (Orange Kaggle Train - 219 KB),
5. `churn-bigml-20.csv` (Orange Kaggle Test - 56 KB),
6. `customer_churn_orange.csv` (129 KB)

---

In [1]:
import pandas as pd
import numpy as np
from pathlib import Path
import hashlib

# Configuration
pd.set_option('display.max_columns', None)

print("‚úÖ Imports r√©ussis")

‚úÖ Imports r√©ussis


In [2]:
def analyser_dataset(fichier):
    """Analyse compl√®te d'un dataset"""
    
    chemin = Path(f"../donnees/brutes/{fichier}")
    
    # Charger
    try:
        df = pd.read_csv(chemin)
    except Exception as e:
        return {
            'fichier': fichier,
            'erreur': str(e),
            'charge': False
        }
    
    # Hash pour identifier doublons
    hash_md5 = hashlib.md5(
        pd.util.hash_pandas_object(df, index=True).values
    ).hexdigest()
    
    # Identifier colonne churn
    churn_cols = [col for col in df.columns if 'churn' in col.lower()]
    
    return {
        'fichier': fichier,
        'charge': True,
        'shape': df.shape,
        'lignes': df.shape[0],
        'colonnes': df.shape[1],
        'taille_mo': chemin.stat().st_size / (1024**2),
        'hash': hash_md5[:8],  # 8 premiers caract√®res
        'colonnes_list': list(df.columns[:5]),
        'churn_col': churn_cols[0] if churn_cols else 'Non trouv√©e',
        'memoire_mb': df.memory_usage(deep=True).sum() / (1024**2),
        'df': df  # Garder r√©f√©rence
    }

print("‚úÖ Fonction d'analyse cr√©√©e")

‚úÖ Fonction d'analyse cr√©√©e


In [3]:
# Liste des fichiers
fichiers = [
    'telco_customer_churn_ibm.csv',
    'orange_telco_churn.csv',
    'WA_Fn-UseC_-Telco-Customer-Churn.csv',
    'churn-bigml-80.csv',
    'churn-bigml-20.csv',
    'customer_churn_orange.csv'
]

# Analyser chaque fichier
print("="*80)
print("üîç ANALYSE DE TOUS LES DATASETS")
print("="*80)

resultats = {}
for fichier in fichiers:
    print(f"\nüìä Analyse : {fichier}")
    resultat = analyser_dataset(fichier)
    resultats[fichier] = resultat
    
    if resultat['charge']:
        print(f"   ‚úÖ Shape      : {resultat['shape']}")
        print(f"   ‚úÖ Hash       : {resultat['hash']}")
        print(f"   ‚úÖ Churn col  : {resultat['churn_col']}")
        print(f"   ‚úÖ Taille     : {resultat['taille_mo']:.2f} MB")
    else:
        print(f"   ‚ùå Erreur     : {resultat['erreur']}")

print("\n" + "="*80)

üîç ANALYSE DE TOUS LES DATASETS

üìä Analyse : telco_customer_churn_ibm.csv
   ‚úÖ Shape      : (7043, 21)
   ‚úÖ Hash       : 6859fc30
   ‚úÖ Churn col  : Churn
   ‚úÖ Taille     : 0.93 MB

üìä Analyse : orange_telco_churn.csv
   ‚úÖ Shape      : (7043, 21)
   ‚úÖ Hash       : 6859fc30
   ‚úÖ Churn col  : Churn
   ‚úÖ Taille     : 0.93 MB

üìä Analyse : WA_Fn-UseC_-Telco-Customer-Churn.csv
   ‚úÖ Shape      : (7043, 21)
   ‚úÖ Hash       : 6859fc30
   ‚úÖ Churn col  : Churn
   ‚úÖ Taille     : 0.93 MB

üìä Analyse : churn-bigml-80.csv
   ‚úÖ Shape      : (2666, 20)
   ‚úÖ Hash       : e3559936
   ‚úÖ Churn col  : Churn
   ‚úÖ Taille     : 0.21 MB

üìä Analyse : churn-bigml-20.csv
   ‚úÖ Shape      : (667, 20)
   ‚úÖ Hash       : 4c8f2efc
   ‚úÖ Churn col  : Churn
   ‚úÖ Taille     : 0.05 MB

üìä Analyse : customer_churn_orange.csv
   ‚úÖ Shape      : (3150, 14)
   ‚úÖ Hash       : 414fbd07
   ‚úÖ Churn col  : Churn
   ‚úÖ Taille     : 0.13 MB



In [4]:
# Afficher aper√ßu des datasets UNIQUES
print("\nüëÄ APER√áU DES DATASETS UNIQUES")
print("="*80)

hashes_vus = set()
for fichier, res in resultats.items():
    if res['charge'] and res['hash'] not in hashes_vus:
        hashes_vus.add(res['hash'])
        
        print(f"\nüìä {fichier}")
        print(f"   Shape: {res['shape']}")
        print(f"   Colonnes: {res['colonnes_list']}...")
        print("\n   Premi√®res lignes:")
        display(res['df'].head(3))
        print("-"*80)


üëÄ APER√áU DES DATASETS UNIQUES

üìä telco_customer_churn_ibm.csv
   Shape: (7043, 21)
   Colonnes: ['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents']...

   Premi√®res lignes:


Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,No,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes


--------------------------------------------------------------------------------

üìä churn-bigml-80.csv
   Shape: (2666, 20)
   Colonnes: ['State', 'Account length', 'Area code', 'International plan', 'Voice mail plan']...

   Premi√®res lignes:


Unnamed: 0,State,Account length,Area code,International plan,Voice mail plan,Number vmail messages,Total day minutes,Total day calls,Total day charge,Total eve minutes,Total eve calls,Total eve charge,Total night minutes,Total night calls,Total night charge,Total intl minutes,Total intl calls,Total intl charge,Customer service calls,Churn
0,KS,128,415,No,Yes,25,265.1,110,45.07,197.4,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False
1,OH,107,415,No,Yes,26,161.6,123,27.47,195.5,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False
2,NJ,137,415,No,No,0,243.4,114,41.38,121.2,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False


--------------------------------------------------------------------------------

üìä churn-bigml-20.csv
   Shape: (667, 20)
   Colonnes: ['State', 'Account length', 'Area code', 'International plan', 'Voice mail plan']...

   Premi√®res lignes:


Unnamed: 0,State,Account length,Area code,International plan,Voice mail plan,Number vmail messages,Total day minutes,Total day calls,Total day charge,Total eve minutes,Total eve calls,Total eve charge,Total night minutes,Total night calls,Total night charge,Total intl minutes,Total intl calls,Total intl charge,Customer service calls,Churn
0,LA,117,408,No,No,0,184.5,97,31.37,351.6,80,29.89,215.8,90,9.71,8.7,4,2.35,1,False
1,IN,65,415,No,No,0,129.1,137,21.95,228.5,83,19.42,208.8,111,9.4,12.7,6,3.43,4,True
2,NY,161,415,No,No,0,332.9,67,56.59,317.8,97,27.01,160.6,128,7.23,5.4,9,1.46,4,True


--------------------------------------------------------------------------------

üìä customer_churn_orange.csv
   Shape: (3150, 14)
   Colonnes: ['Call  Failure', 'Complains', 'Subscription  Length', 'Charge  Amount', 'Seconds of Use']...

   Premi√®res lignes:


Unnamed: 0,Call Failure,Complains,Subscription Length,Charge Amount,Seconds of Use,Frequency of use,Frequency of SMS,Distinct Called Numbers,Age Group,Tariff Plan,Status,Age,Customer Value,Churn
0,8,0,38,0,4370,71,5,17,3,1,1,30,197.64,0
1,0,0,39,0,318,5,7,4,2,1,2,25,46.035,0
2,10,0,37,0,2453,60,359,24,3,1,1,30,1536.52,0


--------------------------------------------------------------------------------
