# üéµ Pr√©diction de Churn - Service de Streaming Musical

---

## Objectif

Pr√©dire si un utilisateur va **churner** (r√©silier son abonnement) dans les **10 jours** suivant le 2018-11-20.

Un utilisateur est consid√©r√© comme ayant churn√© s'il visite la page `'Cancellation Confirmation'`.

---

## Plan du notebook

1. **Chargement et exploration des donn√©es**
2. **Nettoyage et pr√©paration**
3. **Feature Engineering**
4. **Analyse exploratoire**
5. **Mod√©lisation**
6. **√âvaluation et comparaison**
7. **Analyse des features importantes**
8. **Pr√©diction sur donn√©es de test**

## 1. Imports et Configuration

In [7]:
# Manipulation de donn√©es
import pandas as pd
import numpy as np
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Visualisation
import matplotlib.pyplot as plt
import seaborn as sns

# Machine Learning
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import (
    classification_report, confusion_matrix, 
    roc_auc_score, roc_curve, precision_recall_curve,
    average_precision_score, f1_score, balanced_accuracy_score
)

# Configuration
plt.style.use('seaborn-v0_8-whitegrid')
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

# Couleurs personnalis√©es
COLORS = {
    'primary': '#3498db',
    'success': '#2ecc71',
    'danger': '#e74c3c',
    'warning': '#f39c12',
    'purple': '#9b59b6'
}

print("‚úÖ Imports charg√©s avec succ√®s!")

‚úÖ Imports charg√©s avec succ√®s!


## 2. Chargement des Donn√©es

In [8]:
# Charger les donn√©es
df_train_test = pd.read_parquet("data/test.parquet")
df_train = pd.read_parquet("data/train.parquet")
df = df_train.copy()
print(f"üìä Dimensions du dataset: {df.shape[0]:,} √©v√©nements, {df.shape[1]} colonnes")
print(f"üë• Nombre d'utilisateurs uniques: {df['userId'].nunique():,}")

üìä Dimensions du dataset: 17,499,636 √©v√©nements, 19 colonnes
üë• Nombre d'utilisateurs uniques: 19,140


In [9]:
# Aper√ßu des premi√®res lignes
df.head(10)

Unnamed: 0,status,gender,firstName,level,lastName,userId,ts,auth,page,sessionId,location,itemInSession,userAgent,method,length,song,artist,time,registration
0,200,M,Shlok,paid,Johnson,1749042,1538352001000,Logged In,NextSong,22683,"Dallas-Fort Worth-Arlington, TX",278,"""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK...",PUT,524.32934,Ich mache einen Spiegel - Dream Part 4,Popol Vuh,2018-10-01 00:00:01,2018-08-08 13:22:21
992,200,M,Shlok,paid,Johnson,1749042,1538352525000,Logged In,NextSong,22683,"Dallas-Fort Worth-Arlington, TX",279,"""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK...",PUT,178.02404,Monster (Album Version),Skillet,2018-10-01 00:08:45,2018-08-08 13:22:21
1360,200,M,Shlok,paid,Johnson,1749042,1538352703000,Logged In,NextSong,22683,"Dallas-Fort Worth-Arlington, TX",280,"""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK...",PUT,232.61995,Seven Nation Army,The White Stripes,2018-10-01 00:11:43,2018-08-08 13:22:21
1825,200,M,Shlok,paid,Johnson,1749042,1538352935000,Logged In,NextSong,22683,"Dallas-Fort Worth-Arlington, TX",281,"""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK...",PUT,265.50812,Under The Bridge (Album Version),Red Hot Chili Peppers,2018-10-01 00:15:35,2018-08-08 13:22:21
2366,200,M,Shlok,paid,Johnson,1749042,1538353200000,Logged In,NextSong,22683,"Dallas-Fort Worth-Arlington, TX",282,"""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK...",PUT,471.69261,Circlesong 6,Bobby McFerrin,2018-10-01 00:20:00,2018-08-08 13:22:21
3271,200,M,Shlok,paid,Johnson,1749042,1538353671000,Logged In,NextSong,22683,"Dallas-Fort Worth-Arlington, TX",283,"""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK...",PUT,266.86649,Who Can Compare,Foolish Things,2018-10-01 00:27:51,2018-08-08 13:22:21
3802,200,M,Shlok,paid,Johnson,1749042,1538353937000,Logged In,NextSong,22683,"Dallas-Fort Worth-Arlington, TX",284,"""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK...",PUT,1400.65914,Angel Dust,Gil Scott Heron,2018-10-01 00:32:17,2018-08-08 13:22:21
6585,200,M,Shlok,paid,Johnson,1749042,1538355337000,Logged In,NextSong,22683,"Dallas-Fort Worth-Arlington, TX",285,"""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK...",PUT,186.98404,Sweet And Dandy,Toots & The Maytals,2018-10-01 00:55:37,2018-08-08 13:22:21
6675,200,M,Shlok,paid,Johnson,1749042,1538355388000,Logged In,Downgrade,22683,"Dallas-Fort Worth-Arlington, TX",286,"""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK...",GET,,,,2018-10-01 00:56:28,2018-08-08 13:22:21
6961,200,M,Shlok,paid,Johnson,1749042,1538355523000,Logged In,NextSong,22683,"Dallas-Fort Worth-Arlington, TX",287,"""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK...",PUT,306.05016,On The Moon,Peter Cincotti,2018-10-01 00:58:43,2018-08-08 13:22:21


In [10]:
# Informations sur les colonnes
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 17499636 entries, 0 to 25661583
Data columns (total 19 columns):
 #   Column         Dtype         
---  ------         -----         
 0   status         int64         
 1   gender         object        
 2   firstName      object        
 3   level          object        
 4   lastName       object        
 5   userId         object        
 6   ts             int64         
 7   auth           object        
 8   page           object        
 9   sessionId      int64         
 10  location       object        
 11  itemInSession  int64         
 12  userAgent      object        
 13  method         object        
 14  length         float64       
 15  song           object        
 16  artist         object        
 17  time           datetime64[us]
 18  registration   datetime64[us]
dtypes: datetime64[us](2), float64(1), int64(4), object(12)
memory usage: 2.6+ GB


In [11]:
# Statistiques descriptives
df.describe()

Unnamed: 0,status,ts,sessionId,itemInSession,length,time,registration
count,17499640.0,17499640.0,17499640.0,17499640.0,14291430.0,17499636,17499636
mean,209.1387,1540428000000.0,84802.94,105.5937,248.7135,2018-10-25 00:47:01.161927,2018-08-25 04:40:21.543066
min,200.0,1538352000000.0,1.0,0.0,0.522,2018-10-01 00:00:01,2017-10-14 22:05:25
25%,200.0,1539340000000.0,25159.0,26.0,199.8885,2018-10-12 10:33:57.750000,2018-08-10 21:14:59
50%,200.0,1540397000000.0,79038.0,66.0,234.0828,2018-10-24 15:58:54,2018-09-05 18:35:50
75%,200.0,1541500000000.0,138368.0,144.0,276.8714,2018-11-06 10:25:35,2018-09-20 17:24:57
max,404.0,1542672000000.0,207003.0,1426.0,3024.666,2018-11-20 00:00:00,2018-11-19 23:34:34
std,30.2305,1233485000.0,61414.27,116.8854,97.22845,,


## 3. Nettoyage et Pr√©paration

In [12]:
# Conversion des timestamps
df['time'] = pd.to_datetime(df['time'])
df['registration'] = pd.to_datetime(df['registration'])

# Date de r√©f√©rence (fin de la p√©riode d'observation)
REFERENCE_DATE = pd.to_datetime('2018-11-20')

print(f"üìÖ P√©riode des donn√©es: du {df['time'].min()} au {df['time'].max()}")
print(f"üìÖ Date de r√©f√©rence: {REFERENCE_DATE}")

üìÖ P√©riode des donn√©es: du 2018-10-01 00:00:01 au 2018-11-20 00:00:00
üìÖ Date de r√©f√©rence: 2018-11-20 00:00:00


In [13]:
# V√©rification des valeurs manquantes
missing = df.isnull().sum()
missing_pct = (missing / len(df) * 100).round(2)
missing_df = pd.DataFrame({'Manquants': missing, 'Pourcentage (%)': missing_pct})
missing_df[missing_df['Manquants'] > 0]

Unnamed: 0,Manquants,Pourcentage (%)
length,3208203,18.33
song,3208203,18.33
artist,3208203,18.33


In [17]:
# Creating cancellation in following ten days column
import numpy as np

cancellation_events = df_train[df_train['page'] == 'Cancellation Confirmation'].copy()
cancellation_events = cancellation_events[['userId', 'time']].rename(columns={'time': 'churn_time'})

df_train = df_train.merge(cancellation_events, on='userId', how='left')

df_train['days_until_churn'] = (df_train['churn_time'] - df_train['time']).dt.total_seconds() / (24 * 3600)

df_train['will_churn_10days'] = ((df_train['days_until_churn'] >= 0) & 
                                   (df_train['days_until_churn'] <= 10)).astype(int)

df = df_train.drop(['churn_time', 'days_until_churn'], axis=1)

In [18]:
# Distribution de la variable cible
churn_by_user = df.groupby('userId')['will_churn_10days'].first()
churn_counts = churn_by_user.value_counts()

print("üéØ Distribution de la variable cible (par utilisateur):")
print(f"   Non-Churn (0): {churn_counts[0]:,} ({churn_counts[0]/len(churn_by_user)*100:.2f}%)")
print(f"   Churn (1):     {churn_counts[1]:,} ({churn_counts[1]/len(churn_by_user)*100:.2f}%)")

üéØ Distribution de la variable cible (par utilisateur):
   Non-Churn (0): 17,738 (92.68%)
   Churn (1):     1,402 (7.32%)


In [None]:
# Types de pages visit√©es
print("üìÑ Types de pages visit√©es:")
df['page'].value_counts()

## 4. Feature Engineering

Nous allons cr√©er des **features agr√©g√©es au niveau utilisateur** √† partir des √©v√©nements.

In [None]:
def create_user_features(df, reference_date):
    """
    Cr√©e des features agr√©g√©es au niveau utilisateur.
    
    Parameters:
    -----------
    df : DataFrame
        DataFrame contenant les √©v√©nements utilisateur
    reference_date : datetime
        Date de r√©f√©rence pour le calcul des features temporelles
    
    Returns:
    --------
    DataFrame avec une ligne par utilisateur et les features calcul√©es
    """
    
    # Filtrer les √©v√©nements avant la date de r√©f√©rence
    df_filtered = df[df['time'] <= reference_date].copy()
    
    # DataFrame pour stocker les features
    user_features = pd.DataFrame()
    user_features['userId'] = df_filtered['userId'].unique()
    
    # =========================================================================
    # FEATURES D√âMOGRAPHIQUES
    # =========================================================================
    user_info = df_filtered.groupby('userId').agg({
        'gender': 'first',
        'level': 'last',  # Dernier niveau connu
        'registration': 'first',
        'will_churn_10days': 'first'
    }).reset_index()
    
    user_features = user_features.merge(user_info, on='userId', how='left')
    
    # Anciennet√© (jours depuis inscription)
    user_features['days_since_registration'] = (
        reference_date - user_features['registration']
    ).dt.days
    
    # =========================================================================
    # FEATURES D'ACTIVIT√â GLOBALE
    # =========================================================================
    activity = df_filtered.groupby('userId').agg({
        'ts': 'count',  # Nombre total d'√©v√©nements
        'sessionId': 'nunique',  # Nombre de sessions uniques
        'time': ['min', 'max'],  # Premi√®re et derni√®re activit√©
        'length': ['sum', 'mean', 'count'],  # Dur√©e d'√©coute
    })
    activity.columns = ['_'.join(col).strip() for col in activity.columns]
    activity = activity.reset_index()
    activity.columns = ['userId', 'total_events', 'unique_sessions', 
                        'first_activity', 'last_activity',
                        'total_listening_time', 'avg_song_length', 'songs_with_length']
    
    user_features = user_features.merge(activity, on='userId', how='left')
    
    # R√©cence (jours depuis la derni√®re activit√©)
    user_features['days_since_last_activity'] = (
        reference_date - pd.to_datetime(user_features['last_activity'])
    ).dt.days
    
    # Dur√©e d'activit√© (jours entre premi√®re et derni√®re activit√©)
    user_features['activity_span_days'] = (
        pd.to_datetime(user_features['last_activity']) - 
        pd.to_datetime(user_features['first_activity'])
    ).dt.days + 1
    
    # Fr√©quence d'utilisation
    user_features['events_per_day'] = (
        user_features['total_events'] / user_features['activity_span_days']
    ).replace([np.inf, -np.inf], 0)
    
    user_features['sessions_per_day'] = (
        user_features['unique_sessions'] / user_features['activity_span_days']
    ).replace([np.inf, -np.inf], 0)
    
    # =========================================================================
    # FEATURES PAR TYPE DE PAGE
    # =========================================================================
    important_pages = [
        'NextSong', 'Home', 'Thumbs Up', 'Thumbs Down', 
        'Add to Playlist', 'Add Friend', 'Roll Advert',
        'Downgrade', 'Cancel', 'Submit Downgrade', 'Error',
        'Help', 'Settings', 'Logout', 'Upgrade', 'Submit Upgrade'
    ]
    
    page_counts = df_filtered.groupby(['userId', 'page']).size().unstack(fill_value=0)
    
    for page in important_pages:
        col_name = f'page_{page.lower().replace(" ", "_")}'
        if page in page_counts.columns:
            page_counts_temp = page_counts[[page]].reset_index()
            page_counts_temp.columns = ['userId', col_name]
            user_features = user_features.merge(page_counts_temp, on='userId', how='left')
            user_features[col_name] = user_features[col_name].fillna(0)
        else:
            user_features[col_name] = 0
    
    # =========================================================================
    # FEATURES DE SIGNAUX DE CHURN
    # =========================================================================
    # Visites sur pages de r√©siliation/downgrade
    user_features['churn_signals'] = (
        user_features.get('page_downgrade', 0) + 
        user_features.get('page_cancel', 0) + 
        user_features.get('page_submit_downgrade', 0)
    )
    
    # Ratio Thumbs Down / Thumbs Up
    user_features['thumbs_ratio'] = (
        user_features.get('page_thumbs_down', 0) / 
        (user_features.get('page_thumbs_up', 0) + 1)
    )
    
    # Taux d'erreurs
    user_features['error_rate'] = (
        user_features.get('page_error', 0) / user_features['total_events']
    ).fillna(0)
    
    # =========================================================================
    # FEATURES D'ENGAGEMENT
    # =========================================================================
    # Ratio de chansons √©cout√©es sur total d'√©v√©nements
    user_features['song_event_ratio'] = (
        user_features.get('page_nextsong', 0) / user_features['total_events']
    ).fillna(0)
    
    # Interactions positives (Thumbs Up + Add to Playlist + Add Friend)
    user_features['positive_interactions'] = (
        user_features.get('page_thumbs_up', 0) + 
        user_features.get('page_add_to_playlist', 0) + 
        user_features.get('page_add_friend', 0)
    )
    
    # Taux d'interactions positives
    user_features['positive_interaction_rate'] = (
        user_features['positive_interactions'] / user_features['total_events']
    ).fillna(0)
    
    # =========================================================================
    # FEATURES TEMPORELLES
    # =========================================================================
    # Derni√®re semaine vs reste de la p√©riode
    one_week_before = reference_date - pd.Timedelta(days=7)
    
    last_week = df_filtered[df_filtered['time'] >= one_week_before].groupby('userId').size()
    last_week = last_week.reset_index()
    last_week.columns = ['userId', 'events_last_week']
    
    user_features = user_features.merge(last_week, on='userId', how='left')
    user_features['events_last_week'] = user_features['events_last_week'].fillna(0)
    
    # Tendance d'activit√© (activit√© r√©cente vs ancienne)
    user_features['activity_trend'] = (
        user_features['events_last_week'] / (user_features['total_events'] + 1)
    )
    
    # =========================================================================
    # FEATURES PAR NIVEAU (paid/free)
    # =========================================================================
    # Historique des changements de niveau
    level_changes = df_filtered.groupby('userId')['level'].nunique().reset_index()
    level_changes.columns = ['userId', 'level_changes']
    user_features = user_features.merge(level_changes, on='userId', how='left')
    user_features['has_changed_level'] = (user_features['level_changes'] > 1).astype(int)
    
    return user_features

In [None]:
# Cr√©ation des features
print("‚öôÔ∏è Cr√©ation des features utilisateur...")
user_df = create_user_features(df, REFERENCE_DATE)

print(f"‚úÖ Dataset agr√©g√©: {user_df.shape[0]:,} utilisateurs, {user_df.shape[1]} colonnes")

In [None]:
# Aper√ßu du dataset agr√©g√©
user_df.head(10)

In [None]:
# Liste des features cr√©√©es
feature_cols = [col for col in user_df.columns 
                if col not in ['userId', 'will_churn_10days', 'registration', 
                               'first_activity', 'last_activity', 'gender', 'level']]

print(f"üìä {len(feature_cols)} features cr√©√©es:")
for i, col in enumerate(feature_cols, 1):
    print(f"  {i:2d}. {col}")

## 5. Analyse Exploratoire

In [None]:
# Distribution du churn
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# 1. Pie chart du churn
ax1 = axes[0]
churn_counts = user_df['will_churn_10days'].value_counts()
colors = [COLORS['success'], COLORS['danger']]
ax1.pie(churn_counts, labels=['Non-Churn', 'Churn'], autopct='%1.1f%%', 
        colors=colors, startangle=90, explode=[0, 0.05])
ax1.set_title('Distribution du Churn', fontsize=14, fontweight='bold')

# 2. Churn par niveau (paid/free)
ax2 = axes[1]
churn_by_level = user_df.groupby(['level', 'will_churn_10days']).size().unstack(fill_value=0)
churn_by_level_pct = churn_by_level.div(churn_by_level.sum(axis=1), axis=0) * 100
churn_by_level_pct.plot(kind='bar', ax=ax2, color=colors, edgecolor='black')
ax2.set_title('Taux de Churn par Niveau', fontsize=14, fontweight='bold')
ax2.set_xlabel('Niveau', fontsize=12)
ax2.set_ylabel('Pourcentage (%)', fontsize=12)
ax2.legend(['Non-Churn', 'Churn'], loc='upper right')
ax2.tick_params(axis='x', rotation=0)

# 3. Churn par genre
ax3 = axes[2]
churn_by_gender = user_df.groupby(['gender', 'will_churn_10days']).size().unstack(fill_value=0)
churn_by_gender_pct = churn_by_gender.div(churn_by_gender.sum(axis=1), axis=0) * 100
churn_by_gender_pct.plot(kind='bar', ax=ax3, color=colors, edgecolor='black')
ax3.set_title('Taux de Churn par Genre', fontsize=14, fontweight='bold')
ax3.set_xlabel('Genre', fontsize=12)
ax3.set_ylabel('Pourcentage (%)', fontsize=12)
ax3.legend(['Non-Churn', 'Churn'], loc='upper right')
ax3.tick_params(axis='x', rotation=0)

plt.tight_layout()
plt.show()

In [None]:
# Comparaison des distributions par statut de churn
fig, axes = plt.subplots(2, 3, figsize=(15, 10))

features_to_plot = [
    ('total_events', '√âv√©nements Totaux'),
    ('days_since_last_activity', 'Jours depuis derni√®re activit√©'),
    ('unique_sessions', 'Sessions Uniques'),
    ('positive_interactions', 'Interactions Positives'),
    ('events_per_day', '√âv√©nements par Jour'),
    ('churn_signals', 'Signaux de Churn')
]

for idx, (feature, title) in enumerate(features_to_plot):
    ax = axes[idx // 3, idx % 3]
    
    # Boxplot
    data_no_churn = user_df[user_df['will_churn_10days'] == 0][feature]
    data_churn = user_df[user_df['will_churn_10days'] == 1][feature]
    
    bp = ax.boxplot([data_no_churn, data_churn], 
                    labels=['Non-Churn', 'Churn'],
                    patch_artist=True)
    
    bp['boxes'][0].set_facecolor(COLORS['success'])
    bp['boxes'][1].set_facecolor(COLORS['danger'])
    
    ax.set_title(title, fontsize=12, fontweight='bold')
    ax.set_ylabel(feature)

plt.tight_layout()
plt.show()

In [None]:
# Statistiques comparatives
print("üìà Statistiques comparatives Churn vs Non-Churn:")
print("="*70)

comparison_cols = ['total_events', 'unique_sessions', 'days_since_last_activity', 
                   'events_per_day', 'page_nextsong', 'positive_interactions',
                   'churn_signals', 'thumbs_ratio', 'activity_trend']

comparison = user_df.groupby('will_churn_10days')[comparison_cols].mean().T
comparison.columns = ['Non-Churn', 'Churn']
comparison['Diff (%)'] = ((comparison['Churn'] - comparison['Non-Churn']) / comparison['Non-Churn'] * 100).round(1)
comparison = comparison.round(2)
comparison

In [None]:
# Matrice de corr√©lation
fig, ax = plt.subplots(figsize=(14, 12))

# S√©lectionner les features num√©riques
numeric_cols = user_df[feature_cols].select_dtypes(include=[np.number]).columns.tolist()
numeric_cols.append('will_churn_10days')

corr_matrix = user_df[numeric_cols].corr()

# Heatmap
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
sns.heatmap(corr_matrix, mask=mask, annot=False, cmap='RdYlBu_r', 
            center=0, ax=ax, linewidths=0.5)
ax.set_title('Matrice de Corr√©lation des Features', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

In [None]:
# Corr√©lations avec la cible
target_corr = user_df[numeric_cols].corr()['will_churn_10days'].drop('will_churn_10days').sort_values(key=abs, ascending=False)

fig, ax = plt.subplots(figsize=(10, 8))
colors = [COLORS['danger'] if x > 0 else COLORS['primary'] for x in target_corr.values]
target_corr.plot(kind='barh', ax=ax, color=colors)
ax.set_xlabel('Corr√©lation avec le Churn', fontsize=12)
ax.set_title('Corr√©lation des Features avec la Variable Cible', fontsize=14, fontweight='bold')
ax.axvline(x=0, color='black', linestyle='-', linewidth=0.5)
ax.grid(True, axis='x', alpha=0.3)

plt.tight_layout()
plt.show()

## 6. Pr√©paration pour la Mod√©lisation

In [None]:
# Encodage des variables cat√©gorielles
user_df['gender_encoded'] = LabelEncoder().fit_transform(user_df['gender'].fillna('Unknown'))
user_df['level_encoded'] = LabelEncoder().fit_transform(user_df['level'].fillna('Unknown'))

# Liste finale des features
final_features = feature_cols + ['gender_encoded', 'level_encoded']

print(f"üìä Nombre total de features: {len(final_features)}")

In [None]:
# Pr√©paration X et y
X = user_df[final_features].fillna(0)
y = user_df['will_churn_10days']

print(f"üìä Dimensions:")
print(f"   X: {X.shape}")
print(f"   y: {y.shape}")
print(f"   Taux de churn: {y.mean()*100:.2f}%")

In [None]:
# Division train/test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"üìÇ Division des donn√©es:")
print(f"   Train: {X_train.shape[0]:,} utilisateurs ({y_train.mean()*100:.2f}% churn)")
print(f"   Test:  {X_test.shape[0]:,} utilisateurs ({y_test.mean()*100:.2f}% churn)")

In [None]:
# Normalisation
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("‚úÖ Normalisation effectu√©e")

## 7. Mod√©lisation

In [None]:
# D√©finition des mod√®les
models = {
    'Logistic Regression': LogisticRegression(
        class_weight='balanced', 
        max_iter=1000, 
        random_state=42
    ),
    'Random Forest': RandomForestClassifier(
        n_estimators=200,
        max_depth=10,
        min_samples_split=5,
        class_weight='balanced',
        random_state=42,
        n_jobs=-1
    ),
    'Gradient Boosting': GradientBoostingClassifier(
        n_estimators=200,
        max_depth=5,
        learning_rate=0.1,
        random_state=42
    )
}

print(f"üì¶ {len(models)} mod√®les √† entra√Æner")

In [None]:
# Entra√Ænement et √©valuation
results = {}
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

for name, model in models.items():
    print(f"\n{'='*60}")
    print(f"üìå {name}")
    print(f"{'='*60}")
    
    # Utiliser les donn√©es scal√©es pour la r√©gression logistique
    if name == 'Logistic Regression':
        X_tr, X_te = X_train_scaled, X_test_scaled
    else:
        X_tr, X_te = X_train, X_test
    
    # Cross-validation
    cv_scores = cross_val_score(model, X_tr, y_train, cv=cv, scoring='roc_auc')
    print(f"   CV ROC-AUC: {cv_scores.mean():.4f} (+/- {cv_scores.std()*2:.4f})")
    
    # Entra√Ænement final
    model.fit(X_tr, y_train)
    
    # Pr√©dictions
    y_pred = model.predict(X_te)
    y_pred_proba = model.predict_proba(X_te)[:, 1]
    
    # M√©triques
    roc_auc = roc_auc_score(y_test, y_pred_proba)
    f1 = f1_score(y_test, y_pred)
    ap = average_precision_score(y_test, y_pred_proba)
    
    print(f"   Test ROC-AUC: {roc_auc:.4f}")
    print(f"   Test F1-Score: {f1:.4f}")
    print(f"   Test Average Precision: {ap:.4f}")
    
    # Stockage des r√©sultats
    results[name] = {
        'model': model,
        'cv_auc': cv_scores.mean(),
        'cv_std': cv_scores.std(),
        'test_auc': roc_auc,
        'test_f1': f1,
        'test_ap': ap,
        'y_pred': y_pred,
        'y_pred_proba': y_pred_proba
    }

In [None]:
# Rapport de classification d√©taill√© pour chaque mod√®le
for name, res in results.items():
    print(f"\n{'='*60}")
    print(f"üìä Classification Report - {name}")
    print(f"{'='*60}")
    print(classification_report(y_test, res['y_pred'], target_names=['Non-Churn', 'Churn']))

## 8. Comparaison des Mod√®les

In [None]:
# Tableau r√©capitulatif
results_df = pd.DataFrame({
    'Model': list(results.keys()),
    'CV ROC-AUC': [r['cv_auc'] for r in results.values()],
    'CV Std': [r['cv_std'] for r in results.values()],
    'Test ROC-AUC': [r['test_auc'] for r in results.values()],
    'Test F1': [r['test_f1'] for r in results.values()],
    'Test AP': [r['test_ap'] for r in results.values()]
}).sort_values('Test ROC-AUC', ascending=False)

print("üìä R√©capitulatif des performances:")
results_df.round(4)

In [None]:
# Meilleur mod√®le
best_model_name = results_df.iloc[0]['Model']
best_model = results[best_model_name]['model']

print(f"üèÜ Meilleur mod√®le: {best_model_name}")
print(f"   ROC-AUC: {results[best_model_name]['test_auc']:.4f}")

In [None]:
# Courbes ROC et Precision-Recall
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# ROC Curves
ax1 = axes[0]
colors_list = [COLORS['primary'], COLORS['success'], COLORS['purple']]

for (name, res), color in zip(results.items(), colors_list):
    fpr, tpr, _ = roc_curve(y_test, res['y_pred_proba'])
    ax1.plot(fpr, tpr, label=f"{name} (AUC={res['test_auc']:.3f})", 
             linewidth=2, color=color)

ax1.plot([0, 1], [0, 1], 'k--', linewidth=1, label='Random')
ax1.set_xlabel('False Positive Rate', fontsize=12)
ax1.set_ylabel('True Positive Rate', fontsize=12)
ax1.set_title('Courbes ROC', fontsize=14, fontweight='bold')
ax1.legend(loc='lower right')
ax1.grid(True, alpha=0.3)

# Precision-Recall Curves
ax2 = axes[1]
for (name, res), color in zip(results.items(), colors_list):
    precision, recall, _ = precision_recall_curve(y_test, res['y_pred_proba'])
    ax2.plot(recall, precision, label=f"{name} (AP={res['test_ap']:.3f})", 
             linewidth=2, color=color)

ax2.set_xlabel('Recall', fontsize=12)
ax2.set_ylabel('Precision', fontsize=12)
ax2.set_title('Courbes Precision-Recall', fontsize=14, fontweight='bold')
ax2.legend(loc='upper right')
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Matrices de confusion
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

for idx, (name, res) in enumerate(results.items()):
    ax = axes[idx]
    cm = confusion_matrix(y_test, res['y_pred'])
    
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=ax,
                xticklabels=['Non-Churn', 'Churn'],
                yticklabels=['Non-Churn', 'Churn'])
    ax.set_xlabel('Pr√©dit')
    ax.set_ylabel('R√©el')
    ax.set_title(f'{name}', fontsize=12, fontweight='bold')

plt.tight_layout()
plt.show()

## 9. Importance des Features

In [None]:
# Feature importance du Random Forest
rf_model = results['Random Forest']['model']

feature_importance = pd.DataFrame({
    'feature': final_features,
    'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)

print("üîç Top 15 features les plus importantes (Random Forest):")
feature_importance.head(15)

In [None]:
# Visualisation de l'importance des features
fig, ax = plt.subplots(figsize=(10, 8))

top_features = feature_importance.head(15)
colors = plt.cm.RdYlGn_r(np.linspace(0.2, 0.8, len(top_features)))

bars = ax.barh(range(len(top_features)), top_features['importance'], color=colors)
ax.set_yticks(range(len(top_features)))
ax.set_yticklabels(top_features['feature'])
ax.invert_yaxis()
ax.set_xlabel('Importance', fontsize=12)
ax.set_title('Top 15 Features les Plus Importantes (Random Forest)', 
             fontsize=14, fontweight='bold')
ax.grid(True, axis='x', alpha=0.3)

# Ajouter les valeurs
for i, (idx, row) in enumerate(top_features.iterrows()):
    ax.text(row['importance'] + 0.005, i, f"{row['importance']:.3f}", 
            va='center', fontsize=9)

plt.tight_layout()
plt.show()

In [None]:
# Coefficients de la r√©gression logistique
lr_model = results['Logistic Regression']['model']

lr_coef = pd.DataFrame({
    'feature': final_features,
    'coefficient': lr_model.coef_[0]
}).sort_values('coefficient', key=abs, ascending=False)

print("üîç Top 15 coefficients (Logistic Regression):")
lr_coef.head(15)

## 10. Sauvegarde du Mod√®le

In [None]:
import pickle

# Sauvegarde du meilleur mod√®le
model_data = {
    'model': best_model,
    'scaler': scaler,
    'feature_cols': final_features,
    'model_name': best_model_name
}

with open('best_model.pkl', 'wb') as f:
    pickle.dump(model_data, f)

print(f"‚úÖ Mod√®le '{best_model_name}' sauvegard√© dans 'best_model.pkl'")

## 11. Pr√©diction sur Donn√©es de Test

Voici le code pour appliquer le mod√®le sur de nouvelles donn√©es.

In [None]:
def predict_churn(df_test, reference_date=REFERENCE_DATE):
    """
    Applique le mod√®le de churn sur de nouvelles donn√©es.
    
    Parameters:
    -----------
    df_test : DataFrame
        Donn√©es de test au format √©v√©nement
    reference_date : datetime
        Date de r√©f√©rence
    
    Returns:
    --------
    DataFrame avec userId et probabilit√© de churn
    """
    # Charger le mod√®le
    with open('best_model.pkl', 'rb') as f:
        saved = pickle.load(f)
    
    model = saved['model']
    scaler = saved['scaler']
    feature_cols = saved['feature_cols']
    model_name = saved['model_name']
    
    print(f"üì¶ Mod√®le charg√©: {model_name}")
    
    # Pr√©parer les timestamps
    df_test = df_test.copy()
    df_test['time'] = pd.to_datetime(df_test['time'])
    df_test['registration'] = pd.to_datetime(df_test['registration'])
    
    # Cr√©er les features
    test_features = create_user_features(df_test, reference_date)
    
    # Encodage
    test_features['gender_encoded'] = LabelEncoder().fit_transform(
        test_features['gender'].fillna('Unknown')
    )
    test_features['level_encoded'] = LabelEncoder().fit_transform(
        test_features['level'].fillna('Unknown')
    )
    
    # S'assurer que toutes les colonnes existent
    for col in feature_cols:
        if col not in test_features.columns:
            test_features[col] = 0
    
    # Pr√©parer X
    X_new = test_features[feature_cols].fillna(0)
    
    # Pr√©dictions
    if model_name == 'Logistic Regression':
        X_new_scaled = scaler.transform(X_new)
        predictions = model.predict_proba(X_new_scaled)[:, 1]
    else:
        predictions = model.predict_proba(X_new)[:, 1]
    
    # R√©sultats
    submission = pd.DataFrame({
        'userId': test_features['userId'],
        'churn_probability': predictions
    })
    
    print(f"‚úÖ Pr√©dictions g√©n√©r√©es pour {len(submission):,} utilisateurs")
    
    return submission

In [None]:
# Exemple d'utilisation (d√©commenter pour utiliser)
# df_test = pd.read_csv('df_test.csv')
# submission = predict_churn(df_test)
# submission.to_csv('submission.csv', index=False)
# submission.head(10)

## 12. R√©sum√© et Conclusions

In [None]:
print("""
‚ïî‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïó
‚ïë                    R√âSUM√â DU PROJET DE CHURN                         ‚ïë
‚ïö‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïù
""")

print(f"""
üìä DONN√âES
{'‚îÄ'*50}
‚Ä¢ √âv√©nements totaux: {df.shape[0]:,}
‚Ä¢ Utilisateurs uniques: {user_df.shape[0]:,}
‚Ä¢ Features cr√©√©es: {len(final_features)}
‚Ä¢ Taux de churn: {y.mean()*100:.2f}%

üèÜ MEILLEUR MOD√àLE: {best_model_name}
{'‚îÄ'*50}
‚Ä¢ ROC-AUC: {results[best_model_name]['test_auc']:.4f}
‚Ä¢ F1-Score: {results[best_model_name]['test_f1']:.4f}
‚Ä¢ Average Precision: {results[best_model_name]['test_ap']:.4f}

üîë TOP 5 FEATURES PR√âDICTIVES
{'‚îÄ'*50}""")

for i, row in feature_importance.head(5).iterrows():
    print(f"  ‚Ä¢ {row['feature']}: {row['importance']:.4f}")

print(f"""
üí° INSIGHTS CL√âS
{'‚îÄ'*50}
1. La r√©cence d'activit√© est le facteur #1 de pr√©diction du churn
2. Les utilisateurs inactifs depuis longtemps ont ~50% plus de risque
3. L'engagement positif (likes, playlists) prot√®ge contre le churn
4. Les visites sur Downgrade/Cancel sont des signaux d'alerte

üöÄ PISTES D'AM√âLIORATION
{'‚îÄ'*50}
1. Ajouter des features de tendance temporelle (7j, 14j, 30j)
2. Features musicales: diversit√© genres/artistes
3. Hyperparameter tuning avec Optuna/GridSearch
4. Essayer XGBoost/LightGBM avec SMOTE
5. Stacking/Ensemble de mod√®les
""")