# 01 - Chargement des Donn√©es

Ce notebook charge et explore rapidement le dataset de d√©tection de fraude.

## 1. Import des librairies

In [1]:
# Librairies de manipulation de donn√©es
import pandas as pd
import numpy as np

# Configuration de l'affichage
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

print("‚úì Librairies import√©es avec succ√®s")

‚úì Librairies import√©es avec succ√®s


## 2. Chargement des donn√©es

In [2]:
# Chemin vers le fichier de donn√©es
data_path = '../data/raw/fraud_synth_10000.csv'

# Chargement du fichier CSV
df = pd.read_csv(data_path)

print(f"‚úì Donn√©es charg√©es : {df.shape[0]} lignes, {df.shape[1]} colonnes")

‚úì Donn√©es charg√©es : 10000 lignes, 9 colonnes


## 3. Aper√ßu rapide des donn√©es

In [3]:
# Afficher les 5 premi√®res lignes
print("üìä Premi√®res lignes du dataset :")
df.head()

üìä Premi√®res lignes du dataset :


Unnamed: 0,transaction_amount,transaction_hour,num_transactions_24h,account_age_days,avg_amount_30d,country_risk,device_type,is_foreign_transaction,fraud
0,15.01,17,3,3615,15.68,low,mobile,1,0
1,24.67,13,2,209,49.99,low,mobile,0,0
2,92.79,13,4,438,10.65,low,mobile,0,0
3,38.67,22,6,1830,9.62,low,desktop,1,0
4,69.14,10,3,1686,21.91,medium,mobile,0,0


In [4]:
# Informations sur les colonnes et types de donn√©es
print("üìã Informations sur le dataset :")
df.info()

üìã Informations sur le dataset :
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 9 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   transaction_amount      10000 non-null  float64
 1   transaction_hour        10000 non-null  int64  
 2   num_transactions_24h    10000 non-null  int64  
 3   account_age_days        10000 non-null  int64  
 4   avg_amount_30d          10000 non-null  float64
 5   country_risk            10000 non-null  object 
 6   device_type             10000 non-null  object 
 7   is_foreign_transaction  10000 non-null  int64  
 8   fraud                   10000 non-null  int64  
dtypes: float64(2), int64(5), object(2)
memory usage: 703.3+ KB


In [5]:
# Statistiques descriptives
print("üìà Statistiques descriptives :")
df.describe()

üìà Statistiques descriptives :


Unnamed: 0,transaction_amount,transaction_hour,num_transactions_24h,account_age_days,avg_amount_30d,is_foreign_transaction,fraud
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,45.614928,11.4699,2.4984,2502.3157,31.161326,0.2546,0.0551
std,41.264072,6.910389,1.57291,1441.672572,24.688433,0.435658,0.228187
min,5.0,0.0,0.0,1.0,5.0,0.0,0.0
25%,19.4475,5.0,1.0,1252.0,15.33,0.0,0.0
50%,33.675,12.0,2.0,2505.0,24.41,0.0,0.0
75%,57.4,17.0,3.0,3768.25,39.03,1.0,0.0
max,593.0,23.0,10.0,4999.0,346.73,1.0,1.0


## 4. V√©rification des valeurs manquantes

In [6]:
# Compter les valeurs manquantes par colonne
missing_values = df.isnull().sum()
missing_percent = (missing_values / len(df)) * 100

# Cr√©er un DataFrame r√©capitulatif
missing_df = pd.DataFrame({
    'Valeurs manquantes': missing_values,
    'Pourcentage (%)': missing_percent
})

# Afficher uniquement les colonnes avec des valeurs manquantes
missing_df = missing_df[missing_df['Valeurs manquantes'] > 0]

if len(missing_df) > 0:
    print("‚ö†Ô∏è Valeurs manquantes d√©tect√©es :")
    print(missing_df)
else:
    print("‚úì Aucune valeur manquante d√©tect√©e")

‚úì Aucune valeur manquante d√©tect√©e


## 5. Distribution de la variable cible (fraud)

In [7]:
# Compter les fraudes et non-fraudes
fraud_counts = df['fraud'].value_counts()
fraud_percent = df['fraud'].value_counts(normalize=True) * 100

print("üéØ Distribution de la variable cible (fraud) :")
print("\nNombre :")
print(fraud_counts)
print("\nPourcentage :")
print(fraud_percent.round(2))

# V√©rifier si le dataset est d√©s√©quilibr√©
fraud_ratio = fraud_counts[1] / fraud_counts[0]
if fraud_ratio < 0.1:
    print(f"\n‚ö†Ô∏è Dataset d√©s√©quilibr√© d√©tect√© (ratio: {fraud_ratio:.2%})")
else:
    print(f"\n‚úì Dataset √©quilibr√© (ratio: {fraud_ratio:.2%})")

üéØ Distribution de la variable cible (fraud) :

Nombre :
fraud
0    9449
1     551
Name: count, dtype: int64

Pourcentage :
fraud
0    94.49
1     5.51
Name: proportion, dtype: float64

‚ö†Ô∏è Dataset d√©s√©quilibr√© d√©tect√© (ratio: 5.83%)


## 6. Aper√ßu des types de variables

In [8]:
# S√©parer les colonnes num√©riques et cat√©gorielles
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
categorical_cols = df.select_dtypes(include=['object']).columns.tolist()

print(f"üìä Variables num√©riques ({len(numeric_cols)}) :")
print(numeric_cols)
print(f"\nüìù Variables cat√©gorielles ({len(categorical_cols)}) :")
print(categorical_cols)

üìä Variables num√©riques (7) :
['transaction_amount', 'transaction_hour', 'num_transactions_24h', 'account_age_days', 'avg_amount_30d', 'is_foreign_transaction', 'fraud']

üìù Variables cat√©gorielles (2) :
['country_risk', 'device_type']


## 7. Sauvegarde des informations

In [9]:
# Cr√©er un dictionnaire avec les informations cl√©s
data_info = {
    'n_rows': df.shape[0],
    'n_columns': df.shape[1],
    'n_fraud': fraud_counts[1],
    'n_normal': fraud_counts[0],
    'fraud_percentage': fraud_percent[1],
    'columns': df.columns.tolist(),
    'numeric_columns': numeric_cols,
    'categorical_columns': categorical_cols,
    'missing_values': df.isnull().sum().sum()
}

print("‚úì R√©sum√© des donn√©es :")
for key, value in data_info.items():
    if isinstance(value, list):
        print(f"  {key}: {len(value)} √©l√©ments")
    else:
        print(f"  {key}: {value}")

‚úì R√©sum√© des donn√©es :
  n_rows: 10000
  n_columns: 9
  n_fraud: 551
  n_normal: 9449
  fraud_percentage: 5.510000000000001
  columns: 9 √©l√©ments
  numeric_columns: 7 √©l√©ments
  categorical_columns: 2 √©l√©ments
  missing_values: 0


## ‚úÖ Conclusion

Les donn√©es ont √©t√© charg√©es avec succ√®s. Passez au notebook suivant pour l'analyse exploratoire d√©taill√©e.