# 03 - Feature Engineering

Ce notebook s√©lectionne les features importantes et encode les variables cat√©gorielles.

## 1. Import des librairies

In [1]:
# Manipulation de donn√©es
import pandas as pd
import numpy as np

# Encodage
from sklearn.preprocessing import LabelEncoder, StandardScaler

print("‚úì Librairies import√©es avec succ√®s")

‚úì Librairies import√©es avec succ√®s


## 2. Chargement des donn√©es

In [2]:
# Charger les donn√©es brutes
df = pd.read_csv('../data/raw/fraud_synth_10000.csv')

print(f"‚úì Dataset charg√© : {df.shape[0]} lignes, {df.shape[1]} colonnes")
print(f"Colonnes : {df.columns.tolist()}")

‚úì Dataset charg√© : 10000 lignes, 9 colonnes
Colonnes : ['transaction_amount', 'transaction_hour', 'num_transactions_24h', 'account_age_days', 'avg_amount_30d', 'country_risk', 'device_type', 'is_foreign_transaction', 'fraud']


## 3. S√©lection des features importantes

Bas√© sur l'analyse, nous gardons uniquement les features avec un impact significatif :

In [3]:
# Liste des features √† garder (bas√© sur l'analyse)
selected_features = [
    'is_foreign_transaction',  # ‚≠ê‚≠ê‚≠ê‚≠ê‚≠ê Impact +78.5%
    'account_age_days',        # ‚≠ê‚≠ê‚≠ê‚≠ê‚≠ê Impact -23.9%
    'country_risk',            # ‚≠ê‚≠ê‚≠ê‚≠ê‚≠ê Impact massif
    'num_transactions_24h',    # ‚≠ê‚≠ê‚≠ê‚≠ê Impact +15.1%
    'transaction_amount',      # ‚≠ê‚≠ê‚≠ê Impact +15.3%
    'device_type',             # ‚≠ê‚≠ê‚≠ê Impact mod√©r√©
    'fraud'                    # Variable cible (Y)
]

# Cr√©er un nouveau dataframe avec uniquement les features s√©lectionn√©es
df_selected = df[selected_features].copy()

print(f"‚úì Features s√©lectionn√©es : {len(selected_features) - 1} features + 1 cible (Y)")
print(f"\nFeatures gard√©es :")
for i, feat in enumerate(selected_features[:-1], 1):
    print(f"  {i}. {feat}")
print(f"  Y. fraud (variable cible)")

‚úì Features s√©lectionn√©es : 6 features + 1 cible (Y)

Features gard√©es :
  1. is_foreign_transaction
  2. account_age_days
  3. country_risk
  4. num_transactions_24h
  5. transaction_amount
  6. device_type
  Y. fraud (variable cible)


In [4]:
# Afficher un aper√ßu des donn√©es s√©lectionn√©es
print("üìä Aper√ßu des donn√©es s√©lectionn√©es :")
df_selected.head(10)

üìä Aper√ßu des donn√©es s√©lectionn√©es :


Unnamed: 0,is_foreign_transaction,account_age_days,country_risk,num_transactions_24h,transaction_amount,device_type,fraud
0,1,3615,low,3,15.01,mobile,0
1,0,209,low,2,24.67,mobile,0
2,0,438,low,4,92.79,mobile,0
3,1,1830,low,6,38.67,desktop,0
4,0,1686,medium,3,69.14,mobile,0
5,1,997,low,4,52.55,mobile,0
6,0,1236,high,1,19.9,mobile,0
7,0,1114,low,4,51.09,mobile,0
8,1,318,high,4,25.71,desktop,0
9,1,3414,low,5,25.59,desktop,0


## 4. Encodage des variables cat√©gorielles

Nous avons 2 variables cat√©gorielles √† encoder :
- `country_risk` : 3 cat√©gories (low, medium, high)
- `device_type` : 3 cat√©gories (desktop, mobile, tablet)

### 4.1 One-Hot Encoding pour country_risk

In [5]:
# Afficher les valeurs uniques avant encodage
print("üìä Valeurs de country_risk avant encodage :")
print(df_selected['country_risk'].value_counts())

# Appliquer One-Hot Encoding sur country_risk
# pd.get_dummies cr√©e des colonnes binaires (0/1) pour chaque cat√©gorie
country_risk_encoded = pd.get_dummies(df_selected['country_risk'], prefix='country_risk', drop_first=False)

print(f"\n‚úì One-Hot Encoding appliqu√© sur 'country_risk'")
print(f"Nouvelles colonnes cr√©√©es : {country_risk_encoded.columns.tolist()}")

# Afficher un aper√ßu
country_risk_encoded.head()

üìä Valeurs de country_risk avant encodage :
country_risk
low       6007
medium    2551
high      1442
Name: count, dtype: int64

‚úì One-Hot Encoding appliqu√© sur 'country_risk'
Nouvelles colonnes cr√©√©es : ['country_risk_high', 'country_risk_low', 'country_risk_medium']


Unnamed: 0,country_risk_high,country_risk_low,country_risk_medium
0,False,True,False
1,False,True,False
2,False,True,False
3,False,True,False
4,False,False,True


### 4.2 One-Hot Encoding pour device_type

In [6]:
# Afficher les valeurs uniques avant encodage
print("üìä Valeurs de device_type avant encodage :")
print(df_selected['device_type'].value_counts())

# Appliquer One-Hot Encoding sur device_type
device_type_encoded = pd.get_dummies(df_selected['device_type'], prefix='device_type', drop_first=False)

print(f"\n‚úì One-Hot Encoding appliqu√© sur 'device_type'")
print(f"Nouvelles colonnes cr√©√©es : {device_type_encoded.columns.tolist()}")

# Afficher un aper√ßu
device_type_encoded.head()

üìä Valeurs de device_type avant encodage :
device_type
mobile     5498
desktop    3483
tablet     1019
Name: count, dtype: int64

‚úì One-Hot Encoding appliqu√© sur 'device_type'
Nouvelles colonnes cr√©√©es : ['device_type_desktop', 'device_type_mobile', 'device_type_tablet']


Unnamed: 0,device_type_desktop,device_type_mobile,device_type_tablet
0,False,True,False
1,False,True,False
2,False,True,False
3,True,False,False
4,False,True,False


## 5. Assembler le dataset final

In [7]:
# Supprimer les colonnes cat√©gorielles originales
df_final = df_selected.drop(['country_risk', 'device_type'], axis=1)

# Ajouter les colonnes encod√©es
df_final = pd.concat([df_final, country_risk_encoded, device_type_encoded], axis=1)

print(f"‚úì Dataset final assembl√©")
print(f"Dimensions : {df_final.shape[0]} lignes √ó {df_final.shape[1]} colonnes")
print(f"\nColonnes finales :")
for i, col in enumerate(df_final.columns, 1):
    print(f"  {i}. {col}")

‚úì Dataset final assembl√©
Dimensions : 10000 lignes √ó 11 colonnes

Colonnes finales :
  1. is_foreign_transaction
  2. account_age_days
  3. num_transactions_24h
  4. transaction_amount
  5. fraud
  6. country_risk_high
  7. country_risk_low
  8. country_risk_medium
  9. device_type_desktop
  10. device_type_mobile
  11. device_type_tablet


In [8]:
# Afficher les premi√®res lignes du dataset final
print("üìä Aper√ßu du dataset final encod√© :")
df_final.head(10)

üìä Aper√ßu du dataset final encod√© :


Unnamed: 0,is_foreign_transaction,account_age_days,num_transactions_24h,transaction_amount,fraud,country_risk_high,country_risk_low,country_risk_medium,device_type_desktop,device_type_mobile,device_type_tablet
0,1,3615,3,15.01,0,False,True,False,False,True,False
1,0,209,2,24.67,0,False,True,False,False,True,False
2,0,438,4,92.79,0,False,True,False,False,True,False
3,1,1830,6,38.67,0,False,True,False,True,False,False
4,0,1686,3,69.14,0,False,False,True,False,True,False
5,1,997,4,52.55,0,False,True,False,False,True,False
6,0,1236,1,19.9,0,True,False,False,False,True,False
7,0,1114,4,51.09,0,False,True,False,False,True,False
8,1,318,4,25.71,0,True,False,False,True,False,False
9,1,3414,5,25.59,0,False,True,False,True,False,False


## 6. V√©rification des types de donn√©es

In [9]:
# V√©rifier les types de donn√©es
print("üìã Types de donn√©es finaux :")
print(df_final.dtypes)

# V√©rifier qu'il n'y a plus de variables object (texte)
object_cols = df_final.select_dtypes(include=['object']).columns.tolist()
if len(object_cols) == 0:
    print("\n‚úì Toutes les variables sont num√©riques (pr√™tes pour le ML)")
else:
    print(f"\n‚ö†Ô∏è Variables non-num√©riques restantes : {object_cols}")

üìã Types de donn√©es finaux :
is_foreign_transaction      int64
account_age_days            int64
num_transactions_24h        int64
transaction_amount        float64
fraud                       int64
country_risk_high            bool
country_risk_low             bool
country_risk_medium          bool
device_type_desktop          bool
device_type_mobile           bool
device_type_tablet           bool
dtype: object

‚úì Toutes les variables sont num√©riques (pr√™tes pour le ML)


## 7. S√©paration X (features) et Y (cible)

In [10]:
# S√©parer les features (X) de la variable cible (Y)
X = df_final.drop('fraud', axis=1)  # Features
y = df_final['fraud']                # Cible

print("‚úì Donn√©es s√©par√©es :")
print(f"  X (features) : {X.shape[0]} lignes √ó {X.shape[1]} colonnes")
print(f"  y (cible)    : {y.shape[0]} valeurs")

print(f"\nüìä Colonnes de X (features) :")
for i, col in enumerate(X.columns, 1):
    print(f"  {i}. {col}")

print(f"\nüìä Distribution de y (cible) :")
print(y.value_counts())
print(f"Pourcentage de fraudes : {(y.sum() / len(y) * 100):.2f}%")

‚úì Donn√©es s√©par√©es :
  X (features) : 10000 lignes √ó 10 colonnes
  y (cible)    : 10000 valeurs

üìä Colonnes de X (features) :
  1. is_foreign_transaction
  2. account_age_days
  3. num_transactions_24h
  4. transaction_amount
  5. country_risk_high
  6. country_risk_low
  7. country_risk_medium
  8. device_type_desktop
  9. device_type_mobile
  10. device_type_tablet

üìä Distribution de y (cible) :
fraud
0    9449
1     551
Name: count, dtype: int64
Pourcentage de fraudes : 5.51%


## 8. Normalisation des features num√©riques (optionnel mais recommand√©)

In [11]:
# Identifier les colonnes num√©riques originales (avant one-hot encoding)
numeric_features = [
    'is_foreign_transaction',
    'account_age_days',
    'num_transactions_24h',
    'transaction_amount'
]

# Cr√©er une copie pour la normalisation
X_normalized = X.copy()

# Initialiser le scaler
scaler = StandardScaler()

# Normaliser uniquement les colonnes num√©riques (pas les one-hot encod√©es)
X_normalized[numeric_features] = scaler.fit_transform(X[numeric_features])

print("‚úì Normalisation appliqu√©e sur les features num√©riques")
print(f"Features normalis√©es : {numeric_features}")
print(f"\nüìä Statistiques apr√®s normalisation :")
print(X_normalized[numeric_features].describe())

‚úì Normalisation appliqu√©e sur les features num√©riques
Features normalis√©es : ['is_foreign_transaction', 'account_age_days', 'num_transactions_24h', 'transaction_amount']

üìä Statistiques apr√®s normalisation :
       is_foreign_transaction  account_age_days  num_transactions_24h  \
count            1.000000e+04      1.000000e+04          1.000000e+04   
mean             2.273737e-17     -7.318590e-17         -9.467982e-17   
std              1.000050e+00      1.000050e+00          1.000050e+00   
min             -5.844327e-01     -1.735096e+00         -1.588473e+00   
25%             -5.844327e-01     -8.673108e-01         -9.526770e-01   
50%             -5.844327e-01      1.862028e-03         -3.168808e-01   
75%              1.711061e+00      8.781450e-01          3.189154e-01   
max              1.711061e+00      1.731884e+00          4.769489e+00   

       transaction_amount  
count        1.000000e+04  
mean         3.215206e-17  
std          1.000050e+00  
min         -

## 9. Sauvegarde des donn√©es pr√©par√©es

In [12]:
# Sauvegarder le dataset complet encod√© (sans normalisation)
df_final.to_csv('../data/processed/fraud_encoded.csv', index=False)
print("‚úì Dataset encod√© sauvegard√© : data/processed/fraud_encoded.csv")

# Sauvegarder X et y s√©par√©ment (sans normalisation)
X.to_csv('../data/processed/X_features.csv', index=False)
y.to_csv('../data/processed/y_target.csv', index=False)
print("‚úì Features (X) sauvegard√©es : data/processed/X_features.csv")
print("‚úì Cible (y) sauvegard√©e : data/processed/y_target.csv")

# Sauvegarder X normalis√© et y
X_normalized.to_csv('../data/processed/X_features_normalized.csv', index=False)
print("‚úì Features normalis√©es sauvegard√©es : data/processed/X_features_normalized.csv")

‚úì Dataset encod√© sauvegard√© : data/processed/fraud_encoded.csv
‚úì Features (X) sauvegard√©es : data/processed/X_features.csv
‚úì Cible (y) sauvegard√©e : data/processed/y_target.csv
‚úì Features normalis√©es sauvegard√©es : data/processed/X_features_normalized.csv


## ‚úÖ R√©sum√© du Feature Engineering

In [13]:
print("=" * 80)
print("R√âSUM√â DU FEATURE ENGINEERING")
print("=" * 80)

print("\n1. FEATURES S√âLECTIONN√âES")
print("-" * 80)
print("‚úì 6 features gard√©es sur 8 features originales")
print("‚úì 2 features supprim√©es (transaction_hour, avg_amount_30d)")

print("\n2. ENCODAGE DES VARIABLES CAT√âGORIELLES")
print("-" * 80)
print("‚úì country_risk ‚Üí One-Hot Encoding (3 colonnes)")
print("‚úì device_type ‚Üí One-Hot Encoding (3 colonnes)")

print("\n3. DATASET FINAL")
print("-" * 80)
print(f"‚úì Dimensions : {X.shape[0]} lignes √ó {X.shape[1]} features")
print(f"‚úì Variable cible : {y.shape[0]} valeurs")
print(f"‚úì Toutes les variables sont num√©riques")

print("\n4. FICHIERS SAUVEGARD√âS")
print("-" * 80)
print("‚úì fraud_encoded.csv (dataset complet)")
print("‚úì X_features.csv (features sans normalisation)")
print("‚úì X_features_normalized.csv (features normalis√©es)")
print("‚úì y_target.csv (variable cible)")

print("\n5. PROCHAINES √âTAPES")
print("-" * 80)
print("‚Üí Notebook 04 : Split train/test")
print("‚Üí Notebook 05 : Entra√Ænement du mod√®le")
print("‚Üí Notebook 06 : √âvaluation des performances")

print("\n" + "=" * 80)

R√âSUM√â DU FEATURE ENGINEERING

1. FEATURES S√âLECTIONN√âES
--------------------------------------------------------------------------------
‚úì 6 features gard√©es sur 8 features originales
‚úì 2 features supprim√©es (transaction_hour, avg_amount_30d)

2. ENCODAGE DES VARIABLES CAT√âGORIELLES
--------------------------------------------------------------------------------
‚úì country_risk ‚Üí One-Hot Encoding (3 colonnes)
‚úì device_type ‚Üí One-Hot Encoding (3 colonnes)

3. DATASET FINAL
--------------------------------------------------------------------------------
‚úì Dimensions : 10000 lignes √ó 10 features
‚úì Variable cible : 10000 valeurs
‚úì Toutes les variables sont num√©riques

4. FICHIERS SAUVEGARD√âS
--------------------------------------------------------------------------------
‚úì fraud_encoded.csv (dataset complet)
‚úì X_features.csv (features sans normalisation)
‚úì X_features_normalized.csv (features normalis√©es)
‚úì y_target.csv (variable cible)

5. PROCHAINES √â