## üß† Mod√©lisation

Dans cette section, nous allons entamer la phase de mod√©lisation, qui consiste √† :

- Choisir un ou plusieurs mod√®les adapt√©s au type de probl√®me (classification ou r√©gression),
- Entra√Æner ces mod√®les sur l‚Äôensemble d‚Äôapprentissage,
- Optimiser leurs performances √† l‚Äôaide de la validation crois√©e et de la recherche d‚Äôhyperparam√®tres (GridSearchCV, RandomizedSearchCV).


In [1]:
import pandas as pd
Path_Data='../Data/Processed/'
X_train = pd.read_csv(Path_Data+'X_train.csv')
X_val = pd.read_csv(Path_Data+'X_val.csv')
X_test = pd.read_csv(Path_Data+'X_test.csv')

y_train = pd.read_csv(Path_Data+'y_train.csv')
y_val = pd.read_csv(Path_Data+'y_val.csv')
y_test = pd.read_csv(Path_Data+'y_test.csv')


### Initialisation des Mod√®les 

Nous allons utiliser les mod√®les suivants pour nos t√¢ches de classification et r√©gression :

- **CatBoost**
- **LightGBM (LGBM)**
- **XGBoost**
- **For√™t Al√©atoire (Random Forest)**
- **KNN (K-Nearest Neighbors)**
- Et d'autres mod√®les selon les besoins.


In [2]:
from xgboost import XGBClassifier
from catboost import CatBoostClassifier
from sklearn.ensemble import RandomForestClassifier
from lightgbm import LGBMClassifier
from sklearn.neighbors import KNeighborsClassifier
from LogisticRegression import LogisticRegression
from sklearn.svm import SVC
import numpy as np



models = {
    "Logistic Regression": LogisticRegression(),
    "XGBoost":XGBClassifier(),
    "CatBoost":CatBoostClassifier(),
    "Random Forest": RandomForestClassifier(),
    "LightGBM": LGBMClassifier(),
    "KNN": KNeighborsClassifier(),
    "SVM": SVC()
}

##### On definie les Grids de Parametres Pour chaque model


In [3]:
param_grids = {
    "Logistic Regression": {
        'alpha': [0.001, 0.01, 0.1],
        'iterations': [500, 1000],
        'use_l2': [True, False],
        'lambda_': [0.01, 0.1, 1.0],
        'use_decay': [True, False],
        'decay': [0.001, 0.01, 0.1],
        'early_stopping': [True, False],
        'tol': [1e-4, 1e-5]
    },
    "XGBoost": {
        'max_depth': [3, 5, 7],
        'learning_rate': [0.01, 0.1, 0.2],
        'n_estimators': [50, 100, 200]
    },
    "CatBoost": {
        'iterations': [500, 1000],
        'depth': [3, 5, 7],
        'learning_rate': [0.01, 0.1, 0.2]
    },
    "Random Forest": {
        'n_estimators': [100, 200, 300],
        'max_depth': [10, 20, 30, None],
        'min_samples_split': [2, 5, 10],
        'min_samples_leaf': [1, 2, 4]
    },
    "LightGBM": {
        'num_leaves': [31, 50, 100],
        'learning_rate': [0.01, 0.1, 0.2],
        'n_estimators': [50, 100, 200]
    },
    "KNN": {
        'n_neighbors': [3, 5, 7, 10],
        'weights': ['uniform', 'distance'],
        'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute']
    },

}

best_model_results = {"model_name": [], "best_params": [], "val_accuracy": []}
best_overall = {"model_name": None, "val_accuracy": 0.0, "best_params": None}


##  D√©tection de l‚Äôenvironnement d'ex√©cution (CPU, GPU NVIDIA, ou puce Apple M1/M2..)

Ce script Python permet de **d√©tecter automatiquement** l'environnement mat√©riel sur lequel votre code est ex√©cut√©, afin d‚Äôadapter l'entra√Ænement des mod√®les (par exemple : activer l‚Äôutilisation du GPU quand c‚Äôest possible).
Dans Notre Cas on a deux Puissante Machine l'une avec M1 Pro  et l'autre avec une Carte Graphique RTX 4070  

In [None]:
import platform
import subprocess

def detect_environment():
    system = platform.system().lower()
    machine = platform.machine().lower()

    # Apple Silicons (M1/M2)
    if system == 'darwin' and 'arm' in machine:
        return "M1"

    # CUDA-compatible NVIDIA GPU
    try:
        result = subprocess.run(['nvidia-smi'], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
        if result.returncode == 0:
            return "CUDA"
    except FileNotFoundError:
        pass

    return "CPU"

env = detect_environment()
print(f" Environnement d√©tect√© : {env}")


### üîç Optimisation des hyperparam√®tres avec GridSearchCV et acc√©l√©ration GPU

Pour garantir les meilleures performances de chaque mod√®le de classification, nous utilisons **GridSearchCV** pour effectuer un r√©glage fin des hyperparam√®tres. Voici les √©tapes :

1. **Exclusion des mod√®les personnalis√©s** :
   - Le mod√®le de r√©gression logistique impl√©ment√© manuellement est exclu car il ne prend pas en charge `GridSearchCV` directement.

2. **Utilisation compl√®te du CPU et du GPU** :
   - `n_jobs = -1` permet d‚Äôutiliser tous les c≈ìurs du processeur pour les calculs parall√®les.
   - Pour les mod√®les compatibles avec le GPU, nous activons explicitement l'acc√©l√©ration :
     - **XGBoost** : `tree_method='gpu_hist'`
     - **LightGBM** : `device='gpu'`
     - **CatBoost** : `task_type='GPU'`, `devices='0'`

3. **Affichage d√©taill√© de l'entra√Ænement** :
   - `verbose=2` affiche les √©tapes d√©taill√©es de l'entra√Ænement, ce qui permet de suivre la progression en temps r√©el.

4. **Validation crois√©e** :
   - Une validation crois√©e √† 3 plis (`cv=3`) est utilis√©e pour √©viter le surapprentissage et am√©liorer la robustesse de la s√©lection des mod√®les.

5. **√âvaluation** :
   - Apr√®s l'entra√Ænement, nous extrayons les meilleurs hyperparam√®tres et √©valuons le mod√®le sur l'ensemble de validation √† l‚Äôaide de la pr√©cision (`accuracy`).




In [None]:
import platform
import subprocess
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
import json
import os

env = detect_environment()
print(f" Environnement d√©tect√© : {env}")

# Create results directory if it doesn't exist
results_dir = "./model_results"
os.makedirs(results_dir, exist_ok=True)

# Initialize results file
results_file = os.path.join(results_dir, "model_results.json")
if not os.path.exists(results_file):
    with open(results_file, 'w') as f:
        json.dump({"models": []}, f)

# Load existing results
with open(results_file, 'r') as f:
    all_results = json.load(f)

# Track best overall model
best_overall_model = {
    "model_name": None,
    "best_params": None,
    "val_accuracy": 0.0
}

def save_model_result(model_name, best_params, val_accuracy):
    """Save the result of a single model to the JSON file"""
    model_result = {
        "model_name": model_name,
        "best_params": best_params,
        "val_accuracy": val_accuracy
    }
    
    # Update all_results
    all_results["models"].append(model_result)
    
    # Update best overall model
    global best_overall_model
    if val_accuracy > best_overall_model["val_accuracy"]:
        best_overall_model = model_result.copy()
    
    # Save to file
    with open(results_file, 'w') as f:
        json.dump(all_results, f, indent=4)
    
    print(f"\n‚úÖ Results for {model_name} saved to {results_file}")

def wait_for_input(model_name):
    """Wait for user input before proceeding to next model"""
    print(f"\n{'='*50}")
    print(f"Completed training for {model_name}")
    print("Press Enter to continue to next model or 'q' to quit...")
    user_input = input()
    if user_input.lower() == 'q':
        print("\nExiting early...")
        print_summary()
        exit()

def print_summary():
    """Print summary of all results"""
    print("\n\nüìä Final Summary of Results:")
    for model in all_results["models"]:
        print(f"\n{model['model_name']}:")
        print(f"  Validation Accuracy: {model['val_accuracy']:.4f}")
        print(f"  Best Parameters: {model['best_params']}")
    
    print("\nüèÜ Best Overall Model:")
    if best_overall_model["model_name"]:
        print(f"{best_overall_model['model_name']} with accuracy {best_overall_model['val_accuracy']:.4f}")
        print(f"Parameters: {best_overall_model['best_params']}")
    else:
        print("No models completed yet.")

def setup_gpu_for_model(model, model_name):
    """Properly configure GPU settings for each model"""
    if env == "CUDA":
        if model_name == "XGBoost":
            model.set_params(
                tree_method='hist',
                device='cuda:0'
            )
        elif model_name == "LightGBM":
            model.set_params(
                device='gpu',
                gpu_platform_id=0,
                gpu_device_id=0
            )
        elif model_name == "CatBoost":
            model.set_params(
                task_type='GPU',
                devices='0:0',
                verbose=0
            )
  
    elif env == "M1":
        if model_name == "LightGBM":
            model.set_params(device_type='metal')
        elif model_name == "CatBoost":
            model.set_params(task_type='CPU', verbose=0)
    else:  # CPU fallback
        if model_name == "XGBoost":
            model.set_params(tree_method='hist', device='cpu')
        elif model_name == "LightGBM":
            model.set_params(device_type='cpu')
    
    return model


def monitor_gpu():
    try:
        if env == "CUDA":
            result = subprocess.run(['nvidia-smi'], stdout=subprocess.PIPE)
            print("GPU Usage:\n", result.stdout.decode())
    except:
        print("Could not monitor GPU usage")




In [None]:


for model_name, model in models.items():
    print(f"\n{'='*50}")
    print(f"üöÄ Starting GridSearch for: {model_name}")
   
    # Proper GPU setup
    model = setup_gpu_for_model(model, model_name)
    
    # Monitor before training
    monitor_gpu()
    
    try:
        # Special handling for SVM if using cuML

        
        grid_search = GridSearchCV(
            estimator=model,
            param_grid=param_grids[model_name],
            cv=3,
            n_jobs=1,  # Keep this as 1 for GPU models
            verbose=2
        )
        
        # Prepare data - convert to numpy arrays properly
        if env == "CUDA" and model_name in ["XGBoost", "LightGBM", "CatBoost"]:
            X_train_gpu = X_train.values.astype(np.float32)  # Proper conversion
            y_train_gpu = y_train.values.astype(np.float32).ravel()
            
            # For validation data too
            X_val_gpu = X_val.values.astype(np.float32)
            y_val_gpu = y_val.values.astype(np.float32).ravel()
            
            grid_search.fit(X_train_gpu, y_train_gpu)
            best_model = grid_search.best_estimator_
            y_pred_val = best_model.predict(X_val_gpu)  # Predict on GPU data
            val_acc = accuracy_score(y_val_gpu, y_pred_val)
        else:
            # For CPU models
            grid_search.fit(X_train, y_train)
            best_model = grid_search.best_estimator_
            y_pred_val = best_model.predict(X_val)
            val_acc = accuracy_score(y_val, y_pred_val)

        # Monitor after training
        monitor_gpu()
        
        print(f"\nüéØ Best parameters for {model_name}: {grid_search.best_params_}")
        print(f"üìä {model_name} - Validation Accuracy: {val_acc:.4f}")

        save_model_result(model_name, grid_search.best_params_, val_acc)
        
    except Exception as e:
        print(f"\n‚ùå Error training {model_name}: {str(e)}")
        save_model_result(model_name, {"error": str(e)}, 0.0)
    
    if env == "CUDA":
        import gc
        del grid_search
        gc.collect()
    
    wait_for_input(model_name)

print_summary()
