In [1]:
pip install -r requirements.txt


Note: you may need to restart the kernel to use updated packages.


In [2]:
import optuna
import pandas as pd
import os
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import log_loss, accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, ExtraTreesClassifier, AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier

from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)


  from .autonotebook import tqdm as notebook_tqdm


In [3]:
from pathlib import Path
import pandas as pd

base_dir = Path().resolve()  
data_dir = base_dir 

file_path_train = data_dir / "train.csv"
file_path_test = data_dir / "test.csv"

train_df = pd.read_csv(file_path_train)
test_df = pd.read_csv(file_path_test)


print(test_df.head())

              id AP     creation_date_answer  situation  ctc  location  gc_id  \
0  cb7a4e0dd0777  f  2019-03-13 07:00:52.562         30  NaN       100     40   
1  e78e3915f3e30  f  2019-01-07 13:45:55.741         -1    f        95     40   
2  8e65ba155f983  f  2019-01-25 14:01:07.041         -1    f        34     20   
3  701e90ca03ce2  f  2019-01-16 14:35:11.432         10    f        45     40   
4  768fefec8609a  f  2019-02-11 14:25:37.331         10    f        95    100   

  gc_label     creation_date_global       id_group  ... fruit_situation_label  \
0        B  2019-03-13 07:03:13.632  b6a3d931cbbaf  ...                   jzy   
1        B  2018-12-18 18:28:41.942  1b35749232404  ...                  hetz   
2        D  2018-01-17 13:12:05.124  8f7612ff2c9cc  ...                    ag   
3        B  2018-11-07 13:21:33.877  2e3620e03b5f3  ...                    ag   
4        H  2018-10-16 10:17:01.716  ac19c1e8abd0d  ...                  hetz   

  fruits_or_vegetables  nu

### Preprocessing

1.  **Categorical Variables**:

    *   I applied **One Hot Encoding** for categorical columns with a small number of modalities.

    *   categorical\_columns = \[ 'AP', 'ctc', 'gc\_label', 'favorite\_fruit', 'fruit\_situation\_label', 'fruits\_or\_vegetables', 'hobby', 'green\_vegetables', 'vegetable\_type'\]
        
    
            
2.  **Quantitative Variables**:
    
    *   I identified quantitative columns with a small number of possible values, which might contain either large or small numeric values, both positive and negative.

    *   For categorical columns with a large number of unique values (such as 'ville'), I applied **Frequency Encoding**. This involves encoding the values based on the frequency of their occurrences, i.e., the ratio of occurrences of a specific category divided by the total occurrences of all categories, and normalized the values between 0 and 1 to ensure consistent scaling across all features.
        
    *   columns\_to\_check = \['situation', 'location', 'gc\_id', 'fruit\_situation\_id', 'number\_of\_fruit'\] + \['ville'\]
        

        
3.  **Date Columns**:
    
    *   After extracting these components, I applied **Frequency Encoding** to each of the newly created columns (i.e., year, month, and day) to encode the temporal information.

     *   date\_columns = \['creation\_date\_answer', 'creation\_date\_global', 'creation\_date\_request'\]
        


4.  **Id Columns**:
    
    *   id\_columns = \['id\_group', 'id\_group\_2', 'id\_group\_3', 'id\_group\_4'\]
        
    *   These columns had a large number of unique values, with approximately 15,000 unique IDs for some, making them likely not very informative for model training. Therefore, I decided to **drop** these columns, assuming they did not provide significant value.
        
5.  **Columns to Remove**:
    
    *   columns\_to\_remove = \[ 'creation\_date\_answer', 'creation\_date\_global', 'id\_group', 'id\_group\_2', 'id\_group\_3', 'creation\_date\_request', 'id\_group\_4'\]
        
6.  **Final Design Matrix**:
    
    *   I replace in the end all missing values by 0. After encoding all the variables into real numbers between 0 and 1, I had the final **design matrix** ready for model training, with the **'id'** column retained to preserve the identity of each observation.

In [4]:
class DummyEncoder(BaseEstimator, TransformerMixin):
    """
    Dummy Encoder for performing One-Hot Encoding on categorical columns,
    Frequency Encoding on 'ville', numeric columns, and date-related columns,
    and dropping specific columns. Also, extracts date components for specific date columns.
    """

    def __init__(self, categorial_columns: list = None,
                 columns_to_drop: list = None, 
                 columns_to_check: list = None,
                 date_columns: list = None):
        """
        Initializes the DummyEncoder.

        Parameters:
        categorial_columns : list, optional
            List of categorical column names to encode (except 'ville').
        columns_to_drop : list, optional
            List of column names to drop.
        columns_to_check : list, optional
            List of columns to apply frequency encoding to (including 'ville').
        date_columns : list, optional
            List of date columns to extract year, month, and day components.
        """
        self.categorial_columns = categorial_columns
        self.columns_to_drop = columns_to_drop if columns_to_drop is not None else []
        self.columns_to_check = columns_to_check if columns_to_check is not None else []
        self.date_columns = date_columns if date_columns is not None else []

    def fit(self, X: pd.DataFrame, y: np.ndarray = None) -> 'DummyEncoder':
        """
        Fits the encoder to the input data, learning the unique categories for each categorical column.

        Parameters:
        X : pandas DataFrame
            The input data to fit the encoder on.
        y : None
            Not used, but required for compatibility with scikit-learn API.

        Returns:
        self : returns an instance of self.
        """
        self.unique_categories_ = {}
        if self.categorial_columns is None:
            self.categorial_columns = X.select_dtypes(include=['object', 'category']).columns
        for col in self.categorial_columns:
            self.unique_categories_[col] = X[col].dropna().unique()
        return self

    def transform(self, X: pd.DataFrame) -> pd.DataFrame:
        """
        Transforms the input data by applying One-Hot Encoding to the categorical columns,
        Frequency Encoding to 'ville' and numeric columns, and dropping specific columns.

        Parameters:
        X : pandas DataFrame
            The input data to transform.

        Returns:
        X_transformed : pandas DataFrame
            The transformed data with encoded columns and specific columns dropped.
        """
        X_transformed = X.copy()
        X_transformed = self.extract_date_components(X_transformed)
        
        # Drop specified columns
        X_transformed = self.drop_columns(X_transformed)
        
        # Frequency encoding for specific columns including date components
        for col in self.columns_to_check + ['ville'] + [col + '_year' for col in self.date_columns] + \
            [col + '_month' for col in self.date_columns] + [col + '_day' for col in self.date_columns]:
            if col in X_transformed:
                X_transformed = self.encode_by_frequency(X_transformed, col)
        
        # One-Hot Encoding for categorical columns
        for col in self.unique_categories_.keys():
            if col in X_transformed:
                X_transformed = self.encode_by_onehot(X_transformed, col)
        
        # Replace NaN with 0 and boolean values with 0/1
        X_transformed.fillna(0, inplace=True)
        X_transformed = X_transformed.replace({True: 1, False: 0})
        
        return X_transformed

    def encode_by_frequency(self, X: pd.DataFrame, column: str) -> pd.DataFrame:
        """
        Encodes a categorical or numeric column by the frequency of each category or value.

        Parameters:
        X : pandas DataFrame
            The input data.
        column : str
            The column to be encoded.

        Returns:
        X : pandas DataFrame
            The data with the frequency-encoded column.
        """
        if column in X.columns:
            freq_encoding = X[column].value_counts(normalize=True)
            X[column + '_encoded'] = X[column].map(freq_encoding)
            X = X.drop(columns=[column])
        return X

    def encode_by_onehot(self, X: pd.DataFrame, column: str) -> pd.DataFrame:
        """
        Encodes a categorical column by One-Hot Encoding.

        Parameters:
        X : pandas DataFrame
            The input data.
        column : str
            The column to be encoded.

        Returns:
        X : pandas DataFrame
            The data with One-Hot encoded column.
        """
        if column in X.columns:
            dummies = pd.get_dummies(X[column], prefix=column)
            X = X.drop(columns=[column])
            X = pd.concat([X, dummies], axis=1)
        return X

    def drop_columns(self, X: pd.DataFrame) -> pd.DataFrame:
        """
        Drops specific columns from the DataFrame.

        Parameters:
        X : pandas DataFrame
            The input data containing columns to be dropped.

        Returns:
        X : pandas DataFrame
            The input data with specific columns removed.
        """
        X = X.drop(columns=self.columns_to_drop, errors='ignore')
        return X
    
    def extract_date_components(self, df: pd.DataFrame) -> pd.DataFrame:
        """
        Extracts year, month, and day from specified date columns and creates new columns.
        
        Parameters:
        df : pandas DataFrame
            The DataFrame with date columns.
        
        Returns:
        df : pandas DataFrame
            The DataFrame with new year, month, and day columns for each specified date column.
        """
        for col in self.date_columns:
            if col in df.columns:
                df[col] = pd.to_datetime(df[col], errors='coerce')
                
                if df[col].isnull().all():
                    print(f"Warning: Column {col} could not be converted to datetime.")
                    continue

                df[col + '_year'] = df[col].dt.year
                df[col + '_month'] = df[col].dt.month
                df[col + '_day'] = df[col].dt.day
        return df


columns_to_remove = [
    'creation_date_answer', 'creation_date_global', 'id_group', 
    'id_group_2', 'id_group_3', 'creation_date_request', 'id_group_4'
]
columns_to_check = ['situation', 'location', 'gc_id', 'fruit_situation_id', 'number_of_fruit']
date_columns = ['creation_date_answer', 'creation_date_global', 'creation_date_request']

# Initialize the DummyEncoder
encoder = DummyEncoder(
    categorial_columns=[
        'AP', 'ctc', 'gc_label', 'favorite_fruit', 'fruit_situation_label', 
        'fruits_or_vegetables', 'hobby', 'green_vegetables', 'vegetable_type'
    ], 
    columns_to_drop=columns_to_remove,
    columns_to_check=columns_to_check,
    date_columns=date_columns  
)

# Prepare and encode the data
df_train = train_df.copy()
df_train.columns = df_train.columns.str.strip()  # Ensure columns are stripped of extra spaces
df_encoded_train = encoder.fit_transform(df_train)

# Save the final encoded dataframe with the new date columns
file_path_dir = data_dir / "design_matrix.csv"
df_encoded_train.to_csv(file_path_dir, index=False)

print(df_encoded_train.head())


              id  target  situation_encoded  location_encoded  gc_id_encoded  \
0  a46cfa61ea20a       0             0.9586           0.00908        0.02100   
1  c3d0cb8f0c5e2       1             0.9586           0.01876        0.55252   
2  05dfbe0ec3a8b       0             0.9586           0.04980        0.55252   
3  952e869ee1076       1             0.9586           0.01848        0.55252   
4  5bd0e71b1395b       1             0.9586           0.01016        0.15416   

   fruit_situation_id_encoded  number_of_fruit_encoded  ville_encoded  \
0                     0.03732                  0.71672            0.0   
1                     0.28764                  0.71672            0.0   
2                     0.42604                  0.71672            0.0   
3                     0.28764                  0.21260            0.0   
4                     0.28764                  0.71672            0.0   

   creation_date_answer_year_encoded  creation_date_global_year_encoded  ...  \


#### **Data Splitting**

*   **Training Set**: 60% of the original dataset.
    
*   **Validation Set**: 20%.
    
*   **Test Set**: 20%.
    

#### **Models Used**

1.  Logistic Regression
    
2.  Random Forest
    
3.  Gradient Boosting
    
4.  K Nearest Neighbors
    
5.  Extra Trees
    
6.  AdaBoost
    
7.  XGBoost (with **Optuna** for hyperparameter tuning due to compatibility issues with scikit-learn).

I tuned parameters of each model with cross validation.
I kept for each model the optimal set of parameters.

    

#### **Evaluation Metrics**

Each model was evaluated using:

*   Log-Loss for train, validation, and test sets.
    
*   Accuracy for train, validation, and test sets.
    

#### **Predictions**

*   Saved in a directory named **predictions**:
    
    *   **all-1**: Contains predictions from all models except the best-performing one.
        
    *   **best\_predictions**: Contains predictions from the model with the best performance.
        

#### **Format of Predictions**

Each file includes:

*   id: Identifier for the sample.
    
*   class 0, class 1, class 2, class 3: Probabilities of each class.

In [5]:
# Prepare data function
def prepare_data(df):
    """
    Prepares the data by separating the features (X) and target (y).
    
    Parameters:
    df : pandas DataFrame
        The dataset containing the target column and features.

    Returns:
    X : pandas DataFrame
        The feature set excluding target and id.
    y : pandas Series
        The target column.
    """
    X = df.drop(columns=['target', 'id']).copy()  
    y = df['target'].copy()
    return X, y

# Split data into train, validation, and test sets
def split_data(X, y):
    """
    Splits the data into training, validation, and test sets.
    
    Parameters:
    X : pandas DataFrame
        The features.
    y : pandas Series
        The target variable.
    
    Returns:
    X_train, X_val, X_test : pandas DataFrame
        The training, validation, and test feature sets.
    y_train, y_val, y_test : pandas Series
        The training, validation, and test target sets.
    """
    X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42)
    X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)
    return X_train, X_val, X_test, y_train, y_val, y_test

# Optuna hyperparameter optimization for XGBoost
def objective(trial, X_train, y_train, X_val, y_val):
    """
    Objective function for Optuna to optimize XGBoost hyperparameters.
    
    Parameters:
    trial : optuna.Trial
        The trial object for hyperparameter search.
    X_train : pandas DataFrame
        The training features.
    y_train : pandas Series
        The training target variable.
    X_val : pandas DataFrame
        The validation features.
    y_val : pandas Series
        The validation target variable.
    
    Returns:
    float
        The log loss value for the validation set.
    """
    model = XGBClassifier(
        objective='multi:softprob',
        num_class=len(y_train.unique()),
        max_depth=trial.suggest_int('max_depth', 3, 10),
        learning_rate=trial.suggest_float('learning_rate', 0.01, 0.2),
        n_estimators=trial.suggest_int('n_estimators', 50, 200),
        subsample=trial.suggest_float('subsample', 0.8, 1.0),
        colsample_bytree=trial.suggest_float('colsample_bytree', 0.8, 1.0),
        random_state=42
    )

    model.fit(X_train, y_train)
    y_pred_val = model.predict_proba(X_val)
    return log_loss(y_val, y_pred_val)

# Train XGBoost using Optuna for hyperparameter tuning
def tune_xgb_with_optuna(X_train, y_train, X_val, y_val):
    """
    Tunes hyperparameters of the XGBoost model using Optuna.
    
    Parameters:
    X_train : pandas DataFrame
        The training features.
    y_train : pandas Series
        The training target variable.
    X_val : pandas DataFrame
        The validation features.
    y_val : pandas Series
        The validation target variable.
    
    Returns:
    best_model : XGBClassifier
        The XGBoost model with the best hyperparameters.
    best_params : dict
        The best hyperparameters found by Optuna.
    """
    study = optuna.create_study(direction='minimize')
    study.optimize(lambda trial: objective(trial, X_train, y_train, X_val, y_val), n_trials=50)

    best_params = study.best_trial.params
    best_model = XGBClassifier(
        objective='multi:softprob',
        num_class=len(y_train.unique()),
        **best_params,
        random_state=42
    )
    best_model.fit(X_train, y_train, verbose=False)
    return best_model, best_params

# Fine-tuning for other models using GridSearchCV
def tune_model(X_train, y_train, model_type):
    """
    Tunes hyperparameters of different models using GridSearchCV.
    
    Parameters:
    X_train : pandas DataFrame
        The training features.
    y_train : pandas Series
        The training target variable.
    model_type : str
        The type of model to be tuned.
    
    Returns:
    best_estimator : estimator
        The model with the best hyperparameters.
    """
    if model_type == "xgb":
        return None
    elif model_type == "logreg":
        model = LogisticRegression(max_iter=1000, random_state=42)
        param_grid = {
            'C': [0.01, 0.1, 1, 10],
            'penalty': ['l2'],
            'solver': ['lbfgs', 'liblinear']
        }
    elif model_type == "rf":
        model = RandomForestClassifier(random_state=42)
        param_grid = {
            'n_estimators': [50, 100],
            'max_depth': [10, 20],
            'min_samples_split': [2, 5],
            'min_samples_leaf': [1, 2]
        }
    elif model_type == "gb":
        model = GradientBoostingClassifier(random_state=42)
        param_grid = {
            'n_estimators': [100, 200],
            'learning_rate': [0.01, 0.1],
            'max_depth': [3, 6],
            'subsample': [0.8, 1.0]
        }
    elif model_type == "knn":
        model = KNeighborsClassifier()
        param_grid = {
            'n_neighbors': [3, 5, 7],
            'weights': ['uniform', 'distance'],
            'p': [1, 2]
        }
    elif model_type == "et":
        model = ExtraTreesClassifier(random_state=42)
        param_grid = {
            'n_estimators': [50, 100],
            'max_depth': [10, 20],
            'min_samples_split': [2, 5],
            'min_samples_leaf': [1, 2]
        }
    elif model_type == "ada":
        model = AdaBoostClassifier(random_state=42)
        param_grid = {
            'n_estimators': [50, 100],
            'learning_rate': [0.01, 0.1]
        }
    else:
        raise ValueError("Model type not recognized")

    grid_search = GridSearchCV(model, param_grid, cv=3, scoring='accuracy', n_jobs=-1)
    grid_search.fit(X_train, y_train)
    print(f"Best parameters for {model_type}: {grid_search.best_params_}")
    return grid_search.best_estimator_

# Evaluate model performance
def evaluate_model(model, X_train, X_val, X_test, y_train, y_val, y_test, model_name):
    """
    Evaluates the performance of the model on train, validation, and test datasets.
    
    Parameters:
    model : estimator
        The trained model to evaluate.
    X_train, X_val, X_test : pandas DataFrame
        The feature sets for train, validation, and test data.
    y_train, y_val, y_test : pandas Series
        The target values for train, validation, and test data.
    model_name : str
        The name of the model being evaluated.
    
    Returns:
    None
    """
    log_loss_train = log_loss(y_train, model.predict_proba(X_train))
    log_loss_val = log_loss(y_val, model.predict_proba(X_val))
    log_loss_test = log_loss(y_test, model.predict_proba(X_test))

    accuracy_train = accuracy_score(y_train, model.predict(X_train))
    accuracy_val = accuracy_score(y_val, model.predict(X_val))
    accuracy_test = accuracy_score(y_test, model.predict(X_test))

    print(f"{model_name} Train Log-Loss: {log_loss_train:.4f}")
    print(f"{model_name} Validation Log-Loss: {log_loss_val:.4f}")
    print(f"{model_name} Test Log-Loss: {log_loss_test:.4f}")
    print(f"{model_name} Train Accuracy: {accuracy_train * 100:.2f}%")
    print(f"{model_name} Validation Accuracy: {accuracy_val * 100:.2f}%")
    print(f"{model_name} Test Accuracy: {accuracy_test * 100:.2f}%")

# Save predictions
def make_predictions(model, df_encoded_test, encoder, model_name):
    """
    Makes predictions on the test dataset and saves the results.
    
    Parameters:
    model : estimator
        The trained model to make predictions.
    df_encoded_test : pandas DataFrame
        The test data features.
    encoder : transformer
        The encoder used to transform the data.
    model_name : str
        The name of the model for saving the predictions.
    
    Returns:
    output_filename : str
        The path to the saved prediction file.
    """
    ids = df_encoded_test['id']
    df_encoded_test = encoder.transform(df_encoded_test.drop(columns=['id']))

    y_test_pred = model.predict_proba(df_encoded_test)

    output = pd.DataFrame(y_test_pred, columns=[f'class_{i}' for i in range(y_test_pred.shape[1])])
    output['id'] = ids
    output = output[['id'] + [f'class_{i}' for i in range(y_test_pred.shape[1])]]
    
    output_filename = data_dir / f"predictions_{model_name}.csv"
    output.to_csv(output_filename, index=False)
    print(f"Predictions for {model_name} saved to {output_filename}")

    return output_filename

# Encode data function
def encode_data(train_df, test_df, encoder):
    """
    Encodes the training and test datasets using the given encoder.
    
    Parameters:
    train_df : pandas DataFrame
        The training dataset.
    test_df : pandas DataFrame
        The test dataset.
    encoder : transformer
        The encoder to transform the data.
    
    Returns:
    df_encoded_train : pandas DataFrame
        The encoded training dataset.
    df_encoded_test : pandas DataFrame
        The encoded test dataset.
    """
    df_train = train_df.copy()
    df_train.columns = df_train.columns.str.strip()
    df_encoded_train = encoder.fit_transform(df_train)
    
    df_test = test_df.copy()
    df_test.columns = df_test.columns.str.strip()
    df_encoded_test = encoder.transform(df_test)
    
    return df_encoded_train, df_encoded_test

# Create directories to save predictions
def create_prediction_directories():
    """
    Creates directories for saving the prediction files.
    
    Returns:
    all_predictions_path : str
        Path to the "all-1" directory.
    best_predictions_path : str
        Path to the "best_predictions" directory.
    """
    base_path = data_dir / "predictions"
    all_predictions_path = os.path.join(base_path, "all-1")
    best_predictions_path = os.path.join(base_path, "best_predictions")

    os.makedirs(all_predictions_path, exist_ok=True)
    os.makedirs(best_predictions_path, exist_ok=True)

    return all_predictions_path, best_predictions_path

# Move prediction files to appropriate directories
def move_predictions_to_directories(predictions, model_name, all_predictions_path, best_predictions_path):
    """
    Moves prediction files to the appropriate directories based on the model.
    
    Parameters:
    predictions : dict
        Dictionary containing model names and their respective prediction file paths.
    model_name : str
        The name of the model to move.
    all_predictions_path : str
        Path to the "all-1" directory.
    best_predictions_path : str
        Path to the "best_predictions" directory.
    
    Returns:
    None
    """
    best_model_filename = predictions[model_name]
    for model_type, prediction_file in predictions.items():
        if model_type == "xgb":
            os.rename(prediction_file, os.path.join(best_predictions_path, os.path.basename(prediction_file)))
        else:
            os.rename(prediction_file, os.path.join(all_predictions_path, os.path.basename(prediction_file)))

# Main pipeline
def main(train_df, test_df, encoder):
    """
    Main pipeline for training and evaluating models, and saving predictions.
    
    Parameters:
    train_df : pandas DataFrame
        The training dataset.
    test_df : pandas DataFrame
        The test dataset.
    encoder : transformer
        The encoder used to transform the data.
    
    Returns:
    None
    """
    df_encoded_train, df_encoded_test = encode_data(train_df, test_df, encoder)
    
    X, y = prepare_data(df_encoded_train)
    X_train, X_val, X_test, y_train, y_val, y_test = split_data(X, y)

    all_predictions_path, best_predictions_path = create_prediction_directories()

    predictions = {}

    model_types = ["logreg", "rf", "gb", "knn", "et", "ada"] 
    for model_type in model_types:
        print(f"Training {model_type} model...")
        model = tune_model(X_train, y_train, model_type) 
        evaluate_model(model, X_train, X_val, X_test, y_train, y_val, y_test, model_type)
        pred_filename = make_predictions(model, df_encoded_test, encoder, model_type)
        predictions[model_type] = pred_filename

    print("Training XGBoost with Optuna...")
    best_xgb_model, best_params = tune_xgb_with_optuna(X_train, y_train, X_val, y_val)
    evaluate_model(best_xgb_model, X_train, X_val, X_test, y_train, y_val, y_test, "xgb")
    xgb_pred_filename = make_predictions(best_xgb_model, df_encoded_test, encoder, "xgb")
    predictions["xgb"] = xgb_pred_filename
    
    move_predictions_to_directories(predictions, "xgb", all_predictions_path, best_predictions_path)

# Call the main function
main(train_df, test_df, encoder)


##Disabling error messages and all useless information
import logging
import sys

sys.stdout = open(os.devnull, 'w')
sys.stderr = open(os.devnull, 'w')
# Desactivate logs of Optuna
logging.basicConfig(level=logging.ERROR)

Training logreg model...
Best parameters for logreg: {'C': 1, 'penalty': 'l2', 'solver': 'lbfgs'}
logreg Train Log-Loss: 0.8283
logreg Validation Log-Loss: 0.8334
logreg Test Log-Loss: 0.8823
logreg Train Accuracy: 68.37%
logreg Validation Accuracy: 68.36%
logreg Test Accuracy: 66.00%
Predictions for logreg saved to /Users/matthieu/Downloads/candidat_2/predictions_logreg.csv
Training rf model...
Best parameters for rf: {'max_depth': 20, 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 50}
rf Train Log-Loss: 0.4321
rf Validation Log-Loss: 0.7920
rf Test Log-Loss: 0.8221
rf Train Accuracy: 84.42%
rf Validation Accuracy: 69.76%
rf Test Accuracy: 67.32%
Predictions for rf saved to /Users/matthieu/Downloads/candidat_2/predictions_rf.csv
Training gb model...
Best parameters for gb: {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 200, 'subsample': 0.8}
gb Train Log-Loss: 0.6358
gb Validation Log-Loss: 0.7086
gb Test Log-Loss: 0.7396
gb Train Accuracy: 73.93%
gb Validation

[I 2024-12-26 18:54:47,672] A new study created in memory with name: no-name-2b0a04fd-6f3f-4d0d-9471-7c27052ebfb4


Predictions for ada saved to /Users/matthieu/Downloads/candidat_2/predictions_ada.csv
Training XGBoost with Optuna...


[I 2024-12-26 18:54:48,660] Trial 0 finished with value: 0.6884694128690173 and parameters: {'max_depth': 5, 'learning_rate': 0.19967460122857034, 'n_estimators': 101, 'subsample': 0.9717421661042149, 'colsample_bytree': 0.8122778692171073}. Best is trial 0 with value: 0.6884694128690173.
[I 2024-12-26 18:54:49,871] Trial 1 finished with value: 0.7128756212996743 and parameters: {'max_depth': 5, 'learning_rate': 0.07284851557451076, 'n_estimators': 120, 'subsample': 0.9411676055756003, 'colsample_bytree': 0.8742518081125685}. Best is trial 0 with value: 0.6884694128690173.
[I 2024-12-26 18:54:50,814] Trial 2 finished with value: 0.7035637076901017 and parameters: {'max_depth': 7, 'learning_rate': 0.09840995495586018, 'n_estimators': 58, 'subsample': 0.8569691137962188, 'colsample_bytree': 0.8840337391313009}. Best is trial 0 with value: 0.6884694128690173.
[I 2024-12-26 18:54:53,296] Trial 3 finished with value: 0.7226665195517653 and parameters: {'max_depth': 10, 'learning_rate': 0.15

xgb Train Log-Loss: 0.4631
xgb Validation Log-Loss: 0.6821
xgb Test Log-Loss: 0.7182
xgb Train Accuracy: 81.83%
xgb Validation Accuracy: 71.86%
xgb Test Accuracy: 69.94%


Predictions for xgb saved to /Users/matthieu/Downloads/candidat_2/predictions_xgb.csv


### Why XGBoost is the Best

As expected, **XGBoost** outperformed the other models with an accuracy of **70%**, making it the best-performing model for this task. Several factors contribute to XGBoost's superior performance:

1.  **Boosting Technique**: XGBoost uses the **gradient boosting** framework, which builds an ensemble of weak models (decision trees) sequentially. Each subsequent model corrects the errors made by previous ones, leading to better generalization and higher accuracy.
    
2.  **Regularization**: XGBoost includes built-in regularization (L1 and L2), which helps to prevent overfitting, a common issue with tree-based models. This leads to more robust and accurate predictions, especially on unseen data.

### Further Development

While XGBoost has delivered solid results, there are several opportunities to further enhance its performance:

1.  **Increase the Number of Optimization Steps**: Currently, the hyperparameter tuning process using Optuna has a limited number of trials (50). Increasing this number would provide a deeper exploration of the hyperparameter space and might lead to a better combination of parameters, further improving the model's performance.
    
2.  **Use a More Powerful Optimizer**: Although Optuna is a powerful optimization tool, exploring other advanced optimizers like
    
    ADAM could potentially improve the model by using more sophisticated strategies to search the hyperparameter space. These methods could help avoid local minima and better explore the global optimum.
    
3.  **Feature Engineering**: Exploring additional feature engineering techniques (e.g., interaction features or polynomial features) could help capture more complex patterns in the data, improving model performance.
    
4.  **Model Interpretability**: While XGBoost is already interpretable, implementing further interpretability techniques, such as SHAP (SHapley Additive exPlanations), could provide more insights into how the model is making decisions and ensure the model’s decisions are more transparent and explainable.