# 1. Machine Leaning Problem
>The absence of a credit history might mean a lot of things, including young age or a preference for cash. Without traditional data, someone with little to no credit history is likely to be denied. Consumer finance providers must accurately determine which clients can repay a loan and which cannot and data is key. If data science could help better predict one’s repayment capabilities, loans might become more accessible to those who may benefit from them the most.
Currently, consumer finance providers use various statistical and machine learning methods to predict loan risk. These models are generally called scorecards. In the real world, clients' behaviors change constantly, so every scorecard must be updated regularly, which takes time. The scorecard's stability in the future is critical, as a sudden drop in performance means that loans will be issued to worse clients on average. The core of the issue is that loan providers aren't able to spot potential problems any sooner than the first due dates of those loans are observable. Given the time it takes to redevelop, validate, and implement the scorecard, stability is highly desirable. There is a trade-off between the stability of the model and its performance, and a balance must be reached before deployment.
Founded in 1997, competition host Home Credit is an international consumer finance provider focusing on responsible lending primarily to people with little or no credit history. Home Credit broadens financial inclusion for the unbanked population by creating a positive and safe borrowing experience. We previously ran a competition with Kaggle that you can see here.
Your work in helping to assess potential clients' default risks will enable consumer finance providers to accept more loan applications. This may improve the lives of people who have historically been denied due to lack of credit history.

In [1]:
import sys
from pathlib import Path
import subprocess
import os
import gc
from glob import glob
import time

import numpy as np
import pandas as pd
import polars as pl
from datetime import datetime
import seaborn as sns
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings('ignore')

from sklearn.model_selection import  train_test_split,TimeSeriesSplit, GroupKFold, StratifiedGroupKFold
from sklearn.base import BaseEstimator, RegressorMixin
from sklearn.metrics import roc_auc_score
import lightgbm as lgb

from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import OrdinalEncoder
from sklearn.impute import KNNImputer

ROOT            = Path("/kaggle/input/home-credit-credit-risk-model-stability")
TRAIN_DIR       = ROOT / "csv_files" / "train" 
TEST_DIR        = ROOT / "csv_files" / "test" 

# 2. Data Loading

## 2.1 Aggregration*
> Tables that are depth 1 and depth 2 will be handled in this section

Currently the mean, max and last statistics are used to aggregrate. But it is possible to use more methods. Refer the polars documentation  here https://docs.pola.rs/py-polars/html/reference/dataframe/aggregation.html

In [2]:
class Aggregator:
    """ This class handles the aggregration of the columns in depth > 0 columns
    """
    #Please add or subtract features yourself, be aware that too many features will take up too much space.
    def num_expr(df):
        """Handles the aggregrated new columns for columns which are "P" or "A" type
        
        Args : 
        -df: not aggregrated dataframe (Polars object)
        
        Return:
        - Polars DataFrame: aggregrated new columns
        """
        
        # Select the columns with the P and A suffixes
        cols = [col for col in df.columns if col[-1] in ("P", "A")]
        
        # Create a column conatining the max value of the each selected column
        expr_max = [pl.max(col).alias(f"max_{col}") for col in cols]
        
        # Create a column conatining the last value of the each selected column
        expr_last = [pl.last(col).alias(f"last_{col}") for col in cols]
        
        #expr_first = [pl.first(col).alias(f"first_{col}") for col in cols]
        
        # Create a column conatining the maen value of the each selected column
        expr_mean = [pl.mean(col).alias(f"mean_{col}") for col in cols]
        return expr_max +expr_last+expr_mean
    
    def date_expr(df):
        """Handles the aggregrated new columns for columns which are "D" type
        
        Args : 
        -df: not aggregrated dataframe (Polars object)
        
        Return:
        - Polars DataFrame: aggregrated new columns
        """
        
        # Select the columns with the D suffix
        cols = [col for col in df.columns if col[-1] in ("D")]
        
        # Select the columns with the 
        expr_max = [pl.max(col).alias(f"max_{col}") for col in cols]
        #expr_min = [pl.min(col).alias(f"min_{col}") for col in cols]
        expr_last = [pl.last(col).alias(f"last_{col}") for col in cols]
        #expr_first = [pl.first(col).alias(f"first_{col}") for col in cols]
        expr_mean = [pl.mean(col).alias(f"mean_{col}") for col in cols]
        return  expr_max +expr_last+expr_mean
    
    def str_expr(df):
        """Handles the aggregrated new columns for columns which are "M" type
        
        Args : 
        -df: not aggregrated dataframe (Polars object)
        
        Return:
        - Polars DataFrame: aggregrated new columns
        """
        cols = [col for col in df.columns if col[-1] in ("M",)]
        expr_max = [pl.max(col).alias(f"max_{col}") for col in cols]
        #expr_min = [pl.min(col).alias(f"min_{col}") for col in cols]
        expr_last = [pl.last(col).alias(f"last_{col}") for col in cols]
        #expr_first = [pl.first(col).alias(f"first_{col}") for col in cols]
        #expr_count = [pl.count(col).alias(f"count_{col}") for col in cols]
        return  expr_max +expr_last#+expr_count
    
    def other_expr(df):
        """Handles the aggregrated new columns for columns which are "T" or "L" type
        
        Args : 
        -df: not aggregrated dataframe (Polars object)
        
        Return:
        - Polars DataFrame: aggregrated new columns
        """
        cols = [col for col in df.columns if col[-1] in ("T", "L")]
        expr_max = [pl.max(col).alias(f"max_{col}") for col in cols]
        #expr_min = [pl.min(col).alias(f"min_{col}") for col in cols]
        expr_last = [pl.last(col).alias(f"last_{col}") for col in cols]
        #expr_first = [pl.first(col).alias(f"first_{col}") for col in cols]
        return  expr_max +expr_last
    
    def count_expr(df,filename):
        """Handles the aggregrated new columns for columns "num_group1" and "num_group2"
        
        Args : 
        -df: not aggregrated dataframe (Polars object)
        
        Return:
        - Polars DataFrame: aggregrated new columns
        """
        file_parts = filename.split("_")
        filename =  ""
        for part in file_parts[1:]:
            filename += part
            filename += "_"
            
        cols = [col for col in df.columns if "num_group" in col]
        expr_max = [pl.max(col).alias(f"{str(filename).split('/')[-1][:-4]}max_{col}") for col in cols] 
        #expr_min = [pl.min(col).alias(f"min_{col}") for col in cols]
        expr_last = [pl.last(col).alias(f"{str(filename).split('/')[-1][:-4]}last_{col}") for col in cols]
        #expr_first = [pl.first(col).alias(f"first_{col}") for col in cols]
        return  expr_max +expr_last
    
    def get_exprs(df,filename):
        """Aggregrate the entire dataframe
        
        Args : 
        -df: not aggregrated dataframe (Polars object)
        
        Return:
        - Polars DataFrame: aggregrated new columns
        """
        
     
        # Add all aggregrated columns to create the aggregrated dataframe
        exprs = Aggregator.num_expr(df) +  Aggregator.date_expr(df) + Aggregator.str_expr(df) + Aggregator.other_expr(df) + Aggregator.count_expr(df,filename)
        return exprs
    

            

## 2.2 Concatination
> Tables that are divided into several files will be concatinated in this section.

Handling the credit_bureau_a_1 and credit_bureau_a_2 was a nightmare. I initially tried to load only the important features which is about 10 features for the both tables. And yet it ran out memory during the aggregration stage.It is reasonable as for 10 features it creates 90 features when the depth is 2. So my approch was to load the specified columns and do the aggregration before the concatination. And it worked.  

In [3]:
def concatinate(filename):
    """
    Given the filepath of this concatinate all the files that associated with it
    
    Args : 
    filename : Filepath to be aggregrated

    Return:
    Polars DataFrame: Concatinated Tables
    
    """
    filename = str(filename)
    path = ""
    if filename.split('/')[-1][:2] == "tr":
        file_type = "train"
        path = TRAIN_DIR
    elif filename.split('/')[-1][:2] == "te":
        file_type = "test"
        path = TEST_DIR
    else:
        raise ValueError(f"Unrecognized file {filename.split('/')[-1]}")
    
    file_groups = {
        f"{file_type}_applprev_1_0.csv" : [
            f"{file_type}_applprev_1_1.csv",
            f"{file_type}_applprev_1_2.csv" 
        ],
        f"{file_type}_credit_bureau_a_1_0.csv": [
            f"{file_type}_credit_bureau_a_1_1.csv",
            f"{file_type}_credit_bureau_a_1_2.csv",
            f"{file_type}_credit_bureau_a_1_3.csv",
            f"{file_type}_credit_bureau_a_1_4.csv"
        ],
        f"{file_type}_credit_bureau_a_2_0.csv": [
            f"{file_type}_credit_bureau_a_2_1.csv",
            f"{file_type}_credit_bureau_a_2_2.csv",
            f"{file_type}_credit_bureau_a_2_3.csv",
            f"{file_type}_credit_bureau_a_2_4.csv",
            f"{file_type}_credit_bureau_a_2_5.csv",
            f"{file_type}_credit_bureau_a_2_6.csv",
            f"{file_type}_credit_bureau_a_2_7.csv",
            f"{file_type}_credit_bureau_a_2_8.csv",
            f"{file_type}_credit_bureau_a_2_9.csv",
            f"{file_type}_credit_bureau_a_2_10.csv",
        ],
        f"{file_type}_static_0_0": [
            f"{file_type}_static_0_1",
            f"{file_type}_static_0_2"
        ],  
    }
    
    selected_features = {
        f"{file_type}_credit_bureau_a_1_0.csv": [
            'case_id',
            'dateofcredend_289D',
            'refreshdate_3813885D',
            'numberofinstls_320L',
            'numberofcontrsvalue_358L',
            'nominalrate_281L',
            'numberofoutstandinstls_59L',
            'numberofoverdueinstlmax_1039L',
            'num_group1'
        ],
        f"{file_type}_credit_bureau_a_2_0.csv": [
            'case_id',
            'collater_typofvalofguarant_407M',
            'collaterals_typeofguarante_359M',
            'subjectroles_name_541M',
            'collater_typofvalofguarant_298M',
            'collaterals_typeofguarante_669M',
            'subjectroles_name_838M',
            'pmts_year_1139T',
            'pmts_year_507T',
            'num_group2',
            'num_group1'
        ]
    }
    
    
    if filename.split('/')[-1] in list(selected_features.keys()):
        concatinated_df = pl.read_csv(f"{filename}", null_values="a55475b1", columns=selected_features[filename.split('/')[-1]])
        if "num_group2" in concatinated_df.columns:
            concatinated_df = concatinated_df.group_by("case_id","num_group1").agg(Aggregator.get_exprs(concatinated_df,filename.split('/')[-1])).sort("case_id")
            concatinated_df = concatinated_df.group_by("case_id").agg(Aggregator.get_exprs(concatinated_df, filename.split('/')[-1])).sort("case_id")
        
        elif "num_group1" in concatinated_df.columns:
            concatinated_df = concatinated_df.group_by("case_id").agg(Aggregator.get_exprs(concatinated_df,filename.split('/')[-1])).sort("case_id")
    else:
        concatinated_df = pl.read_csv(f"{filename}", null_values="a55475b1")

        
    
    if filename.split('/')[-1] in list(file_groups.keys()):
        
        
            
        for file in file_groups[f"{filename.split('/')[-1]}"]:
            
            if file not in os.listdir(path):
                continue
            
            df = pl.read_csv(f"{path}/{file}", null_values="a55475b1") 
            if "num_group2" in df.columns:
                df = df.group_by("case_id","num_group1").agg(Aggregator.get_exprs(df,filename.split('/')[-1])).sort("case_id")
                df = df.group_by("case_id").agg(Aggregator.get_exprs(df, filename.split('/')[-1])).sort("case_id")
        
            elif "num_group1" in df.columns:
                df = df.group_by("case_id").agg(Aggregator.get_exprs(df,filename.split('/')[-1])).sort("case_id")

            
            concatinated_df = pl.concat([concatinated_df, df], how="diagonal_relaxed")
 
    
    return concatinated_df

## 2.3 Handling Invaluble Attributes
> Columns that have higher null value ratio might give higher error when it is imputed. Therefore those columns need to be removed. Also there might be columns that has only one unique values or one unique value and missing values (After imputing there will be only one unique value). Since these do not provide any insights to the model we can remove them. 

In [4]:
def drop_columns_with_high_missing_values(df, threshold=50):
    """
    Drop columns with missing values exceeding a given threshold percentage.
    
    Args:
    - df: DataFrame: Input DataFrame
    - threshold: float: Threshold percentage for missing values
    
    Returns:
    - DataFrame: Modified DataFrame with columns dropped
    """
    
    if isinstance(df, pl.DataFrame):
        df = df.to_pandas()
        
    # Calculate the percentage of missing values for each column
    missing_percentage = (df.isna().sum() / len(df)) * 100

    # Filter columns with missing percentage greater than the threshold
    columns_to_drop = missing_percentage[missing_percentage > threshold].index

    # Drop columns with more than the threshold percentage of missing values
    df_cleaned = df.drop(columns=columns_to_drop)

    # Display information about dropped columns
#     print("Columns dropped due to more than {}% missing values:".format(threshold))
#     print(columns_to_drop)

#     # Display information about the cleaned DataFrame
#     print("Shape of cleaned DataFrame:", df_cleaned.shape)

    return df_cleaned


def remove_single_unique_value_columns(df):
    """Remove the columns with only one unique value as it does not provide any insights to the model.
    
    Args:
    - df: DataFrame: Input Polars DataFrame
    
    Returns:
    - DataFrame: Modified Polars DataFrame with columns dropped    
    
    """
    if isinstance(df, pl.DataFrame):
        df = df.to_pandas()
    unique_value_counts = df.nunique()
    single_unique_value_columns = unique_value_counts[unique_value_counts == 1].index
    df = df.drop(columns=single_unique_value_columns)
    return df


def remove_dominated_columns(df, threshold=0.8):
    """Remove the columns that has higher amount of null values than the given threshold.
    
    Args:
    - df: DataFrame: Input Polars DataFrame
    - threshold: Maximum null value presentage
    
    Returns:
    - DataFrame: Modified Polars DataFrame with columns dropped    
    
    """
    
    if isinstance(df, pl.DataFrame):
        df = df.to_pandas()
        
    num_rows = len(df)
    dominated_columns = []
    
    for col in df.columns:
        # Skip 'case_id' column
        if col == 'case_id':
            continue
        
        value_counts = df[col].value_counts()
        dominant_value_count = value_counts.max()
        dominant_value_percentage = dominant_value_count / num_rows
        
        if dominant_value_percentage >= threshold:
            dominated_columns.append(col)
    
    df = df.drop(columns=dominated_columns)
    return df

## 2.4 Handling Missing Values*
> This section responsible for removing the null values.

As you know there are three ways to remove missing values. We can try doing other ways
*     Dropping null rows/columns
*     Imputing using a statistic (Current Method)
*     **Using a ML model to predic the values**

In [5]:
def impute_missing_values(df):
    """
    Impute missing values in categorical columns with mode and numerical columns with mean.
    
    Args:
    - df: DataFrame: Input DataFrame
    
    Returns:
    - DataFrame: DataFrame with missing values handled using imputation
    """
    # Identify categorical and numerical columns
    categorical_columns = df.select_dtypes(include=['object','category']).columns
    numerical_columns = df.select_dtypes(exclude=['object','category']).columns

    # Impute missing values for categorical columns with the most frequent value (mode)
    for col in categorical_columns:
        if df[col].isna().sum() != len(df):
            df[col].fillna(df[col].mode()[0], inplace=True)
        else:
            df[col].fillna(np.nan, inplace=True)

    # Impute missing values for numerical columns with the mean
    for col in numerical_columns:
        if df[col].isna().sum() != len(df):
            df[col].fillna(df[col].mean(), inplace=True)
        else:
            df[col].fillna(np.nan, inplace=True)

    return df

## 2.5 Handling Datatypes to Optimize the Memory
> Some columns are not assigned properly in the dataframe. In the first part of this sectiion those columns will be converted to corresponding data type according to the data it carries or according to the column suffix. A drawback of this process is it does not optimize the assigned data type (Eg: An integer column that has values which can be interpreted using only 8 bits will be assigned 64 bits.). This issue will be handled in the second part od the section where it checks the minimum  and the maximun values of the column and assign the corresponding data subtype.  

Category datatype reduces the memory consumption of the nominal data in most cases. Check out what is going in there.                                           https://medium.com/analytics-vidhya/unleash-the-power-of-pandas-category-dtype-encode-categorical-data-in-smarter-ways-eb787cd274df 


In [6]:
def handle_data_types(df, pandas = False):
    """Convert the column data types according to thier suffixes and change it back to optimize the memory
    
    Args:
    - df: Input DataFrame (Polars or Pandas)
    - pandas: If True return the pandas dataframe else return the polar dataframe. Default False
    
    Returns:
    - DataFrame: DataFrame (Polaris or Pandas) with correct and optimised data types 
    
    """
    
    if isinstance(df, pl.DataFrame):
        df = df.to_pandas()
    
    for col in df.columns:
        # Convert integer columns to integer columns
        if col in ["case_id", "WEEK_NUM", "num_group1", "num_group2","target"]:
            df[col] = df[col].astype(np.int64) 
            
        # Convert date columns to corresponding timestamps
        elif col in ["date_decision"] or  col[-1] in ("D",):
            # Handles imputed date columns. Faster
#             df[col] = pd.to_datetime(df[col]).apply(lambda x: int(x.timestamp()))
            # Handles missing value one by one. Slower
            df[col] = pd.to_datetime(df[col]).apply(lambda x: int(x.timestamp()) if not pd.isnull(x) else pd.NaT)
        
        # Convert numerical columns to float columns
        elif col[-1] in ("P", "A"):
            df[col] = df[col].astype(np.float64) 
        
        # Convert masked data columns to category data type 
        elif col[-1] in ("M"):
            df[col] = df[col].astype("category")
            
        # If there are columns with strings try converting it to the categorey columns
        elif df[col].dtype == "string" or "object":
            df[col] = df[col].astype("category")
            
        elif df[col].dtype == "datetime64":
            df[col] = pd.to_datetime(df[col]).apply(lambda x: int(x.timestamp()) if not pd.isnull(x) else pd.NaT)

            
            
            
        # Converting String or objects to category can reduce memory space. But when a String or Object column 
        # has null values, it does not convert into category type. Threrefore it is best practice to first 
        # impute the null values and then transform the data types
        
    
    # Iterate throgh the columns    
    for col in df.columns:
        
        # find the column data type
        col_type = df[col].dtype
        
        # If the type is "category" do not optimize
        if str(col_type)=="category":
            continue
        
        # If the type is not "object" optimize
        if col_type != object and col_type == "datetime64":
            
            # min and max of the numerical values
            c_min = df[col].min()
            c_max = df[col].max()
            
            # Check if it an integer
            if str(col_type)[:3] == 'int':
                # Check whether it is possible to reduce the bit amount of the type by checking the min max values
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            # Handles floats
            else:
                print(df[col].dtype)
                # Check whether it is possible to reduce the bit amount of the type by checking the min max values
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
                    
        
        # whatever the other data types do not optimize
        else:
            continue
    
    # If panda attibite is True return the corresponding pandas dataframe else return the polars dataframe
    if pandas:
        return df
    else:
        return pl.from_pandas(df)

## 2.6 Data Reduction*
> Data reduction can be carried out to decrease the memory consumption. Currently an attributes selection is carried out using the correlation for numerical columns

Can try out other attribute data methods like PCA, t-sne, SVD. Or can try out other attribute selection methods.

In [7]:
def reduce_data(df):
    """
    Given a dataframe remove the correlated attributes
    
    Args:
    -df : Pandas dataframe 
    
    Return:
    -pandas.DataFrame : Filtered pandas dataframe
    """
    numerics = ['int8','int16', 'int32', 'int64','float16', 'float32', 'float64']
    numerical_cols = []
    for col in df.columns:
        if df[col].dtype in numerics:
            numerical_cols.append(col)
            
    cor_matrix = df[numerical_cols].corr()
    columns_to_drop = set()
    for i in range(len(cor_matrix.columns)):
        for j in range(i):
            if abs(cor_matrix.iloc[i, j]) > 0.8:
                colname = cor_matrix.columns[i]
                columns_to_drop.add(colname)
    df_filtered = df.drop(columns=columns_to_drop)
    return df_filtered
    

## 2.7 Handling Outliers
> This section will handle the outliers in the numerical columns.

This follows the way we learn in DS module. we are considering 1.5 IQR because it is the value used commonly. Refer https://python.plainenglish.io/identifying-and-handling-outliers-in-pandas-a-step-by-step-guide-fcecd5c6cd3b

In [8]:
def handle_outliers(df):
    """
    Handling the outliers by removing the data instance.
    
    Args :
    -df : Pandas dataframe with outliers
    
    Return:
    - pandas.DataFrame : Pandas dataframes with no autliers 
    """
    numerics = ['int8','int16', 'int32', 'int64','float16', 'float32', 'float64']
    numerical_cols = []
    for col in df.columns:
        if df[col].dtype in numerics:
            numerical_cols.append(col)
#     print(numerical_cols)
    for col in numerical_cols:
        # Calculate 25% percentile and 75% percentile
        Q1 = df[col].quantile(0.25)
        Q3 = df[col].quantile(0.75)

        # Calculate Interquartile Range (IQR)
        IQR = Q3 - Q1
        
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        
        cleaned_df = df[(df[col] >= lower_bound) & (df[col] <= upper_bound)]
        
    return cleaned_df

## 2.8 Data Transformations
> Data Encodings,and other transformations. Handling nominal data in not possible as onehot encoding is taking too much space. When the nominalcoumn is converted to category type the model accepts it eventhough it is not encoded. Currenly we are handling the ordinal data in the same manner.

Can codider ordinal encoding the ordinal attributes

## 2.9 Main Function of Data Preprocessing
> Main function calls all the preprocessing functionalities for the file specified in filepath and it serialize the preprossesed dataframe while returnning the dataframe.

Instead of serializing it is possible to just write the dataframe to a csv file. But it changes our data type back to original data types(int64, float64 and object). Therefore serialization ( **pickle** ) is used. Refer the following stackoverflow https://stackoverflow.com/questions/72610814/why-do-pandas-dataframes-data-types-change-after-exporting-into-a-csv-file

In [9]:
def data_preprocess(filepath, test = False, enforce=False, save = True):
    """
    Preprocess the data given the table file path.
    
    Args:
    - filepath: Filepath of the csv file to be preprocessed.
    - test: Specify whether the filepath is a test table or not
    - enforce: Enforce data preprocessing eventhough a serielized file exists
    - save: If True serialize the dataframe
    
    Returns:
    - pd.DataFrame: Preprocessed dataframe 
    
    """
    
    start_time = time.time()
    
    print(f"Preprocessing {str(filepath).split('/')[-1]}", end="    ")
    
    if not enforce and os.path.isfile(f"/kaggle/working/{str(filepath).split('/')[-1][:-4]}_preprocessed.pkl"):
        end_time = time.time()
        print(f"Finished in {end_time-start_time}")
        return pd.read_pickle(f"/kaggle/working/{str(filepath).split('/')[-1][:-4]}_preprocessed.pkl")
    
    aggregated_df = None
    for filepath in glob(str(filepath)):
        df = pl.read_csv(filepath, null_values="a55475b1")

        if "num_group2" in df.columns:
            intermediate_aggregated_df = df.group_by(["case_id", "num_group1"]).agg(Aggregator.get_exprs(df,str(filepath).split('/')[-1]))
            aggregated_df = intermediate_aggregated_df.group_by("case_id").agg(Aggregator.get_exprs(intermediate_aggregated_df,str(filepath).split('/')[-1])) if aggregated_df is None else pl.concat([aggregated_df, intermediate_aggregated_df.group_by("case_id").agg(Aggregator.get_exprs(intermediate_aggregated_df,str(filepath).split('/')[-1]))], how="diagonal_relaxed")
        elif "num_group1" in df.columns:
            aggregated_df = df.group_by("case_id").agg(Aggregator.get_exprs(df,str(filepath).split('/')[-1])) if aggregated_df is None else pl.concat([aggregated_df, df.group_by("case_id").agg(Aggregator.get_exprs(df,str(filepath).split('/')[-1]))], how="diagonal_relaxed")
        else: 
            aggregated_df = df if aggregated_df is None else pl.concat([aggregated_df, df], how="diagonal_relaxed")

    # Remove null dominant columns
    if not test:
        df = drop_columns_with_high_missing_values(remove_dominated_columns(aggregated_df))
    
    if isinstance(df, pl.DataFrame):
        df = aggregated_df.to_pandas()
    
    # Impute missing values
    df = impute_missing_values(df)
    
    # Assign and Optimize the data types
    df = handle_data_types(df, pandas=True).sort_values("case_id")
    
    
    if not test:
        # Data Reduction
        df = reduce_data(df)
        # Handling the outliers
        # df = handle_outliers(df)
    
    if save:
        df.to_pickle(f"{str(filepath).split('/')[-1][:-4]}_preprocessed.pkl")  
        
    end_time = time.time()
    
    print(f"Finished in {end_time-start_time}")
    return df   

In [10]:
unique_files = [
    "train_credit_bureau_a_2_*.csv",
    "train_credit_bureau_a_1_*.csv",
    "train_credit_bureau_b_1.csv",
    "train_credit_bureau_b_2.csv",
    "train_applprev_1_*.csv",
    "train_applprev_2.csv",
    "train_debitcard_1.csv",
    "train_deposit_1.csv",
    "train_other_1.csv",
    "train_person_1.csv",
    "train_person_2.csv",
    "train_static_0_*.csv",
    "train_static_cb_*.csv",
    "train_tax_registry_a_1.csv",
    "train_tax_registry_b_1.csv",
    "train_tax_registry_c_1.csv"
]

base = pl.read_csv(TRAIN_DIR / "train_base.csv", null_values="a55475b1")
base = base.to_pandas()
base = handle_data_types(base, pandas=True)

for table in unique_files:
    preproceesed_df = data_preprocess(TRAIN_DIR / table)
    base = base.merge(preproceesed_df, how="left", on="case_id")
    del preproceesed_df
    gc.collect()


base.to_pickle(f"final_dataframe.pkl")  

Preprocessing train_credit_bureau_a_2_*.csv    Finished in 318.3732600212097
Preprocessing train_credit_bureau_a_1_*.csv    Finished in 246.41113305091858
Preprocessing train_credit_bureau_b_1.csv    Finished in 2.7569522857666016
Preprocessing train_credit_bureau_b_2.csv    Finished in 1.5861260890960693
Preprocessing train_applprev_1_*.csv    Finished in 126.11295366287231
Preprocessing train_applprev_2.csv    Finished in 17.267749071121216
Preprocessing train_debitcard_1.csv    Finished in 1.8593082427978516
Preprocessing train_deposit_1.csv    Finished in 3.1188876628875732
Preprocessing train_other_1.csv    Finished in 0.15191292762756348
Preprocessing train_person_1.csv    Finished in 59.530781507492065
Preprocessing train_person_2.csv    Finished in 7.865896224975586
Preprocessing train_static_0_*.csv    Finished in 104.5550262928009
Preprocessing train_static_cb_*.csv    Finished in 19.471354961395264
Preprocessing train_tax_registry_a_1.csv    Finished in 10.240002393722534
Pr

In [11]:
base.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1526659 entries, 0 to 1526658
Columns: 344 entries, case_id to tax_registry_c_1.last_num_group1
dtypes: category(213), float64(122), int64(9)
memory usage: 1.9 GB


In [12]:
base.isnull().mean().mean()

0.3078099582311805

In [13]:
# This section loads the test data set

unique_files = [
    "test_credit_bureau_a_2_*.csv",
    "test_credit_bureau_a_1_*.csv",
    "test_credit_bureau_b_1.csv",
    "test_credit_bureau_b_2.csv",
    "test_applprev_1_*.csv",
    "test_applprev_2.csv",
    "test_debitcard_1.csv",
    "test_deposit_1.csv",
    "test_other_1.csv",
    "test_person_1.csv",
    "test_person_2.csv",
    "test_static_0_*.csv",
    "test_static_cb_0.csv",
    "test_tax_registry_a_1.csv",
    "test_tax_registry_b_1.csv",
    "test_tax_registry_c_1.csv"
]

submission_base = pl.read_csv(TEST_DIR / "test_base.csv", null_values="a55475b1")
submission_base = submission_base.to_pandas()
submission_base = handle_data_types(submission_base, pandas=True)

for table in unique_files:
    preproceesed_df = data_preprocess(TEST_DIR / table, test=True, enforce=True)
    submission_base = submission_base.merge(preproceesed_df, how="left", on="case_id")
    del preproceesed_df
    gc.collect()
    

submission_base.to_pickle(f"submission_dataframe.pkl")  

Preprocessing test_credit_bureau_a_2_*.csv    Finished in 0.4523591995239258
Preprocessing test_credit_bureau_a_1_*.csv    Finished in 0.30438947677612305
Preprocessing test_credit_bureau_b_1.csv    Finished in 0.13110756874084473
Preprocessing test_credit_bureau_b_2.csv    Finished in 0.05401325225830078
Preprocessing test_applprev_1_*.csv    Finished in 0.15424394607543945
Preprocessing test_applprev_2.csv    Finished in 0.03508329391479492
Preprocessing test_debitcard_1.csv    Finished in 0.025413990020751953
Preprocessing test_deposit_1.csv    Finished in 0.028754234313964844
Preprocessing test_other_1.csv    Finished in 0.02477550506591797
Preprocessing test_person_1.csv    Finished in 0.1134943962097168
Preprocessing test_person_2.csv    Finished in 0.08369874954223633
Preprocessing test_static_0_*.csv    Finished in 0.2550642490386963
Preprocessing test_static_cb_0.csv    Finished in 0.08885526657104492
Preprocessing test_tax_registry_a_1.csv    Finished in 0.023255348205566406


In [14]:
submission_base.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Columns: 1098 entries, case_id to tax_registry_c_1.last_num_group1
dtypes: category(633), datetime64[ns](60), float64(386), int64(19)
memory usage: 140.6 KB


In [15]:
submission_base.isnull().mean().mean()

0.5724043715846995

# 3. Data Preprocessing

## 3.1 Handling Null Features

> Currently not handling the null values as null values are accepted by the model.

Can drop, impute or user a **ML model**

In [16]:
df = drop_columns_with_high_missing_values(base)
df.isnull().mean().mean()

0.07485520473137748

In [17]:
df = impute_missing_values(df)
df.isnull().mean().mean()

0.0

In [18]:
cat_cols = list(df.select_dtypes("category").columns)
len(cat_cols)

161

In [19]:
import pandas as pd
import numpy as np

# Assuming df is your pandas DataFrame and cat_cols is the list of categorical columns

# Step 1: Calculate correlation matrix


# Step 2: Mask out correlations involving categorical features
numerical_cols = [col for col in df.columns if col not in cat_cols and col not in ['WEEK_NUM', 'target', 'case_id']]
corr_matrix = df[numerical_cols].corr()
corr_matrix = corr_matrix.loc[numerical_cols, numerical_cols]

# Step 3: Identify highly correlated features
highly_correlated = (corr_matrix.abs() > 0.8) & (corr_matrix.abs() < 1.0)
correlated_features = set()
for col in highly_correlated.columns:
    correlated_features.update(set(highly_correlated.index[highly_correlated[col]]))
correlated_features = list(correlated_features)

# Step 4: Remove highly correlated features from the DataFrame
df_filtered = df.drop(columns=correlated_features)

df_filtered.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1526659 entries, 0 to 1526658
Columns: 230 entries, case_id to thirdquarter_1082L
dtypes: category(161), float64(64), int64(5)
memory usage: 1.1 GB


In [20]:
df= df_filtered

In [21]:
currently_selected_columns = list(df.columns)
currently_selected_columns.remove("target")
submission_base = submission_base[currently_selected_columns]
submission_base.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Columns: 229 entries, case_id to thirdquarter_1082L
dtypes: category(161), float64(65), int64(3)
memory usage: 44.6 KB


In [22]:
submission_base.isnull().mean().mean()

0.3707423580786026

In [23]:
df_test = submission_base

## 3.2 Handling Redundent Features
> Not handling it yet. Can redo the things we did during data loading section.

# 4. Feature Selection
> Currently doing nothing. Can use cross validation

# 5 Model Training

## 5.1 Handling Class Imbalance*
> Currently not handling

Can try out SMOTE or balancing by removing

In [24]:
from sklearn.utils import resample

# Separate majority and minority classes
majority_class = df[df['target'] == 0]
minority_class = df[df['target'] == 1]

# Downsample majority class
majority_downsampled = resample(majority_class, 
                                 replace=False,     # Sample without replacement
                                 n_samples=len(minority_class),    # Match minority class size
                                 random_state=42)  # Reproducible results

# Combine minority class with downsampled majority class
df_downsampled = pd.concat([majority_downsampled, minority_class])

# Shuffle the dataset
df_downsampled = df_downsampled.sample(frac=1, random_state=42)

# Now df_downsampled contains a balanced dataset
df_downsampled

Unnamed: 0,case_id,MONTH,WEEK_NUM,target,max_mean_pmts_dpd_303P,max_mean_pmts_overdue_1152A,last_max_pmts_dpd_1073P,last_max_pmts_dpd_303P,last_max_pmts_overdue_1140A,last_max_pmts_overdue_1152A,...,days180_256L,days30_165L,days360_512L,days90_310L,firstquarter_103L,fourthquarter_440L,maritalst_385M,numberofqueries_373L,secondquarter_766L,thirdquarter_1082L
223173,607706,201901,1,1,99.456545,7503.658123,9.56664,48.231823,1414.021089,3639.015467,...,2.388656,0.517708,4.777066,1.21142,2.86059,2.851214,3439d993,4.777066,2.688482,2.918342
829869,1423898,201906,25,0,0.000000,0.000000,0.00000,3.000000,0.000000,11998.400000,...,2.000000,0.000000,4.000000,0.00000,4.00000,1.000000,3439d993,4.000000,3.000000,2.000000
616358,1000891,202007,80,1,99.456545,7503.658123,9.56664,48.231823,1414.021089,3639.015467,...,2.388656,0.517708,4.777066,1.21142,2.86059,2.851214,3439d993,4.777066,2.688482,2.918342
525389,909922,201912,51,0,0.000000,0.000000,0.00000,0.000000,0.000000,0.000000,...,1.000000,1.000000,2.000000,1.00000,2.00000,2.000000,3439d993,2.000000,3.000000,4.000000
994980,1589009,201910,41,0,6.727273,288.028571,0.00000,2.000000,0.000000,2016.400000,...,0.000000,0.000000,3.000000,0.00000,1.00000,6.000000,3439d993,3.000000,0.000000,3.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1115157,1709186,201912,51,0,3.384615,1801.338492,0.00000,0.000000,0.000000,0.000000,...,1.000000,0.000000,3.000000,1.00000,2.00000,3.000000,3439d993,3.000000,0.000000,0.000000
194137,239571,202007,81,1,0.083333,225.483337,0.00000,1.000000,0.000000,2714.400100,...,5.000000,2.000000,8.000000,4.00000,3.00000,2.000000,3439d993,8.000000,2.000000,4.000000
961998,1556027,201909,38,1,30.615385,1406.276931,0.00000,110.000000,0.000000,5393.000000,...,4.000000,3.000000,5.000000,3.00000,0.00000,1.000000,3439d993,5.000000,1.000000,3.000000
398477,783010,201908,34,0,786.058824,4901.272471,0.00000,1251.000000,0.000000,5219.026400,...,1.000000,1.000000,1.000000,1.00000,0.00000,0.000000,3439d993,1.000000,1.000000,1.000000


In [25]:
df_train = df_downsampled
df_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 95988 entries, 223173 to 1294342
Columns: 230 entries, case_id to thirdquarter_1082L
dtypes: category(161), float64(64), int64(5)
memory usage: 77.5 MB


## 5.2 Splitting Data

> Train, val, test

In [26]:
# X_train, X_val_test, y_train, y_val_test = train_test_split(data.copy().drop(["target"], axis = 1), data["target"], test_size=0.4, random_state=42)
# X_valid, X_test, y_valid, y_test = train_test_split(X_val_test, y_val_test, test_size=0.5, random_state=42)

## 5.3 Setting Model Parameters*

> Can use cross validation with gridsearch or random search.

## 5.4 Training the Best Model*

> Should maximize the AUC

There can be more models that provide better results. And also wecan consider model stacking and boosting

In [27]:
from sklearn.model_selection import train_test_split

# Split the data into 70% training and 30% testing
X_train, X_test = train_test_split(df_train, test_size=0.3, random_state=42)

# Optionally, you can print the shape of the training and testing sets
print("Training set shape:", X_train.shape)
print("Testing set shape:", X_test.shape)
df_train = X_train

Training set shape: (67191, 230)
Testing set shape: (28797, 230)


In [28]:
y = df_train["target"]
weeks = df_train["WEEK_NUM"]
df_train= df_train.drop(columns=["target", "case_id", "WEEK_NUM"])
cv = StratifiedGroupKFold(n_splits=5, shuffle=False)

In [29]:
y.value_counts()

target
0    33658
1    33533
Name: count, dtype: int64

In [30]:
df_train[cat_cols] = df_train[cat_cols].astype(str)
df_test[cat_cols] = df_test[cat_cols].astype(str)

In [31]:
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Columns: 229 entries, case_id to thirdquarter_1082L
dtypes: float64(65), int64(3), object(161)
memory usage: 18.0+ KB


In [32]:
params = {
    "boosting_type": "gbdt",
    "objective": "binary",
    "metric": "auc",
    "max_depth": 10,  
    "learning_rate": 0.05,
    "n_estimators": 2000,  
    "colsample_bytree": 0.8,
    "colsample_bynode": 0.8,
    "verbose": -1,
    "random_state": 42,
    "reg_alpha": 0.1,
    "reg_lambda": 10,
    "extra_trees":True,
    'num_leaves':64,
}

In [33]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

def test(y_test, y_pred_binary):

    # actual values
    actual = y_test
    # predicted values
    predicted = y_pred_binary

    # confusion matrix
    matrix = confusion_matrix(actual,predicted, labels=[1,0])
    print('Confusion matrix : \n',matrix)

    # outcome values order in sklearn
    tp, fn, fp, tn = confusion_matrix(actual,predicted,labels=[1,0]).reshape(-1)
    print('Outcome values : \n', tp, fn, fp, tn)

    # classification report for precision, recall f1-score and accuracy
    matrix = classification_report(actual,predicted,labels=[1,0])
    print('Classification report : \n',matrix)

In [34]:
%%time
from catboost import CatBoostClassifier, Pool
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score, accuracy_score


fitted_models_cat = []
fitted_models_lgb = []

cv_scores_cat = []
cv_scores_lgb = []


for idx_train, idx_valid in cv.split(df_train, y, groups=weeks):#
    X_train, y_train = df_train.iloc[idx_train], y.iloc[idx_train]# 
    X_valid, y_valid = df_train.iloc[idx_valid], y.iloc[idx_valid]
    train_pool = Pool(X_train, y_train,cat_features=cat_cols)
    val_pool = Pool(X_valid, y_valid,cat_features=cat_cols)
    clf = CatBoostClassifier(
    eval_metric='AUC',
    learning_rate=0.03,
    iterations=1200)
    random_seed=3107
    clf.fit(train_pool, eval_set=val_pool,verbose=300)
    fitted_models_cat.append(clf)
    y_pred_valid = clf.predict_proba(X_valid)[:,1]
    auc_score = roc_auc_score(y_valid, y_pred_valid)
    cv_scores_cat.append(auc_score)
    y_pred_binary = [1 if pred > 0.5 else 0 for pred in y_pred_valid]
    test(y_valid, y_pred_binary)
    accuracy = accuracy_score(y_valid, y_pred_binary)
    mae_score_cat = mean_absolute_error(y_valid, y_pred_valid)
    rmse_score_cat = mean_squared_error(y_valid, y_pred_valid, squared=False)
    r2_score_cat = r2_score(y_valid, y_pred_valid)
    print("Accuracy:", accuracy)
    print("mae score: ", mae_score_cat)
    print("rmse score: ", rmse_score_cat)
    print("r2 score: ", r2_score_cat)
    
    
    X_train[cat_cols] = X_train[cat_cols].astype("category")
    X_valid[cat_cols] = X_valid[cat_cols].astype("category")
    
    model = lgb.LGBMClassifier(**params)
    model.fit(
        X_train, y_train,
        eval_set = [(X_valid, y_valid)],
        callbacks = [lgb.log_evaluation(200), lgb.early_stopping(100)] )
    
    fitted_models_lgb.append(model)
    y_pred_valid = model.predict_proba(X_valid)[:,1]
    auc_score = roc_auc_score(y_valid, y_pred_valid)
    cv_scores_lgb.append(auc_score)
    y_pred_binary = [1 if pred > 0.5 else 0 for pred in y_pred_valid]
    test(y_valid, y_pred_binary)
    accuracy = accuracy_score(y_valid, y_pred_binary)
    mae_score_lgb = mean_absolute_error(y_valid, y_pred_valid)
    rmse_score_lgb = mean_squared_error(y_valid, y_pred_valid, squared=False)
    r2_score_lgb = r2_score(y_valid, y_pred_valid)
    print("Accuracy:", accuracy)
    print("mae score: ", mae_score_lgb)
    print("rmse score: ", rmse_score_lgb)
    print("r2 score: ", r2_score_lgb)
    
    
print("CV AUC scores: ", cv_scores_cat)
print("Maximum CV AUC score: ", max(cv_scores_cat))


print("CV AUC scores: ", cv_scores_lgb)
print("Maximum CV AUC score: ", max(cv_scores_lgb))

0:	test: 0.7205481	best: 0.7205481 (0)	total: 1.64s	remaining: 32m 44s
300:	test: 0.8287195	best: 0.8287195 (300)	total: 6m 17s	remaining: 18m 47s
600:	test: 0.8362330	best: 0.8362330 (600)	total: 12m 48s	remaining: 12m 46s
900:	test: 0.8390473	best: 0.8390473 (900)	total: 19m 17s	remaining: 6m 24s
1199:	test: 0.8404054	best: 0.8404054 (1199)	total: 25m 24s	remaining: 0us

bestTest = 0.8404054285
bestIteration = 1199

Confusion matrix : 
 [[5040 1634]
 [1597 5014]]
Outcome values : 
 5040 1634 1597 5014
Classification report : 
               precision    recall  f1-score   support

           1       0.76      0.76      0.76      6674
           0       0.75      0.76      0.76      6611

    accuracy                           0.76     13285
   macro avg       0.76      0.76      0.76     13285
weighted avg       0.76      0.76      0.76     13285

Accuracy: 0.7567933759879564
mae score:  0.33618672129100835
rmse score:  0.40384579378242474
r2 score:  0.3476196284056833
Training until

In [35]:
# %%time
# from catboost import CatBoostClassifier, Pool

# fitted_models_cat = []
# fitted_models_lgb = []

# cv_scores_cat = []
# cv_scores_lgb = []


# for idx_train, idx_valid in cv.split(df_train, y, groups=weeks):#
#     X_train, y_train = df_train.iloc[idx_train], y.iloc[idx_train]# 
#     X_valid, y_valid = df_train.iloc[idx_valid], y.iloc[idx_valid]
#     train_pool = Pool(X_train, y_train,cat_features=cat_cols)
#     val_pool = Pool(X_valid, y_valid,cat_features=cat_cols)
#     clf = CatBoostClassifier(
#     eval_metric='AUC',
#     learning_rate=0.03,
#     iterations=200)
#     random_seed=3107
#     clf.fit(train_pool, eval_set=val_pool,verbose=300)
#     fitted_models_cat.append(clf)
#     y_pred_valid = clf.predict_proba(X_valid)[:,1]
#     auc_score = roc_auc_score(y_valid, y_pred_valid)
#     cv_scores_cat.append(auc_score)
    
    
#     X_train[cat_cols] = X_train[cat_cols].astype("category")
#     X_valid[cat_cols] = X_valid[cat_cols].astype("category")
    
#     model = lgb.LGBMClassifier(**params)
#     model.fit(
#         X_train, y_train,
#         eval_set = [(X_valid, y_valid)],
#         callbacks = [lgb.log_evaluation(200), lgb.early_stopping(100)] )
    
#     fitted_models_lgb.append(model)
#     y_pred_valid = model.predict_proba(X_valid)[:,1]
#     auc_score = roc_auc_score(y_valid, y_pred_valid)
#     cv_scores_lgb.append(auc_score)
    
    
# print("CV AUC scores: ", cv_scores_cat)
# print("Maximum CV AUC score: ", max(cv_scores_cat))


# print("CV AUC scores: ", cv_scores_lgb)
# print("Maximum CV AUC score: ", max(cv_scores_lgb))

In [36]:
class VotingModel(BaseEstimator, RegressorMixin):
    def __init__(self, estimators):
        super().__init__()
        self.estimators = estimators
        
    def fit(self, X, y=None):
        return self
    
    def predict(self, X):
        y_preds = [estimator.predict(X) for estimator in self.estimators]
        return np.mean(y_preds, axis=0)
    
    def predict_proba(self, X):
        
        y_preds = [estimator.predict_proba(X) for estimator in self.estimators[:5]]
        
        X[cat_cols] = X[cat_cols].astype("category")
        y_preds += [estimator.predict_proba(X) for estimator in self.estimators[5:]]
        
        return np.mean(y_preds, axis=0)

model = VotingModel(fitted_models_cat+fitted_models_lgb)

### Validating the performance on the ensemble model

In [37]:
df_train = X_test
df_train[cat_cols] = df_train[cat_cols].astype(str)
df_train[cat_cols] = df_train[cat_cols].astype("category")
y_valid = df_train["target"]
weeks = df_train["WEEK_NUM"]
df_train= df_train.drop(columns=["target", "case_id", "WEEK_NUM"])

In [38]:
y_pred_valid = model.predict_proba(df_train)[:,1]
auc_score = roc_auc_score(y_valid, y_pred_valid)
y_pred_binary = [1 if pred > 0.5 else 0 for pred in y_pred_valid]
test(y_valid, y_pred_binary)
accuracy = accuracy_score(y_valid, y_pred_binary)
mae_score_lgb = mean_absolute_error(y_valid, y_pred_valid)
rmse_score_lgb = mean_squared_error(y_valid, y_pred_valid, squared=False)
r2_score_lgb = r2_score(y_valid, y_pred_valid)
print("AUC score:", auc_score)
print("Accuracy:", accuracy)
print("mae score: ", mae_score_lgb)
print("rmse score: ", rmse_score_lgb)
print("r2 score: ", r2_score_lgb)

Confusion matrix : 
 [[11148  3313]
 [ 3669 10667]]
Outcome values : 
 11148 3313 3669 10667
Classification report : 
               precision    recall  f1-score   support

           1       0.75      0.77      0.76     14461
           0       0.76      0.74      0.75     14336

    accuracy                           0.76     28797
   macro avg       0.76      0.76      0.76     28797
weighted avg       0.76      0.76      0.76     28797

AUC score: 0.8374193180920111
Accuracy: 0.7575441886307601
mae score:  0.3394647049860567
rmse score:  0.4054277987098332
r2 score:  0.3425008115753131


In [39]:
# params = {
#     "boosting_type": "gbdt",
#     "objective": "binary",
#     "metric": "auc",
#     "max_depth": 10,  
#     "learning_rate": 0.05,
#     "n_estimators": 2000,  
#     "colsample_bytree": 0.8,
#     "colsample_bynode": 0.8,
#     "verbose": -1,
#     "random_state": 42,
#     "reg_alpha": 0.1,
#     "reg_lambda": 10,
#     "extra_trees":True,
#     'num_leaves':64,
# }

# fitted_models = []
# cv_scores = []


# for idx_train, idx_valid in cv.split(df_train, y, groups=weeks):#   Because it takes a long time to divide the data set, 
#     X_train, y_train = df_train.iloc[idx_train], y.iloc[idx_train]# each time the data set is divided, two models are trained to each other twice, which saves time.
#     X_valid, y_valid = df_train.iloc[idx_valid], y.iloc[idx_valid]
#     model = lgb.LGBMClassifier(**params)
#     model.fit(
#         X_train, y_train,
#         eval_set = [(X_valid, y_valid)],
#         callbacks = [lgb.log_evaluation(200), lgb.early_stopping(100)] )
#     fitted_models.append(model)
#     y_pred_valid = model.predict_proba(X_valid)[:,1]
#     auc_score = roc_auc_score(y_valid, y_pred_valid)
#     cv_scores.append(auc_score)
    
# print("CV AUC scores: ", cv_scores)
# print("Maximum CV AUC score: ", max(cv_scores))

# 6. Evaluation

## 6.1 Accuracy

## 6.2 AUC

In [40]:

# auc = roc_auc_score(y_test, model.predict_proba(X_test)[:,1])
# auc


## 6.3 GINI Stability

# 7. Submission

In [41]:
df_test = df_test.drop(columns=["WEEK_NUM"])
df_test = df_test.set_index("case_id")

In [42]:
df_test[cat_cols] = df_test[cat_cols].astype("category")

In [43]:
y_pred = pd.Series(model.predict_proba(df_test)[:, 1], index=df_test.index)
df_subm = pd.read_csv(ROOT / "sample_submission.csv")
df_subm = df_subm.set_index("case_id")

df_subm["score"] = y_pred
df_subm.to_csv("submission.csv")
df_subm

Unnamed: 0_level_0,score
case_id,Unnamed: 1_level_1
57543,0.16804
57549,0.487776
57551,0.148497
57552,0.226961
57569,0.760877
57630,0.31798
57631,0.356183
57632,0.118311
57633,0.329008
57634,0.244189
