### Description

In today’s digital age, problematic internet use among children and adolescents is a growing concern. Better understanding this issue is crucial for addressing mental health problems such as depression and anxiety.

Current methods for measuring problematic internet use in children and adolescents are often complex and require professional assessments. This creates access, cultural, and linguistic barriers for many families. Due to these limitations, problematic internet use is often not measured directly, but is instead associated with issues such as depression and anxiety in youth.

Conversely, physical & fitness measures are extremely accessible and widely available with minimal intervention or clinical expertise. Changes in physical habits, such as poorer posture, irregular diet, and reduced physical activity, are common in excessive technology users. We propose using these easily obtainable physical fitness indicators as proxies for identifying problematic internet use, especially in contexts lacking clinical expertise or suitable assessment tools.

This competition challenges you to develop a predictive model capable of analyzing children's physical activity data to detect early indicators of problematic internet and technology use. This will enable prompt interventions aimed at promoting healthier digital habits.

Your work will contribute to a healthier, happier future where children are better equipped to navigate the digital landscape responsibly.

Acknowledgments
The data used for this competition was provided by the Healthy Brain Network, a landmark mental health study based in New York City that will help children around the world. In the Healthy Brain Network, families, community leaders, and supporters are partnering with the Child Mind Institute to unlock the secrets of the developing brain. In addition to the generous support provided by the Kaggle team, financial support has been provided by the California Department of Health Care Services (DHCS) as part of the Children and Youth Behavioral Health Initiative (CYBHI).

Health Care Services Logo

Sponsorship
Dell Technologies and NVIDIA are thrilled to partner with the Child Mind Institute, recognizing the profound impact this collaboration will have on advancing mental health support for children and adolescents. This partnership aligns perfectly with our commitment to leveraging technology for social good and fostering a healthier, more inclusive future.

Dell Technologies AI solutions from desktop to datacenter to cloud. NVIDIA pioneered accelerated computing to tackle challenges no one else can solve. Our work in AI and digital twins is transforming the world's largest industries and profoundly impacting society.

Dell Technologies NVIDIA
Evaluation
Submissions are scored based on the quadratic weighted kappa, which measures the agreement between two outcomes. This metric typically varies from 0 (random agreement) to 1 (complete agreement). In the event that there is less agreement than expected by chance, the metric may go below 0.

To compute the quadratic weighted kappa, we construct three matrices, O
, W
, and E
, with N
 the number of distinct labels.

The matrix O
 is an N×N
 histogram matrix such that Oi,j
 corresponds to the number of instances that have an actual value i
 and a predicted value j
.

The matrix W
 is an N×N
 matrix of weights, calculated based on the squared difference between actual and predicted values:

Wi,j=(i−j)2(N−1)2

The matrix E
 is an N×N
 histogram matrix of expected outcomes, calculated assuming that there is no correlation between values. This is calculated as the outer product between the actual histogram vector of outcomes and the predicted histogram vector, normalized such that E
 and O
 have the same sum.

From these three matrices, the quadratic weighted kappa is calculated as: 

κ=1−∑i,jWi,jOi,j∑i,jWi,jEi,j.

Submission File
For each id in the test set, you must predict the corresponding sii (described on the Data page). The file should contain a header and have the following format:

id,sii
000046df,0
000089ff,1
00012558,2
00017ccd,3

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/child-mind-institute-problematic-internet-use/sample_submission.csv
/kaggle/input/child-mind-institute-problematic-internet-use/data_dictionary.csv
/kaggle/input/child-mind-institute-problematic-internet-use/train.csv
/kaggle/input/child-mind-institute-problematic-internet-use/test.csv
/kaggle/input/child-mind-institute-problematic-internet-use/series_test.parquet/id=00115b9f/part-0.parquet
/kaggle/input/child-mind-institute-problematic-internet-use/series_test.parquet/id=001f3379/part-0.parquet
/kaggle/input/child-mind-institute-problematic-internet-use/series_train.parquet/id=0745c390/part-0.parquet
/kaggle/input/child-mind-institute-problematic-internet-use/series_train.parquet/id=eaab7a96/part-0.parquet
/kaggle/input/child-mind-institute-problematic-internet-use/series_train.parquet/id=8ec2cc63/part-0.parquet
/kaggle/input/child-mind-institute-problematic-internet-use/series_train.parquet/id=b2987a65/part-0.parquet
/kaggle/input/child-mind-institute-problematic-intern

# Importing Libraries
This section imports all the necessary libraries needed for data processing, model training, and evaluation.

In [2]:
# Importing necessary libraries
import numpy as np
import pandas as pd
import os
from sklearn.base import clone
from sklearn.metrics import cohen_kappa_score
from sklearn.model_selection import StratifiedKFold, GridSearchCV
from scipy.optimize import minimize
from concurrent.futures import ProcessPoolExecutor
from tqdm import tqdm
from sklearn.ensemble import StackingRegressor
from lightgbm import LGBMRegressor
from xgboost import XGBRegressor
from catboost import CatBoostRegressor
from sklearn.ensemble import VotingRegressor
from sklearn.preprocessing import MinMaxScaler, LabelEncoder
import warnings
from sklearn.linear_model import Ridge
from imblearn.over_sampling import SMOTE
from collections import Counter

import tensorflow as tf
from tensorflow.keras import layers, models
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import cohen_kappa_score
from scipy.optimize import minimize


warnings.filterwarnings('ignore')
pd.options.display.max_columns = None

# Loading Datasets
In this section, we defined the decorator to handle failure or error occurs. We also load the training, testing, and sample submission datasets from the specified paths.

In [3]:
SEED = 0
n_splits = 5
fails = []  # To track failed operations

# Decorator to return a default value when the function fails
"""def return_default_value_if_fails(default_value):
    def decorator(func):
        def inner(*args, **kwargs):
            try:
                return func(*args, **kwargs)
            except Exception as e:
                fails.append((func, (args, kwargs), e))  # Log failed operations
                return default_value
        return inner
    return decorator"""


# Error handling decorator with logging
def return_default_value_if_fails(default_value):
    def decorator(func):
        def wrapper(*args, **kwargs):
            try:
                return func(*args, **kwargs)
            except Exception as e:
                print(f"Error in {func.__name__}: {e}")
                return default_value
        return wrapper
    return decorator

# Load datasets
train_csv = pd.read_csv('/kaggle/input/child-mind-institute-problematic-internet-use/train.csv')
test_csv = pd.read_csv('/kaggle/input/child-mind-institute-problematic-internet-use/test.csv')
sample_submission = pd.read_csv('/kaggle/input/child-mind-institute-problematic-internet-use/sample_submission.csv')

# Feature Engineering
This section processes the time series data and merges it with the training and testing datasets to create a comprehensive feature set.

In [4]:
# Define the function to load parquet and handle errors manually
def process_parquet_file(filename, directory):
    try:
        df = pd.read_parquet(os.path.join(directory, filename, 'part-0.parquet'))
        df.drop('step', axis=1, inplace=True)
        return df.describe().values.reshape(-1), filename.split('=')[1]
    except Exception as e:
        # If there's an error, return empty results
        return pd.DataFrame().values, None

def process_file(file, directory):
    return process_parquet_file(file, directory)

def load_time_series_data(directory):
    files = os.listdir(directory)
    results = []
    
    with ProcessPoolExecutor() as executor:
        for result in tqdm(executor.map(process_file, files, [directory] * len(files)), total=len(files)):
            results.append(result)

    stats, ids = zip(*[r for r in results if r[1] is not None])
    df = pd.DataFrame(stats, columns=[f"stat_{i}" for i in range(len(stats[0]))])
    df['id'] = ids
    return df

# Loading the time series data
train_pq = load_time_series_data("/kaggle/input/child-mind-institute-problematic-internet-use/series_train.parquet")
test_pq = load_time_series_data("/kaggle/input/child-mind-institute-problematic-internet-use/series_test.parquet")

# Merging with the main dataset
train_sii = train_csv.merge(train_pq, how="left", on="id").drop('id', axis=1)
test_com = test_csv.merge(test_pq, how="left", on="id").drop('id', axis=1)

# Drop rows where 'sii' column has NaN values
train_sii = train_sii.dropna(subset=['sii'])

100%|██████████| 996/996 [01:13<00:00, 13.55it/s]
100%|██████████| 2/2 [00:00<00:00,  7.19it/s]


In [5]:
# Function to perform random oversampling to balance the 'sii' column in the train_sii DataFrame
def apply_oversampling(train_sii, target_column='sii'):
    """
    Applies random oversampling to the target column in the given DataFrame to handle class imbalance.
    
    Args:
    train_sii (pd.DataFrame): The input DataFrame containing features and the target column.
    target_column (str): The name of the target column to balance (default is 'sii').
    
    Returns:
    pd.DataFrame: The DataFrame with balanced target classes after applying oversampling.
    """
    # Separate features (X) and target (y)
    X = train_sii.drop(columns=[target_column])
    y = train_sii[target_column]
    
    # Count the number of occurrences of each class
    class_counts = y.value_counts()
    max_count = class_counts.max()
    
    # Perform oversampling
    resampled_X = X.copy()
    resampled_y = y.copy()
    
    for cls, count in class_counts.items():
        if count < max_count:
            # Calculate how many samples are needed
            n_samples_to_add = max_count - count
            
            # Randomly sample with replacement
            sampled_indices = y[y == cls].sample(n=n_samples_to_add, replace=True, random_state=SEED).index
            
            # Append the new samples to the resampled DataFrame
            resampled_X = pd.concat([resampled_X, X.loc[sampled_indices]], ignore_index=True)
            resampled_y = pd.concat([resampled_y, y.loc[sampled_indices]], ignore_index=True)
    
    # Combine resampled X and y into a new DataFrame
    resampled_df = pd.DataFrame(resampled_X, columns=X.columns)
    resampled_df[target_column] = resampled_y
    
    print(f"Oversampling applied. Class distribution after resampling: {Counter(resampled_y)}")
    
    return resampled_df

# Example usage
train_sii_balanced = apply_oversampling(train_sii, target_column='sii')


# Now drop the 'sii' column from train_sii
train_com = train_sii_balanced.drop(['sii'], axis=1)

Oversampling applied. Class distribution after resampling: Counter({2.0: 1594, 0.0: 1594, 1.0: 1594, 3.0: 1594})


# Feature Processing 
Here, we process the features that will be used for training and testing our models, ensuring they are relevant to the target variable.
We convert categorical variables into numeric codes to make them suitable for model training.

In [6]:
"""# Feature selection
selected_features = [
    'Basic_Demos-Age', 'Basic_Demos-Sex', 'CGAS-CGAS_Score', 'Physical-BMI', 
    'Physical-Height', 'Physical-Weight', 'Fitness_Endurance-Max_Stage', 
    'Fitness_Endurance-Time_Mins', 'FGC-FGC_CU', 'FGC-FGC_PU', 
    'FGC-FGC_SRR', 'BIA-BIA_BMC', 'BIA-BIA_Fat', 'PAQ_A-PAQ_A_Total', 
    'SDS-SDS_Total_Raw', 'PreInt_EduHx-computerinternet_hoursday'
] + [f"stat_{i}" for i in range(train_pq.shape[1] - 1)]

train_com = train_com[selected_features + ['sii']].dropna(subset=['sii'])
train_com = train_com.drop('sii', axis=1)
test_com = test_com[selected_features]
---------------------------------------------------------------

from sklearn.model_selection import TimeSeriesSplit

tscv = TimeSeriesSplit(n_splits=5)
for train_index, test_index in tscv.split(X_train):
    # Your training logic

----------------------------------------------------------------------

import category_encoders as ce

target_encoder = ce.TargetEncoder(cols=categorical_cols)
X_train[categorical_cols] = target_encoder.fit_transform(X_train[categorical_cols], y_train)
X_test[categorical_cols] = target_encoder.transform(X_test[categorical_cols])


for col in categorical_cols:
    freq = X_train[col].value_counts() / len(X_train)
    X_train[col+'_freq'] = X_train[col].map(freq)
    X_test[col+'_freq'] = X_test[col].map(freq)
------------------------------------------------------------------------------------

# Example: Interaction between 'Physical-BMI' and 'Physical-Height'
X_train['BMI_Height'] = X_train['Physical-BMI'] * X_train['Physical-Height']
X_test['BMI_Height'] = X_test['Physical-BMI'] * X_test['Physical-Height']
------------------------------------------------------------------------------

# Example for XGBoost using GridSearchCV
from sklearn.model_selection import GridSearchCV

xgb_params = {
    'max_depth': [6, 7, 8],
    'learning_rate': [0.01, 0.03, 0.05],
    'n_estimators': [300, 500],
    'subsample': [0.7, 0.8, 0.9],
    'colsample_bytree': [0.7, 0.8, 0.9]
}

xgb = XGBRegressor(random_state=SEED)
grid_search = GridSearchCV(estimator=xgb, param_grid=xgb_params, cv=3, scoring='neg_mean_squared_error', verbose=2, n_jobs=-1)
grid_search.fit(X_train, y_train)
best_xgb_params = grid_search.best_params_
print("Best XGBoost Parameters:", best_xgb_params)

# Update XGBoost model with best parameters
xgb_model = XGBRegressor(**best_xgb_params, random_state=SEED)
------------------------------------------------------------------------


from sklearn.ensemble import StackingRegressor

estimators = [
    ('lgb', lgb_model),
    ('xgb', xgb_model),
    ('cat', cat_model)
]

stacking_model = StackingRegressor(
    estimators=estimators,
    final_estimator=XGBRegressor(random_state=SEED),
    cv=5,
    passthrough=True
)

# Train stacking model
stacking_model.fit(X_train, y_train)
stacking_preds = stacking_model.predict(X_test)


"""

'# Feature selection\nselected_features = [\n    \'Basic_Demos-Age\', \'Basic_Demos-Sex\', \'CGAS-CGAS_Score\', \'Physical-BMI\', \n    \'Physical-Height\', \'Physical-Weight\', \'Fitness_Endurance-Max_Stage\', \n    \'Fitness_Endurance-Time_Mins\', \'FGC-FGC_CU\', \'FGC-FGC_PU\', \n    \'FGC-FGC_SRR\', \'BIA-BIA_BMC\', \'BIA-BIA_Fat\', \'PAQ_A-PAQ_A_Total\', \n    \'SDS-SDS_Total_Raw\', \'PreInt_EduHx-computerinternet_hoursday\'\n] + [f"stat_{i}" for i in range(train_pq.shape[1] - 1)]\n\ntrain_com = train_com[selected_features + [\'sii\']].dropna(subset=[\'sii\'])\ntrain_com = train_com.drop(\'sii\', axis=1)\ntest_com = test_com[selected_features]\n---------------------------------------------------------------\n\nfrom sklearn.model_selection import TimeSeriesSplit\n\ntscv = TimeSeriesSplit(n_splits=5)\nfor train_index, test_index in tscv.split(X_train):\n    # Your training logic\n\n----------------------------------------------------------------------\n\nimport category_encoders as 

In [7]:
# Function to remove columns with NaN values above the threshold
def remove_high_nan_columns(df, nan_threshold=0.5):
    """
    Removes columns that have more NaN values than the provided threshold.
    
    Parameters:
    df (pd.DataFrame): DataFrame to process.
    nan_threshold (float): Proportion threshold of NaN values for removing columns.
    
    Returns:
    pd.DataFrame: DataFrame with columns removed.
    """
    nan_proportion = df.isna().mean()
    columns_to_keep = nan_proportion[nan_proportion <= nan_threshold].index
    return df[columns_to_keep]

train_nan_thrshd = remove_high_nan_columns(train_com)
test_nan_thrshd = remove_high_nan_columns(test_com)

In [8]:
# Function to preprocess data, handle NaNs, encode, and scale columns
def preprocess_data(train_df, test_df):
    """
    Preprocesses the train and test dataframes by:
    1. Filling missing values in numerical columns with the median.
    2. Filling missing values in categorical columns with proportions of existing values.
    3. Ensuring all columns in test are present in train.
    4. Encoding categorical variables.
    5. Scaling non-binary columns.
    
    Parameters:
    train_df (pd.DataFrame): The training dataframe.
    test_df (pd.DataFrame): The testing dataframe (base dataframe).
    
    Returns:
    pd.DataFrame, pd.DataFrame: Preprocessed train and test dataframes.
    """

    # Fill missing values in numerical columns with the median
    def fill_nan_with_median(df):
        return df.apply(lambda col: col.fillna(col.median()) if pd.api.types.is_numeric_dtype(col) else col)
    
    train_df = fill_nan_with_median(train_df)
    test_df = fill_nan_with_median(test_df)

    # Fill missing values in categorical columns with proportional sampling
    def fill_nan_categorical_proportion(df):
        def fill_column_proportionally(col):
            if col.dtype == 'object' and col.isna().sum() > 0:
                # Calculate value proportions
                value_counts = col.value_counts(normalize=True)
                # Fill missing values proportionally
                missing_indices = col[col.isna()].index
                col.loc[missing_indices] = np.random.choice(value_counts.index, size=len(missing_indices), p=value_counts.values)
            return col
        
        return df.apply(fill_column_proportionally)

    train_df = fill_nan_categorical_proportion(train_df)
    test_df = fill_nan_categorical_proportion(test_df)
    
    # Ensure train has the same columns as test (using test_df as the base)
    def columns_match(test_df, train_df):
        missing_columns = set(test_df.columns) - set(train_df.columns)
        for col in missing_columns:
            train_df[col] = np.nan
        return train_df[test_df.columns]
    
    train_df = columns_match(test_df, train_df)
    
    # Encode categorical variables (label encoding)
    def encode_categorical_columns(train_df, test_df):
        label_encoders = {}
        categorical_cols = test_df.select_dtypes(include=['object']).columns
        for col in categorical_cols:
            label_encoders[col] = LabelEncoder()
            train_df[col] = label_encoders[col].fit_transform(train_df[col])
            test_df[col] = label_encoders[col].transform(test_df[col])
        return train_df, test_df
    
    train_df, test_df = encode_categorical_columns(train_df, test_df)
    
    # Scale non-binary columns (MinMax scaling)
    def scale_non_binary_columns(df):
        scaler = MinMaxScaler()
        non_binary_cols = [col for col in df.columns if df[col].nunique() > 2 or df[col].max() > 1]
        df[non_binary_cols] = scaler.fit_transform(df[non_binary_cols])
        return df
    
    train_df = scale_non_binary_columns(train_df)
    test_df = scale_non_binary_columns(test_df)
    
    return train_df, test_df

train, test = preprocess_data(train_nan_thrshd, test_nan_thrshd)

In [9]:
# Function to create interaction features based on identified interactions
def add_interaction_features(df):
    # Interaction between 'Physical-BMI' and 'Physical-Height'
    df['BMI_Height'] = df['Physical-BMI'] * df['Physical-Height']
    
    # Interaction between Physical-BMI and Physical-Weight
    df['BMI_Weight'] = df['Physical-BMI'] * df['Physical-Weight']
    
    # Interaction between Physical-HeartRate and Physical-Systolic_BP
    df['HeartRate_SystolicBP'] = df['Physical-HeartRate'] * df['Physical-Systolic_BP']
    
    # Interaction between Physical-HeartRate and Physical-Diastolic_BP
    df['HeartRate_DiastolicBP'] = df['Physical-HeartRate'] * df['Physical-Diastolic_BP']
    
    # Interaction between Physical-Height and Physical-Weight
    df['Height_Weight'] = df['Physical-Height'] * df['Physical-Weight']
    
    # Interaction between SDS-SDS_Total_Raw and SDS-SDS_Total_T
    df['SDS_Raw_T_Interaction'] = df['SDS-SDS_Total_Raw'] * df['SDS-SDS_Total_T']
    
    return df

# Applying the interaction feature function to both train and test datasets
train = add_interaction_features(train)
test = add_interaction_features(test)


# Model Definition, Training, Final Predictions and Submission
We define our models (LightGBM, XGBoost, and CatBoost) along with their optimized hyperparameters.
This section also contains the training process, predictions on the test dataset and optimize the thresholds for the final output. Also, in this section, we create the submission file in the required format.

In [10]:
""" Best Score - 438 V7
# prepare training data
X_train = train
y_train = train_sii['sii']
X_test = test

# Define a helper function for GridSearchCV to train and find the best params
def grid_search_model(model, param_grid, X_train, y_train):
    grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=3, scoring='neg_mean_squared_error', verbose=2, n_jobs=-1)
    grid_search.fit(X_train, y_train)
    return grid_search.best_params_

# XGBoost parameter grid and tuning
xgb_params = {
    'max_depth': [6, 7, 8],
    'learning_rate': [0.01, 0.03, 0.05],
    'n_estimators': [50, 150],
    'subsample': [0.7, 0.8, 0.9],
    'colsample_bytree': [0.7, 0.8, 0.9]
}

xgb = XGBRegressor(random_state=SEED)
best_xgb_params = grid_search_model(xgb, xgb_params, X_train, y_train)
xgb_model = XGBRegressor(**best_xgb_params, random_state=SEED)

# LightGBM parameter grid and tuning
lgb_params = {
    'max_depth': [7, 9, 11],
    'learning_rate': [0.01, 0.05, 0.1],
    'num_leaves': [50, 100, 150],
    'n_estimators': [50, 150],
    'feature_fraction': [0.7, 0.8, 0.9],
    'bagging_fraction': [0.7, 0.8, 0.9],
    'bagging_freq': [5, 7, 10]
}

lgb = LGBMRegressor(random_state=SEED)
best_lgb_params = grid_search_model(lgb, lgb_params, X_train, y_train)
lgb_model = LGBMRegressor(**best_lgb_params, random_state=SEED)

# CatBoost parameter grid and tuning
cat_params = {
    'depth': [6, 7, 8],
    'learning_rate': [0.01, 0.03, 0.05],
    'iterations': [50, 150],
    'l2_leaf_reg': [3, 5, 7]
}

cat = CatBoostRegressor(random_seed=SEED, verbose=0)
best_cat_params = grid_search_model(cat, cat_params, X_train, y_train)
cat_model = CatBoostRegressor(**best_cat_params, random_seed=SEED, verbose=0)

# Ensemble voting model using the best parameters for all models
voting_model = VotingRegressor([('lgb', lgb_model), ('xgb', xgb_model), ('cat', cat_model)])

# Function to train the ensemble model
def train_ensemble_model(model, X_train, y_train, X_test):
    skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=SEED)
    oof_preds = np.zeros(len(X_train))
    test_preds = np.zeros((len(X_test), n_splits))

    for fold, (train_idx, val_idx) in enumerate(skf.split(X_train, y_train)):
        X_tr, X_val = X_train.iloc[train_idx], X_train.iloc[val_idx]
        y_tr, y_val = y_train.iloc[train_idx], y_train.iloc[val_idx]

        cloned_model = clone(model)
        cloned_model.fit(X_tr, y_tr)

        oof_preds[val_idx] = cloned_model.predict(X_val)
        test_preds[:, fold] = cloned_model.predict(X_test)

    return oof_preds, test_preds.mean(axis=1)


# Train the ensemble model
oof_preds, test_preds = train_ensemble_model(voting_model, X_train, y_train, X_test)

# Custom Quadratic Weighted Kappa function
def quadratic_kappa(y_true, y_pred):
    return cohen_kappa_score(y_true, y_pred, weights='quadratic')

# Custom rounding thresholds
def round_predictions(preds, thresholds):
    return np.where(preds < thresholds[0], 0, 
                    np.where(preds < thresholds[1], 1, 
                             np.where(preds < thresholds[2], 2, 3)))

# Optimize thresholds using the custom Kappa metric
opt_thresholds = minimize(lambda x: -quadratic_kappa(y_train, round_predictions(oof_preds, x)), 
                          x0=[0.5, 1.5, 2.5], method='Nelder-Mead').x

# Final predictions using optimized thresholds
final_preds = round_predictions(test_preds, opt_thresholds)


# Create submission file with error handling
@return_default_value_if_fails(default_value=None)
def create_submission_file():
    submission = pd.DataFrame({'id': sample_submission['id'], 'sii': final_preds})
    submission.to_csv('submission.csv', index=False)
    return "Submission file created successfully."

# Try to create the submission file
submission_message = create_submission_file()
if submission_message:
    print(submission_message)
else:
    print("Submission file creation failed.")

# Log any failed processes
if fails:
    print(f"{len(fails)} operations failed. Logs: {fails}")
    
 ---------------------------------------------------------------------------------
 0.332 Score V8
 
 # Model parameters
best_xgb_params = {'colsample_bytree': 0.7, 'learning_rate': 0.05, 'max_depth': 6, 'n_estimators': 50, 'subsample': 0.7}
best_lgb_params = {'bagging_fraction': 0.8, 'bagging_freq': 5, 'feature_fraction': 0.8, 'learning_rate': 0.05, 'max_depth': 7, 'n_estimators': 50, 'num_leaves': 50}
best_cat_params = {'depth': 7, 'iterations': 150, 'l2_leaf_reg': 3, 'learning_rate': 0.05}

# Train data
X_train = train
y_train = train_sii['sii']
X_test = test

# Initialization of models with the best parameters
xgb_model = XGBRegressor(**best_xgb_params, random_state=SEED)
lgb_model = LGBMRegressor(**best_lgb_params, random_state=SEED)
cat_model = CatBoostRegressor(**best_cat_params, random_seed=SEED, verbose=0)

# Helper function to train the ensemble model
def train_ensemble_model(model, X_train, y_train, X_test):
    skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=SEED)
    oof_preds = np.zeros(len(X_train))
    test_preds = np.zeros((len(X_test), n_splits))

    for fold, (train_idx, val_idx) in enumerate(skf.split(X_train, y_train)):
        X_tr, X_val = X_train.iloc[train_idx], X_train.iloc[val_idx]
        y_tr, y_val = y_train.iloc[train_idx], y_train.iloc[val_idx]

        cloned_model = clone(model)
        cloned_model.fit(X_tr, y_tr)

        oof_preds[val_idx] = cloned_model.predict(X_val)
        test_preds[:, fold] = cloned_model.predict(X_test)

    return oof_preds, test_preds.mean(axis=1)

# Custom Quadratic Weighted Kappa function
def quadratic_kappa(y_true, y_pred):
    return cohen_kappa_score(y_true, y_pred, weights='quadratic')

# Custom rounding thresholds
def round_predictions(preds, thresholds):
    return np.where(preds < thresholds[0], 0, 
                    np.where(preds < thresholds[1], 1, 
                             np.where(preds < thresholds[2], 2, 3)))

# Generate out-of-fold predictions for LightGBM model (you can repeat this for other models)
oof_preds, test_preds_lgb = train_ensemble_model(lgb_model, X_train, y_train, X_test)

# Optimize thresholds using the oof_preds generated from the LGBM model
opt_thresholds = minimize(lambda x: -quadratic_kappa(y_train, round_predictions(oof_preds, x)), 
                          x0=[0.5, 1.5, 2.5], method='Nelder-Mead').x

# Stacking model using the provided best parameters
estimators = [
    ('lgb', lgb_model),
    ('xgb', xgb_model),
    ('cat', cat_model)
]

stacking_model = StackingRegressor(
    estimators=estimators,
    final_estimator=XGBRegressor(random_state=SEED),  
    cv=5,
    passthrough=True
)

# Train the stacking model
stacking_model.fit(X_train, y_train)
stacking_preds = stacking_model.predict(X_test)

# Final predictions using optimized thresholds
final_preds = round_predictions(stacking_preds, opt_thresholds)

# Create submission file with error handling
@return_default_value_if_fails(default_value=None)
def create_submission_file():
    submission = pd.DataFrame({'id': sample_submission['id'], 'sii': final_preds})
    submission.to_csv('submission.csv', index=False)
    return "Submission file created successfully."

# This will create the submission file
submission_message = create_submission_file()
if submission_message:
    print(submission_message)
else:
    print("Submission file creation failed.")

# TO log any failed processes
if fails:
    print(f"{len(fails)} operations failed. Logs: {fails}")
else:
    print("All operations executed successfully.")

-----------------------------------------------------------------
0.423 V13

# Prepare training data
X_train = train
y_train = train_sii_balanced['sii']
X_test = test

# Model parameters
best_xgb_params = {'colsample_bytree': 0.7, 'learning_rate': 0.05, 'max_depth': 6, 'n_estimators': 50, 'subsample': 0.7}
best_lgb_params = {'bagging_fraction': 0.8, 'bagging_freq': 5, 'feature_fraction': 0.8, 'learning_rate': 0.05, 'max_depth': 7, 'n_estimators': 50, 'num_leaves': 50}
best_cat_params = {'depth': 7, 'iterations': 150, 'l2_leaf_reg': 3, 'learning_rate': 0.05}

# Initialize models with provided best parameters
xgb_model = XGBRegressor(**best_xgb_params, random_state=SEED)
lgb_model = LGBMRegressor(**best_lgb_params, random_state=SEED)
cat_model = CatBoostRegressor(**best_cat_params, random_seed=SEED, verbose=0)

# Ensemble voting model using the best parameters for all models
voting_model = VotingRegressor([('lgb', lgb_model), ('xgb', xgb_model), ('cat', cat_model)])


# Function to train the ensemble model with cross-validation
def train_ensemble_model(model, X_train, y_train, X_test):
    skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=SEED)
    oof_preds = np.zeros(len(X_train))
    test_preds = np.zeros((len(X_test), n_splits))

    for fold, (train_idx, val_idx) in enumerate(skf.split(X_train, y_train)):
        X_tr, X_val = X_train.iloc[train_idx], X_train.iloc[val_idx]
        y_tr, y_val = y_train.iloc[train_idx], y_train.iloc[val_idx]

        cloned_model = clone(model)
        cloned_model.fit(X_tr, y_tr)

        oof_preds[val_idx] = cloned_model.predict(X_val)
        test_preds[:, fold] = cloned_model.predict(X_test)

    return oof_preds, test_preds.mean(axis=1)

# Train the ensemble model
oof_preds, test_preds = train_ensemble_model(voting_model, X_train, y_train, X_test)

# Custom Quadratic Weighted Kappa function
def quadratic_kappa(y_true, y_pred):
    return cohen_kappa_score(y_true, y_pred, weights='quadratic')

# Custom rounding thresholds
def round_predictions(preds, thresholds):
    return np.where(preds < thresholds[0], 0, 
                    np.where(preds < thresholds[1], 1, 
                             np.where(preds < thresholds[2], 2, 3)))

# Optimize thresholds using the custom Kappa metric
opt_thresholds = minimize(lambda x: -quadratic_kappa(y_train, round_predictions(oof_preds, x)), 
                          x0=[0.5, 1.5, 2.5], method='Nelder-Mead').x

# Final predictions using optimized thresholds
final_preds = round_predictions(test_preds, opt_thresholds)


# Create submission file with error handling
@return_default_value_if_fails(default_value=None)
def create_submission_file():
    submission = pd.DataFrame({'id': sample_submission['id'], 'sii': final_preds})
    submission.to_csv('submission.csv', index=False)
    return "Submission file created successfully."

# Try to create the submission file
submission_message = create_submission_file()
if submission_message:
    print(submission_message)
else:
    print("Submission file creation failed.")

# Log any failed processes
if fails:
    print(f"{len(fails)} operations failed. Logs: {fails}")

----------------------------------------------------------------------------------

Evaluation
Submissions are scored based on the quadratic weighted kappa, which measures the agreement between two outcomes. This metric typically varies from 0 (random agreement) to 1 (complete agreement). In the event that there is less agreement than expected by chance, the metric may go below 0.

To compute the quadratic weighted kappa, we construct three matrices, O
, W
, and E
, with N
 the number of distinct labels.

The matrix O
 is an N×N
 histogram matrix such that Oi,j
 corresponds to the number of instances that have an actual value i
 and a predicted value j
.

The matrix W
 is an N×N
 matrix of weights, calculated based on the squared difference between actual and predicted values:

Wi,j=(i−j)2(N−1)2

The matrix E
 is an N×N
 histogram matrix of expected outcomes, calculated assuming that there is no correlation between values. This is calculated as the outer product between the actual histogram vector of outcomes and the predicted histogram vector, normalized such that E
 and O
 have the same sum.

From these three matrices, the quadratic weighted kappa is calculated as: 

κ=1−∑i,jWi,jOi,j∑i,jWi,jEi,j.

"""

' Best Score - 438 V7\n# prepare training data\nX_train = train\ny_train = train_sii[\'sii\']\nX_test = test\n\n# Define a helper function for GridSearchCV to train and find the best params\ndef grid_search_model(model, param_grid, X_train, y_train):\n    grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=3, scoring=\'neg_mean_squared_error\', verbose=2, n_jobs=-1)\n    grid_search.fit(X_train, y_train)\n    return grid_search.best_params_\n\n# XGBoost parameter grid and tuning\nxgb_params = {\n    \'max_depth\': [6, 7, 8],\n    \'learning_rate\': [0.01, 0.03, 0.05],\n    \'n_estimators\': [50, 150],\n    \'subsample\': [0.7, 0.8, 0.9],\n    \'colsample_bytree\': [0.7, 0.8, 0.9]\n}\n\nxgb = XGBRegressor(random_state=SEED)\nbest_xgb_params = grid_search_model(xgb, xgb_params, X_train, y_train)\nxgb_model = XGBRegressor(**best_xgb_params, random_state=SEED)\n\n# LightGBM parameter grid and tuning\nlgb_params = {\n    \'max_depth\': [7, 9, 11],\n    \'learning_rate\': [

In [None]:
# Train data
X_train = train
y_train = train_sii_balanced['sii']
X_test = test

# Define a helper function for GridSearchCV to train and find the best params
def grid_search_model(model, param_grid, X_train, y_train):
    grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=3, scoring='neg_mean_squared_error', verbose=2, n_jobs=-1)
    grid_search.fit(X_train, y_train)
    return grid_search.best_params_

# XGBoost parameter grid and tuning
xgb_params = {
    'max_depth': [6, 7, 8],
    'learning_rate': [0.01, 0.03, 0.05],
    'n_estimators': [50, 150],
    'subsample': [0.7, 0.8, 0.9],
    'colsample_bytree': [0.7, 0.8, 0.9]
}

xgb = XGBRegressor(random_state=SEED)
best_xgb_params = grid_search_model(xgb, xgb_params, X_train, y_train)
xgb_model = XGBRegressor(**best_xgb_params, random_state=SEED)

# LightGBM parameter grid and tuning
lgb_params = {
    'max_depth': [7, 9, 11],
    'learning_rate': [0.01, 0.05, 0.1],
    'num_leaves': [50, 100, 150],
    'n_estimators': [50, 150],
    'feature_fraction': [0.7, 0.8, 0.9],
    'bagging_fraction': [0.7, 0.8, 0.9],
    'bagging_freq': [5, 7, 10]
}

lgb = LGBMRegressor(random_state=SEED)
best_lgb_params = grid_search_model(lgb, lgb_params, X_train, y_train)
lgb_model = LGBMRegressor(**best_lgb_params, random_state=SEED)

# CatBoost parameter grid and tuning
cat_params = {
    'depth': [6, 7, 8],
    'learning_rate': [0.01, 0.03, 0.05],
    'iterations': [50, 150],
    'l2_leaf_reg': [3, 5, 7]
}

cat = CatBoostRegressor(random_seed=SEED, verbose=0)
best_cat_params = grid_search_model(cat, cat_params, X_train, y_train)
cat_model = CatBoostRegressor(**best_cat_params, random_seed=SEED, verbose=0)

# Ensemble voting model using the best parameters for all models
voting_model = VotingRegressor([('lgb', lgb_model), ('xgb', xgb_model), ('cat', cat_model)])

# Function to train the ensemble model
def train_ensemble_model(model, X_train, y_train, X_test):
    skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=SEED)
    oof_preds = np.zeros(len(X_train))
    test_preds = np.zeros((len(X_test), n_splits))

    for fold, (train_idx, val_idx) in enumerate(skf.split(X_train, y_train)):
        X_tr, X_val = X_train.iloc[train_idx], X_train.iloc[val_idx]
        y_tr, y_val = y_train.iloc[train_idx], y_train.iloc[val_idx]

        cloned_model = clone(model)
        cloned_model.fit(X_tr, y_tr)

        oof_preds[val_idx] = cloned_model.predict(X_val)
        test_preds[:, fold] = cloned_model.predict(X_test)

    return oof_preds, test_preds.mean(axis=1)


# Train the ensemble model
oof_preds, test_preds = train_ensemble_model(voting_model, X_train, y_train, X_test)

# Custom Quadratic Weighted Kappa function
def quadratic_kappa(y_true, y_pred):
    return cohen_kappa_score(y_true, y_pred, weights='quadratic')

# Custom rounding thresholds
def round_predictions(preds, thresholds):
    return np.where(preds < thresholds[0], 0, 
                    np.where(preds < thresholds[1], 1, 
                             np.where(preds < thresholds[2], 2, 3)))

# Optimize thresholds using the custom Kappa metric
opt_thresholds = minimize(lambda x: -quadratic_kappa(y_train, round_predictions(oof_preds, x)), 
                          x0=[0.5, 1.5, 2.5], method='Nelder-Mead').x

# Final predictions using optimized thresholds
final_preds = round_predictions(test_preds, opt_thresholds)


# Create submission file with error handling
@return_default_value_if_fails(default_value=None)
def create_submission_file():
    submission = pd.DataFrame({'id': sample_submission['id'], 'sii': final_preds})
    submission.to_csv('submission.csv', index=False)
    return "Submission file created successfully."

# Try to create the submission file
submission_message = create_submission_file()
if submission_message:
    print(submission_message)
else:
    print("Submission file creation failed.")

# Log any failed processes
if fails:
    print(f"{len(fails)} operations failed. Logs: {fails}")

Fitting 3 folds for each of 162 candidates, totalling 486 fits
[CV] END colsample_bytree=0.7, learning_rate=0.01, max_depth=6, n_estimators=50, subsample=0.7; total time=   0.4s
[CV] END colsample_bytree=0.7, learning_rate=0.01, max_depth=6, n_estimators=50, subsample=0.8; total time=   0.4s
[CV] END colsample_bytree=0.7, learning_rate=0.01, max_depth=6, n_estimators=50, subsample=0.9; total time=   0.4s
[CV] END colsample_bytree=0.7, learning_rate=0.01, max_depth=6, n_estimators=150, subsample=0.8; total time=   0.9s
[CV] END colsample_bytree=0.7, learning_rate=0.01, max_depth=6, n_estimators=150, subsample=0.9; total time=   1.0s
[CV] END colsample_bytree=0.7, learning_rate=0.01, max_depth=7, n_estimators=50, subsample=0.7; total time=   0.6s
[CV] END colsample_bytree=0.7, learning_rate=0.01, max_depth=7, n_estimators=50, subsample=0.8; total time=   0.6s
[CV] END colsample_bytree=0.7, learning_rate=0.01, max_depth=7, n_estimators=150, subsample=0.7; total time=   1.5s
[CV] END colsa

In [None]:
for dirname, _, filenames in os.walk('/kaggle/working'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
dff = pd.read_csv('/kaggle/working/submission.csv')

In [None]:
dff

In [None]:
best_cat_params

In [None]:
best_lgb_params

In [None]:
best_xgb_params