# Dataset Description

The Healthy Brain Network (HBN) dataset is a clinical sample of about five-thousand 5-22 year-olds who have undergone both clinical and research screenings. The objective of the HBN study is to find biological markers that will improve the diagnosis and treatment of mental health and learning disorders from an objective biological perspective. Two elements of this study are being used for this competition: physical activity data (wrist-worn accelerometer data, fitness assessments and questionnaires) and internet usage behavior data. The goal of this competition is to predict from this data a participant's Severity Impairment Index (sii), a standard measure of problematic internet use.

Note that this is a Code Competition, in which the actual test set is hidden. In this public version, we give some sample data in the correct format to help you author your solutions. The full test set comprises about 3800 instances.

The competition data is compiled into two sources, parquet files containing the accelerometer (actigraphy) series and csv files containing the remaining tabular data. The majority of measures are missing for most participants. In particular, the target sii is missing for a portion of the participants in the training set. You may wish to apply non-supervised learning techniques to this data. The sii value is present for all instances in the test set.



In [1]:
import numpy as np 
import pandas as pd
import os

train_data = pd.read_csv('/kaggle/input/handling-sii/impute_train_data.csv', index_col='id')
test_data = pd.read_csv('/kaggle/input/child-mind-institute-problematic-internet-use/test.csv', index_col='id')

In [2]:
train_df = train_data.copy()
test_df = test_data.copy()

train_df.shape, test_df.shape

((3960, 81), (20, 58))

In [3]:
train_cols = train_data.columns.tolist()
test_cols = test_data.columns.tolist()

In [4]:
features = test_cols.copy()

num_features = [f for f in features if test_df[f].dtype == 'float' or f == 'Basic_Demos-Age']
cat_features = [f for f in features if f not in num_features]

len(features), len(num_features), len(cat_features)

(58, 47, 11)

In [5]:
from tqdm import tqdm
from IPython.display import clear_output
from concurrent.futures import ThreadPoolExecutor

import warnings
warnings.filterwarnings('ignore')
pd.options.display.max_columns = None

In [6]:
def process_file(filename, dirname):
    data = pd.read_parquet(os.path.join(dirname, filename, 'part-0.parquet'))
    data.drop('step', axis=1, inplace=True)
    return data.describe().values.reshape(-1), filename.split('=')[1]

def load_time_series(dirname) -> pd.DataFrame:
    ids = os.listdir(dirname)
    
    with ThreadPoolExecutor() as executor:
        results = list(tqdm(executor.map(lambda fname: process_file(fname, dirname), ids), total=len(ids)))
    stats, indexes = zip(*results)
    
    data = pd.DataFrame(stats, columns=[f"stat_{i}" for i in range(len(stats[0]))])
    data['id'] = indexes
    return data

train_ts = load_time_series('/kaggle/input/child-mind-institute-problematic-internet-use/series_train.parquet')
test_ts = load_time_series('/kaggle/input/child-mind-institute-problematic-internet-use/series_test.parquet')

time_series_cols = train_ts.columns.tolist()
time_series_cols.remove('id')

100%|██████████| 996/996 [01:26<00:00, 11.58it/s]
100%|██████████| 2/2 [00:00<00:00,  9.44it/s]


In [7]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import KNNImputer, SimpleImputer
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
from sklearn.feature_selection import SelectKBest, r_regression

In [8]:
num_transformer = Pipeline(steps=[
    ('KNNimputer', KNNImputer(n_neighbors=2, weights='uniform')),
    ('MinMaxScaler', MinMaxScaler())
])

In [9]:
cat_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='unknown')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

In [10]:
ts_transformer = Pipeline(steps=[
    ('MinMaxScaler', MinMaxScaler()),
    ('imputer', SimpleImputer(strategy='median'))
])

In [11]:
preprocessor = ColumnTransformer(transformers=[
    ('numerical', num_transformer, num_features),
    ('categorical', cat_transformer, cat_features),
    ('time_series', ts_transformer, time_series_cols)
])

In [12]:
from lightgbm import LGBMRegressor

params1 = {  
    
    'metric'              :'rmse',
    'objective'           :'regression',
    'learning_rate'       : 0.04,
    'max_depth'           : 12,
    'num_leaves'          : 59,
    'subsample'           : 0.70,
    'colsample_bytree'    : 0.50,
    'min_child_weight'    : 12, 
    'min_child_samples'   : 14,    
    'reg_alpha'           : 0.23,
    'reg_lambda'          : 0.36,
}
params2 = {  
    
    'metric'              :'rmse',
    'objective'           :'regression',
    'learning_rate'       : 0.05,
    'max_depth'           : 9,
    'num_leaves'          : 59,
    'subsample'           : 0.80,
    'colsample_bytree'    : 0.50,
    'min_child_weight'    : 12, 
    'min_child_samples'   : 14,  
    'reg_alpha'           : 0.23,
    'reg_lambda'          : 0.36,
}
params3 = {  
    
    'metric'              :'rmse',
    'objective'           :'regression',
    'learning_rate'       : 0.046,
    'max_depth'           : 12,
    'num_leaves'          : 478,
    'min_data_in_leaf'    : 13,
    'feature_fraction'    : 0.893,
    'bagging_fraction'    : 0.784,
    'bagging_freq'        : 4,
    'lambda_l1'           : 10, 
    'lambda_l2'           : 0.01, 
}

model1 = LGBMRegressor(**params1, n_estimators=300, verbose=-1)
model2 = LGBMRegressor(**params2, n_estimators=300, verbose=-1)
model3 = LGBMRegressor(**params3, n_estimators=300, verbose=-1)

In [13]:
pipeline1 = Pipeline(steps=[
    ('preprocess', preprocessor),
    ('feature_selection', SelectKBest(score_func=r_regression, k=117)),
    ('model', model1)
])
pipeline2 = Pipeline(steps=[
    ('preprocess', preprocessor),
    ('feature_selection', SelectKBest(score_func=r_regression, k=117)),
    ('model', model2)
])
pipeline3 = Pipeline(steps=[
    ('preprocess', preprocessor),
    ('feature_selection', SelectKBest(score_func=r_regression, k=117)),
    ('model', model3)
])

In [14]:
main_train_df = pd.merge(train_df[features], train_ts, how="left", on='id')
main_test_df = pd.merge(test_df, test_ts, how="left", on='id')

In [15]:
X = main_train_df.copy()
y = train_df['sii']
XX = main_test_df.copy()

In [16]:
pipeline1.fit(X, y)

In [17]:
pipeline2.fit(X, y)

In [18]:
pipeline3.fit(X, y)

In [19]:
pred = np.zeros(len(XX))

In [20]:
pred += pipeline1.predict(XX)
pred += pipeline2.predict(XX)
pred += pipeline3.predict(XX)

In [21]:
sub = pd.DataFrame({'id': XX['id'], 'sii': np.round(pred / 3)})
sub.to_csv('submission.csv', index=False)