# Problem Definition — Predicting Problematic Internet Use from Physical Activity

## Overview
Can you predict the level of problematic internet usage exhibited by children and adolescents, based on their physical activity?  
The goal of this competition is to develop a predictive model that analyzes children's physical activity and fitness data to identify early signs of problematic internet use. Identifying these patterns can help trigger interventions to encourage healthier digital habits.

- **Start:** Sep 19, 2024  
- **Close:** Dec 20, 2024

---

## Problem statement
In today’s digital age, problematic internet use among children and adolescents is a growing concern. The competition challenges participants to develop a model that analyzes physical activity and fitness data from children and adolescents to detect early indicators of problematic internet and technology use. Early detection enables prompt interventions aimed at promoting healthier digital habits.

---

## What we are predicting
- For each `id` in the test set you must predict the corresponding `sii` (defined in later markdown cells).  
- Submission format (CSV):  
  ```csv
  id,sii
  000046df,0
  000089ff,1
  00012558,2
  00017ccd,3
  ...


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import plotly.express as ex
import re
from xgboost import XGBClassifier , XGBRegressor
from concurrent.futures import ThreadPoolExecutor
from tqdm import tqdm
from sklearn.base import clone
from sklearn.metrics import cohen_kappa_score
from sklearn.model_selection import StratifiedKFold ,train_test_split
from scipy.optimize import minimize
from sklearn.metrics import cohen_kappa_score
from sklearn.metrics import make_scorer
import plotly.express as px
from functools import partial
from imblearn.over_sampling import RandomOverSampler
import plotly.graph_objects as go
import optuna
def kappa_score(y_true , y_pred):
    return cohen_kappa_score(y_true, y_pred, weights='quadratic')




from sklearn.model_selection import cross_val_score
import os


pd.set_option("display.max_rows" , 100)
pd.set_option("display.max_columns" , 100)


  from .autonotebook import tqdm as notebook_tqdm


# Dataset Definition — Healthy Brain Network (HBN)

## Overview
The **Healthy Brain Network (HBN)** dataset is a clinical sample of approximately 5,000 children and adolescents (ages 5–22) who underwent both clinical and research screenings.  
The HBN study aims to identify biological markers that improve the diagnosis and treatment of mental health and learning disorders from an objective perspective.  

For this competition, two main elements are used:  
1. **Physical activity data** — wrist-worn accelerometer readings, fitness assessments, and questionnaires.  
2. **Internet usage behavior data** — measures of compulsive or problematic internet use.  

The goal is to predict each participant's **Severity Impairment Index (sii)**, a standard measure of problematic internet use.

> **Note:** This is a **Code Competition**. The actual test set is hidden; a sample public version is provided for solution development. The full test set contains ~3,800 participants.

---

## Data Sources
The competition data is provided in **two formats**:  

1. **Parquet files** — accelerometer (actigraphy) time series for each participant.  
2. **CSV files** — remaining tabular measurements from various instruments.

> Many measures are missing for participants; unsupervised techniques may be useful.  
> The target `sii` is present for all test set instances but missing for some training participants.

---

## HBN Instruments (Tabular Data)
Tabular data is in `train.csv` and `test.csv`. Each instrument is described in `data_dictionary.csv`. Key instruments include:

1. **Demographics** — age, sex.  
2. **Internet Use** — hours per day using computer/internet.  
3. **Children's Global Assessment Scale** — clinician-rated general functioning (<18 years).  
4. **Physical Measures** — blood pressure, heart rate, height, weight, waist, and hip measurements.  
5. **FitnessGram Vitals and Treadmill** — cardiovascular fitness (NHANES treadmill protocol).  
6. **FitnessGram Child** — aerobic capacity, muscular strength, endurance, flexibility, body composition.  
7. **Bio-electric Impedance Analysis** — BMI, fat, muscle, and water content.  
8. **Physical Activity Questionnaire** — participation in vigorous activities over last 7 days.  
9. **Sleep Disturbance Scale** — measures sleep disorders.  
10. **Actigraphy** — objective ecological physical activity via wrist-worn accelerometer.  
11. **Parent-Child Internet Addiction Test (PCIAT)** — 20-item scale measuring compulsivity, escapism, and dependency.  

> Target `sii` is derived from `PCIAT-PCIAT_Total`:  
> - `0` = None  
> - `1` = Mild  
> - `2` = Moderate  
> - `3` = Severe  

Each participant has a **unique identifier `id`**.

---

## Actigraphy Files
Participants wore accelerometers for up to **30 days**. Files are structured as:  



In [2]:
pciat_cols = ['PCIAT-PCIAT_03',
 'PCIAT-PCIAT_14',
 'PCIAT-PCIAT_11',
 'PCIAT-PCIAT_19',
 'PCIAT-PCIAT_15',
 'PCIAT-PCIAT_08',
 'PCIAT-PCIAT_06',
 'PCIAT-PCIAT_17',
 'PCIAT-PCIAT_07',
 'PCIAT-PCIAT_10',
 'PCIAT-Season',
 'PCIAT-PCIAT_05',
 'PCIAT-PCIAT_02',
 'PCIAT-PCIAT_12',
 'PCIAT-PCIAT_09',
 'PCIAT-PCIAT_18',
 'PCIAT-PCIAT_16',
 'PCIAT-PCIAT_04',
 'sii',
 'PCIAT-PCIAT_Total',
 'PCIAT-PCIAT_13',
 'PCIAT-PCIAT_01',
 'PCIAT-PCIAT_20']


selected = ['Physical-Height',
 'Basic_Demos-Age',
 'PreInt_EduHx-computerinternet_hoursday',
 'Physical-Weight',
 'Physical-Waist_Circumference',
 'FGC-FGC_CU',
 'BIA-BIA_BMI',
 'SDS-SDS_Total_T',
 'PAQ_A-Season',
 'FGC-FGC_PU',
 'BIA-BIA_Frame_num',
 'FGC-FGC_GSD',
 'Physical-Systolic_BP',
 'FGC-FGC_GSND',
 'FGC-FGC_TL',
 'PAQ_C-Season',
 'BIA-BIA_FFMI',
 'FGC-FGC_SRR_Zone',
 'FGC-FGC_SRL_Zone']

# Selected Columns for Modeling

For this project, we have chosen a subset of columns from the HBN dataset that are most relevant for predicting **problematic internet use (`sii`)**.  

### Reason for Selection
- **Physical & fitness measures:** indicators like height, weight, waist circumference, BMI, cardiovascular and aerobic fitness, and muscular strength may correlate with behavioral patterns linked to excessive internet use.  
- **Demographics:** age is included as a key factor influencing both activity levels and internet behavior.  
- **Internet usage behavior:** the number of hours spent using computer/internet per day is a direct behavioral measure.  
- **Mental health and activity questionnaires:** Sleep Disturbance Scale (SDS), Physical Activity Questionnaire (PAQ), and FitnessGram (FGC) measures provide additional context for lifestyle and functional patterns.  

These features were selected to balance **predictive relevance** and **data availability**, while minimizing missingness and redundancy.

---

### PCIAT Columns
The **Parent-Child Internet Addiction Test (PCIAT)**

# Filtering Training Data

For modeling, we will **only use rows where the target label `sii` is not `NaN`**.  
This ensures that the supervised models are trained on participants for whom the Severity Impairment Index is available.



In [7]:
train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')
train = train.drop('id', axis=1)
id_test = test.pop("id")
train1 = train.loc[~train["sii"].isna()]
train2 = train.loc[train["sii"].isna()]
train1

Unnamed: 0,Basic_Demos-Enroll_Season,Basic_Demos-Age,Basic_Demos-Sex,CGAS-Season,CGAS-CGAS_Score,Physical-Season,Physical-BMI,Physical-Height,Physical-Weight,Physical-Waist_Circumference,Physical-Diastolic_BP,Physical-HeartRate,Physical-Systolic_BP,Fitness_Endurance-Season,Fitness_Endurance-Max_Stage,Fitness_Endurance-Time_Mins,Fitness_Endurance-Time_Sec,FGC-Season,FGC-FGC_CU,FGC-FGC_CU_Zone,FGC-FGC_GSND,FGC-FGC_GSND_Zone,FGC-FGC_GSD,FGC-FGC_GSD_Zone,FGC-FGC_PU,FGC-FGC_PU_Zone,FGC-FGC_SRL,FGC-FGC_SRL_Zone,FGC-FGC_SRR,FGC-FGC_SRR_Zone,FGC-FGC_TL,FGC-FGC_TL_Zone,BIA-Season,BIA-BIA_Activity_Level_num,BIA-BIA_BMC,BIA-BIA_BMI,BIA-BIA_BMR,BIA-BIA_DEE,BIA-BIA_ECW,BIA-BIA_FFM,BIA-BIA_FFMI,BIA-BIA_FMI,BIA-BIA_Fat,BIA-BIA_Frame_num,BIA-BIA_ICW,BIA-BIA_LDM,BIA-BIA_LST,BIA-BIA_SMM,BIA-BIA_TBW,PAQ_A-Season,PAQ_A-PAQ_A_Total,PAQ_C-Season,PAQ_C-PAQ_C_Total,PCIAT-Season,PCIAT-PCIAT_01,PCIAT-PCIAT_02,PCIAT-PCIAT_03,PCIAT-PCIAT_04,PCIAT-PCIAT_05,PCIAT-PCIAT_06,PCIAT-PCIAT_07,PCIAT-PCIAT_08,PCIAT-PCIAT_09,PCIAT-PCIAT_10,PCIAT-PCIAT_11,PCIAT-PCIAT_12,PCIAT-PCIAT_13,PCIAT-PCIAT_14,PCIAT-PCIAT_15,PCIAT-PCIAT_16,PCIAT-PCIAT_17,PCIAT-PCIAT_18,PCIAT-PCIAT_19,PCIAT-PCIAT_20,PCIAT-PCIAT_Total,SDS-Season,SDS-SDS_Total_Raw,SDS-SDS_Total_T,PreInt_EduHx-Season,PreInt_EduHx-computerinternet_hoursday,sii
0,Fall,5,0,Winter,51.0,Fall,16.877316,46.0,50.8,,,,,,,,,Fall,0.0,0.0,,,,,0.0,0.0,7.0,0.0,6.0,0.0,6.0,1.0,Fall,2.0,2.66855,16.8792,932.498,1492.00,8.25598,41.5862,13.8177,3.061430,9.21377,1.0,24.4349,8.89536,38.9177,19.5413,32.6909,,,,,Fall,5.0,4.0,4.0,0.0,4.0,0.0,0.0,4.0,0.0,0.0,4.0,0.0,4.0,4.0,4.0,4.0,4.0,4.0,2.0,4.0,55.0,,,,Fall,3.0,2.0
1,Summer,9,0,,,Fall,14.035590,48.0,46.0,22.0,75.0,70.0,122.0,,,,,Fall,3.0,0.0,,,,,5.0,0.0,11.0,1.0,11.0,1.0,3.0,0.0,Winter,2.0,2.57949,14.0371,936.656,1498.65,6.01993,42.0291,12.8254,1.211720,3.97085,1.0,21.0352,14.97400,39.4497,15.4107,27.0552,,,Fall,2.340,Fall,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Fall,46.0,64.0,Summer,0.0,0.0
2,Summer,10,1,Fall,71.0,Fall,16.648696,56.5,75.6,,65.0,94.0,117.0,Fall,5.0,7.0,33.0,Fall,20.0,1.0,10.2,1.0,14.7,2.0,7.0,1.0,10.0,1.0,10.0,1.0,5.0,0.0,,,,,,,,,,,,,,,,,,,,Summer,2.170,Fall,5.0,2.0,2.0,1.0,2.0,1.0,1.0,2.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,0.0,2.0,2.0,1.0,1.0,28.0,Fall,38.0,54.0,Summer,2.0,0.0
3,Winter,9,0,Fall,71.0,Summer,18.292347,56.0,81.6,,60.0,97.0,117.0,Summer,6.0,9.0,37.0,Summer,18.0,1.0,,,,,5.0,0.0,7.0,0.0,7.0,0.0,7.0,1.0,Summer,3.0,3.84191,18.2943,1131.430,1923.44,15.59250,62.7757,14.0740,4.220330,18.82430,2.0,30.4041,16.77900,58.9338,26.4798,45.9966,,,Winter,2.451,Summer,4.0,2.0,4.0,0.0,5.0,1.0,0.0,3.0,2.0,2.0,3.0,0.0,3.0,0.0,0.0,3.0,4.0,3.0,4.0,1.0,44.0,Summer,31.0,45.0,Winter,0.0,1.0
5,Spring,13,1,Winter,50.0,Summer,22.279952,59.5,112.2,,60.0,73.0,102.0,,,,,Summer,12.0,0.0,16.5,2.0,17.9,2.0,6.0,0.0,10.0,1.0,11.0,1.0,8.0,0.0,Summer,2.0,4.33036,30.1865,1330.970,1996.45,30.21240,84.0285,16.6877,13.498800,67.97150,2.0,32.9141,20.90200,79.6982,35.3804,63.1265,,,Spring,4.110,Summer,3.0,3.0,3.0,0.0,2.0,1.0,0.0,2.0,2.0,1.0,0.0,1.0,3.0,3.0,2.0,1.0,3.0,1.0,2.0,1.0,34.0,Summer,40.0,56.0,Spring,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3953,Fall,8,0,,,Fall,17.139810,52.5,67.2,25.0,60.0,65.0,112.0,,,,,Fall,0.0,0.0,,,,,0.0,0.0,8.0,1.0,10.0,1.0,12.0,1.0,Fall,3.0,3.20303,17.1417,1035.270,1759.96,11.00630,52.5331,13.4004,3.741300,14.66690,1.0,25.7118,15.81500,49.3301,20.2645,36.7181,,,Fall,3.440,Fall,3.0,3.0,3.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,3.0,0.0,2.0,2.0,1.0,22.0,Fall,41.0,58.0,Fall,2.0,0.0
3954,Summer,7,1,,,Summer,13.927006,48.5,46.6,23.0,65.0,75.0,105.0,,,,,Summer,0.0,0.0,,,,,0.0,0.0,9.0,0.0,8.5,0.0,4.5,0.0,Fall,1.0,2.36680,13.6457,966.287,1256.17,9.98802,45.1853,13.2315,0.414263,1.41470,1.0,20.0572,15.14000,42.8185,18.0937,30.0453,,,,,Summer,1.0,3.0,3.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,5.0,1.0,0.0,5.0,3.0,3.0,3.0,0.0,33.0,Summer,48.0,67.0,Summer,0.0,1.0
3955,Fall,13,0,Spring,60.0,Fall,16.362460,59.5,82.4,,71.0,70.0,104.0,,,,,Fall,16.0,0.0,18.0,1.0,19.9,2.0,10.0,1.0,8.0,1.0,9.0,1.0,12.0,1.0,Fall,3.0,4.52277,16.3642,1206.880,2051.70,19.46110,70.8117,14.0629,2.301380,11.58830,1.0,33.3709,17.97970,66.2889,29.7790,52.8320,,,Winter,3.260,Winter,3.0,3.0,3.0,2.0,3.0,2.0,2.0,2.0,2.0,1.0,2.0,0.0,2.0,0.0,1.0,0.0,2.0,1.0,1.0,0.0,32.0,Winter,35.0,50.0,Fall,1.0,1.0
3957,Fall,11,0,Spring,68.0,Winter,21.441500,60.0,109.8,,79.0,99.0,116.0,,,,,Winter,15.0,1.0,18.5,2.0,15.8,2.0,0.0,0.0,10.0,1.0,10.0,1.0,14.0,1.0,Winter,2.0,4.41305,21.4438,1253.740,2005.99,20.48250,75.8033,14.8043,6.639520,33.99670,2.0,33.9805,21.34030,71.3903,28.7792,54.4630,,,Winter,2.729,Winter,5.0,5.0,3.0,0.0,5.0,1.0,0.0,2.0,0.0,2.0,1.0,0.0,1.0,3.0,0.0,0.0,1.0,1.0,0.0,1.0,31.0,Winter,56.0,77.0,Fall,0.0,1.0


# Relationship Between PCIAT Total Score and `sii`

The **Severity Impairment Index (`sii`)** is derived directly from the **Parent-Child Internet Addiction Test (PCIAT)** total score (`PCIAT-PCIAT_Total`).  

Specifically, `sii` is a categorical transformation of the total PCIAT score:
- `0` → None  
- `1` → Mild  
- `2` → Moderate  
- `3` → Severe  

This means that the target variable we aim to predict (`sii`) is fundamentally based on questionnaire responses measuring compulsive and problematic internet behaviors.


In [4]:
ex.scatter( train1 ,
           x = "PCIAT-PCIAT_Total" 
           , y="sii" ,marginal_x="box" , marginal_y="histogram")

# Feature–Target Split and Categorical Encoding

In this step, we prepare the dataset for modeling by:

1. Defining the **target variable (`y`)**  
2. Selecting the **feature matrix (`X`)**  
3. Encoding categorical variables (in this case, **Season**) into numeric values  

Since most of our selected features are numeric, the only categorical variable we need to encode is **Season**. We map seasons to ordinal numeric values for model compatibility.

---


In [8]:
    
seasons_mapping = {
    'Winter': 1,
    'Spring': 2,
    'Summer': 3,
    'Fall': 4  
}



y = train1["PCIAT-PCIAT_Total"]
X = train1[selected].replace(seasons_mapping)

test_m = test[selected].replace(seasons_mapping)
X


Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`


Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`



Unnamed: 0,Physical-Height,Basic_Demos-Age,PreInt_EduHx-computerinternet_hoursday,Physical-Weight,Physical-Waist_Circumference,FGC-FGC_CU,BIA-BIA_BMI,SDS-SDS_Total_T,PAQ_A-Season,FGC-FGC_PU,BIA-BIA_Frame_num,FGC-FGC_GSD,Physical-Systolic_BP,FGC-FGC_GSND,FGC-FGC_TL,PAQ_C-Season,BIA-BIA_FFMI,FGC-FGC_SRR_Zone,FGC-FGC_SRL_Zone
0,46.0,5,3.0,50.8,,0.0,16.8792,,,0.0,1.0,,,,6.0,,13.8177,0.0,0.0
1,48.0,9,0.0,46.0,22.0,3.0,14.0371,64.0,,5.0,1.0,,122.0,,3.0,4.0,12.8254,1.0,1.0
2,56.5,10,2.0,75.6,,20.0,,54.0,,7.0,,14.7,117.0,10.2,5.0,3.0,,1.0,1.0
3,56.0,9,0.0,81.6,,18.0,18.2943,45.0,,5.0,2.0,,117.0,,7.0,1.0,14.0740,0.0,0.0
5,59.5,13,0.0,112.2,,12.0,30.1865,56.0,,6.0,2.0,17.9,102.0,16.5,8.0,2.0,16.6877,1.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3953,52.5,8,2.0,67.2,25.0,0.0,17.1417,58.0,,0.0,1.0,,112.0,,12.0,4.0,13.4004,1.0,1.0
3954,48.5,7,0.0,46.6,23.0,0.0,13.6457,67.0,,0.0,1.0,,105.0,,4.5,,13.2315,0.0,0.0
3955,59.5,13,1.0,82.4,,16.0,16.3642,50.0,,10.0,1.0,19.9,104.0,18.0,12.0,1.0,14.0629,1.0,1.0
3957,60.0,11,0.0,109.8,,15.0,21.4438,77.0,,0.0,2.0,15.8,116.0,18.5,14.0,1.0,14.8043,1.0,1.0


 {'n_estimators': 659, 'max_depth': 3, 'learning_rate': 0.029251614971879027, 'subsample': 0.7643501775628869, 'colsample_bytree': 0.9190514070739568, 'min_child_weight': 1, 'gamma': 5.644694039386449, 'reg_alpha': 1.4610565581059354, 'reg_lambda': 6.667807875722974, 'scale_pos_weight': 6.0887844342387085, 'max_bin': 10}

# Custom Evaluation, Cross-Validation, and Threshold Optimization

In this section, we define three core utilities that control how our model is evaluated and how final **SII classes** are derived from predicted PCIAT total scores.

Even though the model predicts a **continuous PCIAT total score**, the competition evaluation is based on **ordinal severity classes (SII: 0–3)**.  
Therefore, we need a structured way to:

1. Convert continuous scores into ordinal categories  
2. Evaluate predictions using Quadratic Weighted Kappa  
3. Perform Stratified Fold cross-validation  
4. Optimize classification thresholds for best final performance  

---

## Score Conversion Function

The first function converts continuous PCIAT scores into **ordinal SII bins (0–3)**.

### What it does:

- Multiplies raw scores by a **scaling factor**
- Applies predefined **threshold cutoffs**
- Assigns each sample to one of four severity classes:
  - 0 → No / minimal risk
  - 1 → Mild
  - 2 → Moderate
  - 3 → Severe

### Why this is necessary:

Our model outputs a **regression prediction** (continuous score).  
However, evaluation is performed on **ordinal categories**.  

So this function acts as a bridge:

$ PCIAT_{pred} \rightarrow SII_{pred} $


The scaling factor allows flexibility to strech the space of prediction before binning.

---

##  Quadratic Kappa Function

This function computes the **Quadratic Weighted Cohen’s Kappa**, which is the main evaluation metric.

### What it does:

- Optionally converts both true and predicted scores into ordinal bins
- Computes **Quadratic Weighted Kappa (QWK)**

### Why Quadratic Kappa?

Unlike accuracy, QWK:

- Penalizes larger classification mistakes more heavily  
- Respects the **ordinal structure** (0 < 1 < 2 < 3)  
- Is robust to class imbalance  

For example:
- Predicting 3 instead of 2 → small penalty  
- Predicting 3 instead of 0 → large penalty  

This makes it ideal for severity prediction problems.

We also wrap this into a scorer object so it can be used directly inside sklearn pipelines and model selection workflows.

---

## Cross-Validation Function (cv)

This function performs **Stratified K-Fold Cross-Validation**.

### What it does:

- Splits the dataset into multiple folds
- Preserves class distribution across folds 
- Trains a fresh cloned model on each fold
- Predicts on validation data
- Computes Quadratic Kappa for each fold
- Optionally stores:
  - Out-of-fold numeric predictions
  - Converted ordinal predictions

### Why this matters:

- Gives a reliable estimate of model generalization
- Prevents overfitting to a single validation split
- Produces out-of-fold predictions for further calibration or ensembling

This ensures that performance estimates reflect real-world behavior.

---

# Threshold Optimization with Optuna

Even after training a strong regression model, the final performance heavily depends on **where we place the decision thresholds** that map PCIAT totals to SII classes.

Instead of manually choosing thresholds, we will:

### Use Optuna to optimize:
- Scaling factor  
- Threshold cut points  

### Objective:
Maximize **Quadratic Weighted Kappa** on out-of-fold predictions.

### Why optimize thresholds?

Because small shifts in cut boundaries can significantly impact:
- Class distribution
- Misclassification severity
- Final competition score

In other words:

\[
\text{Better thresholds} \Rightarrow \text{Better SII classification} \Rightarrow \text{Higher QWK}
\]

---

# Overall Pipeline Logic

1. Train regression model → Predict PCIAT total  
2. Convert predictions → SII classes  
3. Evaluate using Quadratic Kappa  
4. Optimize thresholds with Optuna  
5. Generate final SII predictions for submission  

This structured approach allows us to:
- Leverage regression stability
- Respect ordinal severity structure
- Directly optimize for the competition metric



In [9]:
def convert(scores , scaling_fact = 1.3 , thresholds = [30 , 50 , 80]):
    a , b ,c = thresholds
    scores = np.array(scores)*scaling_fact
    bins = np.zeros_like(scores)
    bins[scores <= a] = 0
    bins[(scores > a) & (scores < b)] = 1
    bins[(scores >= b) & (scores < c)] = 2
    bins[scores >= c] = 3
    return bins


def quadratic_kappa(y_true, y_pred , scaling_fact = 1.3 , convert_=True , thresholds = [30 , 50 , 80]):
    if convert_==True :
        y_true = convert(y_true , scaling_fact = scaling_fact , thresholds = [30 , 50 , 80])
        y_pred = convert(y_pred, scaling_fact = scaling_fact, thresholds = thresholds)
        
    return cohen_kappa_score(y_true, y_pred, weights='quadratic')

kappa_scorer = make_scorer(quadratic_kappa , greater_is_better = True )


def cv(model , X , y , n_splits = 5 , random_state = 42 , scaling_fact=2 , oof=False , thresholds = [30 , 50 , 80]):
    skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=random_state)
    
    scores = []
    oof_numeric = np.zeros(len(y))
    oof_converted= np.zeros(len(y))
    # Loop through each fold
    for train_idx, val_idx in skf.split(X, y):
        # Split the data into training and validation sets
        X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
        y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]
        
        # Clone the model to ensure each fold has an independent model
        cloned_model = clone(model)
        # Fit the model on the training set
        cloned_model.fit(X_train, y_train)
        
        # Predict on the validation set
        y_pred = cloned_model.predict(X_val)
        score = quadratic_kappa(y_val , y_pred , scaling_fact=scaling_fact , thresholds = thresholds)
        scores.append(score)
        oof_numeric[val_idx] = y_pred
        oof_converted[val_idx] = convert(y_pred , scaling_fact = scaling_fact)
        
    if not oof:
        return np.array(scores)
    else:
        return np.array(scores), oof_numeric, oof_converted



In [None]:
def objective(trial):
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 20, 1200),  # Number of trees in the ensemble
        'max_depth': trial.suggest_int('max_depth', 1, 20),  # Maximum depth of a tree
        'learning_rate': trial.suggest_float('learning_rate', 1e-4, 0.3),  # Step size shrinkage
        'subsample': trial.suggest_float('subsample', 0.2, 1.0),  # Fraction of samples used for fitting the trees
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.2, 1.0),  # Fraction of features used for fitting each tree
        'min_child_weight': trial.suggest_int('min_child_weight', 1, 30),  # Minimum sum of instance weight (hessian) needed in a child
        'gamma': trial.suggest_float('gamma', 1e-8, 10.0),  # Minimum loss reduction required to make a further partition
        'reg_alpha': trial.suggest_float('reg_alpha', 1e-8, 10.0),  # L1 regularization term on weights
        'reg_lambda': trial.suggest_float('reg_lambda', 1e-8, 10.0),  # L2 regularization term on weights
        'scale_pos_weight': trial.suggest_float('scale_pos_weight', 1, 10),  # Controls the balance of positive and negative weights
        'max_bin': trial.suggest_int('max_bin', 10, 255),  # Maximum number of discrete bins to bucket continuous features
        'n_jobs': -1,  # Use all available cores
        'random_state': 42,  # Ensures consistency
        'objective': 'reg:squarederror',  # Learning task and objective function
    }
    
    a = trial.suggest_float("a", 0, 60)
    

    b = trial.suggest_float("b",  a, 100)  

    c = trial.suggest_float("c", b,100) 
    
    
    model = XGBRegressor(**params ,device="cuda")
    

    score = cv(model , X , y 
               , n_splits = 20
               , scaling_fact=1 
               , oof = False 
               , thresholds =  [a ,b, c])
    
    return score.mean()

study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=200 )
print(study.best_params)

[32m[I 2026-02-16 20:52:27,222][0m A new study created in memory with name: no-name-2b3749a5-9fc2-42e3-a6ad-fb5a1cc5991f[0m

The least populated class in y has only 1 members, which is less than n_splits=20.


Potential solutions:
- Use a data structure that matches the device ordinal in the booster.
- Set the device for booster before call to inplace_predict.



[32m[I 2026-02-16 20:52:31,066][0m Trial 0 finished with value: 0.0 and parameters: {'n_estimators': 101, 'max_depth': 3, 'learning_rate': 0.0014331578023502264, 'subsample': 0.6869234408364867, 'colsample_bytree': 0.29490380079606543, 'min_child_weight': 9, 'gamma': 2.553312856091873, 'reg_alpha': 3.313370324176538, 'reg_lambda': 5.792877762545293, 'scale_pos_weight': 8.746168848164661, 'max_bin': 177, 'a': 49.32720264918349, 'b': 88.73870587748492, 'c': 95.76140919472765}. Best is trial 0 with value: 0.0.[0m

The least populated class in y has only 1 members, which is less than n_splits=20.

[32m[I 2026-02-16 20:52:42

KeyboardInterrupt: 

In [12]:
best_trial = study.best_trial
best_params = best_trial.params


best_thresholds = [
    best_params.pop("a"),
    best_params.pop("b"),
    best_params.pop("c"),
]

model_params = best_params

In [14]:

model = XGBRegressor(**model_params 
                     ,device="cuda" 
                     , random_state = 42)

scores , oof_num , oof_con = cv(model, X, y, n_splits=  5, scaling_fact=1 , oof = True , thresholds = best_thresholds)
scores.mean()


The least populated class in y has only 1 members, which is less than n_splits=5.



np.float64(0.37030218004876125)

{'a': 15.752152654678644,
 'step_b': 26.421950777863188,
 'step_c': 44.16970771424839}

In [16]:
fig = px.scatter(train1 
           , x = oof_num 
           , y =  "PCIAT-PCIAT_Total" 
           , color = oof_con 
           , marginal_x ="histogram" 
           , marginal_y ="histogram"
           ,hover_data = "sii")


fig.add_shape(
    type="line",
    x0=min(oof_num), x1=max(oof_num),  
    y0=30, y1=30,
    line=dict(color="Blue", width=2, dash="dash")
)
fig.add_shape(
    type="line",
    x0=min(oof_num), x1=max(oof_num),  
    y0=50, y1=50,
    line=dict(color="Blue", width=2, dash="dash")
)

fig.add_shape(
    type="line",
    x0=min(oof_num), x1=max(oof_num),  
    y0=80, y1=80,
    line=dict(color="Blue", width=2, dash="dash")
)


# Show the plot
fig.show()

In [17]:
model.fit(X , y)
pred =convert(model.predict(test_m) , scaling_fact = 1 ,thresholds= [ 28.569125484672444
                                                                ,36.17314930012063
                                                                , 80.42835371337772]
                                                                                                       )
sub = pd.DataFrame({"id":id_test , "sii":pred})
sub.to_csv("submission.csv" , index=False)

# 🙏 Thank You

Thank you for taking the time to explore this notebook.

This project focused on predicting **Problematic Internet Use severity** using behavioral, physical, and questionnaire-based features.  
We built a regression-based pipeline, converted predictions into ordinal severity classes, and optimized thresholds directly for **Quadratic Weighted Kappa**, ensuring alignment with the evaluation metric.

Through careful cross-validation, threshold tuning, and out-of-fold analysis, we aimed to create a robust and generalizable modeling framework.

If you found this notebook helpful or insightful, feel free to upvote ⭐ and share feedback — it is always appreciated!

Thank you for reading 