A repository containing the submission code for the kaggle competition Child Mind Institute — Problematic Internet Use (Relating Physical Activity to Problematic Internet Use)
Objective: Develop a model that integrates tabular and actigraphy data, manages missing values, and optimizes for the Quadratic Weighted Kappa (QWK) score.
- Impute Missing Values:
- Apply iterative imputation methods (e.g., MICE) for complex relationships.
- Use unsupervised learning (e.g., K-means, PCA, or autoencoders) to impute values based on data structure.
- For missing target values, consider filtering them out or using models that can handle partial supervision.
- Outlier Detection and Noise Reduction:
- Identify and handle outliers in both tabular and actigraphy data, potentially using isolation forests or statistical thresholds.
- Denoise actigraphy data (e.g., with smoothing or Fourier transformations) to retain essential patterns while reducing noise.
- Correlation-Based Feature Selection:
- Calculate correlations between features and the target variable to identify and keep only the most relevant features.
- Feature Interaction Creation:
- Generate new features by interacting key features with each other, such as polynomial features or multiplicative combinations.
- Advanced Feature Selection: -Use methods like Lasso regression, Recursive Feature Elimination (RFE), or tree-based feature importance to refine the feature set further.
- Feature Importance Evaluation:
- Use models like gradient boosting (e.g., CatBoost, XGBoost) or SHAP values to validate and refine features based on their contribution to the model.
- Actigraphy Data Encoding with a Neural Network:
- Build a neural network encoder (e.g., with RNNs or LSTMs) to transform actigraphy sequences into fixed-size representations.
- Optionally, integrate attention mechanisms to help the model focus on important time points within the sequences.
- Combining Encoded Actigraphy with Tabular Data:
- Concatenate the actigraphy encoding with selected tabular features, creating a unified feature set for the final model.
- Model Selection and Initial Training:
- Use gradient boosting models like CatBoost, XGBoost, or LightGBM to leverage their robustness with tabular data.
- Optionally, employ a voting regressor that combines multiple models (e.g., CatBoost, XGBoost, NN encoder) to improve generalization.
- Custom QWK Loss Function (if feasible):
- Adjust training to optimize directly for QWK by implementing it as a custom loss function if the model framework allows.
- Hyperparameter Optimization:
- Perform hyperparameter tuning for gradient boosting models and neural networks using grid search or Bayesian optimization to maximize the QWK score.
- Cross-Validation:
- Use K-Fold cross-validation to train the model on multiple splits of the dataset, improving robustness and reliability in performance evaluation.
- Final Model Evaluation:
- After training, apply the model to the test dataset (if target labels are available) and calculate the QWK score to validate the model’s effectiveness.
- Generate Submission File:
- Using the final model configuration, generate predictions for the test set.
- Save the predictions in the required submission format, ensuring that all necessary columns are included.