<a href="https://www.kaggle.com/code/daniyalatta/ps-s5-09-lightgbm-a-z-best-for-dataset-understand?scriptVersionId=260162362" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>


### <div style="text-align:center; border-radius:25px 70px; padding:9px; color:#ffffff; margin:0; font-size:120%; font-family:'Quicksand', sans-serif; background:linear-gradient(to right, #141e30, #243b55); overflow:hidden"><b>🚀 Your upvote can motivate me to share more useful notebooks!</b>
</div>

# <div style="text-align:center; border-radius:25px 70px; padding:9px; color:#ffffff; margin:0; font-size:120%; font-family:'Quicksand', sans-serif; background:linear-gradient(to right, #141e30, #243b55); overflow:hidden"><b>🌌  LightGBM Regression Pipeline for Kaggle Playground Series S5E9</b>
<p style="text-align:center; border-radius:25px 70px; padding:9px; color:#ffffff; margin:0; font-size:120%; font-family:'Quicksand', sans-serif; background:linear-gradient(to right, #141e30, #243b55); overflow:hidden"> This Python script replicates a machine learning pipeline originally written in R for a regression task using the LightGBM algorithm. The pipeline processes data from the Kaggle Playground Series S5E9 dataset, performs feature engineering, and trains a LightGBM model with 10-fold cross-validation, optimizing for RMSE. The script generates predictions and saves them in the same output formats as the original.</p>
</div>


## Dependencies

In [1]:
# Dependencies
import pandas as pd
import numpy as np
from lightgbm import LGBMRegressor, early_stopping, log_evaluation
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.compose import ColumnTransformer
import gc
import warnings
warnings.filterwarnings("ignore")

## Utility Functions
### Free Memory
### Clears memory to optimize performance.

In [2]:
# Free Memory
def free():
    gc.collect()

### Calculate Mode
Computes the mode of a series, handling missing values.

In [3]:
# Calculate Mode
def calc_mode(x):
    x = x.dropna()
    if len(x) == 0:
        return np.nan
    return pd.Series(x).mode()[0]

### RMSE Metric
Calculates the Root Mean Squared Error (RMSE) for evaluation.

In [4]:
# RMSE Metric
def rmse(y_true, y_pred):
    return np.sqrt(mean_squared_error(y_true, y_pred))

## Data Loading
Loads the training, test, and sample submission files from the Kaggle dataset.

In [5]:
# Data Loading
PATH = "/kaggle/input/playground-series-s5e9/"
dt = pd.read_csv(f"{PATH}train.csv")
dtest = pd.read_csv(f"{PATH}test.csv")
sub = pd.read_csv(f"{PATH}sample_submission.csv")
sub_temp = sub.iloc[:, 1].copy()
# Debug: Inspect raw data
print("Raw train data shape:", dt.shape)
print("Raw train data non-NaN counts:", dt.notna().sum())
print("Raw train data types:", dt.dtypes)

Raw train data shape: (524164, 11)
Raw train data non-NaN counts: id                           524164
RhythmScore                  524164
AudioLoudness                524164
VocalContent                 524164
AcousticQuality              524164
InstrumentalScore            524164
LivePerformanceLikelihood    524164
MoodScore                    524164
TrackDurationMs              524164
Energy                       524164
BeatsPerMinute               524164
dtype: int64
Raw train data types: id                             int64
RhythmScore                  float64
AudioLoudness                float64
VocalContent                 float64
AcousticQuality              float64
InstrumentalScore            float64
LivePerformanceLikelihood    float64
MoodScore                    float64
TrackDurationMs              float64
Energy                       float64
BeatsPerMinute               float64
dtype: object


## Data Preprocessing
### Identify Target
Extracts the target variable from the training set and computes the maximum value per row.

In [6]:
# Identify Target
target = [col for col in dt.columns if col not in dtest.columns]
Y = dt[target].max(axis=1)
dt = dt.drop(columns=target)

### Merge Train and Test
Combines train and test datasets for consistent preprocessing.

In [7]:
# Merge Train and Test
dt['fuente'] = 'train'
dtest['fuente'] = 'test'
dt_total = pd.concat([dt, dtest], ignore_index=True)

### Remove Duplicated Columns
Identifies and removes columns that are duplicates.

In [8]:
# Remove Duplicated Columns
cols_duplicated = dt_total.columns[dt_total.T.duplicated()].tolist()
if cols_duplicated:
    dt_total = dt_total.drop(columns=cols_duplicated)
    print(f"Removed duplicated columns: {cols_duplicated}")

### Remove High-Null Columns
Drops columns with 0.95% or more missing values.

In [9]:
# Remove High-Null Columns
null_cols = dt_total.columns[dt_total.isnull().mean() >= 0.95].tolist()
if null_cols:
    dt_total = dt_total.drop(columns=null_cols)
    print(f"Removed columns with >= 95% nulls: {null_cols}")

### Remove Single-Value Columns
Removes columns with only one unique value.

In [10]:
# Remove Single-Value Columns
n_unique = dt_total.nunique()
one_value_cols = n_unique[n_unique == 1].index.tolist()
if one_value_cols:
    dt_total = dt_total.drop(columns=one_value_cols)
    print(f"Removed columns with one unique value: {one_value_cols}")

### Add Null Count Feature
Adds a column counting missing values per row.

In [11]:
# Add Null Count Feature
if dt_total.isnull().sum().sum() > 0:
    dt_total['FilasNulas'] = dt_total.isnull().sum(axis=1)

### Convert Potential Numeric Columns to Numeric Type

In [12]:
# Convert potential numeric columns to numeric type
for col in dt_total.columns:
    if col not in ['id', 'fuente']:
        try:
            dt_total[col] = pd.to_numeric(dt_total[col], errors='coerce')
        except:
            pass

### Encode Categorical Columns
Converts categorical columns to integer codes using LabelEncoder.

In [13]:
# Encode Categorical Columns
cat_cols = dt_total.select_dtypes(include=['object']).columns.tolist()
cat_cols = [col for col in cat_cols if col != 'fuente']  # Exclude 'fuente' from encoding
for col in cat_cols:
    dt_total[col] = LabelEncoder().fit_transform(dt_total[col].astype(str))

### Remove Low-Variance Features
Drops features where 99.9% or more of the values are identical.

In [14]:
# Remove Low-Variance Features
cols_total = dt_total.columns[3:]
i_cols_borradas = 0
for c in cols_total:
    mode_val = calc_mode(dt_total[c])
    prc_repetido = (dt_total[c] == mode_val).mean()
    if prc_repetido >= 0.95:
        print(f"Removing {c} with repeated value proportion {prc_repetido:.3f}")
        dt_total = dt_total.drop(columns=c)
        i_cols_borradas += 1
print(f"Total columns removed: {i_cols_borradas}")

Total columns removed: 0


### Frequency Encoding
Applies frequency encoding to categorical features with 15 or fewer unique values.

In [15]:
# Frequency Encoding
n_categories = dt_total[cols_total].nunique()
cat_features = n_categories[n_categories <= 15].index.tolist()
for c in cat_features:
    freq = dt_total[c].value_counts().to_dict()
    dt_total[f"{c}_FreqEnc"] = dt_total[c].map(freq)

### Handle Missing Values
Imputes missing values with the median of each column.

In [16]:
# Handle Missing Values
null_cols = dt_total.columns[dt_total.isnull().sum() > 0]
for c in null_cols:
    if dt_total[c].notna().sum() > 0:  # If column has at least one non-NaN value
        median_val = dt_total[c].median()
        dt_total[c].fillna(median_val, inplace=True)
    else:  # If column is entirely NaN
        print(f"Column {c} is entirely NaN, imputing with 0")
        dt_total[c].fillna(0, inplace=True)

### Split Train and Test
Separates the combined dataset back into train and test sets.

In [17]:
# Split Train and Test
dt = dt_total[dt_total['fuente'] == 'train'].drop(columns='fuente')
dtest = dt_total[dt_total['fuente'] == 'test'].drop(columns='fuente')

### Preprocessing Pipeline
Normalizes numeric columns using StandardScaler.

In [18]:
# Preprocessing Pipeline
numeric_cols = dt.select_dtypes(include=[np.number]).columns.drop(['id'], errors='ignore')
numeric_cols = [col for col in numeric_cols if dt[col].notna().sum() > 0]
if len(numeric_cols) > 0:
    preprocessor = ColumnTransformer(
        transformers=[
            ('num', StandardScaler(), numeric_cols)
        ],
        remainder='passthrough'
    )
    dt_processed = preprocessor.fit_transform(dt)
    dtest_processed = preprocessor.transform(dtest)
    all_cols = numeric_cols + [col for col in dt.columns if col not in numeric_cols]
    dt = pd.DataFrame(dt_processed, columns=all_cols)
    dtest = pd.DataFrame(dtest_processed, columns=all_cols)
else:
    print("No valid numeric columns to scale. Using all columns as-is.")
    dt = dt.copy()
    dtest = dtest.copy()
    if dt.drop(columns=['id'], errors='ignore').shape[1] == 0:
        raise ValueError("No features available for modeling after preprocessing.")
dt['id'] = dt['id'].astype(int)
dtest['id'] = dtest['id'].astype(int)

## Model Training and Prediction
### Setup
Initializes out-of-fold (OOF) predictions and submission DataFrames.

In [19]:
# Setup
oof = pd.DataFrame({'target': Y})
sub_seeds = sub.copy()
sub_seeds.iloc[:, 1:] = 0
scores = []
SEEDS = [1975, 2000, 2503, 1511, 2604]

### K-Fold Cross-Validation
Trains LightGBM models for each seed and fold, saving predictions and feature importances.

In [20]:
for s in SEEDS:
    print(f"\nSeed: {s}")
    np.random.seed(s)
    name_seed = f"Seed_{s}"
    sub_seeds[name_seed] = 0
    if dt.shape[0] == 0:
        raise ValueError("Training dataset (dt) is empty. Check data loading or preprocessing steps.")
    kf = KFold(n_splits=10, shuffle=True, random_state=s)
    pred_lgbm_total = np.zeros(len(dtest))
    sub_temp = pd.DataFrame({'id': sub['id']})
    auc_lgbm_total = 0

    for i_fold, (train_idx, val_idx) in enumerate(kf.split(dt), 1):
        print(f"\nLGBM Fold {i_fold} - 10")
        X_train, y_train = dt.iloc[train_idx].drop(columns='id'), Y[train_idx]
        X_val, y_val = dt.iloc[val_idx].drop(columns='id'), Y[val_idx]

        # LightGBM model
        lgb_params = {
            'random_state': 0,
            'n_estimators': 4500,
            'learning_rate': 0.005,
            'boosting_type': 'gbdt',
            'objective': 'regression',
            'metric': 'rmse'
        }

        model = LGBMRegressor(**lgb_params)
        model.fit(
            X_train, y_train,
            eval_set=[(X_train, y_train), (X_val, y_val)],
            eval_metric='rmse',
            callbacks=[early_stopping(stopping_rounds=500), log_evaluation(period=250)]
        )

        # Predictions
        test_lgbm = model.predict(X_val)
        oof.loc[val_idx, f"LGBM_Seed_{s}"] = test_lgbm
        auc_lgbm = rmse(y_val, test_lgbm)
        auc_lgbm_total += auc_lgbm
        scores.append(auc_lgbm)
        print(f"LGBM Fold {i_fold} RMSE: {auc_lgbm:.4f}, Avg RMSE: {np.mean(scores):.4f}")

        # Test predictions
        pred_lgbm = model.predict(dtest.drop(columns='id'))
        pred_lgbm_total += pred_lgbm
        sub_temp[f"Seed_s{s}_Fold_{i_fold}"] = pred_lgbm
        sub_seeds[name_seed] += pred_lgbm / 10

        # Feature importance
        importance = pd.DataFrame({
            'Feature': X_train.columns,
            'Importance': model.feature_importances_
        })
        importance.to_csv("ImportanciaLGBM_v1.csv", index=False)
        free()

    pred_lgbm_total /= 10
    print(f"AUC LGBM Seed {s}: {np.mean(scores):.4f}")


Seed: 1975

LGBM Fold 1 - 10
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.024131 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 2295
[LightGBM] [Info] Number of data points in the train set: 471747, number of used features: 9
[LightGBM] [Info] Start training from score 119.035388
Training until validation scores don't improve for 500 rounds
[250]	training's rmse: 26.4407	valid_1's rmse: 26.4797
[500]	training's rmse: 26.4241	valid_1's rmse: 26.4799
[750]	training's rmse: 26.4087	valid_1's rmse: 26.4809
Early stopping, best iteration is:
[296]	training's rmse: 26.4377	valid_1's rmse: 26.4795
LGBM Fold 1 RMSE: 26.4795, Avg RMSE: 26.4795

LGBM Fold 2 - 10
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.025064 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 2295
[LightGBM] [Info] Number of data points in the tra

## Save Outputs
Saves predictions in multiple formats: mean, median, fold-wise, OOF, and seed-wise.

In [21]:
# Save Outputs
sub.iloc[:, 1] = sub_temp.iloc[:, 1:].mean(axis=1)
sub.to_csv("subLGBMFit_v1.csv", index=False)

sub.iloc[:, 1] = sub_temp.iloc[:, 1:].median(axis=1)
sub.to_csv("subLGBMFitMedian_v1.csv", index=False)

sub_temp.to_csv("subFoldLGBMFit_v1.csv", index=False)
oof.to_csv("OOF_LGBMFit_v1.csv", index=False)
sub_seeds.to_csv("subSeedsLGBM_v1.csv", index=False)

## Evaluate Performance
Prints RMSE for each seed and the mean RMSE across seeds.

In [22]:
# Evaluate Performance
for s in SEEDS:
    print(f"RMSE Seed {s}: {rmse(oof['target'], oof[f'LGBM_Seed_{s}']):.4f}")

auc_mean = np.mean([rmse(oof['target'], oof[f'LGBM_Seed_{s}']) for s in SEEDS])
print(f"Mean RMSE: {auc_mean:.4f}")

RMSE Seed 1975: 26.4583
RMSE Seed 2000: 26.4578
RMSE Seed 2503: 26.4581
RMSE Seed 1511: 26.4584
RMSE Seed 2604: 26.4585
Mean RMSE: 26.4582
