# Task 2: Player Segment Classification

Problem: Marketing runs generic campaigns across all players, achieving only 2-5% engagement. The company spends ฿60 million monthly on promotions, but can't identify the top 5% of players (whales) who generate 60% of revenue.

Your Task: Classify players into four behavioral segments:

- Class 0: Casual Player (relaxed play, low spending)
- Class 1: Competitive Grinder (high playtime, ranked focus)
- Class 2: Social Player (friend-focused, cosmetic spending)
- Class 3: Whale (high spending, status-driven)

Metric: F₁ Score (Macro)

In [1]:
import pandas as pd
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import StackingClassifier
from sklearn.metrics import f1_score

from xgboost import XGBClassifier
from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier


In [2]:
# Load Data
train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")

print(train_df.head())
print(test_df.head())

         id player_id  play_frequency  avg_session_duration  \
0  PLY00001   P050236        5.495437             24.837349   
1  PLY00002   P108696        9.991089             88.376322   
2  PLY00003   P113532       14.234225            101.712292   
3  PLY00004   P123930        3.373683            191.975841   
4  PLY00005   P068623       22.469353             28.042509   

   total_playtime_hours  login_streak  days_since_last_login  \
0           2740.945124          60.0              56.034052   
1                   NaN          22.0              75.036888   
2           2828.479467          66.0                    NaN   
3           1915.082950          80.0               0.127910   
4            517.921948           NaN              45.078460   

   total_spending_thb  avg_monthly_spending  spending_frequency  ...  \
0        58219.915660            434.038311           17.790970  ...   
1        28966.163953           4233.532935           28.862134  ...   
2        44478.82383

In [3]:
# Data Preparation

# Separate features and target 
y_train = train_df['segment']
X_train = train_df.drop(columns=['id', 'player_id', 'segment'])

X_test = test_df.drop(columns=['id', 'player_id'])

In [4]:
# Data Preprocessing

categorical_features = X_train.select_dtypes(include=['object']).columns
numerical_features = X_train.select_dtypes(include=['number']).columns

# Create pipelines for numerical and categorical features
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean'))
])

categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ])

In [5]:
# Create Pipeline

xgb = XGBClassifier(
    n_estimators=400,
    learning_rate=0.05,
    max_depth=6,
    subsample=0.8,
    colsample_bytree=0.8,
    eval_metric="mlogloss",
    random_state=42,
    n_jobs=-1
)

lgbm = LGBMClassifier(
    n_estimators=400,
    learning_rate=0.05,
    num_leaves=31,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42,
    n_jobs=-1
)

cat = CatBoostClassifier(
    iterations=400,
    learning_rate=0.05,
    depth=6,
    loss_function="MultiClass",
    verbose=0,
    random_state=42,
    thread_count=-1
)

# Stacking all models together
stacked_model = StackingClassifier(
    estimators=[
        ('xgb', xgb),
        ('lgbm', lgbm),
        ('cat', cat)
    ],
    final_estimator=LGBMClassifier(
        learning_rate=0.05,
        n_estimators=300
    ),
    stack_method='predict_proba',
    n_jobs=-1,
    passthrough=False
)


model = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', stacked_model)
])

In [6]:
# Cross Validation

# Use StratifiedKFold cross-validation to preserve class distribution
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

In [7]:
# Model Training

f1_scores = cross_val_score(
    model,
    X_train,
    y_train,
    cv=skf,
    scoring='f1_macro'
)

model.fit(X_train, y_train)

print("F1 Macro Scores:", f1_scores)
print("Mean F1 Macro Score:", f1_scores.mean())

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.045066 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 7212
[LightGBM] [Info] Number of data points in the train set: 81326, number of used features: 84
[LightGBM] [Info] Start training from score -0.931137
[LightGBM] [Info] Start training from score -1.386959
[LightGBM] [Info] Start training from score -1.598809
[LightGBM] [Info] Start training from score -1.871298
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.012787 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 7205
[LightGBM] [Info] Number of data points in the train set: 65061, number of used features: 84
[LightGBM] [Info] Start training from score -0.931133
[LightGBM] [Info] Start training from score -1.386986
[LightGBM] [Info] Start training f



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001911 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 3060
[LightGBM] [Info] Number of data points in the train set: 81326, number of used features: 12
[LightGBM] [Info] Start training from score -0.931137
[LightGBM] [Info] Start training from score -1.386959
[LightGBM] [Info] Start training from score -1.598809
[LightGBM] [Info] Start training from score -1.871298




[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.054441 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 7214
[LightGBM] [Info] Number of data points in the train set: 81326, number of used features: 84
[LightGBM] [Info] Start training from score -0.931137
[LightGBM] [Info] Start training from score -1.387008
[LightGBM] [Info] Start training from score -1.598809
[LightGBM] [Info] Start training from score -1.871218
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.009096 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 7209
[LightGBM] [Info] Number of data points in the train set: 65061, number of used features: 84
[LightGBM] [Info] Start training from score -0.931133
[LightGBM] [Info] Start training from score -1.386986
[LightGBM] [Info] Auto-choosing ro



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001250 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 3060
[LightGBM] [Info] Number of data points in the train set: 81326, number of used features: 12
[LightGBM] [Info] Start training from score -0.931137
[LightGBM] [Info] Start training from score -1.387008
[LightGBM] [Info] Start training from score -1.598809
[LightGBM] [Info] Start training from score -1.871218




[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.034085 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 7213
[LightGBM] [Info] Number of data points in the train set: 81326, number of used features: 84
[LightGBM] [Info] Start training from score -0.931137
[LightGBM] [Info] Start training from score -1.387008
[LightGBM] [Info] Start training from score -1.598809
[LightGBM] [Info] Start training from score -1.871218
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.010368 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 7207
[LightGBM] [Info] Number of data points in the train set: 65060, number of used features: 84
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.014027 seconds.
You can set `force_col_wise=true` to r



[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000411 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 3060
[LightGBM] [Info] Number of data points in the train set: 81326, number of used features: 12
[LightGBM] [Info] Start training from score -0.931137
[LightGBM] [Info] Start training from score -1.387008
[LightGBM] [Info] Start training from score -1.598809
[LightGBM] [Info] Start training from score -1.871218




[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.036091 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 7213
[LightGBM] [Info] Number of data points in the train set: 81327, number of used features: 84
[LightGBM] [Info] Start training from score -0.931150
[LightGBM] [Info] Start training from score -1.386971
[LightGBM] [Info] Start training from score -1.598761
[LightGBM] [Info] Start training from score -1.871310
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.004803 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 7205
[LightGBM] [Info] Number of data points in the train set: 65061, number of used features: 84
[LightGBM] [Info] Start training from score -0.931172
[LightGBM] [Info] Start training from score -1.386986
[LightGBM] [Info] Start training f



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001290 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 3060
[LightGBM] [Info] Number of data points in the train set: 81327, number of used features: 12
[LightGBM] [Info] Start training from score -0.931150
[LightGBM] [Info] Start training from score -1.386971
[LightGBM] [Info] Start training from score -1.598761
[LightGBM] [Info] Start training from score -1.871310




[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.057615 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 7213
[LightGBM] [Info] Number of data points in the train set: 81327, number of used features: 84
[LightGBM] [Info] Start training from score -0.931118
[LightGBM] [Info] Start training from score -1.386971
[LightGBM] [Info] Start training from score -1.598822
[LightGBM] [Info] Start training from score -1.871310
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.015410 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 7206
[LightGBM] [Info] Number of data points in the train set: 65061, number of used features: 84
[LightGBM] [Info] Start training from score -0.931133
[LightGBM] [Info] Start training from score -1.386986
[LightGBM] [Info] [LightGBM] [Info] Start training from score -1.598751
Auto-choosing col-wise mul



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001193 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 3060
[LightGBM] [Info] Number of data points in the train set: 81327, number of used features: 12
[LightGBM] [Info] Start training from score -0.931118
[LightGBM] [Info] Start training from score -1.386971
[LightGBM] [Info] Start training from score -1.598822
[LightGBM] [Info] Start training from score -1.871310




[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.030781 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 7219
[LightGBM] [Info] Number of data points in the train set: 101658, number of used features: 84
[LightGBM] [Info] Start training from score -0.931136
[LightGBM] [Info] Start training from score -1.386983
[LightGBM] [Info] Start training from score -1.598802
[LightGBM] [Info] Start training from score -1.871271
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.006696 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 7214
[LightGBM] [Info] Number of data points in the train set: 81326, number of used features: 84
[LightGBM] [Info] Start training from score -0.931137
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0



[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000595 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 3060
[LightGBM] [Info] Number of data points in the train set: 101658, number of used features: 12
[LightGBM] [Info] Start training from score -0.931136
[LightGBM] [Info] Start training from score -1.386983
[LightGBM] [Info] Start training from score -1.598802
[LightGBM] [Info] Start training from score -1.871271
F1 Macro Scores: [0.75753592 0.752543   0.75116626 0.75137816 0.75058918]
Mean F1 Macro Score: 0.7526425040662016


In [8]:
# Predict and Evaluate

predictions = model.predict(X_test)
submission_df = pd.DataFrame({'id': test_df['id'], 'task2': predictions})
submission_df.to_csv('submission.csv', index=False)
print("Submission saved to submission.csv")



Submission saved to submission.csv
