# Model 2: Supervised Learning - Predicting Spending Segment

### Objective
The goal of this notebook is to build and evaluate a supervised classification model to predict a player's `SpendingSegment` ('Whale', 'Dolphin', or 'Minnow'). This is a valuable business tool that could be used to identify potential high-value customers early in their lifecycle.

### Key Challenge: Severe Class Imbalance
Our primary challenge, identified during the initial data split, is the severe class imbalance in our target variable. The 'Minnow' class represents the vast majority of players, while 'Whales' are extremely rare. A naive model could achieve high accuracy by simply always predicting 'Minnow', making it useless for our goal of finding high-value players.


In [10]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import warnings
warnings.filterwarnings('ignore')

In [11]:
# We repeat the full cleaning process to make this notebook self-contained
df = pd.read_csv('data/mobile_game_inapp_purchases.csv')

# --- Full Data Cleaning ---
df.fillna(df.median(numeric_only=True), inplace=True)
categorical_cols_with_na = df.select_dtypes(include='object').columns
for col in categorical_cols_with_na:
    df[col].fillna('Unknown', inplace=True)

df['LastPurchaseDate'] = pd.to_datetime(df['LastPurchaseDate'], errors='coerce')
df['Age'] = df['Age'].astype('int64')
df['SessionCount'] = df['SessionCount'].astype('int64')
df['FirstPurchaseDaysAfterInstall'] = df['FirstPurchaseDaysAfterInstall'].astype('int64')
df['LastPurchaseYear'] = df['LastPurchaseDate'].dt.year.fillna(df['LastPurchaseDate'].dt.year.median()).astype('int64')
df['LastPurchaseMonth'] = df['LastPurchaseDate'].dt.month.fillna(df['LastPurchaseDate'].dt.month.median()).astype('int64')
df['LastPurchaseDayOfWeek'] = df['LastPurchaseDate'].dt.dayofweek.fillna(df['LastPurchaseDate'].dt.dayofweek.median()).astype('int64')

# Drop columns not needed for modeling
df = df.drop(columns=['UserID', 'LastPurchaseDate'])

print("Data loaded and cleaned successfully.")
df.head()

Data loaded and cleaned successfully.


Unnamed: 0,Age,Gender,Country,Device,GameGenre,SessionCount,AverageSessionLength,SpendingSegment,InAppPurchaseAmount,FirstPurchaseDaysAfterInstall,PaymentMethod,LastPurchaseYear,LastPurchaseMonth,LastPurchaseDayOfWeek
0,49,Male,Norway,Android,Battle Royale,9,12.83,Minnow,11.4,28,Apple Pay,2025,3,2
1,15,Male,Switzerland,iOS,Action RPG,11,19.39,Minnow,6.37,18,Debit Card,2025,6,6
2,23,Male,China,Android,Fighting,9,8.87,Minnow,15.81,30,Apple Pay,2025,6,0
3,31,Male,Mexico,Android,Racing,12,19.56,Minnow,13.49,9,Debit Card,2025,4,1
4,37,Female,India,Android,Battle Royale,10,15.23,Minnow,10.86,15,Paypal,2025,5,0


In [12]:
# Define our features (X) and the target variable (y)
X = df.drop(columns=['SpendingSegment', 'InAppPurchaseAmount'])
y_text = df['SpendingSegment']

In [13]:
le = LabelEncoder()
y = le.fit_transform(y_text)

In [14]:
print("Distribution of Spending Segments:")
print(y_text.value_counts())
print("-" * 30)

Distribution of Spending Segments:
SpendingSegment
Minnow     2544
Dolphin     412
Whale        68
Name: count, dtype: int64
------------------------------


In [15]:
#Train Test Split 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print("Data split successfully:")
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)


Data split successfully:
X_train shape: (2419, 12)
X_test shape: (605, 12)
y_train shape: (2419,)
y_test shape: (605,)


In [16]:
numerical_features = X_train.select_dtypes(include=np.number).columns
categorical_features = X_train.select_dtypes(include="object").columns

In [17]:
numeric_transformer = StandardScaler()
categorical_transformer = OneHotEncoder(handle_unknown='ignore')

# Create a preprocessor object using ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ])

In [18]:
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImbPipeline # Use the imblearn pipeline to handle SMOTE correctly

models = {
    "Logistic Regression": LogisticRegression(max_iter=1000, random_state=42),
    "K-Neighbors Classifier": KNeighborsClassifier(),
    "Decision Tree": DecisionTreeClassifier(random_state=42),
    "Random Forest": RandomForestClassifier(random_state=42),
    "SVC": SVC(random_state=42),
    "AdaBoost Classifier": AdaBoostClassifier(random_state=42),
    "XGBoost": XGBClassifier(random_state=42)
}

def evaluate_models_smote(X_train, y_train, X_test, y_test, models):
    model_report = {}
    for name, model in models.items():
        pipeline = ImbPipeline(steps=[('preprocessor', preprocessor),
                                      ('smote', SMOTE(random_state=42)),
                                      ('classifier', model)])
        
        pipeline.fit(X_train, y_train)
        y_pred = pipeline.predict(X_test)
        report = classification_report(y_test, y_pred, output_dict=True)
        
        print(f"--- {name} (with SMOTE) ---")
        print(classification_report(y_test, y_pred))
        print("-" * 50)
        
        # Store the key performance metrics
        # Note: We use strings '0', '1', '2' because the report dictionary keys are strings
        model_report[name] = {
            'accuracy': report['accuracy'],
            'whale_f1_score': report.get('2', {}).get('f1-score', 0.0), 
            'dolphin_f1_score': report.get('0', {}).get('f1-score', 0.0), 
            'minnow_f1_score': report.get('1', {}).get('f1-score', 0.0) 
        }
        
    return model_report

model_performance_smote = evaluate_models_smote(X_train, y_train, X_test, y_test, models)

# Create a DataFrame to easily compare the model performances
performance_df_smote = pd.DataFrame(model_performance_smote).T.sort_values(by='whale_f1_score', ascending=False)

print("\n--- Model Performance Summary (with SMOTE) ---")
print("Models are ranked by their F1-Score for the 'Whale' class.")
performance_df_smote

--- Logistic Regression (with SMOTE) ---
              precision    recall  f1-score   support

           0       0.13      0.32      0.19        82
           1       0.85      0.43      0.57       509
           2       0.02      0.21      0.04        14

    accuracy                           0.41       605
   macro avg       0.33      0.32      0.26       605
weighted avg       0.73      0.41      0.51       605

--------------------------------------------------
--- K-Neighbors Classifier (with SMOTE) ---
              precision    recall  f1-score   support

           0       0.15      0.52      0.23        82
           1       0.90      0.34      0.49       509
           2       0.03      0.21      0.04        14

    accuracy                           0.36       605
   macro avg       0.36      0.36      0.25       605
weighted avg       0.77      0.36      0.44       605

--------------------------------------------------
--- Decision Tree (with SMOTE) ---
              pr

Unnamed: 0,accuracy,whale_f1_score,dolphin_f1_score,minnow_f1_score
K-Neighbors Classifier,0.358678,0.044776,0.228723,0.488571
Logistic Regression,0.409917,0.036585,0.18638,0.571056
Decision Tree,0.652893,0.0,0.12381,0.790072
Random Forest,0.836364,0.0,0.0,0.910891
SVC,0.771901,0.0,0.072993,0.872521
AdaBoost Classifier,0.533884,0.0,0.167883,0.688073
XGBoost,0.826446,0.0,0.021505,0.904805


# Model 2: Final Conclusion & Learnings

### What We Learned
This supervised analysis led to a critical and professionally important "negative result."

1.  **Behavioral Data is Not a Strong Predictor:** Our key finding is that the available behavioral features in this dataset are **not sufficient to reliably predict** a player's specific spending segment. After removing the data leak from the `InAppPurchaseAmount` feature, even our most powerful models struggled to find a strong signal.

2.  **The Limits of SMOTE:** While SMOTE is a powerful technique for handling class imbalance, it cannot create a signal where none exists. Our final model showdown, even with SMOTE, showed very low F1-scores for the 'Whale' and 'Dolphin' classes. This confirms that the behavioral data lacks the necessary predictive power for this specific task.

3.  **The Importance of an Honest Evaluation:** This analysis serves as a crucial business insight. It demonstrates that simply building a model is not enough; one must critically evaluate its performance on realistic data. Our conclusion is not that the models failed, but that **we would need to collect different and more predictive data** (e.g., tracking specific in-game actions, social interactions, etc.) to build a successful spending tier prediction model.

This concludes our classification analysis. While we did not produce a production-ready predictive model, we have generated a valuable strategic insight.