# Model Training
---
In this file, I will train the model using the prepared dataset using scikit-learn.

In [1]:
from Utils import setup_database_connection, true, false
from Utils import load_all_players
from Utils import COLOUR_BANNED, COLOUR_NON_BANNED, COLOUR_BLUE
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from imblearn.combine import SMOTEENN

engine = setup_database_connection()
player_data = load_all_players(engine)
banned_player_data = player_data[player_data['has_ban'] == true]
non_banned_player_data = player_data[player_data['has_ban'] == false]

Connecting to database...
Connection successful!
Loaded 214688 players


## Data Cleaning
---
Apply data cleaning to prepare the dataset for training:
1. Remove features with >50% zero values in banned player data
2. Remove players with >2 zero values across features

In [2]:
features_to_exclude = []
for feature in banned_player_data.select_dtypes(include=['int64', 'float64']).columns:
    banned_zeros = (banned_player_data[feature] == 0).sum()
    banned_zero_pct = (banned_zeros / len(banned_player_data)) * 100

    if banned_zero_pct > 50:
        features_to_exclude.append(feature)

thresholded_player_data = player_data.drop(columns=features_to_exclude)

ZERO_THRESHOLD = 2

numeric_features = thresholded_player_data.select_dtypes(include=['int64', 'float64']).columns.tolist()
zero_counts_per_player = (thresholded_player_data[numeric_features] == 0).sum(axis=1)

mask = zero_counts_per_player <= ZERO_THRESHOLD

filtered_player_data = thresholded_player_data[mask].copy()

original_banned_count = (thresholded_player_data['has_ban'] == true).sum()
original_non_banned_count = (thresholded_player_data['has_ban'] == false).sum()
filtered_banned_count = (filtered_player_data['has_ban'] == true).sum()
filtered_non_banned_count = (filtered_player_data['has_ban'] == false).sum()

total_original = len(thresholded_player_data)
total_filtered = len(filtered_player_data)
total_removed = total_original - total_filtered
banned_removed = original_banned_count - filtered_banned_count
non_banned_removed = original_non_banned_count - filtered_non_banned_count

print(f"{'Category':<20} {'Original':<15} {'Filtered':<15} {'Removed':<15} {'% Retained':<15}")
print("-" * 80)
print(f"{'Banned Players':<20} {original_banned_count:<15,} {filtered_banned_count:<15,} {banned_removed:<15,} {(filtered_banned_count/original_banned_count*100):.2f}%")
print(f"{'Non-Banned Players':<20} {original_non_banned_count:<15,} {filtered_non_banned_count:<15,} {non_banned_removed:<15,} {(filtered_non_banned_count/original_non_banned_count*100):.2f}%")
print(f"{'Total Players':<20} {total_original:<15,} {total_filtered:<15,} {total_removed:<15,} {(total_filtered/total_original*100):.2f}%")

print(f"\nClass Balance:")
print("-" * 50)
print(f"Original - Banned: {(original_banned_count/total_original*100):.2f}% | Non-Banned: {(original_non_banned_count/total_original*100):.2f}%")
print(f"Filtered - Banned: {(filtered_banned_count/total_filtered*100):.2f}% | Non-Banned: {(filtered_non_banned_count/total_filtered*100):.2f}%")

print(f"\nData cleaning complete. Ready for training with {len(filtered_player_data):,} players and {filtered_player_data.shape[1]} features")

Category             Original        Filtered        Removed         % Retained     
--------------------------------------------------------------------------------
Banned Players       43,964          22,232          21,732          50.57%
Non-Banned Players   170,724         168,744         1,980           98.84%
Total Players        214,688         190,976         23,712          88.96%

Class Balance:
--------------------------------------------------
Original - Banned: 20.48% | Non-Banned: 79.52%
Filtered - Banned: 11.64% | Non-Banned: 88.36%

Data cleaning complete. Ready for training with 190,976 players and 29 features


# Handling Class Imbalance
---
I will be using both undersampling and oversampling techniques to handle class imbalance in the dataset. I have too little banned players in my dataset in comparison to non banned players. I will be using scikit learns SMOTEENN method to oversample the banned players and undersample the non banned players.

In [10]:
columns_to_exclude = ['steam_id', 'created_at', 'name', 'total_matches', 'updated_at', 'has_ban']
X = filtered_player_data.drop(columns=columns_to_exclude)
y = filtered_player_data['has_ban'].map({true: 1, false: 0})

smote_enn = SMOTEENN(random_state=42)
X_res, y_res = smote_enn.fit_resample(X, y)

print(f"Original dataset shape: {X.shape}")
print(f"Resampled dataset shape: {X_res.shape}")
print(f"\nOriginal class distribution:")
print(f"  Banned: {(y == 1).sum():,} ({(y == 1).sum()/len(y)*100:.2f}%)")
print(f"  Non-banned: {(y == 0).sum():,} ({(y == 0).sum()/len(y)*100:.2f}%)")
print(f"\nResampled class distribution:")
print(f"  Banned: {(y_res == 1).sum():,} ({(y_res == 1).sum()/len(y_res)*100:.2f}%)")
print(f"  Non-banned: {(y_res == 0).sum():,} ({(y_res == 0).sum()/len(y_res)*100:.2f}%)")



Original dataset shape: (190976, 23)
Resampled dataset shape: (303573, 23)

Original class distribution:
  Banned: 22,232 (11.64%)
  Non-banned: 168,744 (88.36%)

Resampled class distribution:
  Banned: 167,552 (55.19%)
  Non-banned: 136,021 (44.81%)


# Data Resampling Results
---
I went from 22000 banned players up to 167,000 and undersampled non banned players from 168,744 to 136,021, giving me a much more balanced dataset to train on. approx 55/45 split.

# Performance Metrics
---
Before I start training the model and choosing an algorithm, I will decide on what metrics I want to prioritise. Given we are trying to identify cheaters and crucially without incorrectly flagging legit players as cheaters I want to reduce the number of false positives, therefore prioritising Precision as my main metric. I would rather miss some cheaters (false negatives) than incorrectly flag legit players (false positives). I will also monitor Recall, but it will be a secondary metric to Precision. Preferably a good balance would be optimal.

# Algorithm Selection
---
I will start with XGBoost as it's a high performance algorithm as it works well my tabular data, however if I am not happy with the performance, I may try others such as Random Forest.