# Grid.gg Esports Match Prediction Pipeline
This notebook implements a machine learning pipeline for predicting esports match outcomes using the Grid.gg API data. The pipeline includes data cleaning, data preprocessing, feature engineering, model training, and evaluation components.

## Pipeline Structure
The pipeline is organized into the following main sections:
1. Data Collection and Cleaning
    - Reading in combined_player_stats_20241117_1343.csv
    - Handling missing values and outliers

2. Feature Engineering

3. Model Development and Evaluation
Three models will be developed and compared:

 - Neural Network: Designed for complex pattern recognition in player performance
 - XGBoost: Proven effectiveness in similar prediction tasks historically with our data and within Shaynes models
 - [Additional Model TBD]: To be selected based on feature characteristics and data structure

4. Model Optimization Possibilities 
    - Hyperparameter tuning
    - Feature importance analysis
    - Performance validation
    - Cross-validation strategies

## Data Cleaning and Preprocessing

### Data Loading and Initial Assessment
- Loading data from 'combined_player_stats_20241117_1343.csv'
- Initial examination of data structure and completeness
- Documentation of current data shape and basic statistics

### Data Quality Issues to Address

1. Known Invalid Records
    - Identification and removal of ~55 batches of players with all zero statistics
    - Documentation of removed records for future data collection improvement

2. Partial Zero Statistics
    - Analysis of players with some zero statistics but otherwise valid data
    - Determination of valid zero values vs. missing/error data
    - Strategy for handling partially complete player records

3. Team Information Linking
    - Assessment of available team information
    - Preparation for linking player statistics with match results
    - Identification of any missing team affiliations

In [18]:
# Imports
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns
from src.data_preprocessing import *

In [19]:
# Set path and read data
file_path = '../grid_collector/data/combined_player_stats_20241117_1343.csv'
df = pd.read_csv(file_path)

In [20]:
# Display basic information about the dataset
print("Dataset Shape:", df.shape)
print("\nFirst few rows of the dataset:")
print(df.head())

print("\nColumn names:")
print(df.columns.tolist())

print("\nBasic statistics:")
print(df.describe())

# Check for missing values
print("\nMissing values in each column:")
print(df.isnull().sum())


Dataset Shape: (5452, 7)

First few rows of the dataset:
   player_id                                  general  \
0       3435  {'series_played': 0, 'games_played': 0}   
1       3436  {'series_played': 0, 'games_played': 0}   
2       3437  {'series_played': 0, 'games_played': 0}   
3       3438  {'series_played': 0, 'games_played': 0}   
4       3439  {'series_played': 0, 'games_played': 0}   

                                              combat  \
0  {'kills': {'total': 0, 'average': 0.0, 'best':...   
1  {'kills': {'total': 0, 'average': 0.0, 'best':...   
2  {'kills': {'total': 0, 'average': 0.0, 'best':...   
3  {'kills': {'total': 0, 'average': 0.0, 'best':...   
4  {'kills': {'total': 0, 'average': 0.0, 'best':...   

                                         performance  \
0  {'wins': {'count': 0, 'percentage': 0, 'curren...   
1  {'wins': {'count': 0, 'percentage': 0, 'curren...   
2  {'wins': {'count': 0, 'percentage': 0, 'curren...   
3  {'wins': {'count': 0, 'percentage': 

In [None]:
# Flatten nested structures (updated with new extraction logic)
# The function has been updated to ensure data that was previously lost is now extracted properly.
flat_df = extract_nested_stats(df)
print("\nFlattened data shape:", flat_df.shape)
print("\nFlattened data columns:")
print(flat_df.columns.tolist())


Flattened data shape: (5452, 13)

Flattened data columns:
['player_id', 'series_played', 'games_played', 'total_kills', 'avg_kills', 'best_kills', 'total_deaths', 'avg_deaths', 'wins_count', 'win_percentage', 'current_streak', 'avg_net_worth', 'max_net_worth']


In [12]:
# Remove zero-stat players
# Define columns to check for zeros
stat_columns = [
    'series_played',
    'games_played',
    'total_kills',
    'avg_kills',
    'best_kills',
    'total_deaths',
    'avg_deaths',
    'wins_count',
    'win_percentage',
    'current_streak',
    'avg_net_worth',
    'max_net_worth'
]  # Note: Excluded player_id as it's an identifier, not a stat

# Clean the data
cleaned_df = remove_zero_stat_players(flat_df, stat_columns)

# Print stats about the cleaning
print(f"Original shape: {flat_df.shape}")
print(f"Cleaned shape: {cleaned_df.shape}")
print(f"Removed {flat_df.shape[0] - cleaned_df.shape[0]} rows")

# Let's see the distribution of values in cleaned data
print("\nSummary statistics of cleaned data:")
print(cleaned_df[stat_columns].describe())

# Check if we still have any zeros in individual columns
zero_counts = {col: (cleaned_df[col] == 0).sum() for col in stat_columns}
print("\nRemaining zero values in each column:")
for col, count in zero_counts.items():
    print(f"{col}: {count} zeros")


Original shape: (5452, 13)
Cleaned shape: (1768, 13)
Removed 3684 rows

Summary statistics of cleaned data:
       series_played  games_played  total_kills    avg_kills   best_kills  \
count    1768.000000   1768.000000  1768.000000  1768.000000  1768.000000   
mean        9.728507     22.657805   312.673643    27.626575    40.990385   
std        17.301542     41.640428   649.261385    11.880559    19.033715   
min         1.000000      1.000000     0.000000     0.000000     0.000000   
25%         2.000000      5.000000    46.750000    19.611538    27.000000   
50%         5.000000     10.000000   120.000000    30.870833    43.500000   
75%        10.000000     23.000000   307.500000    36.000000    55.000000   
max       168.000000    403.000000  6682.000000    60.666667    89.000000   

       total_deaths   avg_deaths  wins_count  win_percentage  current_streak  \
count   1768.000000  1768.000000      1768.0          1768.0          1768.0   
mean     315.968326    28.927882      

In [17]:
# Futher handling of missing values
# Above we can see that there are missing values in the columns 'wins_count','win_percentage','current_streak','avg_net_worth','max_net_worth' 
# That the amount of missing values is the amount of rows we have left after removing zero-stat players. 
# This means that all the missing values are in the same rows. 
# We can remove these rows now that I verified with our original data file and batch files that this data is 0 in source data.

# Remove the problematic columns that are all zeros
columns_to_drop = [
    'wins_count',
    'win_percentage', 
    'current_streak',
    'avg_net_worth',
    'max_net_worth'
]

final_df = cleaned_df.drop(columns=columns_to_drop)

# Display results
print(f"Original shape: {flat_df.shape}")
print(f"Final shape: {final_df.shape}")
print(f"\nRemaining columns:")
print(final_df.columns.tolist())
print("\nSummary statistics of remaining columns:")
print(final_df.describe())


Original shape: (5452, 13)
Final shape: (1768, 8)

Remaining columns:
['player_id', 'series_played', 'games_played', 'total_kills', 'avg_kills', 'best_kills', 'total_deaths', 'avg_deaths']

Summary statistics of remaining columns:
           player_id  series_played  games_played  total_kills    avg_kills  \
count    1768.000000    1768.000000   1768.000000  1768.000000  1768.000000   
mean   110817.961538       9.728507     22.657805   312.673643    27.626575   
std     14015.940317      17.301542     41.640428   649.261385    11.880559   
min     19537.000000       1.000000      1.000000     0.000000     0.000000   
25%    111640.500000       2.000000      5.000000    46.750000    19.611538   
50%    114123.500000       5.000000     10.000000   120.000000    30.870833   
75%    116721.500000      10.000000     23.000000   307.500000    36.000000   
max    120827.000000     168.000000    403.000000  6682.000000    60.666667   

        best_kills  total_deaths   avg_deaths  
count  17