# Intro

I was unsure exactly how to format this. I didn't want to make it a blog article or Kaggle post, since that is not really me. I'll take you through my process for approaching this problem pretty much as it unfolded. 

Modeling is an iterative process, and for me anyway, can be messy. I did not want to sanitize the notebook too much, since I felt the whole point of this was to gain an understanding of how I solve problems.

Testing and exploration overtime is the key for me. The goal is always to build a strong intuitive understanding of the data and the sport, then go from there. 

I tried to get things done in about a week. So this wouldsay this is not a "Final" model, but I think it's a great place to start. I make some comments and annotations along the way, but as you will see, presenting my work is new for me, but this was a fun project none the less. Very excited to talk about this when we meet.

I'll try and not give you too much tedious comments to read. I figure you guys know most of this.  :)

# Loading, Cleaning and Exploration

In [75]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor, BaggingRegressor
from sklearn.model_selection import cross_val_predict, KFold, cross_val_score
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

pd.set_option('display.max_columns', None)

In [76]:
data = pd.read_csv("mlbb_data_oddin.csv")
data.head() #first looks, can see null cols etc.. dataset is pretty small

Unnamed: 0.1,Unnamed: 0,tournament_id,tournament_name,region,tier,date_start,match_id,map_id,map_order,prematch_home,home_team_id,away_team_id,map_win_team_id,results_away_kills,results_home_kills,results_duration
0,0,5649,WSL Season 8,asia,tier2,2024-02-16 08:00:27.641336+00:00,713573,816775,1,0.757978,10320,10322,10320,,,
1,1,5649,WSL Season 8,asia,tier2,2024-02-16 08:35:37.039547+00:00,713573,816776,2,0.757978,10320,10322,10322,,,
2,2,5649,WSL Season 8,asia,tier2,2024-02-16 09:52:39.177390+00:00,713573,816777,3,0.738443,10320,10322,10320,,,
3,3,5649,WSL Season 8,asia,tier2,2024-02-16 09:45:53.364959+00:00,713574,816779,1,0.703136,10138,10321,10138,,,
4,4,5649,WSL Season 8,asia,tier2,2024-02-16 12:22:35.717617+00:00,713574,816780,2,0.703136,10138,10321,10321,,,


I have worked with MOBA data for a while, so I'm pretty familiar with a lot of the game data and their nature, I will explain some things, but I might skip over others. Ask me about anything you see me do, and I have no problem talking about it.

In [77]:
data.drop('Unnamed: 0',axis=1,inplace=True) # pre cleaning

data['date_start'] = pd.to_datetime(data['date_start'])
data = data.sort_values('date_start')

First thing that jumps out is that there is no game stat data, however there is a prematch probability, which I can use as a stand in for relative power rating, let's see how accurate it is...

In [78]:
win_rate = data.dropna(subset=['prematch_home','map_win_team_id']) #drop
c1 = win_rate['prematch_home'] == .50
win_rate = win_rate[~c1]

x1 = win_rate['home_team_id'] == win_rate['map_win_team_id']
x2 = win_rate['prematch_home'] > .50

y1 = win_rate['home_team_id'] != win_rate['map_win_team_id']
y2 = win_rate['prematch_home'] < .50

len(win_rate[(x1&x2|(y1&y2))])/len(win_rate)

0.6509025270758123

65 is very good for map predictions. Are these opening lines? closing? Something else?

In [79]:
data.isna().sum()

tournament_id           0
tournament_name         0
region                  0
tier                    0
date_start              1
match_id                0
map_id                  0
map_order               0
prematch_home           1
home_team_id            0
away_team_id            0
map_win_team_id         0
results_away_kills    469
results_home_kills    469
results_duration      469
dtype: int64

In [80]:
data.dropna(subset=['date_start','prematch_home','results_away_kills','results_home_kills','results_duration'],inplace=True)

nill_game = (data['results_away_kills'] == 0) & (data['results_home_kills'] == 0) # maps with 0-0 score probably not valid
good_time = data['results_duration'] > 5 # maps under 5 seconds probably not valid

data = data[~nill_game&good_time]
data.reset_index(drop=True,inplace=True)

Only having about 2000 good rows to work with is tough, it makes a lot of feature engineering I like to do moot. 

I can't really generate team specific data, since filtering for teams that have a 3 map history cuts the data almost by half. Also overfitting becomes more a problem, although I argue that overfitting these models is hard, if you use common sense.

I am already thinking of how to get more data, but let's continue.

# Feature Engineering and Modeling Cycle

This is where the process is very iterative, I'll constantly be going back and changing things, deleting etc.. My real process is usually generating sometimes hundreds of features, then slowing peeling away useless and redundant features, until I find the core.

In [81]:
#target variable
data['kill_differential'] = data['results_home_kills'] - data['results_away_kills'] #target

#will be main feature for baseline model
data['prematch_away'] = 1 - data['prematch_home']
data['rating_differential'] = data['prematch_home'] - data['prematch_away']

#will be used as feature later on
data['abs_kill_diff'] = abs(data['results_home_kills'] - data['results_away_kills'])

## EDA example

In [82]:
abs(data['kill_differential']).describe()

count    2071.000000
mean        8.193626
std         4.536026
min         0.000000
25%         5.000000
50%         8.000000
75%        11.000000
max        28.000000
Name: kill_differential, dtype: float64

In [83]:
#Do lower level teams have bigger kill differentials, it looks almost like it....
data.groupby('tier')['kill_differential'].mean()

tier
tier1    0.597403
tier2    0.724044
tier3    0.929348
Name: kill_differential, dtype: float64

In [84]:
#not enough data here I feel to make a feature, but we can try.
data['tier'].value_counts()

tier
tier1    1155
tier2     732
tier3     184
Name: count, dtype: int64

In [85]:
data.groupby('tier')['rating_differential'].mean()

tier
tier1    0.047066
tier2    0.107215
tier3    0.027606
Name: rating_differential, dtype: float64

Now I  ask questions like, do lower ranked teams have more blowouts? Stuff like this. Looking at tiers is interesting, but I do not feel like we have enough data, so its not something I pursue. Also in my previous models tier never really made a difference for me. However I was working on match prediction.

This is where I start to think about team stats, moving averages etc.. however I do not feel that we have enough data to generate those stats.

## Baseline

In [86]:
#Let's get a baseline with a LR simply using rating differential
#test = data.dropna(subset=['hist_abs_90D_mean', 'home_hist_dur_6M', 'away_hist_dur_6M', 'hist_6M_std']).copy()
#the test is code that I would use as I make new features that would have nulls, so I would have to remove them from the baseline model as well
X = data[['rating_differential']]
y = data['kill_differential']

model = LinearRegression()

cv_scores = cross_val_score(
    estimator=model,
    X=X,
    y=y,
    cv=5,              
    scoring='neg_root_mean_squared_error'  
)

print(f"--- 5-Fold Cross-Validation Scores (Accuracy) ---")
print(cv_scores)
print(f"\nMean CV Accuracy: {np.mean(cv_scores):.4f}")

--- 5-Fold Cross-Validation Scores (Accuracy) ---
[-8.67901234 -8.79667425 -8.72654683 -8.4512976  -9.25246344]

Mean CV Accuracy: -8.7812


This is where it hits me that I really do not have a solid metric to go by. In my other experience I would back test the model vs the betting odds and get my baseline, but I realize here I do not have a special metric like ROI.


I always evaluated my model somewhat intuitively, in the sense, that if a model had lower log log loss and a higher accuracy, and if the ROI was lower, you had to find a compromise. 


Now here I am dealing with an even more ambiguous metrics MAE or RMSE, how good is good? I have a feeling you must have some "cash" metric that evaluates the "value" of a line somehow. Interesting to think about.

In [87]:
# ususally wouldn't do this, since I like to be flexible, but I will use this basic model to show you my process, and I'd like call it a few times.
#simple model with CV etc.
def evaluate_model(model, features, data, target='kill_differential'):
    """
    Runs 5-fold CV for a given model and feature set,
    printing the mean RMSE. 
    """
    
    # 1. Prepare data - drop NaNs only from relevant columns
    eval_data = data.dropna(subset=features + [target])
    X = eval_data[features]
    y = eval_data[target]
    
    # 2. Define CV strategy (consistent across all models)
    cv = KFold(n_splits=5, shuffle=True, random_state=99)
    
    # 3. Run CV and get scores
    scores = cross_val_score(
        estimator=model,
        X=X,
        y=y,
        cv=cv,              
        scoring='neg_root_mean_squared_error'  
    )
    # 4. Print results
    print(f"--- Testing Features: {features} ---")
    print(f"CV Scores (RMSE): {scores}")
    print(f"Mean CV RMSE: {np.mean(scores):.4f}\n")
    print("-" * 30)

## Creation of some Team and Population stats

These are the stats that I have to work with if we have a small sparse data set. This cell below gives me the duration and kill differential population data. Duration obviously longer map == more kills. I already have an understanding of what creates high kill maps and I am exploring.

In [88]:
historical_stats_dict = {} #this will hold the data we produce

# I iterate through the dataset since I have to only pull data prematch to prevent data leakage, using vectors here will not work
for index, row in data.iterrows():

    #these variables help us create the historical window below with no data leakage, 6 months
    current_date = row['date_start']
    current_match_id = row['match_id']
    cutoff_date = current_date - pd.DateOffset(days=180)
    
    #hisorical window where we generate the stats
    historical_data_90D = data[
        (data['date_start'] < current_date) &     # Data is in the past
        (data['date_start'] >= cutoff_date) &    # Data is within the 90-day window
        (data['match_id'] != current_match_id)  #EXCLUDE maps from the current match
    ]

    
    stats_kill_diff = historical_data_90D['abs_kill_diff'].describe() # getting the describe data
    stats_duration = historical_data_90D['results_duration'].describe()
        
    historical_stats_dict[row['map_id']] = {
        'hist_abs_180D_mean': stats_kill_diff['mean'],
        'hist_abs_180D_std': stats_kill_diff['std'],
        'hist_abs_180D_min': stats_kill_diff['min'],
        'hist_abs_180D_max': stats_kill_diff['max'],
        'hist_abs_180D_q25': stats_kill_diff['25%'],
        'hist_abs_180D_q50': stats_kill_diff['50%'],
        'hist_abs_180D_q75': stats_kill_diff['75%'],
        
        'hist_dur_180D_mean': stats_duration['mean'],
        'hist_dur_180D_std': stats_duration['std'],
        'hist_dur_180D_min': stats_duration['min'],
        'hist_dur_180D_max': stats_duration['max'],
        'hist_dur_180D_q25': stats_duration['25%'],
        'hist_dur_180D_q50': stats_duration['50%'],
        'hist_dur_180D_q75': stats_duration['75%']
    }

#Convert results and merge back
features_df = pd.DataFrame.from_dict(historical_stats_dict, orient='index')
features_df = features_df.reset_index().rename(columns={'index': 'map_id'})
data = data.merge(features_df, on='map_id', how='left')

In [89]:
# This dataframe makes it easy to create team history data, I have done this many ways, but trying this out.

# This creates a new df with just the home team's perspective
df_home = data.assign(
    team_id=data['home_team_id'],
    map_duration=data['results_duration'],
    team_kill_diff=data['kill_differential'] # Home's perspective
)[['map_id', 'date_start', 'team_id', 'map_duration', 'team_kill_diff']]


# This creates a new df with the away team's perspective
df_away = data.assign(
    team_id=data['away_team_id'],
    map_duration=data['results_duration'],
    team_kill_diff=-data['kill_differential'] # Flipped for Away's perspective
)[['map_id', 'date_start', 'team_id', 'map_duration', 'team_kill_diff']]


# This stacks them, so a 1000-row 'data' becomes a 2000-row 'team_game_data'
team_game_data = pd.concat([df_home, df_away])

# Sort by date
team_game_data = team_game_data.sort_values('date_start')

In [92]:
team_game_data.head() #we use this in the code below to generate team stats...

Unnamed: 0,map_id,date_start,team_id,map_duration,team_kill_diff
0,946460,2024-04-11 07:02:56.516165+00:00,10697,1234.0,-8.0
0,946460,2024-04-11 07:02:56.516165+00:00,10691,1234.0,8.0
1,946461,2024-04-11 07:39:30.213183+00:00,10691,879.0,-8.0
1,946461,2024-04-11 07:39:30.213183+00:00,10697,879.0,8.0
2,946462,2024-04-11 08:14:51.920556+00:00,10697,673.0,9.0


In [93]:
#dictionary to store new features
historical_stats_dict = {}

#same deal as before, but now we input rolling averages of a teams kill diff and duration
for index, row in data.iterrows():
    
    # Get current match info
    current_date = row['date_start']
    current_match_id = row['match_id']
    home_team_id = row['home_team_id']
    away_team_id = row['away_team_id']
    

    cutoff_date = current_date - pd.DateOffset(days=180)

    #historical stats for home team
    home_hist_slice = team_game_data[
        (team_game_data['team_id'] == home_team_id) & # Filter for this team
        (team_game_data['date_start'] < current_date) &  # Data is in the past
        (team_game_data['date_start'] >= cutoff_date) & # Data is in 6-month window
        (team_game_data['map_id'] != row['map_id'])      # <-- No data leakage


    ]
    
    #historical stats for away team
    away_hist_slice = team_game_data[
        (team_game_data['team_id'] == away_team_id) & 
        (team_game_data['date_start'] < current_date) & 
        (team_game_data['date_start'] >= cutoff_date) & 
        (team_game_data['map_id'] != row['map_id'])  
    ]
    
    historical_stats_dict[row['map_id']] = { 
        'home_hist_kd_6M': home_hist_slice['team_kill_diff'].mean(),
        'home_hist_dur_6M': home_hist_slice['map_duration'].mean(),
        'away_hist_kd_6M': away_hist_slice['team_kill_diff'].mean(),
        'away_hist_dur_6M': away_hist_slice['map_duration'].mean()
    }

features_df = pd.DataFrame.from_dict(historical_stats_dict, orient='index')
features_df = features_df.reset_index().rename(columns={'index': 'map_id'})
data = data.merge(features_df, on='map_id', how='left')

In [94]:
data.columns

Index(['tournament_id', 'tournament_name', 'region', 'tier', 'date_start',
       'match_id', 'map_id', 'map_order', 'prematch_home', 'home_team_id',
       'away_team_id', 'map_win_team_id', 'results_away_kills',
       'results_home_kills', 'results_duration', 'kill_differential',
       'prematch_away', 'rating_differential', 'abs_kill_diff',
       'hist_abs_180D_mean', 'hist_abs_180D_std', 'hist_abs_180D_min',
       'hist_abs_180D_max', 'hist_abs_180D_q25', 'hist_abs_180D_q50',
       'hist_abs_180D_q75', 'hist_dur_180D_mean', 'hist_dur_180D_std',
       'hist_dur_180D_min', 'hist_dur_180D_max', 'hist_dur_180D_q25',
       'hist_dur_180D_q50', 'hist_dur_180D_q75', 'home_hist_kd_6M',
       'home_hist_dur_6M', 'away_hist_kd_6M', 'away_hist_dur_6M'],
      dtype='object')

When I am creating these rolling windows, I usually make many more. Now I prefer to use decay curves and weighted averages. I've found decay functions to be a little better than bagging a bunch of static windows, but I wanted to keep this a little simpler. 

With only a week, it wasn't practical to go with my usual testing of decay curves and time windows. You get the idea.

## Model 2

In [95]:
data['sum_dur'] = data['home_hist_dur_6M'] + data['away_hist_dur_6M'] #new feat, sum of both teams rolling mean duration

#calling model again with new feats
features = ['rating_differential','sum_dur']
model = LinearRegression()
evaluate_model(model = model , features=features, data=data, target='kill_differential')

--- Testing Features: ['rating_differential', 'sum_dur'] ---
CV Scores (RMSE): [-8.89905808 -9.09714769 -8.74134163 -8.773837   -8.21536366]
Mean CV RMSE: -8.7453

------------------------------


#### -8.7812 ---> -8.7543

Sorry if things get messy and repetative with the code, I am trying to show you visually how I would test.

In [96]:
data.columns #what we have going.

Index(['tournament_id', 'tournament_name', 'region', 'tier', 'date_start',
       'match_id', 'map_id', 'map_order', 'prematch_home', 'home_team_id',
       'away_team_id', 'map_win_team_id', 'results_away_kills',
       'results_home_kills', 'results_duration', 'kill_differential',
       'prematch_away', 'rating_differential', 'abs_kill_diff',
       'hist_abs_180D_mean', 'hist_abs_180D_std', 'hist_abs_180D_min',
       'hist_abs_180D_max', 'hist_abs_180D_q25', 'hist_abs_180D_q50',
       'hist_abs_180D_q75', 'hist_dur_180D_mean', 'hist_dur_180D_std',
       'hist_dur_180D_min', 'hist_dur_180D_max', 'hist_dur_180D_q25',
       'hist_dur_180D_q50', 'hist_dur_180D_q75', 'home_hist_kd_6M',
       'home_hist_dur_6M', 'away_hist_kd_6M', 'away_hist_dur_6M', 'sum_dur'],
      dtype='object')

In [97]:
#calling model again with new feats
features = ['rating_differential','sum_dur','hist_abs_180D_std']
model = model = make_pipeline(
    StandardScaler(),
    LinearRegression()
) # adding scaling for the different features we are now adding....


evaluate_model(model = model , features=features, data=data, target='kill_differential')

--- Testing Features: ['rating_differential', 'sum_dur', 'hist_abs_180D_std'] ---
CV Scores (RMSE): [-8.96406381 -8.51710277 -8.56680673 -9.00187629 -8.7680716 ]
Mean CV RMSE: -8.7636

------------------------------


Now I try adding other feats one by one, just playing around. The end goal is to have diverse feature sets, feeding each one to its "best fit" model, then taking OOF predictions to LR. No luck here so we move on. I will come back to these versions again and again.

## Creation of RF OOF Predictions

In [98]:
#find with these feats a RF is a good starting point, we can of course tune this set and make these preds more powerful
rf_feats = list(test.columns)[20:-5]
X = data[rf_feats]
y = data['kill_differential']


model_rf = RandomForestRegressor(
    n_estimators=100,  
    random_state=99,  
)

cv = KFold(n_splits=5, shuffle=True, random_state=99)


oof_predictions = cross_val_predict(
    estimator=model_rf,
    X=X,
    y=y,
    cv=cv,
)

data['rf_preds'] = oof_predictions

In [49]:
data.columns

Index(['tournament_id', 'tournament_name', 'region', 'tier', 'date_start',
       'match_id', 'map_id', 'map_order', 'prematch_home', 'home_team_id',
       'away_team_id', 'map_win_team_id', 'results_away_kills',
       'results_home_kills', 'results_duration', 'kill_differential',
       'prematch_away', 'rating_differential', 'abs_kill_diff',
       'hist_abs_180D_mean', 'hist_abs_180D_std', 'hist_abs_180D_min',
       'hist_abs_180D_max', 'hist_abs_180D_q25', 'hist_abs_180D_q50',
       'hist_abs_180D_q75', 'hist_dur_180D_mean', 'hist_dur_180D_std',
       'hist_dur_180D_min', 'hist_dur_180D_max', 'hist_dur_180D_q25',
       'hist_dur_180D_q50', 'hist_dur_180D_q75', 'home_hist_kd_6M',
       'home_hist_dur_6M', 'away_hist_kd_6M', 'away_hist_dur_6M', 'sum_dur',
       'rf_preds'],
      dtype='object')

In [63]:
#calling model again with new OOF preds
features = ['rating_differential','sum_dur','rf_preds']

model = model = make_pipeline(
    StandardScaler(),
    LinearRegression()
) # adding scaling for the different features we are now adding....


evaluate_model(model = model , features=features, data=data, target='kill_differential')

--- Testing Features: ['rating_differential', 'sum_dur', 'rf_preds'] ---
CV Scores (RMSE): [-8.84924525 -9.11120555 -8.71216334 -8.7346379  -8.23354314]
Mean CV RMSE: -8.7282

------------------------------


#### -8.7453 ---> -8.7282

In [64]:
base_model_pipeline = make_pipeline(
    StandardScaler(),
    LinearRegression()
)

bagging_model = BaggingRegressor(
    estimator=base_model_pipeline,
    n_estimators=20,   
    random_state=99,
    max_samples = .90,
    max_features = 1.0,
)

# --- 3. Define Feature List ---
features = ['rating_differential', 'rf_preds', 'sum_dur',]


evaluate_model(model=bagging_model, features=features, data=data,target='kill_differential')

--- Testing Features: ['rating_differential', 'rf_preds', 'sum_dur'] ---
CV Scores (RMSE): [-8.85488123 -9.11341929 -8.71138789 -8.72693628 -8.2366708 ]
Mean CV RMSE: -8.7287

------------------------------


No luck with bagging, but it usually works with more features.



This is where I would then return to feature engineering. Working my way through this pipeline again and again, adding new model OOF preds creating new features, combining them, removing them, placing them into different models, usually all to end with a bagged LR. When the features get very numerous I'd use algos do build feature lists, there are some very cool auto-ML, auto-ensemble packages, I've been wanting to test. All the time also looking for signs of overfitting, data leakage etc..



This is something that is hard to show in a week of work, but I think this is a good idea of things. This process would remain the same but escalating over the next weeks in complexity. With a few more weeks, I'd spend time watching the sport, researching. I have no doubt, that further work would lead to better scores.

## Creation of New Datasets

I felt really limited by my lack of instances and features, and really could not find a serious database, or even the ability to scrape one. 



So I had a an idea to pull match data from Youtube streams. In retrospect, this was a bit large of a project to complete in the time allowed, but I am happy with it, however.... It worked until Youtube flagged my cloud account and I lost my data. 



Here was how it worked. I spent a lot of energy on this, I thought it was worth it, even though I did not get to use it. I think it is a solid idea to pursue.



This is where my edge has come from over the years. Doing the things that others would consider not worth it or a pain in the butt. Which this was, but data is really the easiest and biggest way to make an impact.

## Youtube to Detailed Data Pipeline

I will not get into too much detail, but this is basically how it worked. 



### 2024 fixtures from Liquipedia - All S, A, and B tier tournament fixtures.

* contains basic fixture data, plus hero selections. Around 6000 solid fixtures now.

* This serves as the skeleton to fill



### Find the streams for every tournament

* easier than it sounds. Find over a thousand youtube urls from the 2024 tournaments



### Download youtube videos and extract "still" Frames

* frames are graded on pixel motion, so its easy to extract what we need. 



### Frames are sent to a CNN classifier

* removes 90% of junk screens to save money



### Final map stat screens are sent to a Gemini model

* stat screens are parsed, json is saved with the team gold etc...



### Youtube data is matched to Liquipedia to produce data set..

Since that did not work out, I decided to work with the Liquipedia dataset and have some fun results modeling duration. See the next notebook.