### Liquipedia Data Duration Modeling


Considering I could not get the dataset I wanted, I moved to model map duration, which actually was very fun. I got to use some models I had never used before, but have wanted to. I tnhink this is a great market to model I wonder what the margins are here and how good the market is at predicting these.

Liquipedia had lots of instances along with hero data, side etc.. Perfect for what I had in mind.

In [1]:
import pandas as pd
import numpy as np
import json
from rankit.Table import Table
from rankit.Ranker import EloRanker
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

pd.set_option('display.max_columns', None)

In [2]:
%run ModelHelpers.ipynb # this runs the help notebook to make things a bit cleaner here.

## Loading and Cleaning

In [4]:
#pulled 2023 and 2024 data. I feel that for duration I needed more data. 
#Duration feels like its more a function of playstyle and game mechanics than team rating diff.

mlbb2024 = pd.read_csv("mlbb_2024_clean.csv")
side_data = pd.read_csv('mlbb_2024_side_data.csv')
mlbb2024 = mlbb2024.merge(side_data, on='map_id', how='left')

mlbb2023 = pd.read_csv("mlbb_2023_clean.csv")
side_data = pd.read_csv('mlbb_2023_side_data.csv')
mlbb2023 = mlbb2023.merge(side_data, on='map_id', how='left')

df = pd.concat([mlbb2023,mlbb2024]).reset_index(drop=True)
df['match_date'] = pd.to_datetime(df['match_date'])

df.head()

Unnamed: 0,match_id,map_id,game_number,tournament_name,tournament_url,tier,tournament_stage,liquipedia_path,match_date,team1,team2,winner,duration,team1_picks,team1_bans,team2_picks,team2_bans,team1_side,team2_side
0,MATCH5ebf6583a5,MAP7466b3e7f36b,2,WSL Season 6,https://liquipedia.net/mobilelegends/WSL/Season_6,B-Tier,Regular Season,WSL/Season 6/Regular Season,2023-01-17,Foes Win,GPX Basreng,2,12m 56s,"hilda, bene, lylia, brody, lolita","wan, joy, gloo, estes, martis","lapu, akai, yve, karrie, atlas","aamon, fred, haya, ling, gs",red,blue
1,MATCH5ebf6583a5,MAP76eb4613f945,1,WSL Season 6,https://liquipedia.net/mobilelegends/WSL/Season_6,B-Tier,Regular Season,WSL/Season 6/Regular Season,2023-01-17,Foes Win,GPX Basreng,2,13m 19s,"masha, barats, cecilion, karrie, diggie","wan, haya, joy, valen, fara","fred, ling, xavier, brody, lolita","gloo, yve, aamon, akai, martis",blue,red
2,MATCH85626e8803,MAPb668e4b7d592,2,WSL Season 6,https://liquipedia.net/mobilelegends/WSL/Season_6,B-Tier,Regular Season,WSL/Season 6/Regular Season,2023-01-17,Aura Phoenix,Bigetron Era,2,9m 6s,"yz, alpha, faramis, bea, mathil","karrie, gloo, yve, claude, estes","grock, fred, pharsa, bruno, lolita","wan, joy, haya, martis, barats",red,blue
3,MATCH85626e8803,MAP84a6a2dc8f24,1,WSL Season 6,https://liquipedia.net/mobilelegends/WSL/Season_6,B-Tier,Regular Season,WSL/Season 6/Regular Season,2023-01-17,Aura Phoenix,Bigetron Era,2,11m 35s,"fred, ling, xavier, melis, grock","karrie, haya, mathil, pharsa, lolita","joy, martis, kadita, brody, chou","wan, gloo, yve, bea, kaja",blue,red
4,MATCHdb73da8857,MAP08d334f9487f,1,WSL Season 6,https://liquipedia.net/mobilelegends/WSL/Season_6,B-Tier,Regular Season,WSL/Season 6/Regular Season,2023-01-18,Tiger Wong Seiren,RRQ Mika,1,29m 10s,"lapu, akai, valen, karrie, atlas","joy, haya, faramis, ling, lolita","bene, fred, xavier, brody, diggie","wan, yve, gloo, kaja, mathil",blue,red


In [5]:
#fixing time formatting
df['duration'] = df['duration'].str.replace(r'h', ' hour ')
df['duration'] = df['duration'].str.replace(r'm', ' min ')
df['duration'] = df['duration'].str.replace(r's', ' sec ')

#fixing ill formated times
df.loc[610,'duration'] = '13 min 4 sec'
df.loc[1149,'duration'] = '22 min 16 sec'
df.loc[2645,'duration'] = '14 min 54 sec'
df.loc[2644,'duration'] = '14 min 54 sec'
df.loc[6497,'duration'] = '18 min 22 sec'
df.loc[6498,'duration'] = '14 min 44 sec'
df.loc[6499,'duration'] = '16 min 47 sec'
df.loc[6500,'duration'] = '16 min 00 sec'
df.loc[6501,'duration'] = '10 min 28 sec'
df.loc[9895,'duration'] = '14 min 19 sec'
df.loc[9988,'duration'] = '13 min 14 sec'
df.loc[4986,'duration'] = '12 min 00 sec'

df.drop(1637,inplace=True)

#fixing the side data (blue/red)
df.dropna(subset=['team1_side','team1_side'],inplace=True)
df['team2_side'] = df['team2_side'].replace('bkye','blue')
df['team1_side'] = df['team1_side'].replace('b;ue','blue')
df = df.reset_index(drop=True)

#date TIME
df['duration'] = pd.to_timedelta(df['duration']).dt.total_seconds()

In [6]:
# I made a team mapping file to help deal with team naming issues, lots of mispellings and other problems, but I sorted it
#I could make things better by tracking players, what I have noticed is that teams combined often which is strange.
with open('team_name_mappings.json', 'r', encoding='utf-8') as f:
    team_mapping = json.load(f)

# Apply mapping to team columns
df['team1'] = df['team1'].map(lambda x: team_mapping.get(x, x))
df['team2'] = df['team2'].map(lambda x: team_mapping.get(x, x))

df['team1'] = df['team1'].str.replace('fnatic onic ph','Fnatic ONIC')# missed these ;)
df['team2'] = df['team2'].str.replace('fnatic onic ph','Fnatic ONIC')

In [7]:
# Load hero mapping, again the heros names were not standard as well
with open('hero_name_mappings.json', 'r', encoding='utf-8') as f:
    hero_mapping = json.load(f)

# Apply mapping to hero columns (splits comma-separated strings)
for col in ['team1_picks', 'team1_bans', 'team2_picks', 'team2_bans']:
    df[col] = df[col].map(lambda x: ', '.join([hero_mapping.get(h.strip().lower(), h.strip().lower()) 
                                                for h in str(x).split(',')]) if pd.notna(x) else x)

## Team Rating Creation

In [8]:
#each teams "score" for the map, for calculating ELO
df['team1_map_winner'] = np.where(df['winner']==1,1,0)
df['team2_map_winner'] = np.where(df['winner']==2,1,0)

#days columm so we can loop though the data and assign elo ratings based on the last 6m of play
df['days'] = df['match_date'].sub(df['match_date'].min()).dt.total_seconds()/60/60//24 +1
df['team1_elo'] = np.nan
df['team2_elo'] = np.nan

I wanted to have some sort of team rating, so I used ELO, although I could have gone with a more advanced method. I am working with a custom Trueskill2 rating system now, that is really fun, but maybe better suited for prematch. Ask me about it.

In [9]:
# I usually begin with 3months at least of data for the first ELO's and then drop the nulls when training, but no real reason to here.
# I use a great ratings package, for ELO here. Rankit, great book that goes along with it.
# note I also would have used Whole History Rating (WHR), but its slower and ELO is a good baseline
for d in df['days'].loc[4:].unique():
    cut_off = d - 180
    six_months = df[(df['days']<d)&(df['days']>cut_off)]
    sm_table = Table(six_months, col = ['team1', 'team2', 'team1_map_winner', 'team2_map_winner'])
    
    eloRanker = EloRanker(K=89)#faster moving K value works better for esports
    eloRanker.update(sm_table)
    eloRank = eloRanker.leaderboard()
    
    rating = eloRank.set_index('name')['rating']
    current_day_index = df[df['days'] == d].index
    
    team1_elos = df.loc[current_day_index, 'team1'].map(rating)
    team2_elos = df.loc[current_day_index, 'team2'].map(rating)

    df.loc[current_day_index, 'team1_elo'] = team1_elos
    df.loc[current_day_index, 'team2_elo'] = team2_elos

In [10]:
#elo accuracy for fun
df['elo_diff'] = df['team1_elo'] - df['team2_elo']
df['map_winner'] = np.where(df['winner'] == 1,1,0)

test = df.dropna()

X = test[['elo_diff']]
y = test['map_winner']

model = LogisticRegression()

cv_scores = cross_val_score(
    estimator=model,
    X=X,
    y=y,
    cv=5,              
    scoring='accuracy'  
)

print(f"--- 5-Fold Cross-Validation Scores (Accuracy) ---")
print(cv_scores)
print(f"\nMean CV Accuracy: {np.mean(cv_scores):.4f}")

--- 5-Fold Cross-Validation Scores (Accuracy) ---
[0.61722956 0.61135585 0.6165524  0.61018609 0.625857  ]

Mean CV Accuracy: 0.6162


Not as good as you guys ;), but could be improved with WHR or using multiple K valus and bagging. Obviously data like gold etc.. could imporve things, but its a good baseline as I said before. I think I could get 65.....

In [11]:
df.columns

Index(['match_id', 'map_id', 'game_number', 'tournament_name',
       'tournament_url', 'tier', 'tournament_stage', 'liquipedia_path',
       'match_date', 'team1', 'team2', 'winner', 'duration', 'team1_picks',
       'team1_bans', 'team2_picks', 'team2_bans', 'team1_side', 'team2_side',
       'team1_map_winner', 'team2_map_winner', 'days', 'team1_elo',
       'team2_elo', 'elo_diff', 'map_winner'],
      dtype='object')

## Further FE

In [12]:
#same thing as the other notebook, calculating population data, two windows this time

historical_stats_dict = {}
window_list = [90, 180]


for index, row in df.iterrows():
    
    current_date = row['match_date']
    current_match_id = row['match_id']
    
    
    current_map_stats = {}
    
    # getting 90 and 180 days this time
    for window_days in window_list:
        
        cutoff_date = current_date - pd.DateOffset(days=window_days)
        
        
        historical_data = df[
            (df['match_date'] < current_date) &      # Data is in the past
            (df['match_date'] >= cutoff_date) &     # Data is within the window
            (df['match_id'] != current_match_id)  # EXCLUDE other maps from the current match
        ]
        
        
        stats_duration = historical_data['duration'].describe()
            
        #
        # This adds the stats for the current window to the map's dictionary
        current_map_stats.update({
            f'hist_dur_{window_days}D_mean': stats_duration['mean'],
            f'hist_dur_{window_days}D_std': stats_duration['std'],
            f'hist_dur_{window_days}D_min': stats_duration['min'],
            f'hist_dur_{window_days}D_max': stats_duration['max'],
            f'hist_dur_{window_days}D_q25': stats_duration['25%'],
            f'hist_dur_{window_days}D_q50': stats_duration['50%'],
            f'hist_dur_{window_days}D_q75': stats_duration['75%']
        })

    #Store all collected stats for this map_id
    historical_stats_dict[row['map_id']] = current_map_stats


# This part is the same as before.
features_df = pd.DataFrame.from_dict(historical_stats_dict, orient='index')
features_df = features_df.reset_index().rename(columns={'index': 'map_id'})
df = df.merge(features_df, on='map_id', how='left')

In [13]:
#same thing as the other notebook, calculating team data, two windows this time

historical_stats_dict = {}
window_list = [90, 180] 

for index, row in df.iterrows():
    
    # Get current match info
    current_date = row['match_date']
    current_match_id = row['match_id']
    team1_name = row['team1']
    team2_name = row['team2']
    
    
    current_map_stats = {}
    
    
    for window_days in window_list:
        
        
        cutoff_date = current_date - pd.DateOffset(days=window_days)
        
        
        # This gets all maps played by any team in the window
        # It excludes maps from the current match to prevent data leakage
        historical_data_all = df[
            (df['match_date'] < current_date) &      # Data is in the past
            (df['match_date'] >= cutoff_date) &     # Data is within the window
            (df['match_id'] != current_match_id) 
        ]
        
        
        # Find all historical maps where 'team1' played
        team1_hist_slice = historical_data_all[
            (historical_data_all['team1'] == team1_name) | 
            (historical_data_all['team2'] == team1_name)
        ]
        
        # Find all historical maps where 'team2' played
        team2_hist_slice = historical_data_all[
            (historical_data_all['team1'] == team2_name) | 
            (historical_data_all['team2'] == team2_name)
        ]
        
        #inputing our stats to the current deal
        current_map_stats[f'team1_hist_dur_{window_days}D'] = team1_hist_slice['duration'].mean()
        current_map_stats[f'team2_hist_dur_{window_days}D'] = team2_hist_slice['duration'].mean()


    # After loop, add this map's stats to the main dictionary
    historical_stats_dict[row['map_id']] = current_map_stats



# Convert the dictionary of stats into a DataFrame
features_df = pd.DataFrame.from_dict(historical_stats_dict, orient='index')

# Make 'map_id' (which is the dict key) a regular column
features_df = features_df.reset_index().rename(columns={'index': 'map_id'})

# Merge the new historical features back into your main df
df = df.merge(features_df, on='map_id', how='left')

In [14]:
## Feature importance, getting ready for RF OOF preds

I never use this but why not see feature importance for this data we are getting

In [15]:
df['sum_90D'] = df['team1_hist_dur_90D'] + df['team2_hist_dur_90D']#creating our sum feat from the other notebook
df['sum_180D'] = df['team1_hist_dur_180D'] + df['team2_hist_dur_180D']#creating our sum feat from the other notebook

feature_cols = list(df.columns[26:]) #all the feats we just generated.
all_cols_to_check = feature_cols + ['duration'] #cols to make sure are clean, although not really needed for RF

test = df.dropna(subset=all_cols_to_check).copy()

X = test[feature_cols]
y = test['duration']


model = RandomForestRegressor(
    n_estimators=100,  
    random_state=99, #red ballons  
)

model.fit(X, y)

importances = model.feature_importances_

feature_scores = pd.Series(importances, index=feature_cols)

feature_scores = feature_scores.sort_values(ascending=False)

#for you guys
print("--- Random Forest Feature Importances ---")
print(feature_scores) # these scorea are about interaction, but its clear I think that the sum of rolling duration is decent

--- Random Forest Feature Importances ---
sum_90D                0.122989
sum_180D               0.122574
team2_hist_dur_180D    0.098420
team1_hist_dur_90D     0.097620
team2_hist_dur_90D     0.094596
team1_hist_dur_180D    0.093926
hist_dur_90D_std       0.055437
hist_dur_180D_std      0.052427
hist_dur_180D_q75      0.035606
hist_dur_90D_mean      0.035084
hist_dur_180D_mean     0.032849
hist_dur_90D_q75       0.032800
hist_dur_90D_q25       0.029738
hist_dur_90D_q50       0.025863
hist_dur_180D_q50      0.023433
hist_dur_180D_q25      0.020054
hist_dur_90D_max       0.008319
hist_dur_90D_min       0.007989
hist_dur_180D_min      0.005423
hist_dur_180D_max      0.004853
dtype: float64


## Baseline Model

In [16]:
#figure I'd use this as a baseline to work from
features = ['sum_90D']
model = LinearRegression()
evaluate_model(model=model, features=features, data=df, target='duration')

CV Scores (RMSE): [-279.47514643 -265.91317409 -270.48578163 -303.69355137 -273.85870685]
Mean CV RMSE: -278.6853



In [17]:
feature_cols[:-1]

['hist_dur_90D_mean',
 'hist_dur_90D_std',
 'hist_dur_90D_min',
 'hist_dur_90D_max',
 'hist_dur_90D_q25',
 'hist_dur_90D_q50',
 'hist_dur_90D_q75',
 'hist_dur_180D_mean',
 'hist_dur_180D_std',
 'hist_dur_180D_min',
 'hist_dur_180D_max',
 'hist_dur_180D_q25',
 'hist_dur_180D_q50',
 'hist_dur_180D_q75',
 'team1_hist_dur_90D',
 'team2_hist_dur_90D',
 'team1_hist_dur_180D',
 'team2_hist_dur_180D',
 'sum_90D']

## Creation of RF OOF Predictions

In [18]:
# generate the some RF preds, in future model can use XGboost or some other model
X = df[feature_cols[:-1]]#all columns except sum
y = df['duration']

model_rf = RandomForestRegressor(
    n_estimators=100, 
    random_state=99,  
)

cv = KFold(n_splits=5, shuffle=True, random_state=99)

oof_predictions = cross_val_predict(
    estimator=model_rf,
    X=X,
    y=y,
    cv=cv,
)

df['rf_pred'] = oof_predictions

In [19]:
print("--- Iteration 2: Stacked LR (Sum + RF OOF + ELO) ---")
features_v2 = ['sum_90D','rf_pred','elo_diff']
model = make_pipeline(StandardScaler(), LinearRegression())

# 'test' was created in the cell above, which is correct
evaluate_model(model=model, features=features_v2, data=df, target='duration')

--- Iteration 2: Stacked LR (Sum + RF OOF + ELO) ---
CV Scores (RMSE): [-277.59300092 -265.2461245  -269.61551957 -302.7984314  -272.87817541]
Mean CV RMSE: -277.6263



#### -278.6853 --> -277.5179

## Embedding Model







This is my first time working with this type of model, although it has been on my self and I have been planning to use it for my personal modeling. I like to try models quick first, if I see potential than I dive deeper, unlocking complexity and new potential. I built this in Cursor with the help of AI, I treat it like working with more technical subordinate.







The architecture and ideas are mine, but I use it to prototype quickly. If I were to continue to work on this model, I would dig deeper and read long form research on this type of model, to use it better.

This is all commented by AI, however I have looked over every line myself as well and understand it fully. Feel free to ask me.

In [20]:
# 1. Concatenate all hero string columns
all_heroes_raw = pd.concat([
    df['team1_picks'], 
    df['team2_picks'], 
    df['team1_bans'], 
    df['team2_bans']
])

# 2. Split strings into lists, explode, and CLEAN
# This is the new, critical part
all_heroes = all_heroes_raw \
    .str.split(',') \
    .explode() \
    .str.strip() # <-- This cleans " bene" and "lolita\n" into "bene" and "lolita"

# 3. Create the "lookup maps" (dictionaries)
# We add +1 to reserve 0 for "padding" or "unknown"
team_map = {name: i+1 for i, name in enumerate(pd.concat([df['team1'], df['team2']]).unique())}
hero_map = {name: i+1 for i, name in enumerate(all_heroes.unique())} # Now this will be correct
side_map = {'blue': 1, 'red': 2} # 0 will be our padding/unknown

# --- 4. Apply the maps to your DataFrame ---

# Convert team names to their new IDs
df['team1_id'] = df['team1'].map(team_map)
df['team2_id'] = df['team2'].map(team_map)

# Convert side to its ID (assuming these column names)
df['team1_side_id'] = df['team1_side'].map(side_map).fillna(0)
df['team2_side_id'] = df['team2_side'].map(side_map).fillna(0)

# Helper function to turn a list of hero names into a list of hero IDs
def map_hero_list(hero_string):
    if not isinstance(hero_string, str):
        return [] # Handle NaNs
    
    # Split the string and clean it, just like we did above
    hero_list = [name.strip() for name in hero_string.split(',')]
    
    # Map to IDs
    return [hero_map.get(hero, 0) for hero in hero_list] # 0 = unknown hero

df['team1_picks_ids'] = df['team1_picks'].apply(map_hero_list)
df['team2_picks_ids'] = df['team2_picks'].apply(map_hero_list)
df['team1_bans_ids'] = df['team1_bans'].apply(map_hero_list)
df['team2_bans_ids'] = df['team2_bans'].apply(map_hero_list)

# 5. Store the "vocab size" (total count)
n_teams = len(team_map) + 1 
n_heroes = len(hero_map) + 1 # This number should now be reasonable (e.g., 120-130)
n_sides = 3 

print(f"Total unique teams (vocab size): {n_teams}")
print(f"Total unique heroes (vocab size): {n_heroes}") # <-- Check this number!
print(f"Total unique sides (vocab size): {n_sides}")

Total unique teams (vocab size): 740
Total unique heroes (vocab size): 133
Total unique sides (vocab size): 3


In [21]:
# --- CONFIGURATION ---
N_SPLITS = 5
N_EPOCHS = 10 
BATCH_SIZE = 64
LEARNING_RATE = 0.001
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {DEVICE}")

TARGET_COLUMN = 'duration' 

# --- INITIALIZATION ---
gkf = GroupKFold(n_splits=N_SPLITS)
oof_preds = np.zeros(len(df))
oof_indices = []

# --- 3. Start the K-Fold Loop ---
groups = df['match_id']

for fold, (train_idx, val_idx) in enumerate(gkf.split(df, groups=groups)):
    print("-" * 30)
    print(f"--- FOLD {fold + 1}/{N_SPLITS} ---")
    print("-" * 30)
    
    # 1. Create DataLoaders
    train_df = df.iloc[train_idx]
    val_df = df.iloc[val_idx]
    
    train_dataset = MatchDataset(train_df, target_column=TARGET_COLUMN)
    val_dataset = MatchDataset(val_df, target_column=TARGET_COLUMN)
    
    train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
    val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False)
    
    model = SimpleModel(n_teams, n_heroes, n_sides).to(DEVICE)
    
    # 3. Define Loss Function and Optimizer
    criterion = nn.MSELoss()  
    optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE)
    
    best_val_mae = float('inf') # Set to infinity so any error is better
    best_model_state = None

    # 4. Start Training Loop for this model
    for epoch in range(N_EPOCHS):
        train_loss = train_loop(model, train_loader, criterion, optimizer, DEVICE)
        
        val_preds, val_targets = get_predictions(model, val_loader, DEVICE)
        
        #Calculate regression metrics ---
        val_mae = mean_absolute_error(val_targets, val_preds)
        val_rmse = np.sqrt(mean_squared_error(val_targets, val_preds))
        
        print(f"Epoch {epoch+1:02d} | Train Loss (MSE): {train_loss:.4f} | Val MAE: {val_mae:.4f} | Val RMSE: {val_rmse:.4f}")
        
        #Save model with the lowest MAE
        if val_mae < best_val_mae:
            best_val_mae = val_mae
            best_model_state = copy.deepcopy(model.state_dict())

    # 5. Get OOF predictions from the best model
    print(f"Best Val MAE for Fold {fold+1}: {best_val_mae:.4f}")
    model.load_state_dict(best_model_state) 
    
    oof_preds_fold, _ = get_predictions(model, val_loader, DEVICE)
    
    # 6. Store OOF predictions
    oof_preds[val_idx] = oof_preds_fold
    oof_indices.extend(val_idx)

#Final OOF Score and Saving to Series ---

#Calculate total OOF regression score ---
total_oof_targets = df[TARGET_COLUMN].iloc[oof_indices]
total_oof_mae = mean_absolute_error(total_oof_targets, oof_preds)
total_oof_rmse = np.sqrt(mean_squared_error(total_oof_targets, oof_preds))

print("\n" + "=" * 30)
print(f"TOTAL OOF MAE: {total_oof_mae:.5f}")
print(f"TOTAL OOF RMSE: {total_oof_rmse:.5f}")
print("=" * 30)

# Create a new pandas Series for the OOF predictions
oof_pred_series = pd.Series(
    oof_preds,
    index=df.index,
    name='oof_pred_simple_model_duration' # Changed name
)

print("\nOOF duration predictions saved to a new Series 'oof_pred_series'.")
print(oof_pred_series.head())

Using device: cuda
------------------------------
--- FOLD 1/5 ---
------------------------------
Epoch 01 | Train Loss (MSE): 733686.7885 | Val MAE: 237.7484 | Val RMSE: 300.6569
Epoch 02 | Train Loss (MSE): 123591.4273 | Val MAE: 218.8891 | Val RMSE: 288.5551
Epoch 03 | Train Loss (MSE): 119817.9876 | Val MAE: 214.6337 | Val RMSE: 283.5437
Epoch 04 | Train Loss (MSE): 118679.0078 | Val MAE: 212.9408 | Val RMSE: 279.4671
Epoch 05 | Train Loss (MSE): 115560.2436 | Val MAE: 211.1536 | Val RMSE: 277.5276
Epoch 06 | Train Loss (MSE): 110819.6746 | Val MAE: 209.2192 | Val RMSE: 276.5268
Epoch 07 | Train Loss (MSE): 112725.2691 | Val MAE: 210.7617 | Val RMSE: 274.2731
Epoch 08 | Train Loss (MSE): 110351.5127 | Val MAE: 207.7843 | Val RMSE: 274.3021
Epoch 09 | Train Loss (MSE): 108018.8015 | Val MAE: 209.1261 | Val RMSE: 272.8521
Epoch 10 | Train Loss (MSE): 107423.1145 | Val MAE: 210.2384 | Val RMSE: 272.2852
Best Val MAE for Fold 1: 207.7843
------------------------------
--- FOLD 2/5 ---


In [23]:
df['embed_oof'] = oof_pred_series
features_v3 = ['sum_90D', 'rf_pred', 'elo_diff', 'embed_oof']
model = make_pipeline(StandardScaler(), LinearRegression())

# 'test' has the 'embed_oof' column from the cell above
evaluate_model(model=model, features=features_v3, data=df, target='duration')

--- Iteration 3: Stacked LR (All features + Simple Embed OOF) ---
CV Scores (RMSE): [-277.05464956 -264.22660597 -269.33851168 -301.85784724 -271.37134773]
Mean CV RMSE: -276.7698



#### This is the same model, but now it is taking into account the pick pick order of the draft.

In [25]:
# --- 2. Clean and Explode Hero Strings ---
all_heroes_raw = pd.concat([
    df['team1_picks'], 
    df['team2_picks'], 
    df['team1_bans'], 
    df['team2_bans']
])

all_heroes = all_heroes_raw \
    .astype(str) \
    .str.split(',') \
    .explode() \
    .str.strip() \
    .unique()

all_teams = pd.concat([df['team1'], df['team2']]).unique()

# --- 3. Create Lookup Maps (Vocabularies) ---
team_map = {name: i+1 for i, name in enumerate(all_teams)}
hero_map = {name: i+1 for i, name in enumerate(all_heroes)}
side_map = {'blue': 1, 'red': 2}

# --- 4. Apply Maps to Create ID Columns ---
df['team1_id'] = df['team1'].map(team_map).fillna(0)
df['team2_id'] = df['team2'].map(team_map).fillna(0)
df['team1_side_id'] = df['team1_side'].map(side_map).fillna(0)
df['team2_side_id'] = df['team2_side'].map(side_map).fillna(0)

# Helper function to map hero lists
def map_hero_list(hero_string):
    if not isinstance(hero_string, str):
        return []
    hero_list = [name.strip() for name in hero_string.split(',')]
    return [hero_map.get(hero, 0) for hero in hero_list]

df['team1_picks_ids'] = df['team1_picks'].apply(map_hero_list)
df['team2_picks_ids'] = df['team2_picks'].apply(map_hero_list)
df['team1_bans_ids'] = df['team1_bans'].apply(map_hero_list)
df['team2_bans_ids'] = df['team2_bans'].apply(map_hero_list)

# --- 5. Store Vocab Sizes ---
n_teams = len(team_map) + 1
n_heroes = len(hero_map) + 1
n_sides = 3 # (blue, red) + (padding_idx 0)

print(f"Total unique teams: {n_teams}")
print(f"Total unique heroes: {n_heroes}")
print(f"Total unique sides: {n_sides}")
print("Step 1 Complete.")

Total unique teams: 740
Total unique heroes: 133
Total unique sides: 3
Step 1 Complete.


In [26]:
print("--- Running Step 2: Defining Dataset ---")

# --- 2. Define the Dataset Class ---
class MatchDataset(Dataset):
    def __init__(self, dataframe, target_column):
        self.df = dataframe
        
        # Single-value features
        self.team1_ids = self.df['team1_id'].values
        self.team2_ids = self.df['team2_id'].values
        self.team1_side_ids = self.df['team1_side_id'].values
        self.team2_side_ids = self.df['team2_side_id'].values
        
        
        # List-based features
        self.t1_picks = self.df['team1_picks_ids'].values
        self.t2_picks = self.df['team2_picks_ids'].values
        self.t1_bans = self.df['team1_bans_ids'].values
        self.t2_bans = self.df['team2_bans_ids'].values
        
        self.target = self.df[target_column].values

    def __len__(self):
        return len(self.df)

    def _pad_hero_list(self, hero_list):
        """Pads or truncates a hero list to MAX_LEN."""
        padded_list = hero_list + [PADDING_VALUE] * (HERO_LIST_MAX_LEN - len(hero_list))
        return padded_list[:HERO_LIST_MAX_LEN]

    def _get_positions(self, hero_list):
        """Creates a position list [1, 2, 3, ...] for non-padded heroes."""
        positions = list(range(1, HERO_LIST_MAX_LEN + 1))
        # Set position to 0 if the hero_id is 0 (padded)
        for i, hero_id in enumerate(hero_list):
            if hero_id == PADDING_VALUE:
                positions[i] = PADDING_VALUE
        return positions

    def __getitem__(self, idx):
        # Get hero ID lists
        t1_picks_list = self._pad_hero_list(self.t1_picks[idx])
        t2_picks_list = self._pad_hero_list(self.t2_picks[idx])
        t1_bans_list = self._pad_hero_list(self.t1_bans[idx])
        t2_bans_list = self._pad_hero_list(self.t2_bans[idx])
        
        # Create corresponding position lists
        t1_picks_pos = self._get_positions(t1_picks_list)
        t2_picks_pos = self._get_positions(t2_picks_list)
        t1_bans_pos = self._get_positions(t1_bans_list)
        t2_bans_pos = self._get_positions(t2_bans_list)
        
        features = {
            # Single IDs
            'team1_id': torch.tensor(self.team1_ids[idx], dtype=torch.long),
            'team2_id': torch.tensor(self.team2_ids[idx], dtype=torch.long),
            'team1_side_id': torch.tensor(self.team1_side_ids[idx], dtype=torch.long),
            'team2_side_id': torch.tensor(self.team2_side_ids[idx], dtype=torch.long),
            
            
            # Hero IDs
            't1_picks': torch.tensor(t1_picks_list, dtype=torch.long),
            't2_picks': torch.tensor(t2_picks_list, dtype=torch.long),
            't1_bans': torch.tensor(t1_bans_list, dtype=torch.long),
            't2_bans': torch.tensor(t2_bans_list, dtype=torch.long),
            
            # Position IDs
            't1_picks_pos': torch.tensor(t1_picks_pos, dtype=torch.long),
            't2_picks_pos': torch.tensor(t2_picks_pos, dtype=torch.long),
            't1_bans_pos': torch.tensor(t1_bans_pos, dtype=torch.long),
            't2_bans_pos': torch.tensor(t2_bans_pos, dtype=torch.long)
        }
        
        target = torch.tensor(self.target[idx], dtype=torch.float)
        return features, target

print("Step 2 Complete (MatchDataset defined).")

--- Running Step 2: Defining Dataset ---
Step 2 Complete (MatchDataset defined).


In [27]:
import os
os.environ['CUDA_LAUNCH_BLOCKING'] = "1"
from sklearn.model_selection import GroupKFold
from sklearn.metrics import mean_absolute_error, mean_squared_error

print("--- Running Step 4: Starting K-Fold Training ---")


gkf = GroupKFold(n_splits=N_SPLITS)
oof_preds = np.zeros(len(df))
oof_indices = [] # To ensure alignment
groups = df['match_id']

# --- 3. Start the K-Fold Loop ---
for fold, (train_idx, val_idx) in enumerate(gkf.split(df, groups=groups)):
    print("-" * 30)
    print(f"--- FOLD {fold + 1}/{N_SPLITS} ---")
    print("-" * 30)
    
    # 1. Create DataLoaders
    train_df = df.iloc[train_idx]
    val_df = df.iloc[val_idx]
    train_dataset = MatchDataset(train_df, target_column=TARGET_COLUMN)
    val_dataset = MatchDataset(val_df, target_column=TARGET_COLUMN)
    train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
    val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False)
    
    model = PositionalModel(n_teams, n_heroes, n_sides).to(DEVICE)
    
    # 3. Define Loss Function and Optimizer
    criterion = nn.MSELoss()  
    optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE)
    
    best_val_mae = float('inf') 
    best_model_state = None

    # 4. Start Training Loop
    for epoch in range(N_EPOCHS):
        train_loss = train_loop(model, train_loader, criterion, optimizer, DEVICE)
        val_preds, val_targets = get_predictions(model, val_loader, DEVICE)
        val_mae = mean_absolute_error(val_targets, val_preds)
        
        print(f"Epoch {epoch+1:02d} | Train Loss (MSE): {train_loss:.4f} | Val MAE: {val_mae:.4f}")
        
        if val_mae < best_val_mae:
            best_val_mae = val_mae
            best_model_state = copy.deepcopy(model.state_dict())

    # 5. Get OOF predictions from the *best* model
    print(f"Best Val MAE for Fold {fold+1}: {best_val_mae:.4f}")
    model.load_state_dict(best_model_state) 
    oof_preds_fold, _ = get_predictions(model, val_loader, DEVICE)
    
    # 6. Store OOF predictions
    oof_preds[val_idx] = oof_preds_fold
    oof_indices.extend(val_idx)

# --- 4. Final OOF Score and Saving to Series ---
total_oof_targets = df[TARGET_COLUMN].iloc[oof_indices]
total_oof_mae = mean_absolute_error(total_oof_targets, oof_preds[oof_indices])
total_oof_rmse = np.sqrt(mean_squared_error(total_oof_targets, oof_preds[oof_indices]))

print("\n" + "=" * 30)
print(f"TOTAL OOF MAE: {total_oof_mae:.5f}")
print(f"TOTAL OOF RMSE: {total_oof_rmse:.5f}")
print("=" * 30)

# Create a new pandas Series for the OOF predictions
oof_pred_series = pd.Series(
    oof_preds,
    index=df.index,
    name='oof_pred_positional_model'
)

print("\nOOF duration predictions saved to a new Series 'oof_pred_series'.")
print(oof_pred_series.head())

--- Running Step 4: Starting K-Fold Training ---
------------------------------
--- FOLD 1/5 ---
------------------------------
  [Model Init] n_teams: 740
  [Model Init] n_heroes: 133
  [Model Init] n_sides: 3
  [Model Init] n_positions: 6
Epoch 01 | Train Loss (MSE): 598852.4436 | Val MAE: 210.3272
Epoch 02 | Train Loss (MSE): 114309.2963 | Val MAE: 216.5254
Epoch 03 | Train Loss (MSE): 113519.3875 | Val MAE: 209.4435
Epoch 04 | Train Loss (MSE): 111868.2330 | Val MAE: 209.2863
Epoch 05 | Train Loss (MSE): 110472.8319 | Val MAE: 206.9114
Epoch 06 | Train Loss (MSE): 109491.9091 | Val MAE: 204.7323
Epoch 07 | Train Loss (MSE): 109333.5511 | Val MAE: 206.5630
Epoch 08 | Train Loss (MSE): 107673.7203 | Val MAE: 206.9942
Epoch 09 | Train Loss (MSE): 105930.7020 | Val MAE: 206.8546
Epoch 10 | Train Loss (MSE): 106418.9358 | Val MAE: 204.0603
Best Val MAE for Fold 1: 204.0603
------------------------------
--- FOLD 2/5 ---
------------------------------
  [Model Init] n_teams: 740
  [Model

In [28]:
df['embed_oof'] = oof_pred_series #generating the OOF preds, this time with the pick order taken into account.

In [29]:
features_v3 = ['sum_90D', 'rf_pred', 'elo_diff', 'embed_oof']
model = make_pipeline(StandardScaler(), LinearRegression())

# 'test' has the 'embed_oof' column from the cell above
evaluate_model(model=model, features=features_v3, data=df, target='duration')

CV Scores (RMSE): [-276.06846585 -262.93091256 -268.60517461 -301.91729575 -271.55615534]
Mean CV RMSE: -276.2156



#### -278.6853 --> -276.2156

Its not bad for the first iteration.

All in all, there is obviously a lot of room for improvement. I think I can improve the model by engineering better features, especially, trying to model each teams "speed factor"



Possible K-clustering could be used here. Ask me about it, I have several ways forward. Thank you :)