# Model Training for Street Figther 6 Bracket Predictor
This is the notebook demonstrating how to create the actual prediction model for the Street Figther 6 Bracket Predictor Model. For more on how to create the initial data, see the README for this project.

## Step 1: Data Processing
Before we can feed the data into the prediction model, we need to take our set data, which has winners and losers in their own rows, and join the data onto itself so that all relevant data for a given set are in the same row. We also need to join the player data in the form of ELO ratings and tournament participation data onto this set data too so the model has a better idea of each player's respective skill level.

In [167]:
# Import initial packages for data manipulation
import pandas as pd
import numpy as np

In [168]:
# Import set data. Note that 1 signifies a win and 2 signifies a loss.
df = pd.read_csv('data\\all_sets.csv')
df.head()

Unnamed: 0,set_id,entrant_id,entrant_name,standing,user_id,event_id,player_id,gamerTag,player_prefix,source
0,74141170.0,16483462,ADTerminal,2,221788.0,1129069,324331.0,ADTerminal,,startgg
1,74141170.0,16485711,Lyser,1,2568953.0,1129069,3998295.0,Lyser,,startgg
2,74141171.0,16485711,Lyser,1,2568953.0,1129069,3998295.0,Lyser,,startgg
3,74141171.0,16483462,ADTerminal,2,221788.0,1129069,324331.0,ADTerminal,,startgg
4,74141169.0,16483462,ADTerminal,1,221788.0,1129069,324331.0,ADTerminal,,startgg


In [169]:
# Import ELO data.
elo = pd.read_csv('data\\elo_records.csv')
elo.head()

Unnamed: 0,user_id,event_id,elo,tier1,tier2,tier3,tier5
0,59134.0,372265.0,143.211697,0.0,0.0,0.0,1.0
1,59135.0,372265.0,212.985028,0.0,0.0,0.0,1.0
2,59136.0,372265.0,202.474206,0.0,0.0,0.0,1.0
3,59137.0,372265.0,240.640067,0.0,0.0,0.0,1.0
4,59138.0,372265.0,200.689002,0.0,0.0,0.0,1.0


### A note on tournament tiers
An issue that was encountered in the early stages of this project was that ELO was not adequate to determine the true skill level of each player. It's suspected this was partially due to both a lack of data where not enough match result data could differentiate the top players against more average players. Including participation rates in higher level tournaments was thought to mediate this issue. 

Each "tier" column refers to the competition tier of each tournament a player is known to have participated in. These definitions are taken from the Liquipedia and are quoted from their [website](https://liquipedia.net/fighters/Street_Fighter_6/Tier_1_Tournaments) as follows:
* "**Tier 1 Tournaments**, sometimes called premier tournaments or super-majors, offer high prize pools and feature the best players from all over the world."
* "**Tier 2 Tournaments**, often called major tournaments or just majors, feature some top-tier players and offer good prize pools or seeding slots for premier events. They don't have the prestige or prize pools of a premier event, but they are usually notable enough to attract international attention."
* "**Tier 3 Tournaments**, often called minor tournaments or just minors, offer a smaller prize pool and less prestige than Tier 2 Tournaments, but still feature a significant level of competition. Smaller regional tournaments tend to fall into this category. "

However, to account for the amount of smaller, less formal events (such as local meetups) recorded from start.gg, a **Tier 5** category has been included to refer to these events.

### Days Between Events
I've also opted to include the days between each event for each player. The thought process behind including this variable is that players who participate more often in tournaments are more likely to win since they are playing more which is an indicator of experience.

In [170]:
# Import events data
events = pd.read_csv('data\\events.csv')[['event_id', 'start_at', 'game']]
events['start_at'] = pd.to_datetime(events['start_at'], format='ISO8601')
print(events.dtypes)
events = events.loc[events['game'] == 'SF6', :].drop(columns=['game'])

# Merge event data on ELO to calculate days between events for each player
elo2 = elo.merge(events, on='event_id').sort_values(by=['user_id', 'start_at'],
                                                    ascending= [True, True])
elo2['days_between_events'] = elo2.groupby(['user_id', 'event_id'])['start_at'].diff().dt.days

# Default value for days between events is 999 days as filler value
elo2['days_between_events'] = elo2['days_between_events'].fillna(999)
elo2.head()

event_id                float64
start_at    datetime64[ns, UTC]
game                     object
dtype: object


Unnamed: 0,user_id,event_id,elo,tier1,tier2,tier3,tier5,start_at,days_between_events
44328,0.0,867114.0,234.063434,1.0,0.0,0.0,0.0,2023-08-18 09:00:00+00:00,999.0
174194,0.0,1088942.0,253.70353,1.0,0.0,0.0,1.0,2024-03-10 20:00:00+00:00,999.0
192719,0.0,1118565.0,306.569468,1.0,0.0,0.0,2.0,2024-04-17 01:45:00+00:00,999.0
30569,1.0,864717.0,187.451839,1.0,0.0,0.0,0.0,2023-08-04 15:00:00+00:00,999.0
49950,1.0,963000.0,179.137104,1.0,0.0,0.0,1.0,2023-08-20 21:00:00+00:00,999.0


In [171]:
# Import player data; This is neccesary to graft the elo data to event data
players = pd.read_csv('data\\players.csv')
players = players[['startgg_pid', 'uid', 'liquidpedia_name']]

# Merge data and drop UID; including the specific UID as a category is highly likely to cause overfitting
elo3 = elo2.merge(players, how='left', left_on=['user_id'], right_on=['uid'])
elo3 = elo3.drop(columns=['uid'])
elo3.head()

  players = pd.read_csv('data\\players.csv')


Unnamed: 0,user_id,event_id,elo,tier1,tier2,tier3,tier5,start_at,days_between_events,startgg_pid,liquidpedia_name
0,0.0,867114.0,234.063434,1.0,0.0,0.0,0.0,2023-08-18 09:00:00+00:00,999.0,324331.0,
1,0.0,1088942.0,253.70353,1.0,0.0,0.0,1.0,2024-03-10 20:00:00+00:00,999.0,324331.0,
2,0.0,1118565.0,306.569468,1.0,0.0,0.0,2.0,2024-04-17 01:45:00+00:00,999.0,324331.0,
3,1.0,864717.0,187.451839,1.0,0.0,0.0,0.0,2023-08-04 15:00:00+00:00,999.0,985993.0,
4,1.0,963000.0,179.137104,1.0,0.0,0.0,1.0,2023-08-20 21:00:00+00:00,999.0,985993.0,


After getting elo data for each event and event tier counts, it's important to shift the data down by each `user_id` group and include the default values for the first rows in each group. (We don't want to use the ELO calculated from data for that specific event since we shouldn't know that score yet!)

In [172]:
default_elo = 200
elo3['elo'] = elo3.groupby('user_id')['elo'].shift(1).fillna(default_elo)
for tier in ['tier1', 'tier2', 'tier3', 'tier5']:
    elo3[tier] = elo3.groupby('user_id')[tier].shift(1).fillna(0)
elo3.head()

Unnamed: 0,user_id,event_id,elo,tier1,tier2,tier3,tier5,start_at,days_between_events,startgg_pid,liquidpedia_name
0,0.0,867114.0,200.0,0.0,0.0,0.0,0.0,2023-08-18 09:00:00+00:00,999.0,324331.0,
1,0.0,1088942.0,234.063434,1.0,0.0,0.0,0.0,2024-03-10 20:00:00+00:00,999.0,324331.0,
2,0.0,1118565.0,253.70353,1.0,0.0,0.0,1.0,2024-04-17 01:45:00+00:00,999.0,324331.0,
3,1.0,864717.0,200.0,0.0,0.0,0.0,0.0,2023-08-04 15:00:00+00:00,999.0,985993.0,
4,1.0,963000.0,187.451839,1.0,0.0,0.0,0.0,2023-08-20 21:00:00+00:00,999.0,985993.0,


Now we're ready to merge all of the data onto itself so that we end up with the data for each player and their set result.

In [173]:
df3 = df.merge(elo2, on=['user_id', 'event_id'])
df3 = df3.merge(df, left_on = ['set_id', 'event_id'], right_on=['set_id', 'event_id'])
df3 = df3[(df3['standing_y'] != df3['standing_x'])].drop_duplicates(subset=['set_id', 'event_id'], keep='first')

df4 = df3.loc[df3['event_id'] > 60, :]
df4 = df4.merge(elo3, how='left', left_on=['user_id_y', 'event_id'], right_on=['startgg_pid', 'event_id'])

df5 = df3.loc[df['event_id'] <= 60, :]
df5['match_name'] = df['entrant_name'].str.lower().str.replace(' ','', regex=True)
elo4 = elo3.copy()
elo4['match_name'] = elo4['liquidpedia_name'].str.lower().str.replace(' ','', regex=True)
df5 = df5.merge(elo4, how='left', on=['match_name'])
df5 = df5.drop(columns=['match_name'])

df3 = pd.concat([df4, df5], ignore_index=True)

for val in ['elo_x', 'elo_y']:
    df3[val] = df3[val].fillna(200)

for val in ['tier1_x', 'tier1_y', 'tier2_x', 'tier2_y',
            'tier3_x', 'tier3_y', 'tier5_x', 'tier5_y']:
    df3[val] = df3[val].fillna(0)

df3.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df5['match_name'] = df['entrant_name'].str.lower().str.replace(' ','', regex=True)


Unnamed: 0,set_id,entrant_id_x,entrant_name_x,standing_x,user_id_x,event_id,player_id_x,gamerTag_x,player_prefix_x,source_x,...,tier1_y,tier2_y,tier3_y,tier5_y,start_at_y,days_between_events_y,startgg_pid,liquidpedia_name,event_id_x,event_id_y
0,73485780.0,16337647.0,AlukardNY,2.0,196.0,1117616.0,3484.0,AlukardNY,,startgg,...,0.0,0.0,0.0,0.0,NaT,,,,,
1,73485594.0,16337647.0,AlukardNY,2.0,196.0,1117616.0,3484.0,AlukardNY,,startgg,...,0.0,0.0,0.0,0.0,NaT,,,,,
2,72612176.0,16118157.0,AlukardNY,2.0,196.0,1102572.0,3484.0,AlukardNY,,startgg,...,0.0,0.0,0.0,0.0,NaT,,,,,
3,72612483.0,16118157.0,AlukardNY,2.0,196.0,1102572.0,3484.0,AlukardNY,,startgg,...,0.0,0.0,0.0,0.0,NaT,,,,,
4,72612162.0,16118157.0,AlukardNY,1.0,196.0,1102572.0,3484.0,AlukardNY,,startgg,...,0.0,0.0,0.0,0.0,NaT,,,,,


At this point, we can filter out unneccesary columns and calculate a new column that includes all competitions as a potential column to use for our model training.

In [174]:
filter_cols = ['set_id','event_id', 'elo_x', 'elo_y', 'tier1_x', 'tier1_y', 'tier2_x', 'tier2_y',
               'tier3_x', 'tier3_y', 'tier5_x', 'tier5_y', 'standing_x', 'days_between_events_x',
               'days_between_events_y']

df3 = df3[filter_cols]
df3['result'] = 0
df3.loc[df3['standing_x'] == 1, 'result'] = 0
df3.loc[df3['standing_x'] == 2, 'result'] = 1

df3['all_comps_x'] = df3['tier1_x'] + df3['tier2_x'] + df3['tier3_x'] + df3['tier5_x']
df3['all_comps_y'] = df3['tier1_y'] + df3['tier2_y'] + df3['tier3_y'] + df3['tier5_y']

df3['days_between_events_x'] = df3['days_between_events_x'].fillna(999)
df3['days_between_events_y'] = df3['days_between_events_y'].fillna(999)

df3.drop(columns=['standing_x'], inplace=True)
df3.head()

Unnamed: 0,set_id,event_id,elo_x,elo_y,tier1_x,tier1_y,tier2_x,tier2_y,tier3_x,tier3_y,tier5_x,tier5_y,days_between_events_x,days_between_events_y,result,all_comps_x,all_comps_y
0,73485780.0,1117616.0,422.432621,200.0,0.0,0.0,1.0,0.0,3.0,0.0,127.0,0.0,999.0,999.0,1,131.0,0.0
1,73485594.0,1117616.0,422.432621,200.0,0.0,0.0,1.0,0.0,3.0,0.0,127.0,0.0,999.0,999.0,1,131.0,0.0
2,72612176.0,1102572.0,394.198831,200.0,0.0,0.0,1.0,0.0,3.0,0.0,118.0,0.0,999.0,999.0,1,122.0,0.0
3,72612483.0,1102572.0,394.198831,200.0,0.0,0.0,1.0,0.0,3.0,0.0,118.0,0.0,999.0,999.0,1,122.0,0.0
4,72612162.0,1102572.0,394.198831,200.0,0.0,0.0,1.0,0.0,3.0,0.0,118.0,0.0,999.0,999.0,0,122.0,0.0


Lastly, I also use a function to shuffle the order of winners and losers in the data since my initial data had a disproportionate amount of winners on the "x" side. This function splits the data more evenly.

In [175]:
def shuffle_rows(df):
    shuffle = ['set_id', 'event_id', 'elo_y', 'elo_x', 'tier1_y', 'tier1_x', 
                    'tier2_y', 'tier2_x', 'tier3_y', 'tier3_x', 'tier5_y', 'tier5_x',
                    'days_between_events_y', 'days_between_events_x', 'result', 'all_comps_y', 'all_comps_x']

    mask = np.random.rand(len(df)) < 0.5
    df_shuffled = df.copy()
    df_shuffled.loc[mask, :] = df_shuffled.loc[mask, shuffle].values
    df_shuffled.loc[mask, 'result'] = 1 - df_shuffled.loc[mask, 'result'].values
    return df_shuffled
    
df3 = shuffle_rows(df3)

## Step 2: Initial Model Training

In [176]:
# Linear Regression Model
from sklearn.linear_model import LinearRegression
model = LinearRegression()

X = df3.copy()[['all_comps_x', 'all_comps_y', 'elo_x', 'elo_y', 'tier1_x', 'tier1_y',
                'tier2_x', 'tier2_y', 'tier3_x', 'tier3_y', 'tier5_x', 'tier5_y',
                'days_between_events_x', 'days_between_events_y']]
y = df3['result']

In [177]:
from sklearn.preprocessing import normalize
from sklearn.model_selection import train_test_split

X_norm = normalize(X, norm='l1')

X_train, X_test, y_train, y_test = train_test_split(X_norm, y.astype(int), shuffle = True, train_size=0.3, random_state=24)
print(len(X_train))
print(len(X_test))

2067742
4824732


In [178]:
y.value_counts()

result
0    3446355
1    3446119
Name: count, dtype: int64

In [179]:
# Linear Regression Model
from sklearn.linear_model import Lasso
model = Lasso(alpha=0.1)

model.fit(X_train, y_train)

from sklearn.metrics import mean_squared_error, r2_score
predictions = model.predict(X_test)
r2 = r2_score(y_test, predictions)
rmse = mean_squared_error(y_test, predictions, squared=False)

print('The r2 is: ', r2)
print('The rmse is: ', rmse)

The r2 is:  -5.788201125067616e-07
The rmse is:  0.5000001352959642


In [180]:
from sklearn.linear_model import TweedieRegressor
reg = TweedieRegressor(power=1, alpha=0.5, link='log')
reg.fit(X_train, y_train)

print(reg.coef_)
print(reg.intercept_)

y_preds = reg.predict(X_test)
print(pd.Series(y_preds).sort_values().value_counts())

[ 5.43819000e-03 -5.46065513e-03  1.69912882e-02 -1.69594632e-02
  6.22567252e-05 -6.21376220e-05  4.48792177e-05 -4.47707248e-05
  1.47068064e-04 -1.47338672e-04  5.18398599e-03 -5.20640811e-03
  5.32176230e-06  5.32176230e-06]
-0.6937141431072299
0.499720    1412846
0.498641        759
0.498542        754
0.500415        751
0.500878        750
             ...   
0.499172          1
0.500315          1
0.499608          1
0.500501          1
0.500686          1
Name: count, Length: 294056, dtype: int64


In [181]:
from sklearn.metrics import classification_report
target_names = ['P1 win', 'P2 win']
print(classification_report(y_test.astype(int), y_preds.round(), target_names=target_names))

              precision    recall  f1-score   support

      P1 win       0.61      0.97      0.75   2411898
      P2 win       0.92      0.38      0.54   2412834

    accuracy                           0.67   4824732
   macro avg       0.77      0.67      0.64   4824732
weighted avg       0.77      0.67      0.64   4824732



Feature Selection

In [182]:
from sklearn.feature_selection import SelectKBest, chi2
X_new = SelectKBest(chi2, k=7).fit_transform(X_norm + 1, y)

X_train, X_test, y_train, y_test = train_test_split(X_new, y.astype(int), shuffle = True, train_size=0.3)

reg = TweedieRegressor(power=1, alpha=0.5, link='log')
reg.fit(X_train, y_train.astype(int))

target_names = ['P1 win', 'P2 win']
print(classification_report(y_test, y_preds.round(), target_names=target_names))

              precision    recall  f1-score   support

      P1 win       0.50      0.80      0.61   2413402
      P2 win       0.50      0.20      0.29   2411330

    accuracy                           0.50   4824732
   macro avg       0.50      0.50      0.45   4824732
weighted avg       0.50      0.50      0.45   4824732



In [183]:
from sklearn.neural_network import MLPClassifier
clf = MLPClassifier(solver='lbfgs', alpha=1e-5,
                    hidden_layer_sizes=(15,), random_state=1)

clf.fit(X_train, y_train)

y_preds = clf.predict(X_test)
print(y_preds)

print(classification_report(y_test, y_preds, target_names=target_names))

[1 1 1 ... 1 1 0]
              precision    recall  f1-score   support

      P1 win       1.00      0.71      0.83   2413402
      P2 win       0.77      1.00      0.87   2411330

    accuracy                           0.85   4824732
   macro avg       0.89      0.85      0.85   4824732
weighted avg       0.89      0.85      0.85   4824732

