# League of Legends Project - Machine Learning Predictions
In this notebook I will make a predictive model for the results of League of Legends games using various ML (machine learning) algorithms.

I will predict the result of professional games at several points, firstly before the game starts, then at 10, 15, 20, and 25 minutes into the game.

I will use the ML models of LogistRegression and RandomForest from Scikit-learn as well as XGBRegressor from XGBoost.  I will also engage in PCA (principal component analysis) and data transformation to enhance the predictive accuracy.
***

However, before I do this I need to engage in data engineering, to reshape the dataframe to include all of the data for each game on a single row.  Because the elo ratings I calculated in the LoL_Elo_System notebook, and stored in elolol.pkl, are reshaped so that the data is split by winning and losing team rather than by Red and Blue side, I will have to reshape this dataframe as well.

Furthermore, for each player's stats, I will standardise their in-game statistics to the champion selected and their position.  This is due to information gathered in LoL_Data_Exploration about how different champions and positions accumulate resources.  I will also include the non-standardised in-game statistics.

# Step 1: Dataset Creation

In [1]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split, RandomizedSearchCV, GridSearchCV
from sklearn.metrics import accuracy_score, classification_report, roc_auc_score, f1_score

import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
# Reading in the data
dflol = pd.read_pickle("dflol.pkl")

In [3]:
dflol.head()

Unnamed: 0_level_0,datacompleteness,url,league,year,split,playoffs,date,game,patch,participantid,...,opp_csat25,golddiffat25,xpdiffat25,csdiffat25,killsat25,assistsat25,deathsat25,opp_killsat25,opp_assistsat25,opp_deathsat25
gameid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
TRLH3/33,complete,http://matchhistory.na.leagueoflegends.com/en/...,EU LCS,2014,Spring,0,2014-01-14 17:52:02,1.0,3.15,1,...,206.0,76.0,-512.0,-18.0,3.0,4.0,0.0,1.0,2.0,2.0
TRLH3/33,complete,http://matchhistory.na.leagueoflegends.com/en/...,EU LCS,2014,Spring,0,2014-01-14 17:52:02,1.0,3.15,2,...,140.0,-888.0,351.0,-42.0,0.0,5.0,3.0,2.0,1.0,1.0
TRLH3/33,complete,http://matchhistory.na.leagueoflegends.com/en/...,EU LCS,2014,Spring,0,2014-01-14 17:52:02,1.0,3.15,3,...,225.0,621.0,733.0,8.0,1.0,5.0,1.0,1.0,2.0,0.0
TRLH3/33,complete,http://matchhistory.na.leagueoflegends.com/en/...,EU LCS,2014,Spring,0,2014-01-14 17:52:02,1.0,3.15,4,...,161.0,3265.0,1950.0,50.0,6.0,2.0,0.0,0.0,0.0,4.0
TRLH3/33,complete,http://matchhistory.na.leagueoflegends.com/en/...,EU LCS,2014,Spring,0,2014-01-14 17:52:02,1.0,3.15,5,...,28.0,1780.0,2397.0,-19.0,0.0,7.0,0.0,0.0,1.0,3.0


### Feature Selection

I am interested only in data where I know at what time it was gathered.  This includes data such as the side a team was on (before the game has started) and data gathered at particular times.  Whilst data that is gathered during the game such as "firsttothreetowers" is recorded during the game, it could happen before or after 25 minutes (the latest time, timed data is gathered) and so I will not use it as a feature in my predictive model.

I am also dropping non-informative columns such as "url" and "split" as well as features where the data is gathered at the end of games and so they cannot be used to predict games such as "kills" and "totalgold".

In [4]:

dflol = dflol.drop(labels=["datacompleteness", "url", "league", "year", "split", "playoffs", "date", "game", "patch", "playername", "playerid", "teamname", "teamid", "ban1", "ban2", "ban3", "ban4", "ban5", "pick1", "pick2", "pick3", "pick4", "pick5", "gamelength", "teamkills", "teamdeaths", "chemtechs", "hextechs", "dragons (type unknown)", "void_grubs", "opp_void_grubs", "turretplates", "opp_turretplates", "elementaldrakes", "opp_elementaldrakes", "monsterkillsownjungle", "monsterkillsenemyjungle", "participantid", "firstdragon", "dragons", "opp_dragons", "infernals", "mountains", "clouds", "oceans", "elders", "opp_elders", "firstherald", "heralds", "opp_heralds", "firstbaron", "barons", "opp_barons", "firsttower", "towers", "opp_towers", "firstmidtower", "firsttothreetowers", "gspd", "gpr", "team kpm", "ckpm", "kills", "deaths", "assists", "doublekills", "triplekills", "quadrakills", "pentakills", "firstblood", "firstbloodkill", "firstbloodassist", "firstbloodvictim", "inhibitors", "opp_inhibitors", "damagetochampions", "dpm", "damageshare", "damagetakenperminute", "damagemitigatedperminute", "wardsplaced", "wpm", "wardskilled", "wcpm", "controlwardsbought", "visionscore", "vspm", "totalgold", "earnedgold", "earned gpm", "earnedgoldshare", "goldspent", "total cs", "minionkills", "monsterkills", "cspm"],
            axis=1)

In [5]:
for time in ["10","15","20","25"]:
    # Creating derivative columns
    dflol[f"killparticipationsat{time}"] = dflol[f"killsat{time}"] + dflol[f"assistsat{time}"]
    dflol[f"opp_killparticipationsat{time}"] = dflol[f"opp_killsat{time}"] + dflol[f"opp_assistsat{time}"]
    dflol[f"killparticipationsdiffat{time}"] = dflol[f"killparticipationsat{time}"] - dflol[f"opp_killparticipationsat{time}"]
    dflol[f"killdiffat{time}"] = dflol[f"killsat{time}"] - dflol[f"opp_killsat{time}"]

    # Dropping columns that are mirrored, like opponent stats.  These stats are duplicated because one team's opponent deaths (f"opp_deathsat{time}") is identical to the opponent team's deaths (f"deathsat{time}".
    dflol = dflol.drop([f"opp_goldat{time}", f"opp_xpat{time}", f"opp_csat{time}",
                        f"opp_killsat{time}", f"opp_assistsat{time}", f"opp_deathsat{time}",
                        f"opp_killparticipationsat{time}"], axis=1)

In [6]:
dflol.head(12)

Unnamed: 0_level_0,side,position,champion,result,goldat10,xpat10,csat10,golddiffat10,xpdiffat10,csdiffat10,...,killdiffat10,killparticipationsat15,killparticipationsdiffat15,killdiffat15,killparticipationsat20,killparticipationsdiffat20,killdiffat20,killparticipationsat25,killparticipationsdiffat25,killdiffat25
gameid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
TRLH3/33,Blue,top,Trundle,1,3080.0,3907.0,57.0,-6.0,-87.0,0.0,...,1.0,2.0,0.0,1.0,3.0,0.0,2.0,7.0,4.0,2.0
TRLH3/33,Blue,jng,Vi,1,2335.0,2732.0,32.0,-1102.0,-1425.0,-27.0,...,-2.0,2.0,0.0,-2.0,3.0,0.0,-2.0,5.0,2.0,-2.0
TRLH3/33,Blue,mid,Orianna,1,2817.0,4216.0,73.0,-325.0,-87.0,5.0,...,0.0,1.0,-1.0,0.0,2.0,-1.0,-1.0,6.0,3.0,0.0
TRLH3/33,Blue,bot,Jinx,1,3487.0,3259.0,78.0,977.0,840.0,27.0,...,2.0,3.0,3.0,3.0,4.0,4.0,3.0,8.0,8.0,6.0
TRLH3/33,Blue,sup,Annie,1,2132.0,3079.0,5.0,337.0,296.0,-9.0,...,0.0,4.0,4.0,0.0,4.0,3.0,0.0,7.0,6.0,0.0
TRLH3/33,Red,top,Dr. Mundo,0,3086.0,3994.0,57.0,6.0,87.0,0.0,...,-1.0,2.0,0.0,-1.0,3.0,0.0,-2.0,3.0,-4.0,-2.0
TRLH3/33,Red,jng,Shyvana,0,3437.0,4157.0,59.0,1102.0,1425.0,27.0,...,2.0,2.0,0.0,2.0,3.0,0.0,2.0,3.0,-2.0,2.0
TRLH3/33,Red,mid,LeBlanc,0,3142.0,4303.0,68.0,325.0,87.0,-5.0,...,0.0,2.0,1.0,0.0,3.0,1.0,1.0,3.0,-3.0,0.0
TRLH3/33,Red,bot,Lucian,0,2510.0,2419.0,51.0,-977.0,-840.0,-27.0,...,-2.0,0.0,-3.0,-3.0,0.0,-4.0,-3.0,0.0,-8.0,-6.0
TRLH3/33,Red,sup,Thresh,0,1795.0,2783.0,14.0,-337.0,-296.0,9.0,...,0.0,0.0,-4.0,0.0,1.0,-3.0,0.0,1.0,-6.0,0.0


Because dflol at this stage includes the champion column, by using the dropna() method, I can remove all rows that include the team stats whilst keeping all of the player stats and the champions selected in order to standardise the statistics for each champion and position.

I will standardise the statistics in the following section: Standardised Statistics

In [7]:
dflol_no_teams = dflol.dropna()

### Base Statistics
Like in the previous notebooks, I am reshaping the data by reducing the number of rows (and increasing the number of columns) so that all of the relevant data for each game is on 1 row.

Rather than just keeping the team data, I am including the player data as well even though the sum of the player features equal the corresponding team feature.  This is because, as I know from my data exploration, the positions have different roles and so if the difference in gold between the teams is due to a Bot player having a disproportionate amount of gold, that will indicate that their team has a higher chance of winning compared to if the Support player has a disproportionate amount of gold.

Furthermore, from the analysis of the features in LoL_Elo_System, I know that the player in the Top position has less impact on their team winning than players in other positions.  This may indicate a lower importance of gold and kills for the Top position.

In [8]:
# Drop the "champion" column and rows with NaN values.  Set the index in preparation for reshaping the data.
dflol = (dflol.drop("champion", axis=1)
         .dropna()
         .reset_index()
         .set_index(["gameid", "side"]))
# I am using a multi-level index with both "gameid" and "side" and will first shape the dataframe by position and then by side.

In [9]:
positions = ["top", "jng", "mid", "bot", "sup", "team"]
positions_list = []

for i in range(6):
    # Create 6 dataframes for the 5 positions plus the team stats.
    positions_list.append(dflol[dflol["position"] == positions[i]])


for i in range(len(positions_list)):

    positions_list[i] = positions_list[i].drop("position", axis=1)

    # Keep only 1 column with the result
    if i > 0:
        positions_list[i] = positions_list[i].drop("result", axis=1)

    # Add prefixes of the positions for the column names.
    new_columns = []
    for c in positions_list[i].columns.to_list():
        c = positions[i] + "_" + c
        new_columns.append(c)
    positions_list[i].columns = new_columns

# Merge the 6 dataframes together.
df_blue_red_stats = pd.concat([positions_list[0],
                              positions_list[1],
                              positions_list[2],
                              positions_list[3],
                              positions_list[4],
                              positions_list[5]],
                              axis=1)

In [10]:
# Change the index for the purposes of moving all the data from both red and blue teams onto the same row.
df_blue_red_stats = df_blue_red_stats.reset_index().set_index("gameid")

# Split the data into 2 dataframes, one dataframe for the team on blue side, the other for the team on red side.
blue_df = (df_blue_red_stats[df_blue_red_stats["side"] == "Blue"]
           .drop("side",axis=1)
           .add_prefix("blue_", axis=1))

# In order to keep only 1 result column, I will drop "top_result" from red_df.
red_df = (df_blue_red_stats[df_blue_red_stats["side"] == "Red"]
          .drop(["side", "top_result"], axis=1)
          .add_prefix("red_", axis=1))

# Merge the 2 dataframes together so that each game occupies only a single row.
stats_df = blue_df.merge(red_df, on="gameid").rename({"blue_top_result":"blue_result"}, axis=1)

In [11]:
stats_df.head()

Unnamed: 0_level_0,blue_result,blue_top_goldat10,blue_top_xpat10,blue_top_csat10,blue_top_golddiffat10,blue_top_xpdiffat10,blue_top_csdiffat10,blue_top_killsat10,blue_top_assistsat10,blue_top_deathsat10,...,red_team_killdiffat10,red_team_killparticipationsat15,red_team_killparticipationsdiffat15,red_team_killdiffat15,red_team_killparticipationsat20,red_team_killparticipationsdiffat20,red_team_killdiffat20,red_team_killparticipationsat25,red_team_killparticipationsdiffat25,red_team_killdiffat25
gameid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
TRLH3/33,1,3080.0,3907.0,57.0,-6.0,-87.0,0.0,2.0,0.0,0.0,...,-1.0,6.0,-6.0,-2.0,10.0,-6.0,-2.0,10.0,-23.0,-6.0
TRLH3/44,1,3268.0,4433.0,83.0,69.0,97.0,1.0,1.0,0.0,1.0,...,1.0,12.0,6.0,2.0,17.0,-4.0,0.0,17.0,-6.0,-1.0
TRLH3/76,0,2912.0,4257.0,72.0,-188.0,-311.0,-8.0,0.0,0.0,1.0,...,1.0,6.0,1.0,0.0,7.0,2.0,1.0,11.0,4.0,1.0
TRLH3/85,1,2990.0,4699.0,86.0,-237.0,0.0,-4.0,0.0,0.0,0.0,...,-1.0,2.0,-4.0,-3.0,13.0,-14.0,-6.0,16.0,-38.0,-12.0
TRLH3/10072,0,2404.0,3087.0,46.0,-438.0,-893.0,-23.0,0.0,0.0,0.0,...,0.0,6.0,5.0,2.0,10.0,9.0,4.0,18.0,17.0,7.0


### Adjusted Statistics
I will standardise the statistics of the players by the champions selected.

In [12]:
# Drop the result column, as it will be identical to the "blue_result" column already in the stats_df dataframe.
dflol_no_teams = dflol_no_teams.drop("result", axis=1).reset_index()
dflol_no_teams.head()

Unnamed: 0,gameid,side,position,champion,goldat10,xpat10,csat10,golddiffat10,xpdiffat10,csdiffat10,...,killdiffat10,killparticipationsat15,killparticipationsdiffat15,killdiffat15,killparticipationsat20,killparticipationsdiffat20,killdiffat20,killparticipationsat25,killparticipationsdiffat25,killdiffat25
0,TRLH3/33,Blue,top,Trundle,3080.0,3907.0,57.0,-6.0,-87.0,0.0,...,1.0,2.0,0.0,1.0,3.0,0.0,2.0,7.0,4.0,2.0
1,TRLH3/33,Blue,jng,Vi,2335.0,2732.0,32.0,-1102.0,-1425.0,-27.0,...,-2.0,2.0,0.0,-2.0,3.0,0.0,-2.0,5.0,2.0,-2.0
2,TRLH3/33,Blue,mid,Orianna,2817.0,4216.0,73.0,-325.0,-87.0,5.0,...,0.0,1.0,-1.0,0.0,2.0,-1.0,-1.0,6.0,3.0,0.0
3,TRLH3/33,Blue,bot,Jinx,3487.0,3259.0,78.0,977.0,840.0,27.0,...,2.0,3.0,3.0,3.0,4.0,4.0,3.0,8.0,8.0,6.0
4,TRLH3/33,Blue,sup,Annie,2132.0,3079.0,5.0,337.0,296.0,-9.0,...,0.0,4.0,4.0,0.0,4.0,3.0,0.0,7.0,6.0,0.0


dflol_no_teams contains the name of each champion.  This is for standardising the data for each champion.


In [13]:
#Gather the list of stats present in the data.  The first 3 columns are "side", "position", and "champion".
list_of_column_stats = dflol_no_teams.columns[4:]

# Create 2 dataframes, 1 for the mean, and 1 for the standard deviation of each champion in each position (that it has been selected in).
mean_df = dflol_no_teams.groupby(["champion", "position"])[list_of_column_stats].mean().dropna()
std_df = dflol_no_teams.groupby(["champion", "position"])[list_of_column_stats].std().dropna()

In [14]:
# We can see that for each champion's position, we can see the mean of all the different features for that champion in that position.  We can also see that certain champions have not been selected in certain positions (such as Ahri which has been selected in all positions other than Support.
mean_df.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,goldat10,xpat10,csat10,golddiffat10,xpdiffat10,csdiffat10,killsat10,assistsat10,deathsat10,goldat15,...,killdiffat10,killparticipationsat15,killparticipationsdiffat15,killdiffat15,killparticipationsat20,killparticipationsdiffat20,killdiffat20,killparticipationsat25,killparticipationsdiffat25,killdiffat25
champion,position,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
Aatrox,bot,3237.5,4062.0,63.5,234.0,950.0,-0.5,0.5,1.0,0.0,4757.5,...,0.0,3.0,0.0,-1.0,5.5,1.0,-0.5,9.5,-1.5,0.5
Aatrox,jng,3250.504505,3380.342342,57.418919,10.698198,51.806306,1.504505,0.481982,0.545045,0.27027,5041.779279,...,-0.004505,2.121622,-0.337838,0.0,3.193694,-0.54955,0.036036,4.527027,-0.752252,-0.099099
Aatrox,mid,3348.27,4660.512,80.65,14.092,103.092,-0.434,0.484,0.452,0.34,5365.688,...,0.09,2.002,0.188,0.094,3.24,0.188,0.132,4.984,0.278,0.164
Aatrox,sup,2969.333333,4187.666667,64.0,838.0,1385.333333,53.0,0.333333,0.333333,0.666667,4805.666667,...,0.0,2.333333,-0.333333,-0.333333,2.666667,-3.0,-0.333333,4.333333,-6.333333,0.0
Aatrox,top,3224.064512,4609.365828,76.460967,-24.947739,52.447152,0.433338,0.331813,0.304835,0.440115,5182.507624,...,-0.008732,1.411834,-0.067379,0.040662,2.545028,-0.069725,0.096703,4.149225,-0.07155,0.17529
Ahri,bot,3421.625,3451.5,78.25,9.25,306.125,-4.125,0.75,0.75,0.875,5456.0,...,0.0,2.5,0.125,0.125,4.0,0.25,0.5,4.75,-1.0,-0.25
Ahri,jng,3749.0,4836.0,88.0,330.0,844.0,16.0,1.0,0.0,0.0,6315.0,...,0.0,3.0,0.0,-1.0,6.0,2.0,0.0,8.0,4.0,0.0
Ahri,mid,3390.835682,4661.425904,84.277372,-16.223561,-7.922424,-1.378204,0.467493,0.513156,0.308267,5428.932439,...,0.045833,2.146155,0.307588,0.107622,3.794602,0.478357,0.150229,5.951282,0.633678,0.17705
Ahri,top,3333.777778,4743.777778,82.222222,201.555556,113.666667,9.777778,0.333333,0.444444,0.222222,5310.333333,...,0.222222,1.777778,1.222222,0.222222,3.0,0.111111,-0.111111,5.555556,1.555556,0.777778
Akali,bot,3508.0,4926.333333,87.333333,-9.0,1505.0,10.666667,0.666667,0.0,0.333333,5671.0,...,-0.333333,2.0,-1.333333,-0.666667,5.333333,0.333333,-1.333333,9.333333,2.666667,-1.333333


In [15]:
# Combine mean and std dataframes with dflol_no_teams which holds the data for each game.
adjusted_stats_calc_df =(dflol_no_teams.merge(mean_df, on=["champion", "position"], suffixes=["", "_mean"])
                         .merge(std_df, on=["champion", "position"], suffixes=["", "_std"]))
# Total number of stats to be standardised
print(f"There are {len(list_of_column_stats)} stats to be standardised")

There are 48 stats to be standardised


In [16]:
# Isolate the mean and std rows within the dataframe.
list_of_mean_column_stats = adjusted_stats_calc_df.columns[52:100]
list_of_std_column_stats = adjusted_stats_calc_df.columns[100:148]

# Engage in Z-score standardisation of the statistics by taking away the mean and dividing by the standard deviation.
for i in range(48):
    adjusted_stats_calc_df[list_of_column_stats[i]] = adjusted_stats_calc_df[list_of_column_stats[i]] -\
                                                      adjusted_stats_calc_df[list_of_mean_column_stats[i]]

    adjusted_stats_calc_df[list_of_column_stats[i]] = adjusted_stats_calc_df[list_of_column_stats[i]] /\
                                                      adjusted_stats_calc_df[list_of_std_column_stats[i]]

# Drop the columns that held the mean and standard deviation.
# Set a multi-level index, as I did for dflol, to reshape the data first by position and then by side.
adjusted_stats_calc_df = (adjusted_stats_calc_df.iloc[:,:52]
                          .drop("champion", axis=1)
                          .set_index(["gameid", "side"]))

In [17]:
positions_list2 = []
for i in range(5):
    positions_list2.append(adjusted_stats_calc_df[adjusted_stats_calc_df["position"] == positions[i]])

# Rename the column titles.
for i in range(len(positions_list2)):
    positions_list2[i] = positions_list2[i].drop("position", axis=1)

    new_columns = []
    for c in positions_list2[i].columns.to_list():
        c = positions[i] + "_" + c
        new_columns.append(c)
    positions_list2[i].columns = new_columns

# Merge the 5 dataframes together.
df_blue_red_adjusted_stats = pd.concat([positions_list2[0],
                              positions_list2[1],
                              positions_list2[2],
                              positions_list2[3],
                              positions_list2[4]],
                              axis=1).reset_index().set_index("gameid")

I will now use the found adjusted statistics that are  in df_blue_red_adjusted_stats to create a derivative column for the team's adjusted stats.

In [18]:
# Split the df_blue_red_adjusted_stats dataframe into 2 dataframes, one for the players on red side and the other for the players on Blue side.
blue_adjusted_df = df_blue_red_adjusted_stats[df_blue_red_adjusted_stats["side"] == "Blue"].drop("side",axis=1)
red_adjusted_df = df_blue_red_adjusted_stats[df_blue_red_adjusted_stats["side"] == "Red"].drop("side",axis=1)

# List of team stats to derive.  The name raw_stats refers to the names of the columns (that there are no prefixes such as "top_" or "blue_").
raw_stats = dflol.columns[2:]

# List of positional stats to derive the team stats from.
positional_stats = blue_adjusted_df.columns

# Create the derived team stats by combining the standardised (adjusted for champion) statistics.
for i in range(48):
   blue_adjusted_df[f"team_{raw_stats[i]}"] = (blue_adjusted_df[positional_stats[i]] +
                                                     blue_adjusted_df[positional_stats[i + 48]] +
                                                     blue_adjusted_df[positional_stats[i + 96]] +
                                                     blue_adjusted_df[positional_stats[i + 144]] +
                                                     blue_adjusted_df[positional_stats[i + 192]])

   red_adjusted_df[f"team_{raw_stats[i]}"] = (red_adjusted_df[positional_stats[i]] +
                                                     red_adjusted_df[positional_stats[i + 48]] +
                                                     red_adjusted_df[positional_stats[i + 96]] +
                                                     red_adjusted_df[positional_stats[i + 144]] +
                                                     red_adjusted_df[positional_stats[i + 192]])

# Add a prefix for the side.
blue_adjusted_df = blue_adjusted_df.add_prefix("blue_")
red_adjusted_df = red_adjusted_df.add_prefix("red_")

# Combine the 2 dataframes together and add a prefix to mark the column as adjusted.
adjusted_stats_df = (blue_adjusted_df.merge(red_adjusted_df, on="gameid")).add_prefix("adj_")

In [19]:
adjusted_stats_df.head()

Unnamed: 0_level_0,adj_blue_top_goldat10,adj_blue_top_xpat10,adj_blue_top_csat10,adj_blue_top_golddiffat10,adj_blue_top_xpdiffat10,adj_blue_top_csdiffat10,adj_blue_top_killsat10,adj_blue_top_assistsat10,adj_blue_top_deathsat10,adj_blue_top_goldat15,...,adj_red_team_killdiffat10,adj_red_team_killparticipationsat15,adj_red_team_killparticipationsdiffat15,adj_red_team_killdiffat15,adj_red_team_killparticipationsat20,adj_red_team_killparticipationsdiffat20,adj_red_team_killdiffat20,adj_red_team_killparticipationsat25,adj_red_team_killparticipationsdiffat25,adj_red_team_killdiffat25
gameid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
TRLH3/33,-0.135168,-0.021855,-0.520006,-0.132154,-0.151887,-0.166628,3.024814,-0.512389,-0.581798,-0.212773,...,-1.433629,-1.731643,-2.403074,-1.630137,-2.181986,-1.745944,-1.482228,-3.98839,-4.772925,-2.571911
TRLH3/20057,-0.707841,0.982862,0.751272,-0.065393,0.725057,0.320334,-0.457858,-0.512389,-0.581798,-0.492693,...,-0.209948,-3.311287,-0.555006,-0.46719,-1.743161,1.586179,0.15606,-1.089217,1.357828,0.512618
TRLH3/20094,0.098575,0.628928,-0.02255,2.006275,1.603745,1.850784,-0.457858,2.779747,-0.581798,-0.071018,...,-3.95256,-3.931522,-3.627975,-2.367295,-5.379623,-2.478421,-1.574944,-5.378668,-0.987231,-0.444217
TRLH3/20100,-2.069398,-0.391013,-1.238555,-1.304638,-0.913765,-1.766644,-0.457858,-0.512389,-0.581798,-0.826444,...,1.763452,-2.165335,0.409398,0.304882,-2.239106,0.642069,0.771471,-4.085636,0.44048,0.616665
TRLH3/20115,-0.368912,-0.531826,-0.630552,0.011799,-1.483865,-2.114473,1.283478,-0.512389,-0.581798,-0.076401,...,1.041818,1.529538,3.632455,2.52417,0.260071,2.997219,2.460345,1.550889,1.864937,2.760491


In [20]:
print(f"There are {adjusted_stats_df.shape[1]} columns in adjusted_stats_df and {stats_df.shape[1]} columns in stats_df\n\nThat is {adjusted_stats_df.shape[1]} columns of adjusted stats, {stats_df.shape[1] -1} columns of unadjusted stats and 1 column for the result of the game")

There are 576 columns in adjusted_stats_df and 577 columns in stats_df

That is 576 columns of adjusted stats, 576 columns of unadjusted stats and 1 column for the result of the game


In [21]:
# Merge the adjusted and unadjusted stats
stats_df = stats_df.merge(adjusted_stats_df, on="gameid")

# Drop all columns that show the team on Red side's data minus the team on Blue side's data because this is the exact inverse of the columns that show the Blue side's data minus the Red side's data.
stats_df = stats_df[stats_df.columns.drop(list(stats_df.filter(regex="red.*?diff.*?")))]

### Elo Data
I will incorporate the elo ratings created in LoL_Elo_System into my dataset.

In [22]:
# Read in the elo_record dataframe.
elo_record_df = pd.read_pickle("elolol.pkl").drop("predict", axis=1)

# Drop all "id" columns and the "year" column.
elo_record_df = elo_record_df[elo_record_df.columns.drop(list(elo_record_df.filter(regex="playerid|champion")))].drop("year", axis=1)

In [23]:
elo_record_df.head()

Unnamed: 0_level_0,side,win_top_player_elo,win_jng_player_elo,win_mid_player_elo,win_bot_player_elo,win_sup_player_elo,win_team_elo,win_top_champ_elo,win_jng_champ_elo,win_mid_champ_elo,...,lose_mid_player_elo,lose_bot_player_elo,lose_sup_player_elo,lose_team_elo,lose_top_champ_elo,lose_jng_champ_elo,lose_mid_champ_elo,lose_bot_champ_elo,lose_sup_champ_elo,lose_total_elo
gameid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
TRLH3/33,1,1500.0,1500.0,1500.0,1500.0,1500.0,1500.0,1500.0,1500.0,1500.0,...,1500.0,1500.0,1500.0,1500.0,1500.0,1500.0,1500.0,1500.0,1500.0,39710.535446
TRLH3/44,1,1500.0,1500.0,1500.0,1500.0,1500.0,1500.0,1500.0,1500.0,1500.0,...,1500.0,1500.0,1500.0,1500.0,1500.0,1502.305301,1500.0,1500.0,1497.695549,39710.520451
TRLH3/76,-1,1500.396867,1500.611265,1500.702681,1500.858904,1500.926882,1501.127031,1497.649661,1499.998669,1500.0,...,1499.296914,1499.1406,1499.072583,1498.872319,1502.350339,1500.0,1497.624497,1500.0,1495.389768,39691.187622
TRLH3/85,1,1499.603133,1499.388735,1499.297319,1499.141096,1499.073118,1498.872969,1500.0,1497.628714,1500.0,...,1500.703086,1500.8594,1500.927417,1501.127681,1499.93412,1500.0,1500.0,1500.001487,1502.305781,39729.38726
TRLH3/10072,-1,1500.0,1500.0,1500.0,1500.0,1500.0,1500.0,1500.06588,1502.369955,1497.625867,...,1500.0,1500.0,1500.0,1500.0,1497.31512,1500.19901,1502.442088,1497.128939,1502.370412,39710.810896


Splitting the dataframe from having the elo ratings defined by the winning and losing team, to the columns being defined by Blue side and Red side.

In [24]:
# Split the dataframe into red and blue.  in LoL_Elo_System I coded the "side" column into 1 for Blue and -1 for Red.
elo_blue_df = elo_record_df[elo_record_df["side"] == 1]
elo_red_df = elo_record_df[elo_record_df["side"] == -1]

# Add changed column titles.
# The games in elo_blue_df are all the games that the team on Blue side team won.
# The games in elo_red_df are all the games that the team on Red side won.
elo_blue_df.columns = (elo_blue_df.columns
                       .str.replace("win", "blue")
                       .str.replace("lose", "red"))

elo_red_df.columns = (elo_red_df.columns
                      .str.replace("win", "red")
                      .str.replace("lose", "blue"))

# Join the 2 dataframes back together.
blue_red_elo_record_df = pd.concat([elo_blue_df, elo_red_df]).drop("side", axis=1)

# Add a new column for the difference between the total elo ratings of each team.
blue_red_elo_record_df["elo_diff"] = blue_red_elo_record_df["blue_total_elo"] - blue_red_elo_record_df["red_total_elo"]

# Merge the elo_record with the other features.  Drop all rows with missing data
stats_df = stats_df.merge(blue_red_elo_record_df, on="gameid")

In [25]:
print(f"There are {stats_df.shape[0]} games recorded in the dataset")
stats_df = stats_df.dropna()
print(f"There are {stats_df.shape[0]} games- without missing data- recorded in the dataset")
print(f"There are {stats_df.shape[1]} features in the dataset")

There are 66799 games recorded in the dataset
There are 66436 games- without missing data- recorded in the dataset
There are 938 features in the dataset


In [26]:
stats_df.head()

Unnamed: 0_level_0,blue_result,blue_top_goldat10,blue_top_xpat10,blue_top_csat10,blue_top_golddiffat10,blue_top_xpdiffat10,blue_top_csdiffat10,blue_top_killsat10,blue_top_assistsat10,blue_top_deathsat10,...,red_bot_player_elo,red_sup_player_elo,red_team_elo,red_top_champ_elo,red_jng_champ_elo,red_mid_champ_elo,red_bot_champ_elo,red_sup_champ_elo,red_total_elo,elo_diff
gameid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
TRLH3/33,1,3080.0,3907.0,57.0,-6.0,-87.0,0.0,2.0,0.0,0.0,...,1500.0,1500.0,1500.0,1500.0,1500.0,1500.0,1500.0,1500.0,39710.535446,0.0
TRLH3/44,1,3268.0,4433.0,83.0,69.0,97.0,1.0,1.0,0.0,1.0,...,1500.0,1500.0,1500.0,1500.0,1502.305301,1500.0,1500.0,1497.695549,39710.520451,-0.187854
TRLH3/76,0,2912.0,4257.0,72.0,-188.0,-311.0,-8.0,0.0,0.0,1.0,...,1500.858904,1500.926882,1501.127031,1497.649661,1499.998669,1500.0,1502.576391,1500.0,39728.812156,-37.624535
TRLH3/10072,0,2404.0,3087.0,46.0,-438.0,-893.0,-23.0,0.0,0.0,0.0,...,1500.0,1500.0,1500.0,1500.06588,1502.369955,1497.625867,1500.0,1502.304451,39711.083963,-0.273067
TRLH3/10087,1,2828.0,4325.0,71.0,-444.0,-417.0,-14.0,0.0,0.0,1.0,...,1498.257111,1498.119171,1497.713029,1500.0,1500.0,1500.341335,1502.946805,1504.940218,39674.451841,34.15075


Formatting the table

In [27]:
%%html
<style>
  table, th, td {
    text-align: left !important;
  }
</style>

### Feature Rundown

| Stat      | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | Recorded at What Time (minutes) |
|:----------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------------------------------|
| Kills     | This records the number of kills by the player/team.<br>- A kill is assigned to whatever player on the killing team dealt the final blow.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | z10, 15, 20, 25                 |
| Assists   | This records the number of assists by the player/team.<br>- An assist is assigned to any players on the killing team who took part in the kill but did not deal the final blow.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      | z10, 15, 20, 25                 |
| Deaths    | This records the number of deaths by the player/team.<br>- A death is assigned to a player whenever they die.  They can be killed by players but also by turrets, neutral monsters and minions.<br>- When a player dies to a non-player, a kill is not recorded for the other team but a death is recorded.                                                                                                                                                                                                                                                                                                                                                          | 10, 15, 20, 25                  |
| CS        | This records the number of "things" killed by the player/team.<br>- Whenever a player kills a minion, champion, neutral monster, or ward, their CS increases by 1.- <br>(CS stands for "Creep Score", "creeps" being the old label for "minions".)                                                                                                                                                                                                                                                                                                                                                                                                                   | 10, 15, 20, 25                  |
| Gold      | This records the amount of gold the player/team has.<br>- Players receive a gold whenever they kill minions, champions, neutral monsters, wards, and turrets, as well as when they get "assists" for killing champions.  Killing "bigger" things gives more gold: killing a champion gives around 300 gold whereas killing a minion, around 20 gold.<br>- Players can use gold to buy items which increases their power.<br>- If a player has more gold than the opponent it doesn't just mean that they have killed more things than their opponent, but that they are more powerful.  This allows the player to further increase their positive gold differential. | 10, 15, 20, 25                  |
| XP        | This records the amount of XP (Experience Points) the player/team has.<br/>- Players receive XP whenever they are near enemy minions, enemy champions, and neutral monsters that die.  Like with gold, being near "bigger" things that die means receiving more XP.<br>- XP automatically increases the power of players that receive it.<br>- Like with gold, if a player has more XP than the opponent it means they are currently stronger than the enemy.  This allows the player to further increase their positive XP differential.                                                                                                                            | 10, 15, 20, 25                  |
| Elo       | This is the relative strength of the player/team/champion before the game starts.<br>- It is based on the previous results of the player/team/champion.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | Before the game starts          |

# Step 2: Machine Learning
I will use 3 different ML algorithms as well as PCA and data transformation.  I will also provide a brief explanation of the algorithms when I use them.
### Preparation for Machine Learning

In [28]:
# Split stats_df into train and test arrays
blue_result = stats_df["blue_result"]

array_train, array_test, result_train, result_test =train_test_split(
    stats_df.drop(labels=["blue_result"], axis=1),
    blue_result,
    test_size = 0.2,
    random_state=1)

In [29]:
times = ["0", "10", "15", "20", "25"]

# My comparison will include the accuracy of the model on:
# The test data.
# The F1 Score, to evaluate the model's balance of false positives and false negatives (due to there being more results equalling 1 than 0 in the dataset), on the test data.
# The accuracy on the train data to check for overfitting.
results_df = pd.DataFrame(columns=["Model", "Time", "Test Accuracy", "F-1", "Train Accuracy"])

# Recording the predictive effectiveness of Kills, Gold, and XP at the timed intervals.
for stat in ["kill", "gold", "xp"]:
    for time in times[1:]:
        temp_series = array_test[f"blue_team_{stat}diffat{time}"] > 0
        accuracy_test = accuracy_score(result_test, temp_series)
        f1 = f1_score(result_test, temp_series)
        accuracy_train = accuracy_score(result_train, array_train[f"blue_team_{stat}diffat{time}"] > 0)
        results_df.loc[f"{stat}_diff_at{time}"] = [f"{stat}_diff", time, accuracy_test, f1, accuracy_train]

# Recording the predictive effectiveness of "elo_diff".
# The Blue Side Advantage calculated in LoL_Elo_System
bsa = 23.463215
results_df.loc["elo_diff"] = ["elo_diff",
                              "0",
                              accuracy_score(result_test, array_test["elo_diff"]+ bsa > 0),
                              f1_score(result_test, array_test["elo_diff"]+ bsa > 0),
                              accuracy_score(result_train, array_train["elo_diff"]+ bsa > 0)]
results_df
# A full analysis of the different models will be conducted at the end of the notebook

Unnamed: 0,Model,Time,Test Accuracy,F-1,Train Accuracy
kill_diff_at10,kill_diff,10,0.618603,0.582124,0.622168
kill_diff_at15,kill_diff,15,0.675497,0.663125,0.677241
kill_diff_at20,kill_diff,20,0.734347,0.730122,0.729736
kill_diff_at25,kill_diff,25,0.795756,0.798005,0.793783
gold_diff_at10,gold_diff,10,0.670831,0.686766,0.674908
gold_diff_at15,gold_diff,15,0.721252,0.736894,0.727064
gold_diff_at20,gold_diff,20,0.775662,0.787147,0.773011
gold_diff_at25,gold_diff,25,0.833534,0.842383,0.833992
xp_diff_at10,xp_diff,10,0.64118,0.655442,0.649846
xp_diff_at15,xp_diff,15,0.700256,0.70655,0.701362


## Model 1: Logistic Regression

### The Basics
Logistic Regression multiplies each feature (x) by a constant (w).

The sum of all of these wx (feature multiplied by constant) is then put through a sigmoid function which finds the probability, a number between 0 and 1, of the Blue team winning.

Because this is a Logistic Regression function, and the predictions are binary, any value above 0.5 is classified as 1, and the rest are classified as 0.  These 1s and 0s are then put into a logloss function (the cost function that LogisticRegression uses).  This measures the difference between the predicted results and the actual results.

The partial derivative of the cost function for each individual w value is then calculated (as well as the partial derivative of the bias).

This is then multiplied by the learning rate (a constant) and then taken away from the w value so that on the next iteration every single w value has been modified by its own partial derivative of the cost function.  This process is known as gradient descent.

This process is repeated until the cost function reaches a minimum.

### Regularisation

Regularisation is used to prevent the model from overfitting (learning the training data so well that it can't effectively predict new data).

My model utilises both L1 and L2 regularisation.

L1 regularisation adjusts the w values, each iteration, to reduce the size of them.  Whatever value w is, it is pushed a fixed amount towards 0.<br>For small w values, its partial derivative can be a smaller value than the value of the adjustment of L1 regularisation.  This means that those smaller (less important) w values are pushed to 0 and so L1 regularisation engages in feature selection ny eliminating less important features.

The adjuster for L2 regularisation pushes w values towards 0 but instead of using a fixed value, it does so proportionately to the w values it is adjusting.  This means that it prevents the model from relying too much on a few large features.

In [75]:
from sklearn.linear_model import LogisticRegression

def logistic_regression(array_train, array_test, new_columns, info):

    # Scale the data.
    scaler = StandardScaler()
    array_train = scaler.fit_transform(array_train)
    array_test = scaler.transform(array_test)

    # Define my model and my parameters for my grid search.
    model = LogisticRegression(max_iter=100000,
                               penalty="elasticnet",
                               solver="saga")

    param_grid = {"C": [0.001, 0.01, 0.1],
                  "l1_ratio": [0.3, 0.5, 0.7]}

    lr_grid_search = GridSearchCV(estimator=model,
                                  param_grid=param_grid,
                                  cv=3,
                                  n_jobs=-1)

    # Train and test the regression model.
    lr_grid_search.fit(array_train, result_train)

    pred_result_test = lr_grid_search.predict(array_test)
    pred_result_train = lr_grid_search.predict(array_train)

    print(lr_grid_search.best_params_)

    # Extract the multipliers for each feature.
    rank = pd.Series(lr_grid_search.coef_.flatten(), index=new_columns).sort_values(ascending=False)

    # Find and record the accuracy and F1 score.
    accuracy_test = accuracy_score(result_test, pred_result_test)
    f1 = f1_score(result_test, pred_result_test)
    accuracy_train = accuracy_score(result_train, pred_result_train)
    results_df.loc[f"{info}log_reg_at{time}"] = ["log_reg", time, accuracy_test, f1, accuracy_train]

    return (accuracy_test,
            f1,
            accuracy_train,
            rank)

In [76]:
def timed_logistic_regression(array_train, array_test, time="None", info=""):

    def timed_feature_selection(array_train, array_test, string, info):

        # Selecting only correct columns to be used for the time interval.
        for c in array_train.columns:
            if "elo" in c or c[-2:] in string:
                new_columns.append(c)

        return logistic_regression(array_train[new_columns], array_test[new_columns], new_columns, info)

    new_columns = []

    # Assigning a string, with the selected and previous interval, in order to filter for the correct columns.
    if time == "10":
        return timed_feature_selection(array_train, array_test, "10", info)
    elif time == "15":
        return timed_feature_selection(array_train, array_test, "10 15", info)
    elif time == "20":
        return timed_feature_selection(array_train, array_test, "15 20", info)
    elif time == "25":
        return timed_feature_selection(array_train, array_test, "20 25", info)
    else:
        return timed_feature_selection(array_train, array_test, "None", info)

In [44]:
print("Test Accuracy", "F1 Score", "Train Accuracy")
for time in times:
    test_accuracy, f1, train_accuracy, rank = timed_logistic_regression(array_train, array_test, time)
    print(test_accuracy, f1, train_accuracy, time)
# We can see that as the game gets longer, the accuracy of the predictions increase as well, with slightly larger increases in accuracy between 20 and 25 than the others.

Test Accuracy F1 Score Train Accuracy
0.63237507525587 0.6659829059829059 0.6377286069090088 0
0.7056742925948224 0.725794012479843 0.7123504177015128 10
0.7417971101745936 0.756234458259325 0.754459245879431 15
0.792519566526189 0.8031698436496038 0.7938022126890946 20
0.845048163756773 0.8532744245706548 0.8466546248212539 25


In [45]:
print("gold:", rank.loc[rank.index.str.contains("gold")].abs().sum())
print("elo:",rank[rank.index.str.contains("elo")].abs().sum())
print("kills:",rank[rank.index.str.contains("kill")].abs().sum())
print("xp:",rank[rank.index.str.contains("xp")].abs().sum())
print("cs:",rank[rank.index.str.contains("cs")].abs().sum())

gold: 2.0959819563026705
elo: 0.6306501678069353
kills: 0.7275182637756563
xp: 1.0642165165305428
cs: 0.4717730343717551


We can see that for the prediction at 25 minutes, the most important feature is gold.

### Principal Component Analysis (PCA)
Principal Componenet Analysis is a way to reduce the dimensionality of the data.

First it finds the covariance matrix of the data.

Then it finds the eigenvectors (each with its eigenvalue) that make up the covariance matrix.

Each eigenvector represents a "principal component" (PC) of the data.  The larger the eigenvector's eigenvalue, the more variance it explains in the data.  The data is then projected onto these eigenvectors to create principal components.

By selecting the most important PCs, I can explain 95% of the data whilst massively reducing the data's dimensionality because there will be many PCs that don't explain much of the variation in the data (much more than 5%).

PCA doesn't standardise the principal components.  This means that if a principal component is 16x better at explaining variance in the data than a different one, the more important component's data will be 16x larger than the less important component's.  This reduces the time that the LogisticRegression ML algorithm needs to find the correct w values as the features are already scaled.

I would note that depending on the data, the principal components that explain the most variance may not be the principal components that best explain the target variable.

In [83]:
from sklearn.decomposition import PCA
def pca_logistic_regression(array_train, array_test, info):

    # Scale the data.
    scaler = StandardScaler()
    array_train = scaler.fit_transform(array_train)
    array_test = scaler.transform(array_test)

    # Apply PCA.
    pca = PCA(n_components=array_train.shape[1])
    pca_array_train = pca.fit_transform(array_train)
    pca_array_test = pca.transform(array_test)

    # Maintain 95% of the data of the original dataframe.
    percent95_of_data = np.argwhere(np.cumsum(pca.explained_variance_ratio_)>0.95).min()
    pca_array_train = pca_array_train[:,:percent95_of_data]
    pca_array_test = pca_array_test[:,:percent95_of_data]

    # Define my model and my parameters for my grid search.
    model = LogisticRegression(max_iter=100000,
                               penalty="elasticnet",
                               solver="saga")

    param_grid = {"C": [0.001, 0.005, 0.01, 0.02, 0.1],
                  "l1_ratio": [0.2, 0.4, 0.6, 0.8]}

    lr_grid_search = GridSearchCV(estimator=model,
                                  param_grid=param_grid,
                                  cv=6,
                                  n_jobs=-1)

    # Train and test the regression model.
    lr_grid_search.fit(pca_array_train, result_train)

    pred_result_test = lr_grid_search.predict(pca_array_test)
    pred_result_train = lr_grid_search.predict(pca_array_train)

    # Printing the best parameters.
    print(lr_grid_search.best_params_)


    # Find and record the accuracy and F1 score.
    accuracy_test = accuracy_score(result_test, pred_result_test)
    f1 = f1_score(result_test, pred_result_test)
    accuracy_train = accuracy_score(result_train, pred_result_train)
    results_df.loc[f"{info}pca_log_reg_at{time}"] = ["pca_log_reg", time, accuracy_test, f1, accuracy_train]

    return (accuracy_test,
            f1,
            accuracy_train)

In [84]:
def timed_pca_logistic_regression(array_train, array_test, time="None", info=""):

    def timed_pca_feature_selection(array_train, array_test, string, info):

        # Selecting only correct columns to be used for the time interval.
        for c in array_train.columns:
            if "elo" in c or c[-2:] in string:
                new_columns.append(c)

        return pca_logistic_regression(array_train[new_columns], array_test[new_columns], info)

    new_columns = []

    # Assigning a string, with the selected and previous interval, in order to filter for the correct columns.
    if time == "10":
        return timed_pca_feature_selection(array_train, array_test, "10", info)
    elif time == "15":
        return timed_pca_feature_selection(array_train, array_test, "10 15", info)
    elif time == "20":
        return timed_pca_feature_selection(array_train, array_test, "15 20", info)
    elif time == "25":
        return timed_pca_feature_selection(array_train, array_test, "20 25", info)
    else:
        return timed_pca_feature_selection(array_train, array_test, "None", info)

In [85]:
for time in times:
    test_accuracy, f1, train_accuracy = timed_pca_logistic_regression(array_train, array_test, time)
    print(test_accuracy, f1, train_accuracy, time)
# Comparable to the non-PCA results.  But the time to compute is much shorter

{'C': 0.01, 'l1_ratio': 0.8}
0.6295906080674293 0.664119011873891 0.6388387145330022 0
{'C': 0.1, 'l1_ratio': 0.8}
0.7055990367248646 0.7255507226041813 0.7124821253857153 10
{'C': 0.01, 'l1_ratio': 0.6}
0.742098133654425 0.7568986309143789 0.7539888612929931 15
{'C': 0.02, 'l1_ratio': 0.6}
0.7900361228175797 0.8011120615911036 0.7914502897569052 20
{'C': 0.01, 'l1_ratio': 0.6}
0.8436183022275737 0.8515926296243393 0.844359148039437 25


### Manual Feature Selection
The manual feature selection I will engage in is different from PCA in one main way.  Whilst Principal Component Analysis selects the eigenvectors of the data that best explain its variation, I will include the correlation of individual features with the result of games in my selection.

To do this, I will find all the features that have a high correlation to other features.  I will create a dataframe that has each pair of highly correlated features on a row.  A feature that is highly correlated with several others will appear in several pairs.

I will then drop features in each pair that has a lower correlation with the result.  Through this, I will be able to reduce the dimensionality of my data whilst maintaining the features most correlated to the result.

In [35]:
# Find the correlation of different features with each other.
stats_corr = pd.concat([array_train, result_train], axis=1).corr()

# Get rid of duplicate values by creating a triangular matrix.
correlation_df = (stats_corr.where(np.triu(np.ones(stats_corr.shape), k=1).astype(bool))
               .stack()
               .reset_index())

correlation_df.columns = ["feature_1", "feature_2", "R^2"]

# Creation of a dataframe to find the correlation of each statistic with the result.
blue_result_corr = (correlation_df[correlation_df["feature_2"] == "blue_result"]
                    .drop("feature_2", axis=1)
                    .rename({"feature_1":"feature", "R^2":"R^2"}, axis=1)
                    .set_index("feature"))#.squeeze().sort_values()

In [36]:
# Finding the features that least correlate to winning.
blue_result_corr[blue_result_corr["R^2"].abs() < 0.1].sort_values(by="R^2", ascending=False)

Unnamed: 0_level_0,R^2
feature,Unnamed: 1_level_1
adj_red_mid_deathsat10,0.099806
red_mid_deathsat10,0.097572
adj_red_sup_deathsat10,0.097337
blue_mid_assistsat10,0.097101
adj_blue_mid_assistsat10,0.096278
...,...
blue_top_deathsat10,-0.093810
adj_blue_top_deathsat10,-0.094754
red_bot_csat15,-0.095407
adj_red_mid_csat10,-0.097043


The features that least correlate to winning are the champion elo ratings, and the cs of the Support role at all time intervals.

Combined with champions having both a fast rate of change in the Elo System, a low impact on their team's overall rating within the Elo System, and their elo rating's correlation to the result being so low could mean several things for the use of champions in predicting games.<br>It could simply mean that champions are not very useful at predicting the result of games.  However, it could mean that the elo system is particularly bad at finding the strength of each champion due to the regular changes to champions in the fortnightly "patches" and that simply treating "champion" as a categorical variable may have produced better results.

We can also see that many of the stats at the 10 minute interval have a very weak correlation to the result.

In [37]:
# Features that are highly correlated.
correlation_df = correlation_df[correlation_df["R^2"] > 0.9]
print(f"There are {len(correlation_df)} combinations of features that have an r correlation between themselves of above 0.9.")
correlation_df = correlation_df[correlation_df["feature_1"].str[-2:] == correlation_df["feature_2"].str[-2:]]
print(f"Only including features that are recorded at the same time interval, there are {len(correlation_df)} combinations of features that have an r correlation between themselves of above 0.9.")

There are 777 combinations of features that have an r correlation between themselves of above 0.9.
Only including features that are recorded at the same time interval, there are 688 combinations of features that have an r correlation between themselves of above 0.9.


In [38]:
# Merge the 2 dataframes together.  The correlation between the features and the result (blue_result_corr) and the correlation of the features between themselves (correlation_df).

# Adds the blue_result_corr for all features in the column "feature_1" in correlation_df.
result_corr_feature_1 = pd.merge(correlation_df, blue_result_corr,
                                 how="left",
                                 left_on="feature_1",
                                 right_on="feature",
                                 suffixes=("", "_result_feature_1"))
# Adds the blue_result_corr for all features in the column "feature_2" in result_corr_feature_1 (renamed from correlation_df in previous line).
result_corr_both_features = pd.merge(result_corr_feature_1, blue_result_corr,
                                     how="left",
                                     left_on="feature_2",
                                     right_on="feature",
                                     suffixes=("", "_result_feature_2"))

# Feature Selection
# Find the features that have a high correlation with another feature but are less correlated with the result than the other feature.
list_to_drop = result_corr_both_features.apply(lambda row: row["feature_1"] if row["R^2_result_feature_1"] < row['R^2_result_feature_2'] else row["feature_2"], axis=1).to_list()

# Dropping these features from the dataset.
reduced_array_train = array_train.drop(columns=list_to_drop)
reduced_array_test = array_test.drop(columns=list_to_drop)

In [39]:
print(f"The original array_train has {array_train.shape[1]} features.")
print(f"The reduced_array_train has {reduced_array_train.shape[1]} features.")

The original array_train has 937 features.
The reduced_array_train has 490 features.


In [53]:
for time in times:
    test_accuracy, f1, train_accuracy, rank = timed_logistic_regression(reduced_array_train, reduced_array_test, time, "reduced_")
    print(test_accuracy, f1, train_accuracy, time)

Fitting 3 folds for each of 9 candidates, totalling 27 fits
0.6319987959060807 0.6656638862300014 0.6380672838112441 0
Fitting 3 folds for each of 9 candidates, totalling 27 fits
0.7056742925948224 0.7263312574347491 0.712613833069918 10
Fitting 3 folds for each of 9 candidates, totalling 27 fits
0.7428506923540036 0.7571946280110851 0.7562090765409799 15
Fitting 3 folds for each of 9 candidates, totalling 27 fits
0.7928205900060205 0.8035395703989153 0.7951757356814931 20
Fitting 3 folds for each of 9 candidates, totalling 27 fits
0.8447471402769416 0.8533655554765798 0.8471814555580642 25


In [86]:
for time in times:
    test_accuracy, f1, train_accuracy = timed_pca_logistic_regression(reduced_array_train, reduced_array_test, time, "reduced_")
    print(test_accuracy, f1, train_accuracy, time)

{'C': 0.01, 'l1_ratio': 0.8}
0.6295906080674293 0.664119011873891 0.6388575299164597 0
{'C': 0.005, 'l1_ratio': 0.6}
0.7071794099939795 0.7276926306949402 0.7123316023180553 10
{'C': 0.005, 'l1_ratio': 0.4}
0.740593016255268 0.7556185749734137 0.7529728305862874 15
{'C': 0.02, 'l1_ratio': 0.6}
0.7882299819385912 0.7990574121679521 0.7909610897870099 20
{'C': 0.1, 'l1_ratio': 0.4}
0.8430915111378687 0.8511245983577294 0.8439640249868292 25


In [None]:
# COMPARE THE RESULTS AND USE REDUCED DATA

### Data Transformation

Because I am looking for linear relationships between each feature and the result, if there are features that have a different type of correlation with the result, such as an exponential relationship, logistic regression will be unable to properly utilise those features.

Therefore, I will engage in data transformation and will square, cube, and find the log of each feature because based on my exploratory data analysis, these are the transformations that could improve the predictive value of features in my dataset.

I will also use the reduced_array_train rather than the original array_train because by transforming the data I am increasing its dimensionality by 3x and do not want the array to be too large.

In [42]:
def transform_stats(array):

    # Using MinMaxScaler because I will need all the values to be positive in order to effectively square and log them.
    scaler = MinMaxScaler()
    minmax_reduced_array = scaler.fit_transform(array)


    # I am adding a very small number before I use np.log to eliminate any 0s.
    transformed_array = np.hstack([minmax_reduced_array,
                                   np.square(minmax_reduced_array),
                                   np.power(3, minmax_reduced_array),
                                   np.log(minmax_reduced_array + 1e-8)])

    # Adding a prefix for all columns for identification.
    transformed_array_cols = ((("raw_") + array.columns)
    .append([("sqr_") + array.columns,
             ("cbe_") + array.columns,
             ("log_") + array.columns]))

    # Convert to a dataframe.
    return pd.DataFrame(transformed_array, columns=transformed_array_cols)

transformed_array_train = transform_stats(reduced_array_train)
transformed_array_test = transform_stats(reduced_array_test)

In [80]:
# Logistic regression using polynomials gives a very similar result to using the original linear data.
# I am not worried about overfitting due to the size of the dataset and the small number of polynomials used
for time in times:
    test_accuracy, f1, train_accuracy = timed_pca_logistic_regression(transformed_array_train, transformed_array_test, time, "transformed_")
    print(test_accuracy, f1, train_accuracy, time)

0.6082931968693558 0.6968373230822995 0.637371114623316 0
0.7003311258278145 0.7394320115168171 0.7121998946338527 10
0.7414960866947622 0.7551849476159932 0.7539888612929931 15
0.7782209512341962 0.7687357765047478 0.7924475050801535 20
0.8444461167971101 0.8536429936982227 0.8453563633626854 25


In [None]:
# Using the transformed statistics
for time in times:
    test_accuracy, f1, train_accuracy, rank = timed_logistic_regression(transformed_array_train, transformed_array_test, time, "transformed_")
    print(test_accuracy, f1, train_accuracy, time)
# Comparable to using just the raw original data.  Not particularly useful.  The relationships appear to be linear

0.618151715833835 0.6917375455650061 0.6382366222623617 0


# Model 2: Random Forest

The random forest ML algorithm uses decision trees rather than the linear model of logistic regression.

### Bagging
First, bootstrap samples are taken from the dataset.  This process, called bagging, selects a row at random and then returns the row to the dataset so that it can be selected again.  It selects as many rows for each bootstrap sample as there are rows in the dataset.  This means that each bootstrap sample is the same size as the original data but can have repeated rows.  Each individual tree will be trained on its own bootstrap sample.

### Decision Tree
Then, the model selects a random subset of features (equal in number to the square root of the total number of features), and evaluates every possible way to split the features and selects the split that reduces the impurity of the result the most.<br>Simply, it selects the feature that best separates the wins and the losses.<br>This increases the number of nodes (from 1 to 2).

The process repeats itself on each of the new nodes, splitting them as well, and continues to repeat until:
- The "max_depth" (number of successive splits) is reached.
- The number of rows at a node is less than "min_samples_split".
- The number of rows that would be left after a split is below the "min_samples_leaf".

The model trains many decision trees (determined by "n_estimators") each with their own bootstrap sample.

### Classification

To predict the test data, each row is passed through and classified by every single decision tree.  Each tree makes its own classification of whether the row will result in a win or loss (for the Blue team).  The model then takes the majority decision of the trees.


In [95]:
# Random Forests
from sklearn.ensemble import RandomForestClassifier
def random_forest(array_train, array_test):

    # Define my model and my parameters for my grid search.
    model = RandomForestClassifier(n_jobs=-1)

    param_grid = {"n_estimators": [100],
                  "max_depth": [15],
                  "min_samples_split": [40, 70, 100],
                  "min_samples_leaf": [20, 25, 30]}

    rf_grid_search = GridSearchCV(estimator=model,
                                  param_grid=param_grid,
                                  cv=3)

    # Train and Test.
    rf_grid_search.fit(array_train, result_train)

    pred_result_test = rf_grid_search.predict(array_test)
    pred_result_train = rf_grid_search.predict(array_train)

    # Printing the best parameters.
    print(rf_grid_search.best_params_)

    # Find and record the accuracy and F1 score.
    accuracy_test = accuracy_score(result_test, pred_result_test)
    f1 = f1_score(result_test, pred_result_test)
    accuracy_train = accuracy_score(result_train, pred_result_train)
    results_df.loc[f"random_forest_at{time}"] = ["random_forest", time, accuracy_test, f1, accuracy_train]

    return (accuracy_test,
            f1,
            accuracy_train)

In [96]:
def timed_random_forest(array_train, array_test, time="None"):

    def timed_random_forest_feature_selection(array_train, array_test, time):

        # Selecting only correct columns to be used for the time interval.
        for c in array_train.columns:
            if "result" in c or "elo" in c or c[-2:] in time:
                new_columns.append(c)

        return random_forest(array_train[new_columns], array_test[new_columns])

    new_columns = []

    # Assigning a string, with the selected and previous interval, in order to filter for the correct columns.
    if time == "10":
        return timed_random_forest_feature_selection(array_train, array_test, "10")
    elif time == "15":
        return timed_random_forest_feature_selection(array_train, array_test, "10 15")
    elif time == "20":
        return timed_random_forest_feature_selection(array_train, array_test, "15 20")
    elif time == "25":
        return timed_random_forest_feature_selection(array_train, array_test, "20 25")
    else:
        return timed_random_forest_feature_selection(array_train, array_test, "None")

In [97]:
accuracy_test, f1, accuracy_train = timed_random_forest(reduced_array_train, reduced_array_test, "25")

print(accuracy_test, f1, accuracy_train)
# For 0 mins:
# Best Params: {'n_estimators': 150, 'min_samples_split': 10, 'max_depth': 8} Best Accuracy: -0.63930909911944
# For 10 mins:
# Best Params: {'max_depth': 20, 'min_samples_leaf': 7, 'min_samples_split': 8, 'n_estimators': 450} Best Accuracy: -0.7090389102129901

{'max_depth': 15, 'min_samples_leaf': 20, 'min_samples_split': 40, 'n_estimators': 100}
0.8409843467790488 0.8497048154207268 0.8831000225784601


In [68]:
for time in times[1:]:
    accuracy_test, f1_score, accuracy_train = timed_random_forest(array_train, array_test, time)
    print(accuracy_test, f1_score, accuracy_train)
# For random forests we can see that the result is not largely different but slightly worse than logistic regression


no valid time given.  Using only data available before the match
Best Params: {'n_estimators': 150, 'min_samples_split': 10, 'max_depth': 8} Best Accuracy: -0.63930909911944
None
0.6319987959060807 0.6599683901557913 0.629174037182641 0
Best Params: {'n_estimators': 250, 'min_samples_split': 15, 'max_depth': 10} Best Accuracy: -0.7064047565289381
None
0.7013094521372667 0.7845074132610823 0.6990418510885114 10
Best Params: {'n_estimators': 250, 'min_samples_split': 15, 'max_depth': 10} Best Accuracy: -0.7438473696093926
None
0.7354756170981337 0.8167381651238053 0.7339268710393128 15


KeyboardInterrupt: 

# Model 3: Gradient Boost


### Creating the Decision Trees
The model first creates an initial prediction by predicting the mean probability of result_train (~0.53) for every single row.  It does this by putting the log-odds of result_train and putting that into a sigmoid function.  The model will be calculating log-odds

The model then finds the residuals (the difference between the predicted probability and the actual result) of each row.  The residual is also called the gradient.

Then, the model selects a random subset of features (determined by colsample_by_tree), and evaluates every possible way to split the features and selects the split that:
- Best maximises the square of the sum of residual values (to group together rows with similar errors).
- And minimises the sum of Hessian values (to prioritise correcting rows where the model is confidently incorrect).

This gives more weight to predicted probabilities that are very wrong.  It increases the number of nodes (from 1 to 2).

The process repeats itself on each of the new nodes, splitting them as well, and continues to repeat until:
- The "max_depth" (number of successive splits) is reached.
- The sum of hessian values (of rows) at a node is less than "min_child_weight".

### Calculating the Probability
Once the tree has finished splitting the data, the model goes through each individual leaf and calculates for each row in that leaf a new probability.

It first calculates:
- The Hessian value: probability * (1 - probability)
- The log-odds: log( probability / (1 - probability) )

Calculate Delta/Leaf Score:
- All of the residual values in the leaf are added together.
- All of the hessian values in the leaf are added together.
- The sum of the residuals is then divided by: the sum of the hessians + L2 regularisation constant.  The L2 regularisation constant is determined by "reg_lambda".
- This value is then multiplied by -1 (in order to minimise, rather than maximise, the loss function).  This value is Delta, and it is stored in the leaf as the Leaf Score.

Calculate updated probability for each individual row:
- Delta is then multiplied by the "learning_rate".
- This is added to the log-odds for each individual row.  (Each row has its own log-odds but the delta is shared by all rows in that leaf.)
- Then the new probability is calculated by putting the log-dds in the sigmoid function.
- Finally, the new residual value is calculated by finding the difference between the new probability and the actual result.

Then, the model takes the new residual values and moves onto the next tree.  The number of trees is determined by "n_estimators".

### Predicting the result
Each row is put through the first tree and follows the decision nodes until it makes it to a leaf.  Once it is at a leaf, the Leaf Score of that leaf is added to the row's log-odds (which starts at 0).

Then each row moves on to the next tree and gets a Leaf Score added to its log-odds, and then the next tree and Leaf Score, and so on.  When each row has passed through every single tree, the log-odds of each row is converted to probability through the sigmoid function.  If the output of the sigmoid function is above 0.5 the model predicts, for that row, that the Blue team will win, if not, it predicts that the red team will win.

In [40]:
from xgboost import XGBClassifier

def gradient_boost(array_train, array_test):

    # Define my model and the parameters for my grid search.
    model = XGBClassifier(objective="binary:logistic",
                          eval_metric="logloss",
                          n_jobs=-1)

    param_grid = {"n_estimators": [100],
                  "max_depth": [10],
                  "min_child_weight": [15, 25],
                  "learning_rate": [0.1, 0.2],
                  "reg_lambda": [0.1, 1]}

    xg_grid_search = GridSearchCV(estimator=model,
                                  param_grid=param_grid,
                                  cv=3)

    # Train and Test.
    xg_grid_search.fit(array_train, result_train)

    pred_result_test = xg_grid_search.predict(array_test)
    pred_result_train = xg_grid_search.predict(array_train)

    # Printing the best parameters.
    print(xg_grid_search.best_params_)

    # Find the accuracy
    accuracy_test = accuracy_score(result_test, pred_result_test)
    f1 = f1_score(result_test, pred_result_test)
    accuracy_train = accuracy_score(result_train, pred_result_train)
    results_df.loc[f"x_grad_boost_at{time}"] = ["x_grad_boost", time, accuracy_test, f1, accuracy_train]

    return (accuracy_test,
            f1,
            accuracy_train)

In [43]:
def timed_gradient_boost(array_train, array_test, time="None"):

    def timed_gradient_boost_feature_selection(array_train, array_test, time):

        # Selecting only correct columns to be used for the time interval.
        for c in array_train.columns:
            if "result" in c or "elo" in c or c[-2:] in time:
                new_columns.append(c)

        return gradient_boost(array_train[new_columns], array_test[new_columns])

    new_columns = []

    # Assigning a string, with the selected and previous interval, in order to filter for the correct columns.
    if time == "10":
        return timed_gradient_boost_feature_selection(array_train, array_test, "10")
    elif time == "15":
        return timed_gradient_boost_feature_selection(array_train, array_test, "10 15")
    elif time == "20":
        return timed_gradient_boost_feature_selection(array_train, array_test, "15 20")
    elif time == "25":
        return timed_gradient_boost_feature_selection(array_train, array_test, "20 25")
    else:
        return timed_gradient_boost_feature_selection(array_train, array_test, "None")

In [44]:
accuracy_test, f1, accuracy_train = timed_gradient_boost(reduced_array_train, reduced_array_test, "25")
print(accuracy_test, f1, accuracy_train, time)

{'learning_rate': 0.1, 'max_depth': 10, 'min_child_weight': 15, 'n_estimators': 100, 'reg_lambda': 0.1}
0.8422636965683323 0.8502215235100756 0.9341649732821555 25


In [52]:
for time in times:
    accuracy_test, f1, accuracy_train = timed_gradient_boost(array_train, array_test, time)
    print(accuracy_test, f1, accuracy_train, time)

no valid time given.  Using only data available before the match
0.6298163756773029 0.6605479262990822 0.6454617295100474 0
0.7034166164960867 0.7222104743779516 0.7196884172499436 10
0.7427001806140879 0.7575693114940083 0.7602731993678031 15
0.7901113786875377 0.8011691737363655 0.7982990893354407 20
0.8432420228777845 0.8514371300192569 0.8496462707909987 25


In [72]:
# Data analysis and inferences from regression