# Preprocessing match data
In this notebook we preprocess the matches.csv file to obtain a suitable input file for the models we want to run. 

The input is a matches.csv file generated by the get_match_data.ipynb notebook. 

The input is the champions on the red and blue team, as categorical variables (blue one-hot encoded as +1, red as -1), and there is a two node output predicting which team won, blue or red. 

## TO DO:

1. 

### 1. Import packages and data

In [1]:
"""
@author: Mark Bugden
March 2023

Part of a ML project in predicting win rates for League of Legends games based on team composition.
Current update available on GitHub: https://github.com/Mark-Bugden
"""

# Import necessary packages
import requests
import pandas as pd
from ratelimit import limits, sleep_and_retry
import pickle
import math
import numpy as np
from matplotlib import pyplot as plt
import os
import glob
import h5py



from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression


# This gives us a progress bar for longer computations. 
from tqdm.notebook import tqdm
# To use it, just wrap any iterable with tqdm(iterable).
# Eg: 
# for i in tqdm(range(100)):
#     ....

In [13]:
# Put the location of the data folder on your computer
data_location = 'C:\\Users\\Mark\\Code\\LoL Win Prediction\\Data Collection\\'

In [14]:
# Here are the tiers and divisions
tier_list = ['DIAMOND', 'PLATINUM', 'GOLD', 'SILVER', 'BRONZE', 'IRON']
division_list = ['I', 'II', 'III', 'IV']

# Load the champion information
champion_url = 'http://ddragon.leagueoflegends.com/cdn/12.14.1/data/en_US/champion.json'
r = requests.get(champion_url)
json_data = r.json()
champion_data = json_data['data']

champions = list(champion_data.keys())
num_champs = len(champions)

# For some reason Fiddlesticks is listed as FiddleSticks in some of the other data. To avoid problems like this, I will convert all champion names to lowercase
champions = [champ.lower() for champ in champions]


champ_to_num = {k: v for v, k in enumerate(champions)}
num_to_champ = {v: k for v, k in enumerate(champions)}

# We can get champion information by accessing the champion_data dict
# Eg:
# champion_data['Zyra']

In [15]:
# Load the csv file produced in the Get_Match_Data iPython notebook
rankeddf = pd.read_csv(data_location + 'ranked_matches.csv') 



# Convert the champion names to lower as well to match up with our champions list.
rankeddf['championName'] = rankeddf['championName'].str.lower()

### 2. Format the data

In [16]:
# Here are three matches
rankeddf.head(30)

Unnamed: 0,matchId,team,win,championName,summonerName,gameMode
0,EUN1_3205133592,Blue,False,tahmkench,mariogrzyb321,420
1,EUN1_3205133592,Blue,False,elise,xBakuu,420
2,EUN1_3205133592,Blue,False,azir,Vecrone,420
3,EUN1_3205133592,Blue,False,jhin,UNfriendlyEwok,420
4,EUN1_3205133592,Blue,False,yuumi,metrosexual,420
5,EUN1_3205133592,Red,True,garen,DEMACI4,420
6,EUN1_3205133592,Red,True,kayn,zombieldtv,420
7,EUN1_3205133592,Red,True,cassiopeia,whimsical 11 19,420
8,EUN1_3205133592,Red,True,varus,ΞΞ ZielU ΞΞ,420
9,EUN1_3205133592,Red,True,xerath,savo2kk,420


In [17]:
rankeddf.shape

(213330, 6)

In [18]:
# One hot encode a list of champions
def onehotencodechampions(champs):
    ''' One hot encodes a list of champions.
    
    
    '''
    integer_encoded = [champ_to_num[champ] for champ in champs]
    onehot_encoded = [0] * len(champions)
    for value in integer_encoded:
        onehot_encoded[value] = 1
    return onehot_encoded

In [19]:
# We will now put the data into the format we will use for training and testing. This will consist of two lists - the first will be features, and the second will be labels.
# For the features list, each element of the list corresponds to a game, and that element will itself be a list of the champions in that game one-hot-encoded.
# Blue team champions being represented by +1 and red team by -1.

# For the labels list, each element will again be a game, and that element will be a 1 (blue win) or a 0 (red win). 

matches = rankeddf['matchId'].unique().tolist()

features = []
labels = []

for match in tqdm(matches):
    blueteam = rankeddf[(rankeddf['matchId'] == match) & (rankeddf['team'] == 'Blue')]['championName'].tolist()
    redteam = rankeddf[(rankeddf['matchId'] == match) & (rankeddf['team'] == 'Red')]['championName'].tolist()
    
    blueteam = onehotencodechampions(blueteam)
    redteam = [value*-1 for value in onehotencodechampions(redteam)]
    
    features.append([sum(value) for value in zip(blueteam, redteam)])
    
    # We only need to check one champion
    if rankeddf[(rankeddf['matchId'] == match) & (rankeddf['team'] == 'Blue')].iloc[0]['win'] == True:
        labels.append(1)
    else:
        labels.append(0)
    
    
    # Since we are interested in team composition regardless of whether the team is red or blue, we can double the matches by swapping champs and winners
    # That is, champion data swaps +1 and -1, and winner swaps [1,0] and [0,1]. We then randomly select matches later for testing and training.
    
#    features.append([-1*sum(value) for value in zip(blueteam, redteam)])
#    if rankeddf[(rankeddf['matchId'] == match) & (rankeddf['team'] == 'Blue')].iloc[0]['win'] == True:
#        labels.append(0)
#    else:
#        labels.append(1)
    
    

features = np.array(features)        
labels = np.array(labels)

  0%|          | 0/21333 [00:00<?, ?it/s]

In [20]:
features.shape

(21333, 161)

In [25]:
# SAVE
with open('featureslabels.npy', 'wb') as f:
    np.save(f, features)
    np.save(f, labels)

In [27]:
# LOAD
#with open('featureslabels.npy', 'rb') as f:
#    features = np.load(f)
#    labels = np.load(f)