# Quest 1: Build a model for win prediction


![Banner image](assets/notebook_intro_any.png)

Welcome to AWS GameDay: LoL Esports Edition!

You are part of the highly capable, highly motivated, Demacia Data team tasked with developing secret technologies that can see into the future and predict events that have yet to happen.

2023 World’s Pick’em has taken Demacia by storm this year; it’s all anyone talks about; it's everywhere!  Who do you think is going to win? What are everyone's Crystal Ball picks? Demacia is obsessed with winning Pick’em (Task 1)! Once we know all the secret answers, your team will be tasked with engineering your own Crystal Ball to predict League of Legends matches (Task 2). Maybe it can help us know who wins Worlds before anyone else!

Your team at Demacia Data has been developing a new experiential technology called Machine Learning that may be the answer. The task for you and your team will be to create the most accurate win prediction model for professional League of Legends matches in the shortest amount of time. 

For this quest we are using Amazon SageMaker Studio. It provides a single, web-based visual interface where we can perform all of the machine learning development steps, such as visualizing data, training a model, and making model predictions.



**Good Luck!!!**




## The cells marked Challenge Questions are harder problems that require coding knowledge, they can be skipped.  Looking at the [Pandas documentation](https://pandas.pydata.org/) or information on [Stackoverflow](https://stackoverflow.com/) will be helpful for the Challenge Questions. You can also ask Amazon Q in the console for help or external services like claude.ai

# Task 1: Pick'em Sneak Peak (Warmup)

To get us started, let's see if we can answer some of the Crystal Ball questions from this year's Pick'Em to warm up. For this we will use a different dataset, all games from the 2023 World Championship.


## Library Imports
The first step is to load all the necessary python libraries we will need for model training.

In [None]:
%%capture --no-display
!sudo apt-get update
!sudo apt install font-manager -qq -y
!rm ~/.cache/matplotlib -rf

In [None]:
%%capture --no-display
!pip install yellowbrick
print("Installed yellowbrick library for model visualizations")

In [None]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
import utils
import plotly.express as px
from IPython.display import Markdown as md
print("Imported libraries for data analysis")

## Load the data
Now lets, load the dataset that we have. The dataset contains match data from all of worlds.

In [None]:
df=pd.read_csv("LiveStats.csv")
df

## Visualize the data
The following cells will render visualizations of the data. You can input your answers to each question back in the GameDay Quest website.

# Diving into the data

When working with big data sets it can be helpful to dive into the data and visualize it and understand the fields you are working with. Throughout the notebook you will be referencing three files:

### `LiveStats.csv` - `df` defined above
Contains a breakdown of every players performance across every game at Worlds 2023

To get column names for the challenge questions you can run `list(df)`

### `KDA.csv`
This will be used in the KDA section. This contains the overal KDA of every player of Worlds 2023.

### `LoL_Data.csv`
Contains a lot of years of profesional games including Worlds 2023 games. This will be used only in the model training section. This compares team_x vs team_y and sanatizes the data to remove any bias from what team/region is playing.


## Question 1.1: Who had the most amount of champion kills in a single game?

During a game of League of Legends the object is to destroy the other team's Nexus. During the game you play as a champion that can get kills and assists that can help you get gold and take an advantage to push down one of the three lanes. Usually the higher number of kills and amount of gold captured in a game by a player reflect who is winning in a lane and usually reflects a lead.

For this first question let's take a look at what player had the most amount of champion kills in one game.

In [None]:
# You can set the column names to any valid data field
# To show figures in a notebook you can just use fig.show()
fig = px.bar(df, x="playerName", y= "CHAMPIONS_KILLED" ,orientation="v")
fig.show()

#### Does this help us a lot? Yes but no.

While this graph does allow us to drill down to the answer, it might take us longer to do that then to write some code to sort the data and then output the answer instead. So let's do that.

In [None]:
# sort_values can be used to sort data frame this can super helpful to dive into parts of the data
# by drilling into .playerName we shed all other columns after the sort and now we can just output a list
# that is sorted by CHAMPIONS_KILLED then we just can pull the first one off the list at 0

# If the python doesnt set a variable or ends with a ; you will see the response

# This list will not show since we are setting a variable
playername = list(df.sort_values(by="CHAMPIONS_KILLED", ascending=False).playerName)[0]

# but we can just call the variable later to output like this
playername

## Challenge Question 1.2: What player had the most amount of assists in a single game?

Use the knowledge you learned above to see if you can find out what player had the most amount of assists in a game.

In [None]:
# Code here

## Challenge Question 1.3: Which team has played the most different (unique) champions?

In [None]:
# Enter code

## Question 1.4: Which champion with at least 5 games played has the highest win-rate at Worlds?

### Champion Win Rate
The following visualization shows an aggregation of the champion win rate by games played. 

In [None]:
df["games_played"] = 1
color_seq = px.colors.qualitative.Dark24 + px.colors.qualitative.Alphabet + px.colors.qualitative.Pastel1 + px.colors.qualitative.Pastel2
df_champions = df.groupby('championName').agg({'games_played':'sum', 'WINNER': 'mean'}).reset_index()
fig = px.scatter(df_champions, x="games_played", y="WINNER",color="championName",size="games_played",color_discrete_sequence=color_seq)
fig.show()

## Challenge Question 1.5: Which champion has the most total deaths over all games?

In [None]:
# Enter code

## Kills/Deaths/Assists (KDA) Data
The following visualization shows an aggregation of all the games in the dataset highlighting each player champion KDA Average. For the next few question we will take a look at KDA and how it impacts the final result of T1 winning Worlds 2023.

### What is KDA and how is it calculated:
KDA is an acronym for kills/deaths/assists and it is a simple formula of (kills + assists)/deaths. Players believe that this stat usually reflects a players performance in a game. At a pro level, KDA is sometimes can be a misleading stat to how that player is performing overall in the tournament/game and can even be different with the different roles in the game. 


## Question 1.6 Which player has the highest KDA in Worlds 2023?

In [None]:
df_kda_10 = pd.read_csv("KDA.csv")
fig = px.bar(df_kda_10, x="name", y= "kda" ,orientation="v")
fig.show()

## Challenge Question 1.7: What was the best KDA for a single game? 

If the player had 0 deaths consider that a perfect game i.e. highest possible KDA and for this question just add the players kills and assists.

**Notes**: 
- Use the `df` as it has every player break down. The fields you will need are: `CHAMPIONS_KILLED`, `ASSISTS`, `NUM_DEATHS`
- You will need to create a new column from the above fields. Recommend looking at the documentation of `DataFrame.apply`


In [None]:
# Write code

## Challenge Question 1.8: How many total kills did players playing the champion Renekton, Ahri, and Rell across all games at Worlds 23?

In [None]:
#Write code here

## Challenge Question 1.9: Which player played Alistar the most at Worlds 2023?

In [None]:
#Write code here

# Task 2: Model Training

Now that we have honed some of our data analysis skills on the worlds dataset, lets jump something more ambitious: Using all games from worlds qualifying regions in 2022 to try to build a model that can predict the outcome of games with data from the 15-minute mark only.

In [None]:
df = pd.read_csv("LoL_Data.csv")
df

## Model Training
This dataset strips out information such as team name and other identifiers. Each row in the dataset has team-based stats such as tower kills or champion kills. The target value is the `winningteamoutput`. 

The following cell will train a model that can make a prediction based on all the other columns values and output which team will win.


In [None]:
# Train and evaluate the model with our helper function
init_model, model_score = utils.train_and_eval_model(df)
md(f"## The model accuracy is {round(model_score,3)}")

## Model Results
The initial results indicate that this model is able to predict the correct winning output of a game with data from the 15 minute mark around 73% of the time correctly.

Here is a more detailed breakdown of what each of the numbers mean.

### Precision
Precision can be seen as a measure of a classifier’s exactness. For each class, it is defined as the ratio of true positives to the sum of true and false positives. Said another way, “for all instances classified positive, what percent was correct?”

### Recall
Recall is a measure of the classifier’s completeness; the ability of a classifier to correctly find all positive instances. For each class, it is defined as the ratio of true positives to the sum of true positives and false negatives. Said another way, “for all instances that were actually positive, what percent was classified correctly?”

### F1
The F1 score is a weighted harmonic mean of precision and recall such that the best score is 1.0 and the worst is 0.0. Generally speaking, F1 scores are lower than accuracy measures as they embed precision and recall into their computation. As a rule of thumb, the weighted average of F1 should be used to compare classifier models, not global accuracy.

## Choose Your features!!!

![Chaos event](assets/lol_gameday_team_v_team_banner.png)

To test how well you understand League of Legends, we want to see if you can train a model using up to only **6 features**.

Your team must decide which features you want to use for the model. 

We will provide some sample models for you to get started, but ultimately your team must decide which model you want to submit for this quest.



In [None]:
## This model is based on baron kills

# Team stats
df_baron = df[
    [
        "winningteamoutput",
        "baronkills_x",
        "baronkills_y",
    ]
]

baron_model, model_score = utils.train_and_eval_model(df_baron)
md(f"## The model accuracy is {round(model_score,3)}")

## Bad Model?
Why do you think the baron only model didn’t produce good results?  
**Note:** This is not a question on the Quest website

In [None]:
# Dragon and Minions
df_db = df[
    [
        "winningteamoutput",
        "dragonkills_x",
        "dragonkills_y",
        'team_x_minions_killed',
        'team_y_minions_killed',
    ]
]

db_model, model_score = utils.train_and_eval_model(df_db)
md(f"## The model accuracy is {round(model_score,3)}")

In [None]:
# Wards placed and killed
df_w = df[
    [
        "winningteamoutput",
        "team_x_ward_kills",
        "team_y_ward_kills",
        "team_x_ward_placed",
        "team_y_ward_placed",
    ]
]

w_model, model_score = utils.train_and_eval_model(df_w)
md(f"## The model accuracy is {round(model_score,3)}")

In [None]:
# Damage Model

df_dhc = df[
    [
        "winningteamoutput",
        "team_x_physical_damage_dealt_to_champions",
        "team_y_physical_damage_dealt_to_champions",
        "team_x_magic_damage_dealt_to_champions",
        "team_y_magic_damage_dealt_to_champions",
        "team_x_true_damage_dealt_to_champions",
        "team_y_true_damage_dealt_to_champions",
    ]
]

dhc_model, model_score = utils.train_and_eval_model(df_dhc)
md(f"## The model accuracy is {round(model_score,3)}")

## Feature List
Here are the list of features, what will you choose for your model?

In [None]:
list(df.columns)

## Training your model
Try different feature sets before submitting. Once your feature set is submitted, you will not be able to change and you have to train your final model with the exact same feature set.

In [None]:
## Create your own model to test
df_my_data = df[
    [
        "winningteamoutput", # REQUIRED FIELD
        ### Insert up to 6 features
    ]
]

MY_MODEL, model_score = utils.train_and_eval_model(df_my_data)
md(f"## The model accuracy is {round(model_score,3)}")

## 2.1 Model features
The code below will print out a string to input into the GameDay website based on your model features

In [None]:
input_string = ""
cols = list(df_my_data.columns)
for col in cols:
    if col == "winningteamoutput":
        continue
    input_string += col + ","


input_string = input_string[:-1]
input_string

## 2.2 Upload your model
When you’re ready to upload your model, run the following cells, to create a model endpoint.  
In the Quest UI, enter the features you have chosen and the model’s name. 

This will run the evaluation script that will score your model based on the number of correct predictions from the Riot evaluation set (correct_predictions/total_number_of_games).

This will take around 5 minutes.


In [None]:
## Enter your model
utils.create_model_endpoint(MY_MODEL)