# S13 Group 8: Project - Phase 1

## Members:
- Ano, Joseph Thomas M.
- Limjoco, Jared Ethan D. 
- Nadela, Cymon Radjh O.

In this Notebook, we will be using the [League of Legends Worlds 2021 Main Event - Champion Stats](https://www.kaggle.com/datasets/vincentbarletta/league-of-legends-worlds-champion-pb-dataset?select=Worlds+2021+Main+Event+-+Champion+Stats+-+OraclesElixir.csv). The notebook will cover four main parts namely the **dataset description**, **data cleaning**, the **exploratory data analysis questions**, and the group's **proposed research questions**. 



## Data Set Description

### Brief description of dataset
This data set contains the list of every champion either picked or banned during the 2021 League of Legends World tournament. As metioned in its kaggle page, the main statistics of this particular data set include the champion name, position, pick rate, ban rate, pick+ban rate, and more individual champion statistics (Kill, Death, Assist, KDA rate, etc). Each column also has a breif description explaining the variable.

### Description of the data collection process
The data was collected from a popular website called [Oracle's Elixir](https://oracleselixir.com/about), the premier source for advance League of Legends esports statistics. All of the data comes from the analysts and data scrapers at Oracles Elixir who retrive the data from several sources including including Match History pages, [lolesports.com](https://lolesports.com), [lpl.QQ.com](https://lpl.qq.com), [Leaguepedia](https://lol.fandom.com/wiki/League_of_Legends_Esports_Wiki), the Riot Games solo queue APIs, and more 

### Structure of the data set

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# sets the theme of the charts
#plt.style.use('seaborn-darkgrid')

%matplotlib inline

In [2]:
lol_df = pd.read_csv('Worlds 2021 Main Event - Champion Stats - OraclesElixir.csv')

In [3]:
lol_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 94 entries, 0 to 93
Data columns (total 25 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Champion  94 non-null     object
 1   Pos       94 non-null     object
 2   GP        94 non-null     int64 
 3   P%        94 non-null     object
 4   B%        94 non-null     object
 5   P+B%      94 non-null     object
 6   W%        94 non-null     object
 7   CTR%      94 non-null     object
 8   K         94 non-null     object
 9   D         94 non-null     object
 10  A         94 non-null     object
 11  KDA       94 non-null     object
 12  KP        94 non-null     object
 13  DTH%      94 non-null     object
 14  FB%       94 non-null     object
 15  GD10      94 non-null     object
 16  XPD10     94 non-null     object
 17  CSD10     94 non-null     object
 18  CSPM      94 non-null     object
 19  CS%P15    94 non-null     object
 20  DPM       94 non-null     object
 21  DMG%      94 non-n

The data set contains a total of **94 observations** and is made up of **25 variables**. Each row in the data set represents a the game statistics of a champion that was either picked or banned during the tournament. You may notice that there are observations consisting of the same value for the variable **`Champion`**, this is because a champion can be played in more than one postion which will result in the data set having rows/observations each with a unique combination of the **`Champion`** and **`Pos`** variable 

### Description of each variable
- **`Champion`**: Name of the champion picked/banned
- **`Pos`**: Position the champion played in. Values include Top, Middle, ADC, Jungle, and Support
- **`GP`**: Total number of games the champion was picked/played in this role
- **`P%`**: Percentage of games champion was picked in this role
- **`B%`**: Percentage of games in which the champion was banned (not tied to a specific role)
- **`P+B%`**: Percentage of games in which the champion was either banned or picked in any role
- **`W%`**: Win percentage of champion in a specific role
- **`CTR%`**: Counter-pick rate: percentage of games in which this player/champion was picked after their lane opponent 
- **`K`**: Total kills a champion had 
- **`D`**: Total deaths a champion had 
- **`A`**: Total assists a champion had 
- **`KDA`**: Total Kill/Death/Assits ration for a champion 
- **`KP%`**: Kill participation which is the percentage of the team's kills in which the champion earned a Kill or Assist 
- **`DTH%`**: Average share of team’s deaths 
- **`FB%`**: Percent of games earning a First Blood participation either kill or assist 
- **`GD10`**: Average gold difference at 10 minutes 
- **`XPD10`**: Average experience difference at 10 minutes
- **`CSD10`**: Average creep score difference at 10 minutes
- **`CSPM`**: Average monsters + minions killed per minute
- **`CS%P15`**: Average share of team's total CS post-15-minutes
- **`DPM`**: Average damage to champions per minute
- **`DMG%`**: Average share of team’s total damage to champions
- **`GOLD%`**: Average share of team’s total gold earned (excludes starting gold and inherent gold generation)
- **`WPM`**: Average wards placed per minute
- **`WCPM`**: Average wards cleared per minute

## Data Clean up

### Checking for Duplicate Data
The first thing we'll check for is if duplicate records exists in the data set. Since each record in this data set represents the statistics of every champion either picked (in a specific position) or banned, each record must have a unique combination of the **`Champion`** and the **`Pos`** variable. To do this, we will first select the columns of the **`Champion`** and the **`Pos`** variables which will result in a **dataframe** which we will assign to a variable called `duplicate_test`. We will then use the `duplicated` function, along with the `any` function, to check if there are any duplicated records based on these two variables.

In [4]:
duplicate_test = lol_df[['Champion', 'Pos']]
duplicate_data_exists = duplicate_test.duplicated().any()
print('Duplicated data exists: ', duplicate_data_exists)

Duplicated data exists:  False


The reason why we select the **`Champion`** and the **`Pos`** columns is because if we were to use the `duplicated` function using the original dataframe, there is a chance that two records may have the same **`Champion`** and **`Pos`** values but differ in other columns. In this case, the `duplicated` function will return a value of `false` which is wrong since we want each record to have a unique combination of the **`Champion`** and **`Pos`** variables. 

### Missing Data

In [5]:
lol_df['Pos'].value_counts()

Top        24
Middle     21
Support    17
Jungle     16
ADC        14
-           2
Name: Pos, dtype: int64

When viewing the dataset, you will notice that some records have a value of `-` in some of their variables. This is the default value when a variable of a record should be empty. For example, the code above shows that there are two records which have a value of `-` in their **`Pos`** variable. This because these two records(Champions) were never picked in the entire tournament but were banned at least once which is why they are included in this data set but with most of their variables having a value of `-`.

Since majority of their variables will not have any meaningful value, and considering the fact that there are only 2 of these records out of a total of 94 entries, we believe that it will be best to remove these two records from entire dataset. To do so we will first get the indices of the records which have a value of `-` in their **`Pos`** column. After getting the indices, we can use the `drop` function to remove these records from the data set.

In [15]:
#Get indices
dropped_index = lol_df[ lol_df['Pos'] == '-' ].index

#Drop records
lol_df = lol_df.drop(dropped_index)

#Check new number of entries
print('Number of entries in the dataframe: ', lol_df.shape[0])

#Re-check the value count of the Pos variable
lol_df['Pos'].value_counts()

Number of entries in the dataframe:  92


Top        24
Middle     21
Support    17
Jungle     16
ADC        14
Name: Pos, dtype: int64

As you can see, the new number of entries in the data set is 92 which means that the two records that we selected were removed from the data set. We can also check this by checking the value count of the **`Pos`** variable which shows that there are only 5 values for that variable. We can now ensure that there are no missing values in our data set as each record gwill contain a champion in a specific