# Data Description

## Data Source

* PUBG Match Deaths and Statistics, Kaggle 
    \- https://www.kaggle.com/skihikingkevin/pubg-match-deaths

## Data Introduction

In this Kaggle Dataset, over 720,000 competitive matches from the popular game PlayerUnknown's Battlegrounds. The data was extracted from pubg.op.gg, a game tracker website.


### PlayerUnknown's Battlegrounds

PUBG is a first/third-person shooter battle royale style game that matches over 90 players on a large island where teams and players fight to the death until one remains. Players are airdropped from an airplane onto the island where they are to scavenge towns and buildings for weapons, ammo, armor and first-aid. Players will then decide to either fight or hide with the ultimate goal of being the last one standing. A bluezone (see below) will appear a few minutes into the game to corral players closer and closer together by dealing damage to anyone that stands within the bluezone and sparing whoever is within the safe zone.


### The Dataset

This dataset provides two zips: aggregate and deaths.

In **deaths**, the files record every death that occurred within the 720k matches. That is, each row documents an event where a player has died in the match.

In **aggregate**, each match's meta information and player statistics are summarized (as provided by pubg). It includes various aggregate statistics such as player kills, damage, distance walked, etc as well as metadata on the match itself such as queue size, fpp/tpp, date, etc.
The uncompressed data is divided into 5 chunks of approximately 2gb each.

### Columns in deaths

1. killed_by: Which weapon is killed
1. killer_name: Killer game id
1. killer_placement: The final ranking of the team where the killer is located
1. killer_position_x: X coordinate of the killer when the killing behavior occurs
1. killer_position_y: Y coordinate of the killer when the killing behavior occurs
1. map: Game Map(Erangel island/ Miramar desert)
1. match_id : Event Unique ID
1. time: When the kill occurs(How many seconds after the game starts)
1. victim_name: The killed game id
1. victim_position_x: X coordinate of the person being killed when the killing occurs
1. victim_position_y: Y coordinate of the killer at the time of the killing behavior

### Columns in aggregate

1. date: Start time of the game
1. game_size: Site size
1. match_id: Event Unique ID
1. match_mode: Game Mode(First/ Third Person View)
1. party_size: Squad size(1person/ 2people/ 4people)
1. player_assists: Rescue teammates
1. player_dbno: Number of times the player was knocked down
1. player_dist_ride: Driving Distance
1. player_dist_walk: Walking distance
1. player_dmg: Injury points
1. player_kills: kills
1. player_name: Player Game id
1. player_survive_time: Player survival time
1. team_id: The player's team number
1. team_placement: The final ranking of the player's team

# 라이브러리 및 데이터 로드

## 라이브러리

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import math
from tqdm.auto import tqdm
tqdm.pandas()

In [None]:
plt.rc('font', family='AppleGothic')
plt.rc('axes', unicode_minus=False)

In [None]:
pd.set_option('display.float_format', lambda x: '%.3f' % x)

## 데이터

In [None]:
data_dir = '../dataset/raw/'

In [None]:
agg_0 = pd.read_csv(data_dir + 'aggregate/agg_match_stats_0.csv')
agg_1 = pd.read_csv(data_dir + 'aggregate/agg_match_stats_1.csv')
agg_2 = pd.read_csv(data_dir + 'aggregate/agg_match_stats_2.csv')
agg_3 = pd.read_csv(data_dir + 'aggregate/agg_match_stats_3.csv')
agg_4 = pd.read_csv(data_dir + 'aggregate/agg_match_stats_4.csv')

In [None]:
deaths_0 = pd.read_csv(data_dir + 'deaths/kill_match_stats_final_0.csv')
deaths_1 = pd.read_csv(data_dir + 'deaths/kill_match_stats_final_1.csv')
deaths_2 = pd.read_csv(data_dir + 'deaths/kill_match_stats_final_2.csv')
deaths_3 = pd.read_csv(data_dir + 'deaths/kill_match_stats_final_3.csv')
deaths_4 = pd.read_csv(data_dir + 'deaths/kill_match_stats_final_4.csv')

# Data preprocessing

## Aggregate의 na 제거

In [None]:
agg_0 = agg_0.dropna()
agg_1 = agg_1.dropna()
agg_2 = agg_2.dropna()
agg_3 = agg_3.dropna()
agg_4 = agg_4.dropna()

## match_mode 제거
* 모두 'tpp'만 가짐

In [None]:
del agg_0['match_mode']
del agg_1['match_mode']
del agg_2['match_mode']
del agg_3['match_mode']
del agg_4['match_mode']

## 두 데이터에서 매칭되지 않는 match_id 제거

In [None]:
agg_list = [agg_0, agg_1, agg_2, agg_3, agg_4]

In [None]:
match_id_list =[]
for i in agg_list:
    match_id_list += [x for x in i['match_id'].unique()]

In [None]:
deaths_0 = deaths_0[deaths_0['match_id'].isin(match_id_list)]
deaths_1 = deaths_1[deaths_1['match_id'].isin(match_id_list)]
deaths_2 = deaths_2[deaths_2['match_id'].isin(match_id_list)]
deaths_3 = deaths_3[deaths_3['match_id'].isin(match_id_list)]
deaths_4 = deaths_4[deaths_4['match_id'].isin(match_id_list)]

## deaths data na 처리

In [None]:
deaths_0.isnull().mean()*100

In [None]:
deaths_1.isnull().mean()*100

In [None]:
deaths_2.isnull().mean()*100

In [None]:
deaths_3.isnull().mean()*100

In [None]:
deaths_4.isnull().mean()*100

### map na drop

In [None]:
deaths_0['map'].unique()

* map 결측값 대체 가능 여부 확인

In [None]:
# map이 na인 match_id 정리

d0_map_na_match_id = [x for x in deaths_0.loc[deaths_0['map'].isnull(), 'match_id'].unique()]
d1_map_na_match_id = [x for x in deaths_1.loc[deaths_1['map'].isnull(), 'match_id'].unique()]
d2_map_na_match_id = [x for x in deaths_2.loc[deaths_2['map'].isnull(), 'match_id'].unique()]
d3_map_na_match_id = [x for x in deaths_3.loc[deaths_3['map'].isnull(), 'match_id'].unique()]
d4_map_na_match_id = [x for x in deaths_4.loc[deaths_4['map'].isnull(), 'match_id'].unique()]

map_na_match_id = []
map_na_match_id = d0_map_na_match_id + d1_map_na_match_id + d2_map_na_match_id + d3_map_na_match_id + d4_map_na_match_id 

In [None]:
# map이 있는 match_id 정리

d0_map_match_id = [x for x in deaths_0.loc[deaths_0['map'].isin(['MIRAMAR', 'ERANGEL']), 'match_id'].unique()]
d1_map_match_id = [x for x in deaths_1.loc[deaths_1['map'].isin(['MIRAMAR', 'ERANGEL']), 'match_id'].unique()]
d2_map_match_id = [x for x in deaths_2.loc[deaths_2['map'].isin(['MIRAMAR', 'ERANGEL']), 'match_id'].unique()]
d3_map_match_id = [x for x in deaths_3.loc[deaths_3['map'].isin(['MIRAMAR', 'ERANGEL']), 'match_id'].unique()]
d4_map_match_id = [x for x in deaths_4.loc[deaths_4['map'].isin(['MIRAMAR', 'ERANGEL']), 'match_id'].unique()]

map_match_id = []
map_match_id = d0_map_match_id + d1_map_match_id + d2_map_match_id + d3_map_match_id + d4_map_match_id

In [None]:
[map_na_match_id.index(x) for x in map_match_id if x in map_na_match_id]

* 대체 불가능 판단 -> Drop

In [None]:
deaths_0 = deaths_0[deaths_0['map'].isnull()==False]
deaths_1 = deaths_1[deaths_1['map'].isnull()==False]
deaths_2 = deaths_2[deaths_2['map'].isnull()==False]
deaths_3 = deaths_3[deaths_3['map'].isnull()==False]
deaths_4 = deaths_4[deaths_4['map'].isnull()==False]

### deaths 데이터 na drop

In [None]:
deaths_0 = deaths_0.dropna()
deaths_1 = deaths_1.dropna()
deaths_2 = deaths_2.dropna()
deaths_3 = deaths_3.dropna()
deaths_4 = deaths_4.dropna()

In [None]:
deaths_0.isnull().mean()*100

## Data 병합

* key columns: agg.match_id = deaths.match_id, agg.player_name = deaths.killer_name

In [None]:
# key column의 이름을 맞춰야 함
# deaths의 killer_name을 player_name으로 변경

def chg_cols(df_lst, cols):
    for i in tqdm(df_lst):
        i.columns = cols

In [None]:
deaths_lst = [deaths_0, deaths_1, deaths_2, deaths_3, deaths_4]
agg_lst = [agg_0, agg_1, agg_2, agg_3, agg_4]

In [None]:
col_names = ['killed_by', 'player_name', 'killer_placement', 'killer_position_x',
             'killer_position_y', 'map', 'match_id', 'time', 'victim_name',
             'victim_placement', 'victim_position_x', 'victim_position_y']

In [None]:
# key column 이름 맞추기

chg_cols(deaths_lst, col_names)

## Aggregate 데이터와 Deaths 데이터 Merge

* 같은 번호의 데이터와 짝을 이루는 것을 match_id로 확인함
    * e.g. agg_0는 deaths_0과 Merge

In [None]:
df_0 = pd.merge(agg_0, deaths_0, how='left', on=['match_id', 'player_name'])
df_1 = pd.merge(agg_1, deaths_1, how='left', on=['match_id', 'player_name'])
df_2 = pd.merge(agg_2, deaths_2, how='left', on=['match_id', 'player_name'])
df_3 = pd.merge(agg_3, deaths_3, how='left', on=['match_id', 'player_name'])
df_4 = pd.merge(agg_4, deaths_4, how='left', on=['match_id', 'player_name'])

In [None]:
df_0[df_0['party_size']==2].isnull().sum()

## map별로 나누기

* ERANGEL과 MIRAMAR로 데이터 셋을 나눔

In [None]:
E_match_id_0 = list(df_0.loc[df_0['map']=='ERANGEL', 'match_id'].unique())
E_match_id_1 = list(df_1.loc[df_1['map']=='ERANGEL', 'match_id'].unique())
E_match_id_2 = list(df_2.loc[df_2['map']=='ERANGEL', 'match_id'].unique())
E_match_id_3 = list(df_3.loc[df_3['map']=='ERANGEL', 'match_id'].unique())
E_match_id_4 = list(df_4.loc[df_4['map']=='ERANGEL', 'match_id'].unique())

E_match_id = E_match_id_0 + E_match_id_1 + E_match_id_2 + E_match_id_3 + E_match_id_4

In [None]:
# 중복 여부 확인

len(E_match_id), len(set(E_match_id))

In [None]:
M_match_id_0 = list(df_0.loc[df_0['map']=='MIRAMAR', 'match_id'].unique())
M_match_id_1 = list(df_1.loc[df_1['map']=='MIRAMAR', 'match_id'].unique())
M_match_id_2 = list(df_2.loc[df_2['map']=='MIRAMAR', 'match_id'].unique())
M_match_id_3 = list(df_3.loc[df_3['map']=='MIRAMAR', 'match_id'].unique())
M_match_id_4 = list(df_4.loc[df_4['map']=='MIRAMAR', 'match_id'].unique())

M_match_id = M_match_id_0 + M_match_id_1 + M_match_id_2 + M_match_id_3 + M_match_id_4

In [None]:
# 중복 여부 확인

len(M_match_id), len(set(M_match_id))

In [None]:
len(E_match_id + M_match_id), len(set(E_match_id + M_match_id))

In [None]:
# kill 기록이 없는 player의 경우, 병합 했을 때 deaths의 컬럼에 대해 na값 발생
# map의 경우 match_id를 통해 채울 수 있음

df_0.loc[(df_0['match_id'].isin(E_match_id))&(df_0['map'].isnull()), 'map'] = 'ERANGEL'
df_1.loc[(df_1['match_id'].isin(E_match_id))&(df_1['map'].isnull()), 'map'] = 'ERANGEL'
df_2.loc[(df_2['match_id'].isin(E_match_id))&(df_2['map'].isnull()), 'map'] = 'ERANGEL'
df_3.loc[(df_3['match_id'].isin(E_match_id))&(df_3['map'].isnull()), 'map'] = 'ERANGEL'
df_4.loc[(df_4['match_id'].isin(E_match_id))&(df_4['map'].isnull()), 'map'] = 'ERANGEL'

In [None]:
df_0.loc[(df_0['match_id'].isin(M_match_id))&(df_0['map'].isnull()), 'map'] = 'MIRAMAR'
df_1.loc[(df_1['match_id'].isin(M_match_id))&(df_1['map'].isnull()), 'map'] = 'MIRAMAR'
df_2.loc[(df_2['match_id'].isin(M_match_id))&(df_2['map'].isnull()), 'map'] = 'MIRAMAR'
df_3.loc[(df_3['match_id'].isin(M_match_id))&(df_3['map'].isnull()), 'map'] = 'MIRAMAR'
df_4.loc[(df_4['match_id'].isin(M_match_id))&(df_4['map'].isnull()), 'map'] = 'MIRAMAR'

In [None]:
df_0_E = df_0[df_0['map']=='ERANGEL']
df_1_E = df_1[df_1['map']=='ERANGEL']
df_2_E = df_2[df_2['map']=='ERANGEL']
df_3_E = df_3[df_3['map']=='ERANGEL']
df_4_E = df_4[df_4['map']=='ERANGEL']

In [None]:
df_0_M = df_0[df_0['map']=='MIRAMAR']
df_1_M = df_1[df_1['map']=='MIRAMAR']
df_2_M = df_2[df_2['map']=='MIRAMAR']
df_3_M = df_3[df_3['map']=='MIRAMAR']
df_4_M = df_4[df_4['map']=='MIRAMAR']

In [None]:
df_4_E.isnull().sum()

In [None]:
df_4_M.isnull().sum()

## party_size 별로 데이터 나누기

* party_size에 따라 Tier가 다르기 때문에 데이터를 분리함

In [None]:
df_0_solo_E = df_0_E[df_0_E['party_size']==1]
df_1_solo_E = df_1_E[df_1_E['party_size']==1]
df_2_solo_E = df_2_E[df_2_E['party_size']==1]
df_3_solo_E = df_3_E[df_3_E['party_size']==1]
df_4_solo_E = df_4_E[df_4_E['party_size']==1]

In [None]:
df_0_solo_M = df_0_M[df_0_M['party_size']==1]
df_1_solo_M = df_1_M[df_1_M['party_size']==1]
df_2_solo_M = df_2_M[df_2_M['party_size']==1]
df_3_solo_M = df_3_M[df_3_M['party_size']==1]
df_4_solo_M = df_4_M[df_4_M['party_size']==1]

In [None]:
df_0_duo_E = df_0_E[df_0_E['party_size']==2]
df_1_duo_E = df_1_E[df_1_E['party_size']==2]
df_2_duo_E = df_2_E[df_2_E['party_size']==2]
df_3_duo_E = df_3_E[df_3_E['party_size']==2]
df_4_duo_E = df_4_E[df_4_E['party_size']==2]

In [None]:
df_0_duo_M = df_0_M[df_0_M['party_size']==2]
df_1_duo_M = df_1_M[df_1_M['party_size']==2]
df_2_duo_M = df_2_M[df_2_M['party_size']==2]
df_3_duo_M = df_3_M[df_3_M['party_size']==2]
df_4_duo_M = df_4_M[df_4_M['party_size']==2]

In [None]:
df_0_squad_E = df_0_E[df_0_E['party_size']==4]
df_1_squad_E = df_1_E[df_1_E['party_size']==4]
df_2_squad_E = df_2_E[df_2_E['party_size']==4]
df_3_squad_E = df_3_E[df_3_E['party_size']==4]
df_4_squad_E = df_4_E[df_4_E['party_size']==4]

In [None]:
df_0_squad_M = df_0_M[df_0_M['party_size']==4]
df_1_squad_M = df_1_M[df_1_M['party_size']==4]
df_2_squad_M = df_2_M[df_2_M['party_size']==4]
df_3_squad_M = df_3_M[df_3_M['party_size']==4]
df_4_squad_M = df_4_M[df_4_M['party_size']==4]

## party_size별 map별 Dataset 만들기

In [None]:
df_solo_E = pd.concat([df_0_solo_E, df_1_solo_E, df_2_solo_E, df_3_solo_E, df_4_solo_E])

In [None]:
df_duo_E = pd.concat([df_0_duo_E, df_1_duo_E, df_2_duo_E, df_3_duo_E, df_4_duo_E])

In [None]:
df_squad_E = pd.concat([df_0_squad_E, df_1_squad_E, df_2_squad_E, df_3_squad_E, df_4_squad_E])

In [None]:
df_solo_M = pd.concat([df_0_solo_M, df_1_solo_M, df_2_solo_M, df_3_solo_M, df_4_solo_M])

In [None]:
df_duo_M = pd.concat([df_0_duo_M, df_1_duo_M, df_2_duo_M, df_3_duo_M, df_4_duo_M])

In [None]:
df_squad_M = pd.concat([df_0_squad_M, df_1_squad_M, df_2_squad_M, df_3_squad_M, df_4_squad_M])

In [None]:
del df_solo_E['party_size']
del df_duo_E['party_size']
del df_squad_E['party_size']
del df_solo_M['party_size']
del df_duo_M['party_size']
del df_squad_M['party_size']

## killed_by 그룹핑

In [None]:
def killed_by_refine(df):
    df['killed_by'] = df['killed_by'].replace({'death.WeapSawnoff_C': 'sawed_off', 'death.PlayerMale_A_C': 'Punch', 'death.PG117_A_01_C':'Boat' , 'death.RedZoneBomb_C':'RedZone'})
    df['killed_by'] = df['killed_by'].replace(['Pickup Truck','Hit by Car','Buggy','Dacia','Motorbike','Motorbike (SideCar)','Uaz','Van'], 'land_vehicle')
    df['killed_by'] = df['killed_by'].replace(['death.ProjMolotov_C', 'death.ProjMolotov_DamageField_C', 'death.Buff_FireDOT_C'], 'Molotov')
    df['killed_by'] = df['killed_by'].replace(['Aquarail','Boat'], 'water_vehicle')

In [None]:
killed_by_refine(df_solo_E)

In [None]:
killed_by_refine(df_duo_E)

In [None]:
killed_by_refine(df_squad_E)

In [None]:
killed_by_refine(df_solo_M)

In [None]:
killed_by_refine(df_duo_M)

In [None]:
killed_by_refine(df_squad_M)

## csv로 내보내기

In [None]:
df_solo_E.to_csv('../dataset/preprocessing/solo_E.csv', index=False)

In [None]:
df_duo_E.to_csv('../dataset/preprocessing/duo_E.csv', index=False)

In [None]:
df_squad_E.to_csv('../dataset/preprocessing/squad_E.csv', index=False)

In [None]:
df_solo_M.to_csv('../dataset/preprocessing/solo_M.csv', index=False)

In [None]:
df_duo_M.to_csv('../dataset/preprocessing/duo_M.csv', index=False)

In [None]:
df_squad_M.to_csv('../dataset/preprocessing/squad_M.csv', index=False)

# Outlier 처리

## Duo

In [None]:
df_M = pd.read_csv('../dataset/preprocessing/duo_M.csv')

In [None]:
df_E = pd.read_csv('../dataset/preprocessing/duo_E.csv')

In [None]:
pd.set_option('display.float_format', lambda x: '%.3f' % x)

In [None]:
df_E.shape, df_M.shape

In [None]:
df_E.isnull().sum()

In [None]:
df_M.isnull().sum()

In [None]:
df_M.describe()

In [None]:
df_E.describe()

In [None]:
df_E_raw = df_E.copy()

In [None]:
df_M_raw = df_M.copy()

### game_size
* 40팀 미만

In [None]:
df_E = df_E.loc[df_E['game_size'] >= 40]

In [None]:
df_M = df_M.loc[df_M['game_size'] >= 40]

In [None]:
df_E['team_placement'].value_counts()

In [None]:
df_M['team_placement'].value_counts()

### player_dist_ride, player_dist_walk
* ride 30km 초과
* walk 10km 초과

In [None]:
df_E = df_E.loc[df_E['player_dist_ride'] <= 30000]

In [None]:
df_M = df_M.loc[df_M['player_dist_ride'] <= 30000]

In [None]:
df_E = df_E.loc[df_E['player_dist_walk'] <= 10000]

In [None]:
df_M = df_M.loc[df_M['player_dist_walk'] <= 10000]

In [None]:
df_E['team_placement'].value_counts()

In [None]:
df_M['team_placement'].value_counts()

### player_kills, player_dmg
* kill: 30킬 초과
* dmg: 3000데미지 초과

In [None]:
plt.ticklabel_format(style='plain')
sns.scatterplot(data=df_E[['player_dmg', 'player_kills']], x='player_dmg', y='player_kills')

In [None]:
plt.ticklabel_format(style='plain')
sns.scatterplot(data=df_M[['player_dmg', 'player_kills']], x='player_dmg', y='player_kills')

In [None]:
df_E = df_E.drop(df_E.loc[(df_E['player_kills'] > 30 ) | (df_E['player_dmg'] > 3000)].index)

In [None]:
df_M = df_M.drop(df_M.loc[(df_M['player_kills'] > 30 ) | (df_M['player_dmg'] > 3000)].index)

In [None]:
df_E['team_placement'].value_counts()

In [None]:
df_M['team_placement'].value_counts()

### kill_dist
* 400m 초과

In [None]:
df_E['kill_dist'] = np.sqrt((df_E['killer_position_x'] - df_E['victim_position_x'])**2 
                             + (df_E['killer_position_y'] - df_E['victim_position_y'])**2)

In [None]:
df_M['kill_dist'] = np.sqrt((df_M['killer_position_x'] - df_M['victim_position_x'])**2 
                             + (df_M['killer_position_y'] - df_M['victim_position_y'])**2)

In [None]:
plt.ticklabel_format(style='plain')
df_E.loc[df_E['kill_dist'] < 40000, 'kill_dist'].hist(bins=100)

In [None]:
plt.ticklabel_format(style='plain')
df_M.loc[df_M['kill_dist'] < 40000, 'kill_dist'].hist(bins=100)

In [None]:
df_E = df_E.drop(df_E[df_E['kill_dist'] > 40000].index)

In [None]:
df_M = df_M.drop(df_M[df_M['kill_dist'] > 40000].index)

### player_assists, player_dbno
* assist: 그대로 사용
* dbno: 11번 초과

In [None]:
df_E = df_E.drop(df_E.loc[df_E['player_dbno'] > 11].index)

In [None]:
df_M = df_M.drop(df_M.loc[df_M['player_dbno'] > 11].index)

In [None]:
df_E['team_placement'].value_counts()

In [None]:
df_M['team_placement'].value_counts()

### player_survive_time
* 1900초 초과

In [None]:
df_E_1 = df_E.copy()
df_M_1 = df_M.copy()

In [None]:
# df_E = df_E_1.copy()
# df_M = df_M_1.copy()

In [None]:
df_E['player_survive_time'].hist(bins=100)

In [None]:
df_E.loc[df_E['team_placement']==1, 'player_survive_time'].hist(bins=100)

In [None]:
df_E.loc[df_E['player_survive_time'] > 1900, 'team_placement'].value_counts()

In [None]:
df_E = df_E.drop(df_E.loc[(df_E['player_survive_time'] > 1900)].index)

In [None]:
df_M = df_M.drop(df_M.loc[(df_M['player_survive_time'] > 1900)].index)

In [None]:
df_E['team_placement'].value_counts()

In [None]:
df_M['team_placement'].value_counts()

## 분석 Dataset 확인

In [None]:
df_E.shape

In [None]:
df_E.isnull().sum()

In [None]:
df_M.shape

In [None]:
df_M.isnull().sum()

## Dataset 분리하기

* Outlier를 제거한 데이터셋과 Outlier만 모은 데이터셋으로 분리

In [None]:
E_in_idx = list(df_E.index)
M_in_idx = list(df_M.index)

In [None]:
df_E_in = df_E.reset_index(drop=True)
df_M_in = df_M.reset_index(drop=True)

In [None]:
E_out_idx = list(set(df_E_raw.index) - set(df_E.index))
M_out_idx = list(set(df_M_raw.index) - set(df_M.index))

In [None]:
df_E_out = df_E_raw.loc[E_out_idx].reset_index(drop=True)
df_M_out = df_M_raw.loc[M_out_idx].reset_index(drop=True)

## csv로 내보내기

In [None]:
df_E_out.to_csv('../dataset/duo/duo_E_out.csv', index=False)

In [None]:
df_M_out.to_csv('../dataset/duo/duo_M_out.csv', index=False)

In [None]:
df_E_in.to_csv('../dataset/duo/duo_E_in.csv', index=False)

In [None]:
df_M_in.to_csv('../dataset/duo/duo_M_in.csv', index=False)

## play_count 10회 이상인 player 선택

In [None]:
df_duo = pd.concat([df_E_in, df_M_in])

In [None]:
df_duo.shape

In [None]:
play_count = df_duo.groupby('player_name')['match_id'].nunique().to_frame()

In [None]:
play_count.loc[play_count['match_id'] >= 10].index

In [None]:
df_duo = df.loc[df['player_name'].isin(play_count[play_count['match_id'] >= 10].index)]

In [None]:
df_duo.shape

In [None]:
df_duo.to_csv('../dataset/duo/duo.csv', index=False)

# 파생변수 생성

## Duo

In [None]:
df = pd.read_csv('../dataset/duo/duo.csv')

In [None]:
df.shape

In [None]:
df.info()

### date

* 날짜형 데이터 처리

In [None]:
df['date'] = pd.to_datetime(df['date'])

### Score
* each_game_score: (50 - team_placement) * 1 + player_kills * 2 + player_assists * 2
* total_score: sum(each_game_score) by player_name

In [None]:
df['team_placement'].value_counts()

In [None]:
# player별 각 게임에서의 점수 계산

df['each_game_score'] = (50 - df['team_placement'])*1 + df['player_kills']*2 + df['player_assists']*2
min(df['each_game_score']), max(df['each_game_score'])

In [None]:
# player별 총 점수를 계산

score = df.groupby(['player_name', 'match_id'])['each_game_score'].mean().to_frame()
total_score = score.groupby('player_name')['each_game_score'].sum().to_frame()
total_score.columns = ['total_score']
min(total_score['total_score']), max(total_score['total_score'])

In [None]:
total_score['total_score'].hist(bins=1000)
plt.ticklabel_format(style='plain')

In [None]:
total_score.loc[total_score['total_score'] > 1500, 'total_score'].count()

In [None]:
df['total_score'] = total_score.loc[df['player_name'], 'total_score'].values

In [None]:
df['total_score'] = np.log(df['total_score'])

In [None]:
df.groupby('player_name')['total_score'].mean().hist(bins=1000)

### Tier

In [None]:
def get_tier(score):
    if score < 6:
        return 1 # Bronze
    elif score < 6.75:
        return 2 # Silver
    elif score < 7.5:
        return 3 # Gold
    elif score < 8.25:
        return 4 # Platinum
    elif score < 9:
        return 5 # Diamond
    else: 
        return 6 # Master

In [None]:
df['tier'] = df['total_score'].apply(lambda x: get_tier(x))

In [None]:
df.groupby('player_name')['tier'].mean().hist()

In [None]:
df.groupby('tier')['player_name'].nunique()

In [None]:
df.groupby('tier')['player_name'].nunique()/491883*100

### KDA
* (kills + assists)/deaths

#### kills

In [None]:
kill = df.groupby(['player_name', 'match_id'])['player_kills'].mean().to_frame()

In [None]:
kills = kill.groupby('player_name')['player_kills'].sum().to_frame()

In [None]:
kills.head()

In [None]:
kills.isnull().sum()

#### assists

In [None]:
assist = df.groupby(['player_name', 'match_id'])['player_assists'].mean().to_frame()

In [None]:
assists = assist.groupby('player_name')['player_assists'].sum().to_frame()

In [None]:
assists.head()

In [None]:
assists.isnull().sum()

#### deaths

In [None]:
# 1등 한 game 횟수
count_rank1 = df[df['team_placement']==1].groupby('player_name')['match_id'].nunique().to_frame()
count_rank1.columns = ['rank1']

In [None]:
count_rank1.isnull().sum()

In [None]:
# play한 game 횟수

game_count = df.groupby('player_name')['match_id'].nunique().to_frame()
game_count.columns = ['games']

In [None]:
game_count.isnull().sum()

In [None]:
deaths = pd.merge(count_rank1, game_count, how='outer', on='player_name')
deaths.head()

In [None]:
deaths.isnull().sum()

In [None]:
deaths = deaths.fillna(0)

In [None]:
deaths['deaths'] = deaths['games'] - deaths['rank1']
deaths.head(1)

#### KDA

In [None]:
kda = pd.merge(kills, assists, how='outer', on='player_name')
kda = pd.merge(kda, deaths['deaths'], how='outer', on='player_name')
kda.isnull().sum()

In [None]:
kda['kda'] = (kda['player_kills'] + kda['player_assists']) / kda['deaths']
kda.head()

In [None]:
df = pd.merge(df, kda['kda'], how='left', on='player_name')

In [None]:
df['kda'].isnull().sum()

### num_of_match
* player별 play 횟수

In [None]:
# tier별 차이를 확인하는 데에 필요한 컬럼만 선택

cols = ['player_name', 'match_id', 'player_kills', 'player_dmg', 'player_assists', 'player_dbno',  'kda',
        'player_dist_walk', 'player_dist_ride', 'kill_dist', 'player_survive_time', 'team_placement', 'tier']

df_player_match = pd.pivot_table(data=df[cols], index=['player_name', 'match_id'], aggfunc='mean')

In [None]:
# player별 game play 횟수

num_of_match = df_player_match.groupby('player_name')['tier'].value_counts().to_frame()
num_of_match.columns = ['num_of_match']

In [None]:
df_player = df_player_match.groupby('player_name').mean()
df_player = pd.merge(df_player, num_of_match, how='left', on='player_name')

In [None]:
df_player = df_player[['player_kills', 'player_dmg', 'player_assists', 'player_dbno',  'kda', 'player_dist_walk', 
                       'player_dist_ride', 'kill_dist', 'player_survive_time', 'team_placement',
                       'num_of_match','tier']]
df_player.head()

In [None]:
df_player.shape

In [None]:
df_player.info()

In [None]:
df_player.to_csv('tier_diff_duo.csv', index=False)

# Tier별 차이 검정

## Duo

In [None]:
df.corr().style.background_gradient(cmap='Blues')

In [None]:
df.groupby('tier').mean()

### player_kills
* Tier와 평균 kill 횟수는 비례함

In [None]:
df.groupby('tier')['player_kills'].mean().to_frame()

In [None]:
plt.figure(figsize=(15, 5))
sns.kdeplot(data=df, x='player_kills', hue='tier', multiple='stack')

In [None]:
tier1 = stats.anderson(df.loc[df['tier'] == 1, 'player_kills'], dist='norm')
tier2 = stats.anderson(df.loc[df['tier'] == 2, 'player_kills'], dist='norm')
tier3 = stats.anderson(df.loc[df['tier'] == 3, 'player_kills'], dist='norm')
tier4 = stats.anderson(df.loc[df['tier'] == 4, 'player_kills'], dist='norm')
tier5 = stats.anderson(df.loc[df['tier'] == 5, 'player_kills'], dist='norm')
tier6 = stats.anderson(df.loc[df['tier'] == 6, 'player_kills'], dist='norm')

print('tier1:', tier1[0] < tier1[1][2], '\n' 
      'tier2:', tier2[0] < tier2[1][2], '\n'
      'tier3:', tier3[0] < tier3[1][2], '\n'
      'tier4:', tier4[0] < tier3[1][2], '\n'
      'tier5:', tier5[0] < tier3[1][2], '\n'
      'tier6:', tier6[0] < tier3[1][2], '\n')

In [None]:
stats.kruskal(df.loc[df['tier'] == 1, 'player_kills'],
              df.loc[df['tier'] == 2, 'player_kills'],
              df.loc[df['tier'] == 3, 'player_kills'],
              df.loc[df['tier'] == 4, 'player_kills'],
              df.loc[df['tier'] == 5, 'player_kills'],
              df.loc[df['tier'] == 6, 'player_kills'])

In [None]:
sp.posthoc_conover(df, val_col ='player_kills', 
                     group_col ='tier', p_adjust = 'holm').style.background_gradient(cmap='Blues')

### player_dmg

In [None]:
df.groupby('tier')['player_dmg'].mean().to_frame()

In [None]:
plt.figure(figsize=(15,5))
sns.violinplot(data=df, x='tier', y='player_dmg')

In [None]:
plt.figure(figsize=(15, 5))
sns.kdeplot(data=df, x='player_dmg', hue='tier', multiple='stack')

In [None]:
tier1 = stats.anderson(df.loc[df['tier'] == 1, 'player_dmg'], dist='norm')
tier2 = stats.anderson(df.loc[df['tier'] == 2, 'player_dmg'], dist='norm')
tier3 = stats.anderson(df.loc[df['tier'] == 3, 'player_dmg'], dist='norm')
tier4 = stats.anderson(df.loc[df['tier'] == 4, 'player_dmg'], dist='norm')
tier5 = stats.anderson(df.loc[df['tier'] == 5, 'player_dmg'], dist='norm')
tier6 = stats.anderson(df.loc[df['tier'] == 6, 'player_dmg'], dist='norm')

print('tier1:', tier1[0] < tier1[1][2], '\n' 
      'tier2:', tier2[0] < tier2[1][2], '\n'
      'tier3:', tier3[0] < tier3[1][2], '\n'
      'tier4:', tier4[0] < tier3[1][2], '\n'
      'tier5:', tier5[0] < tier3[1][2], '\n'
      'tier6:', tier6[0] < tier3[1][2], '\n')

In [None]:
stats.kruskal(df.loc[df['tier'] == 1, 'player_dmg'],
              df.loc[df['tier'] == 2, 'player_dmg'],
              df.loc[df['tier'] == 3, 'player_dmg'],
              df.loc[df['tier'] == 4, 'player_dmg'],
              df.loc[df['tier'] == 5, 'player_dmg'],
              df.loc[df['tier'] == 6, 'player_dmg'])

In [None]:
sp.posthoc_conover(df, val_col ='player_dmg', 
                     group_col ='tier', p_adjust = 'holm').style.background_gradient(cmap='Blues')

### player_assists
* Tier와 평균 assits 횟수는 비례함 

In [None]:
df.groupby('tier')['player_assists'].mean().to_frame()

In [None]:
plt.figure(figsize=(15,5))
sns.violinplot(data=df, x='tier', y='player_assists')

In [None]:
plt.figure(figsize=(15, 5))
sns.kdeplot(data=df, x='player_assists', hue='tier', multiple='stack')

In [None]:
tier1 = stats.anderson(df.loc[df['tier'] == 1, 'player_assists'], dist='norm')
tier2 = stats.anderson(df.loc[df['tier'] == 2, 'player_assists'], dist='norm')
tier3 = stats.anderson(df.loc[df['tier'] == 3, 'player_assists'], dist='norm')
tier4 = stats.anderson(df.loc[df['tier'] == 4, 'player_assists'], dist='norm')
tier5 = stats.anderson(df.loc[df['tier'] == 5, 'player_assists'], dist='norm')
tier6 = stats.anderson(df.loc[df['tier'] == 6, 'player_assists'], dist='norm')

print('tier1:', tier1[0] < tier1[1][2], '\n' 
      'tier2:', tier2[0] < tier2[1][2], '\n'
      'tier3:', tier3[0] < tier3[1][2], '\n'
      'tier4:', tier4[0] < tier3[1][2], '\n'
      'tier5:', tier5[0] < tier3[1][2], '\n'
      'tier6:', tier6[0] < tier3[1][2], '\n')

In [None]:
stats.kruskal(df.loc[df['tier'] == 1, 'player_assists'],
              df.loc[df['tier'] == 2, 'player_assists'],
              df.loc[df['tier'] == 3, 'player_assists'],
              df.loc[df['tier'] == 4, 'player_assists'],
              df.loc[df['tier'] == 5, 'player_assists'],
              df.loc[df['tier'] == 6, 'player_assists'])

In [None]:
sp.posthoc_conover(df, val_col ='player_assists', 
                     group_col ='tier', p_adjust = 'holm').style.background_gradient(cmap='Blues')

### player_dbno
* Tier와 평균 dbno 횟수는 비례함 

In [None]:
df.groupby('tier')['player_dbno'].mean().to_frame()

In [None]:
plt.figure(figsize=(15, 5))
sns.kdeplot(data=df, x='player_dbno', hue='tier', multiple='stack')

In [None]:
tier1 = stats.anderson(df.loc[df['tier'] == 1, 'player_dbno'], dist='norm')
tier2 = stats.anderson(df.loc[df['tier'] == 2, 'player_dbno'], dist='norm')
tier3 = stats.anderson(df.loc[df['tier'] == 3, 'player_dbno'], dist='norm')
tier4 = stats.anderson(df.loc[df['tier'] == 4, 'player_dbno'], dist='norm')
tier5 = stats.anderson(df.loc[df['tier'] == 5, 'player_dbno'], dist='norm')
tier6 = stats.anderson(df.loc[df['tier'] == 6, 'player_dbno'], dist='norm')

print('tier1:', tier1[0] < tier1[1][2], '\n' 
      'tier2:', tier2[0] < tier2[1][2], '\n'
      'tier3:', tier3[0] < tier3[1][2], '\n'
      'tier4:', tier4[0] < tier3[1][2], '\n'
      'tier5:', tier5[0] < tier3[1][2], '\n'
      'tier6:', tier6[0] < tier3[1][2], '\n')

In [None]:
stats.kruskal(df.loc[df['tier'] == 1, 'player_dbno'],
              df.loc[df['tier'] == 2, 'player_dbno'],
              df.loc[df['tier'] == 3, 'player_dbno'],
              df.loc[df['tier'] == 4, 'player_dbno'],
              df.loc[df['tier'] == 5, 'player_dbno'],
              df.loc[df['tier'] == 6, 'player_dbno'])

In [None]:
sp.posthoc_conover(df, val_col ='player_dbno', 
                     group_col ='tier', p_adjust = 'holm').style.background_gradient(cmap='Blues')

### kda
* Tier와 평균 kda는 비례함

In [None]:
df.groupby('tier')['kda'].mean().to_frame()

In [None]:
plt.figure(figsize=(15,5))
sns.violinplot(data=df, x='tier', y='kda')

In [None]:
plt.figure(figsize=(15, 5))
sns.kdeplot(data=df, x='kda', hue='tier', multiple='stack')

In [None]:
tier1 = stats.anderson(df.loc[df['tier'] == 1, 'kda'], dist='norm')
tier2 = stats.anderson(df.loc[df['tier'] == 2, 'kda'], dist='norm')
tier3 = stats.anderson(df.loc[df['tier'] == 3, 'kda'], dist='norm')
tier4 = stats.anderson(df.loc[df['tier'] == 4, 'kda'], dist='norm')
tier5 = stats.anderson(df.loc[df['tier'] == 5, 'kda'], dist='norm')
tier6 = stats.anderson(df.loc[df['tier'] == 6, 'kda'], dist='norm')

print('tier1:', tier1[0] < tier1[1][2], '\n' 
      'tier2:', tier2[0] < tier2[1][2], '\n'
      'tier3:', tier3[0] < tier3[1][2], '\n'
      'tier4:', tier4[0] < tier3[1][2], '\n'
      'tier5:', tier5[0] < tier3[1][2], '\n'
      'tier6:', tier6[0] < tier3[1][2], '\n')

In [None]:
stats.kruskal(df.loc[df['tier'] == 1, 'kda'],
              df.loc[df['tier'] == 2, 'kda'],
              df.loc[df['tier'] == 3, 'kda'],
              df.loc[df['tier'] == 4, 'kda'],
              df.loc[df['tier'] == 5, 'kda'],
              df.loc[df['tier'] == 6, 'kda'])

In [None]:
sp.posthoc_conover(df, val_col ='kda', 
                     group_col ='tier', p_adjust = 'holm').style.background_gradient(cmap='Blues')

### player_dist_walk
* Tier 1~3 에서 평균 거리가 증가
* Tier 3~6 에서는 거리가 감소
* Tier 2와 5는 평균의 차이가 없음

In [None]:
df.groupby('tier')['player_dist_walk'].mean().to_frame()

In [None]:
plt.figure(figsize=(15,5))
sns.violinplot(data=df, x='tier', y='player_dist_walk')

In [None]:
plt.figure(figsize=(15, 5))
sns.kdeplot(data=df, x='player_dist_walk', hue='tier', multiple='stack')

In [None]:
tier1 = stats.anderson(df.loc[df['tier'] == 1, 'player_dist_walk'], dist='norm')
tier2 = stats.anderson(df.loc[df['tier'] == 2, 'player_dist_walk'], dist='norm')
tier3 = stats.anderson(df.loc[df['tier'] == 3, 'player_dist_walk'], dist='norm')
tier4 = stats.anderson(df.loc[df['tier'] == 4, 'player_dist_walk'], dist='norm')
tier5 = stats.anderson(df.loc[df['tier'] == 5, 'player_dist_walk'], dist='norm')
tier6 = stats.anderson(df.loc[df['tier'] == 6, 'player_dist_walk'], dist='norm')

print('tier1:', tier1[0] < tier1[1][2], '\n' 
      'tier2:', tier2[0] < tier2[1][2], '\n'
      'tier3:', tier3[0] < tier3[1][2], '\n'
      'tier4:', tier4[0] < tier3[1][2], '\n'
      'tier5:', tier5[0] < tier3[1][2], '\n'
      'tier6:', tier6[0] < tier3[1][2], '\n')

In [None]:
stats.kruskal(df.loc[df['tier'] == 1, 'player_dist_walk'],
              df.loc[df['tier'] == 2, 'player_dist_walk'],
              df.loc[df['tier'] == 3, 'player_dist_walk'],
              df.loc[df['tier'] == 4, 'player_dist_walk'],
              df.loc[df['tier'] == 5, 'player_dist_walk'],
              df.loc[df['tier'] == 6, 'player_dist_walk'])

In [None]:
sp.posthoc_conover(df, val_col ='player_kills', 
                     group_col ='tier', p_adjust = 'holm').style.background_gradient(cmap='Blues')

### player_dist_ride
* Tier 1에서는 확실히 탈것을 활용한 이동 거리가 짧음
* Tier 2부터는 그룹간의 차이가 있긴하는하지만, tier와 탈것을 활용한 이동 거리가 정비례 하지는 않음

In [None]:
df.groupby('tier')['player_dist_ride'].mean().to_frame()

In [None]:
plt.figure(figsize=(15,5))
sns.violinplot(data=df, x='tier', y='player_dist_ride')

In [None]:
plt.figure(figsize=(15, 5))
sns.kdeplot(data=df, x='player_dist_ride', hue='tier', multiple='stack')

In [None]:
tier1 = stats.anderson(df.loc[df['tier'] == 1, 'player_dist_ride'], dist='norm')
tier2 = stats.anderson(df.loc[df['tier'] == 2, 'player_dist_ride'], dist='norm')
tier3 = stats.anderson(df.loc[df['tier'] == 3, 'player_dist_ride'], dist='norm')
tier4 = stats.anderson(df.loc[df['tier'] == 4, 'player_dist_ride'], dist='norm')
tier5 = stats.anderson(df.loc[df['tier'] == 5, 'player_dist_ride'], dist='norm')
tier6 = stats.anderson(df.loc[df['tier'] == 6, 'player_dist_ride'], dist='norm')

print('tier1:', tier1[0] < tier1[1][2], '\n' 
      'tier2:', tier2[0] < tier2[1][2], '\n'
      'tier3:', tier3[0] < tier3[1][2], '\n'
      'tier4:', tier4[0] < tier3[1][2], '\n'
      'tier5:', tier5[0] < tier3[1][2], '\n'
      'tier6:', tier6[0] < tier3[1][2], '\n')

In [None]:
stats.kruskal(df.loc[df['tier'] == 1, 'player_dist_ride'],
              df.loc[df['tier'] == 2, 'player_dist_ride'],
              df.loc[df['tier'] == 3, 'player_dist_ride'],
              df.loc[df['tier'] == 4, 'player_dist_ride'],
              df.loc[df['tier'] == 5, 'player_dist_ride'],
              df.loc[df['tier'] == 6, 'player_dist_ride'])

In [None]:
sp.posthoc_conover(df, val_col ='player_dist_ride', 
                     group_col ='tier', p_adjust = 'holm').style.background_gradient(cmap='Blues')

### kill_dist
* Tier가 높다고 kill_dist가 비례하여 커지는건 아니지만, 1~2/3~6은 구분되는 것처럼 보임

In [None]:
df.groupby('tier')['kill_dist'].median().to_frame()

In [None]:
plt.figure(figsize=(15,5))
sns.violinplot(data=df, x='tier', y='kill_dist')
plt.xticks(np.arange(6), ['Bronze', 'Silver', 'Gold', 'Platinum', 'Diamond', 'Master'])

In [None]:
plt.figure(figsize=(15, 5))
sns.kdeplot(data=df, x='kill_dist', hue='tier', multiple='stack')

In [None]:
tier1 = stats.anderson(df.loc[df['tier'] == 1, 'kill_dist'], dist='norm')
tier2 = stats.anderson(df.loc[df['tier'] == 2, 'kill_dist'], dist='norm')
tier3 = stats.anderson(df.loc[df['tier'] == 3, 'kill_dist'], dist='norm')
tier4 = stats.anderson(df.loc[df['tier'] == 4, 'kill_dist'], dist='norm')
tier5 = stats.anderson(df.loc[df['tier'] == 5, 'kill_dist'], dist='norm')
tier6 = stats.anderson(df.loc[df['tier'] == 6, 'kill_dist'], dist='norm')

print('tier1:', tier1[0] < tier1[1][2], '\n' 
      'tier2:', tier2[0] < tier2[1][2], '\n'
      'tier3:', tier3[0] < tier3[1][2], '\n'
      'tier4:', tier4[0] < tier3[1][2], '\n'
      'tier5:', tier5[0] < tier3[1][2], '\n'
      'tier6:', tier6[0] < tier3[1][2], '\n')

In [None]:
stats.kruskal(df.loc[df['tier'] == 1, 'kill_dist'].fillna(0),
               df.loc[df['tier'] == 2, 'kill_dist'].fillna(0),
               df.loc[df['tier'] == 3, 'kill_dist'].fillna(0),
               df.loc[df['tier'] == 4, 'kill_dist'].fillna(0),
               df.loc[df['tier'] == 5, 'kill_dist'].fillna(0),
               df.loc[df['tier'] == 6, 'kill_dist'].fillna(0))

In [None]:
sp.posthoc_conover(df, val_col ='kill_dist', 
                     group_col ='tier', p_adjust = 'holm').style.background_gradient(cmap='Blues')


### player_survive_time
* 확실히 높은 티어일수록 1500초 이상 살아있는 경우가 많음

In [None]:
df.groupby('tier')['player_survive_time'].mean().to_frame()

In [None]:
plt.figure(figsize=(15,5))
sns.violinplot(data=df, x='tier', y='player_survive_time')

In [None]:
plt.figure(figsize=(15, 5))
sns.kdeplot(data=df, x='player_survive_time', hue='tier', multiple='stack')

In [None]:
tier1 = stats.anderson(df.loc[df['tier'] == 1, 'player_survive_time'], dist='norm')
tier2 = stats.anderson(df.loc[df['tier'] == 2, 'player_survive_time'], dist='norm')
tier3 = stats.anderson(df.loc[df['tier'] == 3, 'player_survive_time'], dist='norm')
tier4 = stats.anderson(df.loc[df['tier'] == 4, 'player_survive_time'], dist='norm')
tier5 = stats.anderson(df.loc[df['tier'] == 5, 'player_survive_time'], dist='norm')
tier6 = stats.anderson(df.loc[df['tier'] == 6, 'player_survive_time'], dist='norm')

print('tier1:', tier1[0] < tier1[1][2], '\n' 
      'tier2:', tier2[0] < tier2[1][2], '\n'
      'tier3:', tier3[0] < tier3[1][2], '\n'
      'tier4:', tier4[0] < tier3[1][2], '\n'
      'tier5:', tier5[0] < tier3[1][2], '\n'
      'tier6:', tier6[0] < tier3[1][2], '\n')

In [None]:
stats.kruskal(df.loc[df['tier'] == 1, 'player_survive_time'],
              df.loc[df['tier'] == 2, 'player_survive_time'],
              df.loc[df['tier'] == 3, 'player_survive_time'],
              df.loc[df['tier'] == 4, 'player_survive_time'],
              df.loc[df['tier'] == 5, 'player_survive_time'],
              df.loc[df['tier'] == 6, 'player_survive_time'])

In [None]:
sp.posthoc_conover(df, val_col ='player_kills', 
                     group_col ='tier', p_adjust = 'holm').style.background_gradient(cmap='Blues')

### team_placement
* Tier 1은 확실히 낮은 등수를 기록하는 경우가 많음

In [None]:
df.groupby('tier')['team_placement'].mean().to_frame()

In [None]:
plt.figure(figsize=(15,5))
sns.violinplot(data=df, x='tier', y='team_placement')

In [None]:
tier1 = stats.anderson(df.loc[df['tier'] == 1, 'team_placement'], dist='norm')
tier2 = stats.anderson(df.loc[df['tier'] == 2, 'team_placement'], dist='norm')
tier3 = stats.anderson(df.loc[df['tier'] == 3, 'team_placement'], dist='norm')
tier4 = stats.anderson(df.loc[df['tier'] == 4, 'team_placement'], dist='norm')
tier5 = stats.anderson(df.loc[df['tier'] == 5, 'team_placement'], dist='norm')
tier6 = stats.anderson(df.loc[df['tier'] == 6, 'team_placement'], dist='norm')

print('tier1:', tier1[0] < tier1[1][2], '\n' 
      'tier2:', tier2[0] < tier2[1][2], '\n'
      'tier3:', tier3[0] < tier3[1][2], '\n'
      'tier4:', tier4[0] < tier3[1][2], '\n'
      'tier5:', tier5[0] < tier3[1][2], '\n'
      'tier6:', tier6[0] < tier3[1][2], '\n')

In [None]:
stats.kruskal(df.loc[df['tier'] == 1, 'team_placement'],
              df.loc[df['tier'] == 2, 'team_placement'],
              df.loc[df['tier'] == 3, 'team_placement'],
              df.loc[df['tier'] == 4, 'team_placement'],
              df.loc[df['tier'] == 5, 'team_placement'],
              df.loc[df['tier'] == 6, 'team_placement'])

In [None]:
sp.posthoc_conover(df, val_col ='team_placement', 
                     group_col ='tier', p_adjust = 'holm').style.background_gradient(cmap='Blues')

In [None]:
df_tier = pd.read_csv('../dataset/duo/duo_tier.csv')

In [None]:
del df_count

In [None]:
df_tier.columns

### num_of_match

In [None]:
df.groupby('tier')['num_of_match'].mean().to_frame()

In [None]:
plt.figure(figsize=(15,5))
sns.violinplot(data=df, x='tier', y='num_of_match')

In [None]:
tier1 = stats.anderson(df.loc[df['tier'] == 1, 'num_of_match'], dist='norm')
tier2 = stats.anderson(df.loc[df['tier'] == 2, 'num_of_match'], dist='norm')
tier3 = stats.anderson(df.loc[df['tier'] == 3, 'num_of_match'], dist='norm')
tier4 = stats.anderson(df.loc[df['tier'] == 4, 'num_of_match'], dist='norm')
tier5 = stats.anderson(df.loc[df['tier'] == 5, 'num_of_match'], dist='norm')
tier6 = stats.anderson(df.loc[df['tier'] == 6, 'num_of_match'], dist='norm')

print('tier1:', tier1[0] < tier1[1][2], '\n' 
      'tier2:', tier2[0] < tier2[1][2], '\n'
      'tier3:', tier3[0] < tier3[1][2], '\n'
      'tier4:', tier4[0] < tier3[1][2], '\n'
      'tier5:', tier5[0] < tier3[1][2], '\n'
      'tier6:', tier6[0] < tier3[1][2], '\n')

In [None]:
stats.kruskal(df.loc[df['tier'] == 1, 'num_of_match'],
              df.loc[df['tier'] == 2, 'num_of_match'],
              df.loc[df['tier'] == 3, 'num_of_match'],
              df.loc[df['tier'] == 4, 'num_of_match'],
              df.loc[df['tier'] == 5, 'num_of_match'],
              df.loc[df['tier'] == 6, 'num_of_match'])

In [None]:
sp.posthoc_conover(df, val_col ='num_of_match', 
                     group_col ='tier', p_adjust = 'holm').style.background_gradient(cmap='Blues')

### killed_by

In [None]:
plt.figure(figsize=(5, 15))
sns.countplot(data=df_tier[df_tier['tier']==1], y='killed_by', 
              order=df_tier.loc[df_tier['tier']==1, 'killed_by'].value_counts().index)

In [None]:
plt.figure(figsize=(5, 15))
sns.countplot(data=df_tier[df_tier['tier']==2], y='killed_by',
              order=df_tier.loc[df_tier['tier']==2, 'killed_by'].value_counts().index)

In [None]:
plt.figure(figsize=(5, 15))
sns.countplot(data=df_tier[df_tier['tier']==3], y='killed_by',
              order=df_tier.loc[df_tier['tier']==3, 'killed_by'].value_counts().index)

In [None]:
plt.figure(figsize=(5, 15))
sns.countplot(data=df_tier[df_tier['tier']==4], y='killed_by',
              order=df_tier.loc[df_tier['tier']==4, 'killed_by'].value_counts().index)

In [None]:
plt.figure(figsize=(5, 15))
sns.countplot(data=df_tier[df_tier['tier']==5], y='killed_by',
              order=df_tier.loc[df_tier['tier']==5, 'killed_by'].value_counts().index)

In [None]:
plt.figure(figsize=(5, 15))
sns.countplot(data=df_tier[df_tier['tier']==6], y='killed_by',
              order=df_tier.loc[df_tier['tier']==6, 'killed_by'].value_counts().index)