# PUBG Finish Placement Prediction (Kernels Only)


* Id - 플레이어의 Id
* groupId - 경기 내의 그룹을 식별하는 ID. 현재 그룹의 선수들이 서로 다른 경기에서 경기한다면, 그들은 매번 다른 groupId를 갖게 될 것이다.
* matchId - 경기를 식별하기 위한 ID. train set과 test set에 모두 있는 시합은 없다.
* matchDuration - 경기 시간.
* matchType - 솔로, 듀오와 같은 게임 종류 그 외에는 이벤트게임 -컬럼확인 필요
* assists - 어시스트
* boosts - 부스트 아이템 사용한 수.
 * damageDealt - 가한 데미지 총량. Note: 자신에게 가한 데미지는 제외.
* DBNOs - 빈사상태로 만든 적의 수.
* headshotKills - 헤드샷 으로 처리한 적의 수.
* heals - 치료 아이템 사용 수.
* killPlace - 경기에서 처치한 적의 수 랭킹.
* killPoints - 플레이어의 처치 기반 외부 랭킹 (Elo 방식의 순위). rankPoints에서 -1이 아닌 값이 있는 경우, killPoints에서 0은 "없음"으로 처리되어야 한다.
* kills - 처치한 적의 수.
* killStreaks - 단기간에 가장 많이 처치한 적의 최대치.
* longestKill - 플레이어가 적을 죽인 가장 긴 거리. 상대를 죽이고 멀리 운전하는 것이 가장 긴 처치로 이어질 수 있기 때문에 오해의 소지가 있을 수 있다.
* maxPlace - 경기에서 가장 순위가 낮은 것에 대한 데이터. 이것은 순위를 건너뛸 수도 있기 때문에 numGroups와 일치하지 않을 수 있다.
* numGroups - 경기에 있는 팀의 수.
* rankPoints - Elo 방식의 플레이어 랭킹. 다음 버전의 API에서는 삭제될 예정이기 때문에 사용에 주의. '-1' 값은 순위가 "None" 이다.
* revives - 플레이어가 팀원 회복 시킨 수.
* rideDistance - 차량으로 이동한 거리(단위 : 미터).
* roadKills - 차량으로 죽인 플레이어 수.
*  swimDistance - 수영한 거리(단위 : 미터).
* teamKills - 팀킬한 횟수.
* vehicleDestroys - 차량을 폭파시킨 횟수.
* walkDistance - 걸은 총 거리(단위 : 미터).
* weaponsAcquired - 무기 얻은 갯수.
* winPoints - 플레이어의 승리 기반 외부 랭킹 (Elo 방식의 순위). rankPoints에서 -1이 아닌 값이 있는 경우, winPoints에서 0은 "없음"으로 처리되어야 한다.
* **winPlacePerc** - 예측 목표. 순위의 퍼센트로 표시되며, 1이면 경기에서 1등이고 0이면 경기에서 꼴지 했다는 것이다. 이것은 numGroups로 계산되는게 아니라, maxPlace로 계산되기 때문에 누락되는 것이 있을수도다.

In [24]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


pd.options.display.float_format = '{:.4f}'.format
data_path = "./data/"
train_file = f"{data_path}train_V2.csv"
test_file = f"{data_path}test_V2.csv"

# Load DataSet

In [2]:
# _raw : 원본 로드한 데이터
train_raw = pd.read_csv(train_file)
test_raw = pd.read_csv(test_file)

print(train_raw.shape, test_raw.shape)

(4446966, 29) (1934174, 28)


In [7]:
train_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4446966 entries, 0 to 4446965
Data columns (total 29 columns):
 #   Column           Dtype  
---  ------           -----  
 0   Id               object 
 1   groupId          object 
 2   matchId          object 
 3   assists          int64  
 4   boosts           int64  
 5   damageDealt      float64
 6   DBNOs            int64  
 7   headshotKills    int64  
 8   heals            int64  
 9   killPlace        int64  
 10  killPoints       int64  
 11  kills            int64  
 12  killStreaks      int64  
 13  longestKill      float64
 14  matchDuration    int64  
 15  matchType        object 
 16  maxPlace         int64  
 17  numGroups        int64  
 18  rankPoints       int64  
 19  revives          int64  
 20  rideDistance     float64
 21  roadKills        int64  
 22  swimDistance     float64
 23  teamKills        int64  
 24  vehicleDestroys  int64  
 25  walkDistance     float64
 26  weaponsAcquired  int64  
 27  winPoints   

## 데이터 타입 지정하기

In [33]:
def reduce_mem_usage(df, verbose=True):
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2    
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)    
    end_mem = df.memory_usage().sum() / 1024**2
    if verbose: print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
    return df

In [37]:
train_raw = reduce_mem_usage(train_raw)

train_raw.info()

Mem. usage decreased to 288.39 Mb (0.0% reduction)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4446966 entries, 0 to 4446965
Data columns (total 29 columns):
 #   Column           Dtype  
---  ------           -----  
 0   Id               object 
 1   groupId          object 
 2   matchId          object 
 3   assists          uint8  
 4   boosts           int8   
 5   damageDealt      float16
 6   DBNOs            int8   
 7   headshotKills    int8   
 8   heals            int8   
 9   killPlace        int8   
 10  killPoints       int16  
 11  kills            int8   
 12  killStreaks      int8   
 13  longestKill      float16
 14  matchDuration    int16  
 15  matchType        object 
 16  maxPlace         int8   
 17  numGroups        int8   
 18  rankPoints       int16  
 19  revives          int8   
 20  rideDistance     float16
 21  roadKills        int8   
 22  swimDistance     float16
 23  teamKills        int8   
 24  vehicleDestroys  int8   
 25  walkDistance     floa

In [35]:
train_raw.describe()

Unnamed: 0,assists,boosts,damageDealt,DBNOs,headshotKills,heals,killPlace,killPoints,kills,killStreaks,...,revives,rideDistance,roadKills,swimDistance,teamKills,vehicleDestroys,walkDistance,weaponsAcquired,winPoints,winPlacePerc
count,4446966.0,4446966.0,4446966.0,4446966.0,4446966.0,4446966.0,4446966.0,4446966.0,4446966.0,4446966.0,...,4446966.0,4446966.0,4446966.0,4446966.0,4446966.0,4446966.0,4446966.0,4446966.0,4446966.0,4446965.0
mean,0.2338,1.1069,,0.6579,0.2268,1.3701,47.5994,505.006,0.9248,0.544,...,0.1647,,0.0035,,0.0239,0.0079,,3.6605,606.4601,
std,0.5886,1.7158,,1.1457,0.6022,2.68,27.4629,627.5049,1.5584,0.711,...,0.4722,,0.0734,,0.1674,0.0926,,2.4565,739.7004,0.0
min,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,24.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,155.125,2.0,0.0,0.2
50%,0.0,0.0,84.25,0.0,0.0,0.0,47.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,685.5,3.0,0.0,0.4583
75%,0.0,2.0,186.0,1.0,0.0,2.0,71.0,1172.0,1.0,1.0,...,0.0,0.191,0.0,0.0,0.0,0.0,1976.0,5.0,1495.0,0.7407
max,22.0,33.0,6616.0,53.0,64.0,80.0,101.0,2170.0,72.0,20.0,...,39.0,40704.0,18.0,3824.0,12.0,5.0,25776.0,236.0,2013.0,1.0


In [36]:
train_raw.describe(include="O")

Unnamed: 0,Id,groupId,matchId,matchType
count,4446966,4446966,4446966,4446966
unique,4446966,2026745,47965,16
top,7f96b2f878858a,14d6b54cdec6bc,4b5db40aec4797,squad-fpp
freq,1,74,100,1756186


# Data Preprocessing

## 파생변수 만들기
(Derived Variable)

- boosts + heals -> boosts_heals
- matchType -> matchType_game, matchType_team, matchType_fpp
    - matchType_game = {0: "rank", 1: "nomal", 2: "event"}
    - matchType_team = {0: "solo", 1: "duo", 2: "squad", 3: "event"}
    - matchType_fpp = {0: "tpp", 1: "fpp"}

[['solo', 'duo', 'squad'], ['solo-fpp', 'duo-fpp', 'squad-fpp'],['normal-solo', 'normal-duo', 'normal-squad'],
['normal-solo-fpp', 'normal-duo-fpp', 'normal-squad-fpp'],['crashtpp', 'crashfpp', 'flaretpp''flarefpp']]

- swimDistance + walkDistance -> nonrideDistance


In [11]:
train_dv = train_raw.copy()
print(train_dv.shape)
display(train_dv.head())

(4446966, 29)


Unnamed: 0,Id,groupId,matchId,assists,boosts,damageDealt,DBNOs,headshotKills,heals,killPlace,...,revives,rideDistance,roadKills,swimDistance,teamKills,vehicleDestroys,walkDistance,weaponsAcquired,winPoints,winPlacePerc
0,7f96b2f878858a,4d4b580de459be,a10357fd1a4a91,0,0,0.0,0,0,0,60,...,0,0.0,0,0.0,0,0,244.8,1,1466,0.4444
1,eef90569b9d03c,684d5656442f9e,aeb375fc57110c,0,0,91.47,0,0,0,57,...,0,0.0045,0,11.04,0,0,1434.0,5,0,0.64
2,1eaf90ac73de72,6a4a42c3245a74,110163d8bb94ae,1,0,68.0,0,0,0,47,...,0,0.0,0,0.0,0,0,161.8,2,0,0.7755
3,4616d365dd2853,a930a9c79cd721,f1f1f4ef412d7e,0,0,32.9,0,0,0,75,...,0,0.0,0,0.0,0,0,202.7,3,0,0.1667
4,315c96c26c9aac,de04010b3458dd,6dc8ff871e21e6,0,0,100.0,0,0,0,45,...,0,0.0,0,0.0,0,0,49.75,2,0,0.1875


### matchType

In [32]:
train_raw["matchType"].unique()

array(['squad-fpp', 'duo', 'solo-fpp', 'squad', 'duo-fpp', 'solo',
       'normal-squad-fpp', 'crashfpp', 'flaretpp', 'normal-solo-fpp',
       'flarefpp', 'normal-duo-fpp', 'normal-duo', 'normal-squad',
       'crashtpp', 'normal-solo'], dtype=object)

In [33]:
train_raw["matchType"].value_counts()

squad-fpp           1756186
duo-fpp              996691
squad                626526
solo-fpp             536762
duo                  313591
solo                 181943
normal-squad-fpp      17174
crashfpp               6287
normal-duo-fpp         5489
flaretpp               2505
normal-solo-fpp        1682
flarefpp                718
normal-squad            516
crashtpp                371
normal-solo             326
normal-duo              199
Name: matchType, dtype: int64

In [22]:
# matchType_game = {0: "rank", 1: "nomal", 2: "event"}

def temp(x):
    rank = ['solo', 'duo', 'squad', 'solo-fpp', 'duo-fpp', 'squad-fpp']
    nomal = ['normal-solo', 'normal-duo', 'normal-squad', 'normal-solo-fpp', 'normal-duo-fpp', 'normal-squad-fpp']
    event = ['crashtpp', 'crashfpp', 'flaretpp''flarefpp']
    if x in event:
        return 2
    elif x in nomal:
        return 1
    elif x in rank:
        return 0
    else:
        return None
    
train_dv["matchType_game"] = train_dv["matchType"].map(temp)
train_dv["matchType_game"].value_counts()
train_dv["matchType_game"]

0.0    4411699
1.0      25386
2.0       6658
Name: matchType_game, dtype: int64

In [23]:
# matchType_team = {0: "solo", 1: "duo", 2: "squad", 3: "event"}

def temp(x):
    if "solo" in x:
        return 0
    elif 'duo' in x:
        return 1
    elif 'squad' in x:
        return 2
    else:
        return 3

train_dv["matchType_team"] = train_dv["matchType"].map(temp)
train_dv["matchType_team"].value_counts()

2    2400402
1    1315970
0     720713
3       9881
Name: matchType_team, dtype: int64

In [21]:
# matchType_fpp = {0: "tpp", 1: "fpp"}

def temp(x):
    if "fpp" in x:
        return 1
    else:
        return 0

train_dv["matchType_fpp"] = train_dv["matchType"].map(temp)
train_dv["matchType_fpp"].value_counts()

1    3320989
0    1125977
Name: matchType_fpp, dtype: int64

### boosts + heals -> boosts_heals

### swimDistance + walkDistance -> nonrideDistance

# ML

In [3]:
# raw 데이터가 너무 커서 샘플링, 모델링할 때 활용
# _rs : random sample 데이터
train_rs = train_raw.sample(n=10000)
display(train_rs.head())
print(train_rs.shape)

Unnamed: 0,Id,groupId,matchId,assists,boosts,damageDealt,DBNOs,headshotKills,heals,killPlace,...,revives,rideDistance,roadKills,swimDistance,teamKills,vehicleDestroys,walkDistance,weaponsAcquired,winPoints,winPlacePerc
607585,b2fa3690f26eb2,2b6af95ab2e05f,e8850d318cb2d0,0,0,100.0,1,1,0,35,...,0,0.0,0,0.0,0,0,666.9,1,1491,0.24
4087103,90d4fa513b09c2,04590fa7b281dd,ad86940e23ec2c,0,2,78.69,0,0,1,62,...,0,0.0,0,0.0,0,0,207.2,4,1500,0.4639
3721251,45178bc216b0a8,40f62d4fd4ab60,b369a6830e0e75,0,0,217.4,2,1,1,19,...,1,0.0,0,0.0,0,0,533.1,7,0,0.1852
1856326,0b5ab35299ac06,dd1c4d2e1f3bcd,b27dc971aaa3e7,0,0,0.0,0,0,0,89,...,0,0.0,0,0.0,0,0,11.59,2,1466,0.0745
1816324,8150efbf3e2d20,ed8c023e2616e8,65efd4df7edf9e,0,0,83.85,0,0,0,83,...,0,0.0,0,0.0,0,0,80.91,1,0,0.1064


(10000, 29)


In [26]:
label = 'winPlacePerc'
features = train_raw.columns.tolist()
features.remove(label)
print(label)
print(features)

winPlacePerc
['Id', 'groupId', 'matchId', 'assists', 'boosts', 'damageDealt', 'DBNOs', 'headshotKills', 'heals', 'killPlace', 'killPoints', 'kills', 'killStreaks', 'longestKill', 'matchDuration', 'matchType', 'maxPlace', 'numGroups', 'rankPoints', 'revives', 'rideDistance', 'roadKills', 'swimDistance', 'teamKills', 'vehicleDestroys', 'walkDistance', 'weaponsAcquired', 'winPoints']


In [None]:
X_train = train[features]
y_train = train[label]

## Modeling

In [None]:
from 