## PUBG Finish Placement Prediction  -PUBG遊戲結束排名預測
- 在PUBG遊戲中，每場比賽最多有100名玩家（matchId）。玩家可以在團隊中（groupId）根據有多少其他團隊在被淘汰時還活著而在遊戲結束時排名（winPlacePerc）。在遊戲中，玩家可以拿起不同的彈藥，恢復被擊倒但未被擊倒的隊友，駕駛車輛，游泳，跑步，射擊，並體驗所有後果 - 例如跌得太遠或者自己跑過來 消除自己。
- 您將獲得大量匿名的PUBG遊戲統計數據，其格式設置為每行包含一個玩家的遊戲後統計數據。數據來自所有類型的比賽：一個人，雙人，小隊和自定義; 不保證每場比賽有100名玩家，每組最多4名玩家。
- 你必須創建一個模型，根據他們的最終統計數據預測玩家的結束排名，從1（第一名）到0（最後一名）

### 評估
- 兩個 column，分別為測試集的ID (ID)、最終排名的預測機率值 (WinPlaceperC) 
- 評估指標為預期的WinPlaceperC和觀察到的WinPlaceperC之間的平均絕對誤差 MAE。

### 特徵定義
- assists - 傷害過多少敵人（最終該敵人被隊友殺害）
- boosts - 使用過多少個提升性的物品(boost items used)
- damageDealt - 造成的總傷害-自己所受的傷害
- DBNOs - 擊倒多少敵人 
- headshotKills - 通過爆頭而殺死的敵人數量
- heals - 使用了多少救援類物品
- Id - 玩家ID
- killPlace - 殺死敵人數量的排名
- killPoints - 基於殺戮的玩家外部排名。將其視為Elo排名，只有殺死才有意義。如果rankPoints中的值不是-1，那麽killPoints中的任何0都應被視為“無”。
- killStreaks - 短時間內殺死敵人的最大數量
- kills - 殺死的敵人的數量
- longestKill - 玩家和玩家在死亡時被殺的最長距離。 這可能會產生誤導，因為擊倒一名球員並開走可能會導致最長的殺戮統計數據。
- matchDuration - 匹配用了多少秒
- matchId - 匹配的ID（每一局一個ID）
- matchType -  單排/雙排/四排；標準模式是“solo”，“duo”，“squad”，“solo-fpp”，“duo-fpp”和“squad-fpp”; 其他模式來自事件或自定義匹配。
- rankPoints - 類似Elo的玩家排名。 此排名不一致，並且在API的下一個版本中已棄用，因此請謹慎使用。值-1表示“無”。
- revives - 玩家救援隊友的次數
- rideDistance - 玩家使用交通工具行駛了多少米
- roadKills - 在交通工具上殺死了多少玩家
- swimDistance - 遊泳了多少米
- teamKills - 該玩家殺死隊友的次數
- vehicleDestroys - 毀壞了多少交通工具
- walkDistance - 步行運動了多少米
- groupId - 隊伍的ID。 如果同一組玩家在不同的比賽中比賽，他們每次都會有不同的groupId。
- numGroups - 在該局比賽中有玩家數據的隊伍數量
- maxPlace - 在該局中已有數據的最差的隊伍名詞（可能與該局隊伍數不匹配，因為數據收集有跳躍）
- weaponsAcquired - 撿了多少把槍
- winPoints - 基於贏的玩家外部排名。將其視為Elo排名，只有獲勝才有意義。如果rankPoints中的值不是-1，那麽winPoints中的任何0都應被視為“無”。
- winPlacePerc - 預測目標，是以百分數計算的，介於0-1之間，1對應第一名，0對應最後一名。 它是根據maxPlace計算的，而不是numGroups，因此匹配中可能缺少某些隊伍。

In [7]:
# 事前準備
import os
import numpy as np          # 資料處理分析工具
import pandas as pd         # 資料處理分析工具
from scipy import stats     # 統計函式庫
from scipy.stats import norm, skew
from collections import Counter # counter是字典，用來計數，key是要計數的item，value儲存的是個數
from sklearn.preprocessing import LabelEncoder # 用於特徵提取, 將數值資料轉為離散
from sklearn.preprocessing import MinMaxScaler

# 匯入必要的函式庫
from sklearn.linear_model import ElasticNet, Lasso,  BayesianRidge, LassoLarsIC
from sklearn.ensemble import RandomForestRegressor,  GradientBoostingRegressor
from sklearn.kernel_ridge import KernelRidge
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler
from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin, clone
from sklearn.model_selection import KFold, cross_val_score, train_test_split
from sklearn.metrics import mean_squared_error
import xgboost as xgb
import lightgbm as lgb

# 資料視覺化
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
color = sns.color_palette() # 生成各種顏色
sns.set_style('darkgrid') # 設置主題：darkgrid、whitegrid、dark、white、ticks
# 設定展示欄位最大值
pd.set_option('display.max_row', 200) 
pd.set_option('display.max_columns', 100) 

# 忽略警告
import warnings
warnings.filterwarnings("ignore")

In [8]:
# 載入訓練資料集
train = pd.read_csv('./data/pubg/train_V2.csv')
train.head(5)

Unnamed: 0,Id,groupId,matchId,assists,boosts,damageDealt,DBNOs,headshotKills,heals,killPlace,killPoints,kills,killStreaks,longestKill,matchDuration,matchType,maxPlace,numGroups,rankPoints,revives,rideDistance,roadKills,swimDistance,teamKills,vehicleDestroys,walkDistance,weaponsAcquired,winPoints,winPlacePerc
0,7f96b2f878858a,4d4b580de459be,a10357fd1a4a91,0,0,0.0,0,0,0,60,1241,0,0,0.0,1306,squad-fpp,28,26,-1,0,0.0,0,0.0,0,0,244.8,1,1466,0.4444
1,eef90569b9d03c,684d5656442f9e,aeb375fc57110c,0,0,91.47,0,0,0,57,0,0,0,0.0,1777,squad-fpp,26,25,1484,0,0.0045,0,11.04,0,0,1434.0,5,0,0.64
2,1eaf90ac73de72,6a4a42c3245a74,110163d8bb94ae,1,0,68.0,0,0,0,47,0,0,0,0.0,1318,duo,50,47,1491,0,0.0,0,0.0,0,0,161.8,2,0,0.7755
3,4616d365dd2853,a930a9c79cd721,f1f1f4ef412d7e,0,0,32.9,0,0,0,75,0,0,0,0.0,1436,squad-fpp,31,30,1408,0,0.0,0,0.0,0,0,202.7,3,0,0.1667
4,315c96c26c9aac,de04010b3458dd,6dc8ff871e21e6,0,0,100.0,0,0,0,45,0,1,1,58.53,1424,solo-fpp,97,95,1560,0,0.0,0,0.0,0,0,49.75,2,0,0.1875


In [9]:
# 載入測試資料集
test = pd.read_csv('./data/pubg/test_V2.csv')
test.head(5)

Unnamed: 0,Id,groupId,matchId,assists,boosts,damageDealt,DBNOs,headshotKills,heals,killPlace,killPoints,kills,killStreaks,longestKill,matchDuration,matchType,maxPlace,numGroups,rankPoints,revives,rideDistance,roadKills,swimDistance,teamKills,vehicleDestroys,walkDistance,weaponsAcquired,winPoints
0,9329eb41e215eb,676b23c24e70d6,45b576ab7daa7f,0,0,51.46,0,0,0,73,0,0,0,0.0,1884,squad-fpp,28,28,1500,0,0.0,0,0.0,0,0,588.0,1,0
1,639bd0dcd7bda8,430933124148dd,42a9a0b906c928,0,4,179.1,0,0,2,11,0,2,1,361.9,1811,duo-fpp,48,47,1503,2,4669.0,0,0.0,0,0,2017.0,6,0
2,63d5c8ef8dfe91,0b45f5db20ba99,87e7e4477a048e,1,0,23.4,0,0,4,49,0,0,0,0.0,1793,squad-fpp,28,27,1565,0,0.0,0,0.0,0,0,787.8,4,0
3,cf5b81422591d1,b7497dbdc77f4a,1b9a94f1af67f1,0,0,65.52,0,0,0,54,0,0,0,0.0,1834,duo-fpp,45,44,1465,0,0.0,0,0.0,0,0,1812.0,3,0
4,ee6a295187ba21,6604ce20a1d230,40754a93016066,0,4,330.2,1,2,1,7,0,3,1,60.06,1326,squad-fpp,28,27,1480,1,0.0,0,0.0,0,0,2963.0,4,0


### 觀察資料
- 發現跟ID有關的都是object型態, 還有matchtype也是
- winPlacePerc為福點數型態
- 資料快1G, 資料集非常大

In [11]:
# 確認資料特徵類型
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4446966 entries, 0 to 4446965
Data columns (total 29 columns):
 #   Column           Dtype  
---  ------           -----  
 0   Id               object 
 1   groupId          object 
 2   matchId          object 
 3   assists          int64  
 4   boosts           int64  
 5   damageDealt      float64
 6   DBNOs            int64  
 7   headshotKills    int64  
 8   heals            int64  
 9   killPlace        int64  
 10  killPoints       int64  
 11  kills            int64  
 12  killStreaks      int64  
 13  longestKill      float64
 14  matchDuration    int64  
 15  matchType        object 
 16  maxPlace         int64  
 17  numGroups        int64  
 18  rankPoints       int64  
 19  revives          int64  
 20  rideDistance     float64
 21  roadKills        int64  
 22  swimDistance     float64
 23  teamKills        int64  
 24  vehicleDestroys  int64  
 25  walkDistance     float64
 26  weaponsAcquired  int64  
 27  winPoints   

### 降低記憶體空間的使用
- 主要思想是更改數據類型，該數據類型將大量空間佔用佔用較少空間的數據類型，而不會丟失數據。我們將順序地遍歷，最後將記憶體減少超過一半。例如：對於殺戮，發現其絕對值不超過50，因此原始Int64數據類型可以轉換為INT8

In [13]:
def reduce_mem_usage(df):
    start_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    for col in df.columns:
        col_type = df[col].dtype
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
    end_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    return df
reduce_mem_usage(train)

Memory usage of dataframe is 983.90 MB
Memory usage after optimization is: 288.39 MB
Decreased by 70.7%


Unnamed: 0,Id,groupId,matchId,assists,boosts,damageDealt,DBNOs,headshotKills,heals,killPlace,killPoints,kills,killStreaks,longestKill,matchDuration,matchType,maxPlace,numGroups,rankPoints,revives,rideDistance,roadKills,swimDistance,teamKills,vehicleDestroys,walkDistance,weaponsAcquired,winPoints,winPlacePerc
0,7f96b2f878858a,4d4b580de459be,a10357fd1a4a91,0,0,0.00000,0,0,0,60,1241,0,0,0.00000,1306,squad-fpp,28,26,-1,0,0.000000,0,0.000000,0,0,244.7500,1,1466,0.444336
1,eef90569b9d03c,684d5656442f9e,aeb375fc57110c,0,0,91.50000,0,0,0,57,0,0,0,0.00000,1777,squad-fpp,26,25,1484,0,0.004501,0,11.039062,0,0,1434.0000,5,0,0.640137
2,1eaf90ac73de72,6a4a42c3245a74,110163d8bb94ae,1,0,68.00000,0,0,0,47,0,0,0,0.00000,1318,duo,50,47,1491,0,0.000000,0,0.000000,0,0,161.7500,2,0,0.775391
3,4616d365dd2853,a930a9c79cd721,f1f1f4ef412d7e,0,0,32.90625,0,0,0,75,0,0,0,0.00000,1436,squad-fpp,31,30,1408,0,0.000000,0,0.000000,0,0,202.7500,3,0,0.166748
4,315c96c26c9aac,de04010b3458dd,6dc8ff871e21e6,0,0,100.00000,0,0,0,45,0,1,1,58.53125,1424,solo-fpp,97,95,1560,0,0.000000,0,0.000000,0,0,49.7500,2,0,0.187500
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4446961,afff7f652dbc10,d238e426f50de7,18492834ce5635,0,0,0.00000,0,0,0,74,1029,0,0,0.00000,1873,squad-fpp,29,28,-1,0,1292.000000,0,0.000000,0,0,1019.0000,3,1507,0.178589
4446962,f4197cf374e6c0,408cdb5c46b2ac,ee854b837376d9,0,1,44.15625,0,0,0,69,0,0,0,0.00000,1435,solo,93,93,1501,0,0.000000,0,0.000000,0,0,81.6875,6,0,0.293457
4446963,e1948b1295c88a,e26ac84bdf7cef,6d0cd12784f1ab,0,0,59.06250,0,0,0,66,0,0,0,0.00000,1321,squad-fpp,28,28,1500,0,0.000000,0,2.183594,0,0,788.5000,4,0,0.481445
4446964,cc032cdd73b7ac,c2223f35411394,c9c701d0ad758a,0,4,180.37500,1,1,2,11,0,2,1,98.50000,1373,squad-fpp,26,25,1418,2,0.000000,0,0.000000,0,0,2748.0000,8,0,0.799805
