# Imports

In [14]:
import numpy as np
import pandas as pd

# Data exploration and preparation

In [27]:
df_train = pd.read_csv('data/rocketskillshots_train.csv')
df_test = pd.read_csv('data/rocketskillshots_test.csv')

In [34]:
df_train.head()

Unnamed: 0,id,window_id,BallAcceleration,Time,DistanceWall,DistanceCeil,DistanceBall,PlayerSpeed,BallSpeed,up,...,slow,goal,left,boost,camera,down,right,slide,jump,label
0,0,,0.0,2.205022,3817.38,2013.0,,150959.239888,145648.06166,0.0,...,0,0,0,0,0,0,1,0.0,0.0,6
1,0,0.0,1636.798772,0.0,3498.01,2012.98,,104267.426232,99035.849337,0.0,...,0,0,0,0,0,0,1,0.0,1.0,6
2,0,1.0,3198.029397,0.138893,3494.08,2012.98,229.89678,124248.031988,102233.878734,0.0,...,0,0,0,1,0,0,1,0.0,1.0,6
3,0,2.0,0.0,0.173617,3494.08,2012.98,,124248.031988,102968.35899,0.0,...,0,0,0,1,0,0,0,0.0,0.0,6
4,0,3.0,9914.766242,0.31251,3500.08,2012.98,,115248.016009,112883.125231,0.0,...,0,0,0,0,0,0,1,0.0,0.0,6


First, let's check for missing values in the dataset.

In [29]:
df_train.isna().sum()

id                          0
window_id                 178
BallAcceleration           45
Time                        0
DistanceWall               93
DistanceCeil              116
DistanceBall             3112
PlayerSpeed                 0
BallSpeed                  36
up                          0
accelerate                  0
slow                        0
goal                        0
left                        0
boost                       0
camera                      0
down                        0
right                       0
slide                       0
jump                        0
BallAcceleration_skew    3959
Time_skew                3959
DistanceWall_skew        3959
DistanceCeil_skew        3959
DistanceBall_skew        3959
PlayerSpeed_skew         3959
BallSpeed_skew           3959
up_skew                  3959
accelerate_skew          3959
slow_skew                3959
goal_skew                3959
left_skew                3959
boost_skew               3959
camera_ske

From this overview, we see that all *\*\_skew* parameters (eg. BallAcceleration_skew, Time_skew, etc.) have a large number of missing values (only 178 non-null values per parameter).

Aside from these, the Distance_Ball parameter has a very large number of missing values (3112 null values).

Most other parameters have either none or a small number of missing values:
- *window\_id*: 178 null values
- *BallAcceleration*: 45 null values
- *Distance_Wall*: 93 null values
- *Distance_Ceil*: 116 null values
- *Distance_Ball*: 3112 null values (!)
- *BallSpeed*: 36 null values
- *\*\_skew*: 3959 null values (!)
- all other parameters: 0 null values

Given such a large amount of missing values, the *Distance_Ball* parameter and all *\*\_skew* parameters will be excluded from further analysis and model training.

In [30]:
df_train = df_train.loc[:,~df_train.columns.str.endswith('_skew')]
df_train.drop('DistanceBall', axis=1)

Unnamed: 0,id,window_id,BallAcceleration,Time,DistanceWall,DistanceCeil,PlayerSpeed,BallSpeed,up,accelerate,slow,goal,left,boost,camera,down,right,slide,jump,label
0,0,,0.000000,2.205022,3817.38,2013.00,150959.239888,145648.061660,0.0,0.0,0,0,0,0,0,0,1,0.0,0.0,6
1,0,0.0,1636.798772,0.000000,3498.01,2012.98,104267.426232,99035.849337,0.0,0.0,0,0,0,0,0,0,1,0.0,1.0,6
2,0,1.0,3198.029397,0.138893,3494.08,2012.98,124248.031988,102233.878734,0.0,0.0,0,0,0,1,0,0,1,0.0,1.0,6
3,0,2.0,0.000000,0.173617,3494.08,2012.98,124248.031988,102968.358990,0.0,0.0,0,0,0,1,0,0,0,0.0,0.0,6
4,0,3.0,9914.766242,0.312510,3500.08,2012.98,115248.016009,112883.125231,0.0,0.0,0,0,0,0,0,0,1,0.0,0.0,6
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4132,297,10.0,0.000000,2.057068,3038.61,1945.02,219368.460358,0.000000,0.0,0.0,0,0,0,0,0,0,1,1.0,0.0,-1
4133,297,11.0,0.000000,2.126637,3082.31,,220365.796688,0.000000,0.0,0.0,0,0,0,0,1,0,1,1.0,0.0,-1
4134,297,12.0,0.000000,2.196237,3126.01,1986.26,221520.167700,0.000000,0.0,1.0,0,0,0,1,0,0,1,1.0,0.0,-1
4135,297,13.0,0.000000,2.300599,3160.95,2005.86,221196.340110,0.000000,0.0,1.0,0,0,0,1,1,0,1,1.0,0.0,-1


Next, let's get an overview of values in the dataset's columns.

In [32]:
for col in df_train.columns:
    print(f"{col} value counts:")
    print(df_train[col].value_counts())
    print("\n")

id value counts:
id
134    65
155    50
278    49
146    46
265    45
       ..
57      9
87      8
255     7
244     5
231     5
Name: count, Length: 178, dtype: int64


window_id value counts:
window_id
0.0     178
1.0     178
2.0     178
3.0     178
4.0     176
       ... 
59.0      1
60.0      1
61.0      1
62.0      1
63.0      1
Name: count, Length: 64, dtype: int64


BallAcceleration value counts:
BallAcceleration
 0.000000       718
-1393.000000      4
-57.023293        2
 2157.000000      2
-2503.000000      2
               ... 
 1981.432894      1
 716.754644       1
 3070.897867      1
 5204.160485      1
-6024.047940      1
Name: count, Length: 3296, dtype: int64


Time value counts:
Time
0.000000    178
0.382700     12
0.347900      9
0.034800      7
0.417500      6
           ... 
0.835022      1
1.009003      1
1.182963      1
1.704834      1
2.474537      1
Name: count, Length: 3540, dtype: int64


DistanceWall value counts:
DistanceWall
35.99      26
0.00       19
36.

Here, we can easily see the distribution of values in the categorical parameters (from the parameter *up* onward). Most of these parameters have an unbalanced distribution. This could possibly mean that they can be used for easier differentiation of trickshots (i.e. if the parameter has one of the more rare values, it is likely that the trickshot belongs to a specific category).

The goal of the task is to predict the *label* column, which is a categorical property with 7 categories.

In [33]:
df_train.describe()

Unnamed: 0,id,window_id,BallAcceleration,Time,DistanceWall,DistanceCeil,DistanceBall,PlayerSpeed,BallSpeed,up,...,slow,goal,left,boost,camera,down,right,slide,jump,label
count,4137.0,3959.0,4092.0,4137.0,4044.0,4021.0,1025.0,4137.0,4101.0,4137.0,...,4137.0,4137.0,4137.0,4137.0,4137.0,4137.0,4137.0,4137.0,4137.0,4137.0
mean,143.800822,12.901743,-3814.882875,2.021066,3780.541356,1695.130315,1081.380738,150372.747353,123900.647991,0.044356,...,0.033358,0.119652,0.01692,0.254774,0.126904,0.012328,0.924583,0.263113,0.403312,3.35533
std,84.764944,10.064004,46389.387153,1.729769,13671.927555,540.974674,1535.02081,49072.15042,72183.192982,0.205468,...,0.17959,0.324593,0.128989,0.435787,0.332905,0.110357,0.264095,0.440171,0.490437,2.588084
min,0.0,0.0,-298303.227932,0.0,0.0,0.07,129.800236,27.037012,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-1.0
25%,70.0,5.0,-1846.379236,0.731333,1219.25,1473.77,234.093144,122054.835787,87607.044272,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
50%,146.0,11.0,0.0,1.598579,3154.4,1976.27,429.132909,148003.167986,129003.890178,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,3.0
75%,214.0,18.0,985.247663,2.8879,3765.1375,2013.0,929.411923,185378.187706,168797.513992,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,6.0
max,297.0,63.0,287269.750948,13.470363,223799.815054,4039.97,9194.156158,229999.958811,309832.16491,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,7.0


This output shows us the distribution of values of continuous parameters.

# Model creation

In the process of finding the best prediction model for this task, several different machine learning approaches will be taken into consideration:
- Decision tree
- Random forest
- ...

## Decision tree

## Random forest