# Problem Statement: 

Tennis Australia Open is trying to better automate how tennis points get categorized into three outcomes – 

* `Winner` – the point winning player hits a shot that is not touched by the opponent
* `Forced error` – the point winning player hits a shot that causes the opponent to not be able to return it, i.e. a good shot that is hard to handle
* `Unforced error` – the player attempting to return the ball makes an error on an otherwise normal looking rally shot

## Dataset Description:

The dataset includes point outcomes of rallies only (where the number of shots hit exceeds two, which represents the serve and return). All points were played at a past Australian Open.

## Atribute description

| Variable | Description| Value Range |
| :- | -: | :-: |
rally | The number of shots in the point counting serves and point-ending shot | An integer from 1, 2, 3...
| serve | A number indicating whether the point was played on a first or second serve.  | 1 = First, 2 = Second
| hitpoint | Shot category for point-ending shot | F = Forehand, B = Backhand, V = Volley, U = Unknown
| speed | Speed of point-ending shot | Continuous (m/s)
| net.clearance | Distance above the net as point-ending shot passed the net | Continuous (cm) distance above net. Can be negative if shot did not pass above the net.
| distance.from.sideline | Lateral distance of the point-ending shot bounce from the nearest singles sideline. | Perpendicular distance in meters (always positive even if out)
| depth | Distance of the point-ending shot bounce from the baseline | Perpendicular distance in meters
(always positive even if out)
| outside.sideline | Logical indicator of whether point-ending shot landed outside of the in-play singles sideline | TRUE, FALSE
| outside.baseline | Logical indicator of whether point-ending shot landed beyond the in-play baseline | TRUE, FALSE
| player.distance.travelled | Distance player who made the point-ending shot travelled between the impact of the penultimate shot and the impact of the point-ending shot | Euclidean distance in meters
| player.impact.depth | Distance of player who made point-ending shot from the net at the time the point-ending shot was made | Perpendicular distance along the length of court from net in meters
| player.impact.distance.from.center | Distance of player who made point-ending shot from the center line at the time the point-ending shot was made | Perpendicular distance from the center line in meters
| player.depth | Distance of player who made point-ending shot from the net at the time the penultimate shot was made | Perpendicular distance along the length of court from net in meters
| player.distance.from.center | Distance of player who made point-ending shot from the center line at the time the penultimate shot was made | Perpendicular distance from the center line in meters
| opponent.depth | Distance of opponent from the net at the time the at the time the penultimate shot was made | Perpendicular distance along the length of court from net in meters
| opponent.distance.from.center | Distance of opponent from the center line at the time the penultimate shot was made | Perpendicular distance from the center line in meters
| same.side | Logical indicator if both player and opponent were positioned on the same side of the center line (ad or deuce court) at the time the penultimate shot was made | TRUE, FALSE
| previous.speed | Speed of penultimate shot | Continuous (m/s)
| previous.net.clearance | Distance above the net as penultimate shot passed the net | Continuous (cm) distance above net. Can be negative if shot did not pass above the net.
| previous.distance.from.sideline | Lateral distance of the penultimate  shot bounce from the nearest singles sideline. | Perpendicular distance in meters (always positive even if out)
| previous.depth | Distance of the penultimate shot bounce from the baseline | Perpendicular distance in meters
(always positive even if out)
| previous.hitpoint | Shot category for penultimate shot | F = Forehand, B = Backhand, V = Volley, U = Unknown
| previous.time.to.net | Time for penultimate shot to be hit and pass the net | Continuous number in seconds
| server.is.impact.player | Logical if player who made point-ending shot was the server of the point | TRUE, FALSE
| outcome | Target variable, character with three categories indicating the type of shot that ended the point  | W (Winner), FE (Forced Error), UE (Unforced Error)
| id | A 10-character unique identifier for the point | Character

# Import libraries

In [4]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score, classification_report

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier, GradientBoostingClassifier

from xgboost import XGBClassifier

from sklearn.model_selection import GridSearchCV

In [6]:
!pip install vecstack
from vecstack import stacking

Collecting vecstack
  Downloading vecstack-0.4.0.tar.gz (18 kB)
Building wheels for collected packages: vecstack
  Building wheel for vecstack (setup.py): started
  Building wheel for vecstack (setup.py): finished with status 'done'
  Created wheel for vecstack: filename=vecstack-0.4.0-py3-none-any.whl size=19877 sha256=82809688c6b95e5a6b4e0452370f0fcfb54022467c365ecc695ec8e354856fef
  Stored in directory: c:\users\gsk44\appdata\local\pip\cache\wheels\7e\ee\d6\47cb94a403bc544de1433986e5530d6b0498021098fbe43aa1
Successfully built vecstack
Installing collected packages: vecstack
Successfully installed vecstack-0.4.0


# Load data

In [8]:
data = pd.read_csv("C:/Users/gsk44/OneDrive/Desktop/Stacking/tennis.csv")

# Data Understanding

## Number of records and columns

In [9]:
data.shape

(8001, 27)

## See the first five records

In [10]:
data.head()

Unnamed: 0,rally,serve,hitpoint,speed,net.clearance,distance.from.sideline,depth,outside.sideline,outside.baseline,player.distance.travelled,...,previous.depth,opponent.depth,opponent.distance.from.center,same.side,previous.hitpoint,previous.time.to.net,server.is.impact.player,outcome,gender,ID
0,4,1,B,35.515042,-0.021725,3.474766,6.797621,False,False,1.46757,...,0.705435,12.5628,2.0724,True,F,0.445318,False,UE,mens,8644
1,4,2,B,33.38264,1.114202,2.540801,2.608708,False,True,2.311931,...,3.8566,12.3544,5.1124,False,B,0.432434,False,FE,mens,1182
2,23,1,B,22.31669,-0.254046,3.533166,9.435749,False,False,3.903728,...,2.908892,13.862,1.6564,False,F,0.397538,True,FE,mens,9042
3,9,1,F,36.837309,0.766694,0.586885,3.34218,True,False,0.583745,...,0.557554,14.2596,0.1606,True,B,0.671984,True,UE,mens,1222
4,4,1,B,35.544208,0.116162,0.918725,5.499119,False,False,2.333456,...,3.945317,11.3658,1.1082,False,F,0.340411,False,W,mens,4085


## Different classes in Outcome variable

In [11]:
data.outcome.value_counts()

UE    3501
W     2682
FE    1818
Name: outcome, dtype: int64

In [12]:
data.outcome.value_counts(normalize= True)*100

UE    43.75703
W     33.52081
FE    22.72216
Name: outcome, dtype: float64

## Display data type of each variable

In [13]:
data.dtypes

rally                                   int64
serve                                   int64
hitpoint                               object
speed                                 float64
net.clearance                         float64
distance.from.sideline                float64
depth                                 float64
outside.sideline                         bool
outside.baseline                         bool
player.distance.travelled             float64
player.impact.depth                   float64
player.impact.distance.from.center    float64
player.depth                          float64
player.distance.from.center           float64
previous.speed                        float64
previous.net.clearance                float64
previous.distance.from.sideline       float64
previous.depth                        float64
opponent.depth                        float64
opponent.distance.from.center         float64
same.side                                bool
previous.hitpoint                 

## Identifying categorical attributes

In [14]:
categorical_list = ["hitpoint","outside.sideline",
                    "outside.baseline","same.side",
                    "previous.hitpoint",
                    "server.is.impact.player",
                    "gender","outcome"]

## Converting to appropriate datatype

In [15]:
data[categorical_list] = data[categorical_list].astype("category")    

## Display data type of each variable after conversion

In [16]:
data.dtypes

rally                                    int64
serve                                    int64
hitpoint                              category
speed                                  float64
net.clearance                          float64
distance.from.sideline                 float64
depth                                  float64
outside.sideline                      category
outside.baseline                      category
player.distance.travelled              float64
player.impact.depth                    float64
player.impact.distance.from.center     float64
player.depth                           float64
player.distance.from.center            float64
previous.speed                         float64
previous.net.clearance                 float64
previous.distance.from.sideline        float64
previous.depth                         float64
opponent.depth                         float64
opponent.distance.from.center          float64
same.side                             category
previous.hitp

## Dropping ID column and checking the length of columns

In [17]:
len(data['ID'].unique())

8001

In [18]:
data.shape

(8001, 27)

In [19]:
data.drop(["ID"], axis=1, inplace=True)

len(data.columns)

26

## Display summary statistics 

In [20]:
data.describe()

Unnamed: 0,rally,serve,speed,net.clearance,distance.from.sideline,depth,player.distance.travelled,player.impact.depth,player.impact.distance.from.center,player.depth,player.distance.from.center,previous.speed,previous.net.clearance,previous.distance.from.sideline,previous.depth,opponent.depth,opponent.distance.from.center,previous.time.to.net
count,8001.0,8001.0,8001.0,8001.0,8001.0,8001.0,8001.0,8001.0,8001.0,8001.0,8001.0,8001.0,8001.0,8001.0,8001.0,8001.0,8001.0,8001.0
mean,5.966004,1.3987,30.806938,0.629658,1.46763,4.421146,2.690463,11.899694,1.919544,12.253954,1.213795,28.763676,0.821562,2.19342,4.218717,12.61681,2.367952,0.549988
std,3.548182,0.489661,7.298917,0.982504,1.108697,3.144965,1.713136,2.788231,1.205449,2.039085,0.964364,6.47747,0.674663,1.038942,2.052946,2.075401,1.313927,0.186788
min,3.0,1.0,5.176078,-0.998184,0.000497,0.003135,0.0,2.156,0.0002,1.3898,0.0004,8.449117,0.028865,0.000164,0.000467,2.1612,0.0002,0.003201
25%,3.0,1.0,26.77029,-0.027092,0.5395,1.641161,1.444233,11.2214,0.9424,11.3742,0.5518,24.033218,0.404815,1.354458,2.733674,12.0824,1.3522,0.432164
50%,5.0,1.0,32.41769,0.44587,1.210847,3.860266,2.360894,12.6918,1.8294,12.5516,0.9838,29.793417,0.658382,2.168822,4.126864,12.9016,2.332,0.507559
75%,7.0,2.0,35.681431,0.970844,2.215955,7.029345,3.565853,13.553,2.7452,13.498,1.5966,33.581003,1.021397,3.022677,5.595515,13.7128,3.259,0.624135
max,38.0,2.0,55.052795,12.815893,7.569757,11.886069,14.480546,18.1256,7.7462,18.7458,9.3526,54.207506,6.730275,4.114361,9.997963,20.211,6.8526,1.635257


In [21]:
data.describe(include=['category'])

Unnamed: 0,hitpoint,outside.sideline,outside.baseline,same.side,previous.hitpoint,server.is.impact.player,outcome,gender
count,8001,8001,8001,8001,8001,8001,8001,8001
unique,4,2,2,2,4,2,3,2
top,F,False,False,False,F,True,UE,mens
freq,4402,6500,6380,6036,3684,4670,3501,4005


## Check the distribution of all categorical attributes

In [22]:
for i in categorical_list:
    print(data[i].value_counts(normalize=True)*100)

F    55.018123
B    37.920260
U     5.361830
V     1.699788
Name: hitpoint, dtype: float64
False    81.239845
True     18.760155
Name: outside.sideline, dtype: float64
False    79.740032
True     20.259968
Name: outside.baseline, dtype: float64
False    75.44057
True     24.55943
Name: same.side, dtype: float64
F    46.044244
B    40.969879
V     8.998875
U     3.987002
Name: previous.hitpoint, dtype: float64
True     58.367704
False    41.632296
Name: server.is.impact.player, dtype: float64
mens      50.056243
womens    49.943757
Name: gender, dtype: float64
UE    43.75703
W     33.52081
FE    22.72216
Name: outcome, dtype: float64


## Checking for null values

In [23]:
data.isnull().sum()

rally                                 0
serve                                 0
hitpoint                              0
speed                                 0
net.clearance                         0
distance.from.sideline                0
depth                                 0
outside.sideline                      0
outside.baseline                      0
player.distance.travelled             0
player.impact.depth                   0
player.impact.distance.from.center    0
player.depth                          0
player.distance.from.center           0
previous.speed                        0
previous.net.clearance                0
previous.distance.from.sideline       0
previous.depth                        0
opponent.depth                        0
opponent.distance.from.center         0
same.side                             0
previous.hitpoint                     0
previous.time.to.net                  0
server.is.impact.player               0
outcome                               0


# Divide the data into train and test

In [24]:
y = data["outcome"]
X = data.drop('outcome', axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=123, stratify=y)

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(5600, 25)
(2401, 25)
(5600,)
(2401,)


# Preprocessing

## Display all the columns

In [25]:
data.columns

Index(['rally', 'serve', 'hitpoint', 'speed', 'net.clearance',
       'distance.from.sideline', 'depth', 'outside.sideline',
       'outside.baseline', 'player.distance.travelled', 'player.impact.depth',
       'player.impact.distance.from.center', 'player.depth',
       'player.distance.from.center', 'previous.speed',
       'previous.net.clearance', 'previous.distance.from.sideline',
       'previous.depth', 'opponent.depth', 'opponent.distance.from.center',
       'same.side', 'previous.hitpoint', 'previous.time.to.net',
       'server.is.impact.player', 'outcome', 'gender'],
      dtype='object')

## Creating a list of numerical attributes and categorical list

In [26]:
numeric_list = ['rally', 'serve', 'speed', 'net.clearance',
                'distance.from.sideline', 'depth',
                'player.distance.travelled', 'player.impact.depth',
                'player.impact.distance.from.center',
                'player.depth', 'player.distance.from.center',
                'previous.speed', 'previous.net.clearance',
                'previous.distance.from.sideline', 'previous.depth',
                'opponent.depth', 'opponent.distance.from.center',
                'previous.time.to.net']

categorical_list = ["hitpoint", "outside.sideline", "outside.baseline", "same.side", 
                    "previous.hitpoint", "server.is.impact.player", "gender"]

In [27]:
len(numeric_list)

18

In [28]:
len(categorical_list)

7

## LabelEncoder  : Target Attributes

In [29]:
y_train.value_counts()

UE    2450
W     1877
FE    1273
Name: outcome, dtype: int64

In [30]:
le = LabelEncoder()

le.fit(y_train)

y_train = le.transform(y_train)
y_test = le.transform(y_test)

In [31]:
print(le.classes_)

['FE' 'UE' 'W']


In [32]:
print(le.inverse_transform([0, 1, 2]))

['FE' 'UE' 'W']


In [33]:
np.unique(y_train, return_counts=True)
# 0 - Forced Error, 1 - Unforced Error, 2 - Winner

(array([0, 1, 2]), array([1273, 2450, 1877], dtype=int64))

In [34]:
np.unique(y_test, return_counts=True)

(array([0, 1, 2]), array([ 545, 1051,  805], dtype=int64))

## Standard Scaler : Independent Numberic Attributes

In [35]:
scaler = StandardScaler()
scaler.fit(X_train[numeric_list])

X_train_num = pd.DataFrame(scaler.transform(X_train[numeric_list]), columns=numeric_list)
X_test_num = pd.DataFrame(scaler.transform(X_test[numeric_list]), columns=numeric_list)

In [36]:
X_train_num.head()

Unnamed: 0,rally,serve,speed,net.clearance,distance.from.sideline,depth,player.distance.travelled,player.impact.depth,player.impact.distance.from.center,player.depth,player.distance.from.center,previous.speed,previous.net.clearance,previous.distance.from.sideline,previous.depth,opponent.depth,opponent.distance.from.center,previous.time.to.net
0,2.206016,1.228398,1.121552,-0.730696,-0.175183,0.988152,-0.829916,1.120228,0.808667,1.52635,0.482096,0.548899,0.520014,1.276148,-0.584783,0.454714,-0.368646,-0.231291
1,4.136322,-0.814069,1.170424,0.163662,-0.86293,-1.148001,0.014953,1.102595,0.362499,0.70273,-1.249078,0.196423,0.152115,-1.020791,-1.205634,-0.887901,0.331354,-0.808893
2,-4.9e-05,1.228398,0.182863,0.532959,-1.015778,-1.404791,0.531566,0.340187,0.245007,0.377467,0.154493,1.550284,-0.843765,-0.169719,-0.07968,1.26712,-0.574296,-0.513104
3,-0.827324,-0.814069,0.245147,-0.189687,-0.77174,-0.467465,-0.234643,0.107888,0.884055,-0.367821,-0.410447,0.390784,-0.559132,-0.237628,0.418002,-0.342972,1.316556,-0.54651
4,-0.827324,-0.814069,-0.396162,-0.647252,1.065523,0.586853,-1.105658,-0.557186,-0.463438,-1.017955,-0.630639,0.824545,-0.876408,0.551589,0.361632,-0.14435,1.534831,-0.648883


## OneHotEncoder : Independent Categorical Attributes

In [37]:
ohe = OneHotEncoder()

ohe.fit(X_train[categorical_list])

columns_ohe = list(ohe.get_feature_names(categorical_list))
print(columns_ohe)

['hitpoint_B', 'hitpoint_F', 'hitpoint_U', 'hitpoint_V', 'outside.sideline_False', 'outside.sideline_True', 'outside.baseline_False', 'outside.baseline_True', 'same.side_False', 'same.side_True', 'previous.hitpoint_B', 'previous.hitpoint_F', 'previous.hitpoint_U', 'previous.hitpoint_V', 'server.is.impact.player_False', 'server.is.impact.player_True', 'gender_mens', 'gender_womens']




In [38]:
X_train_cat = ohe.transform(X_train[categorical_list])
X_test_cat  = ohe.transform(X_test[categorical_list])

In [39]:
X_train_cat = pd.DataFrame(X_train_cat.todense(), columns=columns_ohe)
X_test_cat  = pd.DataFrame(X_test_cat.todense(), columns=columns_ohe)

In [40]:
X_train_cat.head()

Unnamed: 0,hitpoint_B,hitpoint_F,hitpoint_U,hitpoint_V,outside.sideline_False,outside.sideline_True,outside.baseline_False,outside.baseline_True,same.side_False,same.side_True,previous.hitpoint_B,previous.hitpoint_F,previous.hitpoint_U,previous.hitpoint_V,server.is.impact.player_False,server.is.impact.player_True,gender_mens,gender_womens
0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0
1,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0
2,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0
3,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0
4,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0


## Concatenate

In [41]:
X_train = pd.concat([X_train_num, X_train_cat], axis=1)
X_test = pd.concat([X_test_num, X_test_cat], axis=1)

In [42]:
print(X_train.shape, X_test.shape)

(5600, 36) (2401, 36)


In [43]:
X_train.head()

Unnamed: 0,rally,serve,speed,net.clearance,distance.from.sideline,depth,player.distance.travelled,player.impact.depth,player.impact.distance.from.center,player.depth,...,same.side_False,same.side_True,previous.hitpoint_B,previous.hitpoint_F,previous.hitpoint_U,previous.hitpoint_V,server.is.impact.player_False,server.is.impact.player_True,gender_mens,gender_womens
0,2.206016,1.228398,1.121552,-0.730696,-0.175183,0.988152,-0.829916,1.120228,0.808667,1.52635,...,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0
1,4.136322,-0.814069,1.170424,0.163662,-0.86293,-1.148001,0.014953,1.102595,0.362499,0.70273,...,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0
2,-4.9e-05,1.228398,0.182863,0.532959,-1.015778,-1.404791,0.531566,0.340187,0.245007,0.377467,...,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0
3,-0.827324,-0.814069,0.245147,-0.189687,-0.77174,-0.467465,-0.234643,0.107888,0.884055,-0.367821,...,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0
4,-0.827324,-0.814069,-0.396162,-0.647252,1.065523,0.586853,-1.105658,-0.557186,-0.463438,-1.017955,...,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0


In [44]:
X_test.head()

Unnamed: 0,rally,serve,speed,net.clearance,distance.from.sideline,depth,player.distance.travelled,player.impact.depth,player.impact.distance.from.center,player.depth,...,same.side_False,same.side_True,previous.hitpoint_B,previous.hitpoint_F,previous.hitpoint_U,previous.hitpoint_V,server.is.impact.player_False,server.is.impact.player_True,gender_mens,gender_womens
0,-4.9e-05,-0.814069,1.915802,-0.20941,-0.993875,-1.313488,-1.414688,0.872303,0.124021,1.063491,...,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
1,-0.827324,-0.814069,0.637883,-0.521776,0.027012,0.172605,-1.042676,-0.261741,-1.57628,-0.42014,...,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0
2,-0.827324,1.228398,1.285836,-0.255016,-0.385628,0.10308,-0.075567,0.665664,-1.426336,-0.45251,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
3,-0.827324,1.228398,0.191722,-0.185242,-0.566945,-0.193971,-0.799147,0.413152,-0.663473,-0.165681,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
4,-0.827324,-0.814069,-0.6233,-0.431102,-0.535634,0.74006,-1.082177,0.277543,-0.569446,-0.176144,...,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0


# Error Metrics

## Function to calculate accuracy, recall, precision and F1 score

In [45]:
scores = pd.DataFrame(columns=['Model', 'Train_Accuracy', 'Train_Recall', 'Train_Precision', 'Train_F1_Score', 
                               'Test_Accuracy', 'Test_Recall', 'Test_Precision', 'Test_F1_Score'])

def get_metrics(train_actual, train_predicted, test_actual, test_predicted, model_description, dataframe):

    train_accuracy  = accuracy_score(train_actual, train_predicted)
    train_recall    = recall_score(train_actual, train_predicted, average="weighted")
    train_precision = precision_score(train_actual, train_predicted, average="weighted")
    train_f1score   = f1_score(train_actual, train_predicted, average="weighted")
    
    test_accuracy   = accuracy_score(test_actual, test_predicted)
    test_recall     = recall_score(test_actual, test_predicted, average="weighted")
    test_precision  = precision_score(test_actual, test_predicted, average="weighted")
    test_f1score    = f1_score(test_actual, test_predicted, average="weighted")

    dataframe       = dataframe.append(pd.Series([model_description, 
                                                  train_accuracy, train_recall, train_precision, train_f1score,
                                                  test_accuracy, test_recall, test_precision, test_f1score],
                                                 index=scores.columns ), 
                                       ignore_index=True)

    return(dataframe)

## Function for Classification Report

In [46]:
def classifcation_report_train_test(y_train, y_train_pred, y_test, y_test_pred):

    print('''
            =========================================
               CLASSIFICATION REPORT FOR TRAIN DATA
            =========================================
            ''')
    print(classification_report(y_train, y_train_pred, digits=4))

    print('''
            =========================================
               CLASSIFICATION REPORT FOR TEST DATA
            =========================================
            ''')
    print(classification_report(y_test, y_test_pred, digits=4))

# Model Building

## Decision Trees

In [47]:
clf_dt = DecisionTreeClassifier()

In [48]:
clf_dt.fit(X_train, y_train)

DecisionTreeClassifier()

In [49]:
y_pred_train = clf_dt.predict(X_train)
y_pred_test = clf_dt.predict(X_test)

In [50]:
classifcation_report_train_test(y_train, y_pred_train, y_test, y_pred_test)


               CLASSIFICATION REPORT FOR TRAIN DATA
            
              precision    recall  f1-score   support

           0     1.0000    1.0000    1.0000      1273
           1     1.0000    1.0000    1.0000      2450
           2     1.0000    1.0000    1.0000      1877

    accuracy                         1.0000      5600
   macro avg     1.0000    1.0000    1.0000      5600
weighted avg     1.0000    1.0000    1.0000      5600


               CLASSIFICATION REPORT FOR TEST DATA
            
              precision    recall  f1-score   support

           0     0.6292    0.6257    0.6274       545
           1     0.7905    0.7897    0.7901      1051
           2     0.9036    0.9081    0.9058       805

    accuracy                         0.7922      2401
   macro avg     0.7744    0.7745    0.7744      2401
weighted avg     0.7918    0.7922    0.7920      2401



In [51]:
scores = get_metrics(y_train, y_pred_train, y_test, y_pred_test, "DecisionTrees", scores)
scores

  dataframe       = dataframe.append(pd.Series([model_description,


Unnamed: 0,Model,Train_Accuracy,Train_Recall,Train_Precision,Train_F1_Score,Test_Accuracy,Test_Recall,Test_Precision,Test_F1_Score
0,DecisionTrees,1.0,1.0,1.0,1.0,0.79217,0.79217,0.79178,0.791972


## Random Forests

In [52]:
clf_rf = RandomForestClassifier()

In [53]:
clf_rf.fit(X=X_train, y=y_train)

RandomForestClassifier()

In [54]:
y_pred_train = clf_rf.predict(X_train)
y_pred_test = clf_rf.predict(X_test)

In [55]:
classifcation_report_train_test(y_train, y_pred_train, y_test, y_pred_test)


               CLASSIFICATION REPORT FOR TRAIN DATA
            
              precision    recall  f1-score   support

           0     1.0000    1.0000    1.0000      1273
           1     1.0000    1.0000    1.0000      2450
           2     1.0000    1.0000    1.0000      1877

    accuracy                         1.0000      5600
   macro avg     1.0000    1.0000    1.0000      5600
weighted avg     1.0000    1.0000    1.0000      5600


               CLASSIFICATION REPORT FOR TEST DATA
            
              precision    recall  f1-score   support

           0     0.7892    0.7211    0.7536       545
           1     0.8566    0.8754    0.8659      1051
           2     0.9180    0.9453    0.9315       805

    accuracy                         0.8638      2401
   macro avg     0.8546    0.8473    0.8503      2401
weighted avg     0.8619    0.8638    0.8624      2401



In [56]:
scores = get_metrics(y_train, y_pred_train, y_test, y_pred_test, "RandomForest", scores)
scores

  dataframe       = dataframe.append(pd.Series([model_description,


Unnamed: 0,Model,Train_Accuracy,Train_Recall,Train_Precision,Train_F1_Score,Test_Accuracy,Test_Recall,Test_Precision,Test_F1_Score
0,DecisionTrees,1.0,1.0,1.0,1.0,0.79217,0.79217,0.79178,0.791972
1,RandomForest,1.0,1.0,1.0,1.0,0.863807,0.863807,0.861873,0.86238


## Build Gradient Boosting Classifier

In [57]:
clf_gbm = GradientBoostingClassifier()

In [None]:
clf_gbm.fit(X=X_train, y=y_train)

In [None]:
y_pred_train = clf_gbm.predict(X_train)
y_pred_test = clf_gbm.predict(X_test)

In [None]:
classifcation_report_train_test(y_train, y_pred_train, y_test, y_pred_test)

In [None]:
scores = get_metrics(y_train, y_pred_train, y_test, y_pred_test, "GBM", scores)
scores

## AdaBoost

In [None]:
clf_adaboost =  AdaBoostClassifier()

In [None]:
clf_adaboost.fit(X_train, y_train)

AdaBoostClassifier()

In [None]:
y_pred_train = clf_adaboost.predict(X_train)
y_pred_test = clf_adaboost.predict(X_test)

In [None]:
classifcation_report_train_test(y_train, y_pred_train, y_test, y_pred_test)


               CLASSIFICATION REPORT FOR TRAIN DATA
            
              precision    recall  f1-score   support

           0     0.7465    0.6779    0.7106      1273
           1     0.8365    0.8665    0.8512      2450
           2     0.9297    0.9441    0.9368      1877

    accuracy                         0.8496      5600
   macro avg     0.8376    0.8295    0.8329      5600
weighted avg     0.8473    0.8496    0.8480      5600


               CLASSIFICATION REPORT FOR TEST DATA
            
              precision    recall  f1-score   support

           0     0.7166    0.6587    0.6864       545
           1     0.8341    0.8516    0.8427      1051
           2     0.9117    0.9366    0.9240       805

    accuracy                         0.8363      2401
   macro avg     0.8208    0.8156    0.8177      2401
weighted avg     0.8335    0.8363    0.8345      2401



In [None]:
scores = get_metrics(y_train, y_pred_train, y_test, y_pred_test, "Adaboost", scores)
scores

Unnamed: 0,Model,Train_Accuracy,Train_Recall,Train_Precision,Train_F1_Score,Test_Accuracy,Test_Recall,Test_Precision,Test_F1_Score
0,DecisionTrees,1.0,1.0,1.0,1.0,0.795085,0.795085,0.794354,0.794691
1,RandomForest,1.0,1.0,1.0,1.0,0.865473,0.865473,0.863483,0.863991
2,Adaboost,0.849643,0.849643,0.847281,0.847952,0.836318,0.836318,0.833453,0.834513


## XGBOOST 

In [None]:
clf_xgb = XGBClassifier()

In [None]:
clf_xgb.fit(X_train, y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=0, num_parallel_tree=1,
              objective='multi:softprob', random_state=0, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=None, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

In [None]:
y_pred_train = clf_xgb.predict(X_train)
y_pred_test = clf_xgb.predict(X_test)

In [None]:
classifcation_report_train_test(y_train, y_pred_train, y_test, y_pred_test)


               CLASSIFICATION REPORT FOR TRAIN DATA
            
              precision    recall  f1-score   support

           0     1.0000    1.0000    1.0000      1273
           1     1.0000    1.0000    1.0000      2450
           2     1.0000    1.0000    1.0000      1877

    accuracy                         1.0000      5600
   macro avg     1.0000    1.0000    1.0000      5600
weighted avg     1.0000    1.0000    1.0000      5600


               CLASSIFICATION REPORT FOR TEST DATA
            
              precision    recall  f1-score   support

           0     0.7857    0.7468    0.7658       545
           1     0.8757    0.8716    0.8736      1051
           2     0.9164    0.9528    0.9342       805

    accuracy                         0.8705      2401
   macro avg     0.8593    0.8570    0.8579      2401
weighted avg     0.8689    0.8705    0.8695      2401



In [None]:
XGB = XGBClassifier(n_jobs=-1)
 
# Use a grid over parameters of interest
param_grid = {
    'colsample_bytree': np.linspace(0.6, 0.8, 2),
    'n_estimators': [50, 100],
    'max_depth': [5, 6]}

CV_XGB = GridSearchCV(estimator=XGB, param_grid=param_grid, cv=3)

In [None]:
CV_XGB.fit(X=X_train, y=y_train)

GridSearchCV(cv=3,
             estimator=XGBClassifier(base_score=None, booster=None,
                                     colsample_bylevel=None,
                                     colsample_bynode=None,
                                     colsample_bytree=None, gamma=None,
                                     gpu_id=None, importance_type='gain',
                                     interaction_constraints=None,
                                     learning_rate=None, max_delta_step=None,
                                     max_depth=None, min_child_weight=None,
                                     missing=nan, monotone_constraints=None,
                                     n_estimators=100, n_jobs=-1,
                                     num_parallel_tree=None, random_state=None,
                                     reg_alpha=None, reg_lambda=None,
                                     scale_pos_weight=None, subsample=None,
                                     tree_method=None, v

Find best model

In [None]:
best_xgb_model = CV_XGB.best_estimator_

In [None]:
print(CV_XGB.best_score_, CV_XGB.best_params_)

0.8701788820821882 {'colsample_bytree': 0.8, 'max_depth': 6, 'n_estimators': 100}


In [None]:
best_xgb_model

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.8, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=-1, num_parallel_tree=1,
              objective='multi:softprob', random_state=0, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=None, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

In [None]:
y_pred_train = best_xgb_model.predict(X_train)
y_pred_test = best_xgb_model.predict(X_test)

In [None]:
classifcation_report_train_test(y_train, y_pred_train, y_test, y_pred_test)


               CLASSIFICATION REPORT FOR TRAIN DATA
            
              precision    recall  f1-score   support

           0     1.0000    1.0000    1.0000      1273
           1     1.0000    1.0000    1.0000      2450
           2     1.0000    1.0000    1.0000      1877

    accuracy                         1.0000      5600
   macro avg     1.0000    1.0000    1.0000      5600
weighted avg     1.0000    1.0000    1.0000      5600


               CLASSIFICATION REPORT FOR TEST DATA
            
              precision    recall  f1-score   support

           0     0.7871    0.7193    0.7517       545
           1     0.8652    0.8792    0.8721      1051
           2     0.9210    0.9553    0.9378       805

    accuracy                         0.8684      2401
   macro avg     0.8578    0.8512    0.8539      2401
weighted avg     0.8662    0.8684    0.8668      2401



In [None]:
scores = get_metrics(y_train, y_pred_train, y_test, y_pred_test, "XGBoost", scores)
scores

Unnamed: 0,Model,Train_Accuracy,Train_Recall,Train_Precision,Train_F1_Score,Test_Accuracy,Test_Recall,Test_Precision,Test_F1_Score
0,DecisionTrees,1.0,1.0,1.0,1.0,0.795085,0.795085,0.794354,0.794691
1,RandomForest,1.0,1.0,1.0,1.0,0.865473,0.865473,0.863483,0.863991
2,Adaboost,0.849643,0.849643,0.847281,0.847952,0.836318,0.836318,0.833453,0.834513
3,GBM,0.919286,0.919286,0.918602,0.918517,0.866722,0.866722,0.864293,0.864928
4,XGBoost,1.0,1.0,1.0,1.0,0.868388,0.868388,0.866164,0.866799


### Extracting the important features

In [None]:
best_xgb_model.feature_importances_

array([0.0053074 , 0.00741261, 0.02442907, 0.05360271, 0.01002102,
       0.01178371, 0.01067767, 0.02029161, 0.00647646, 0.00682708,
       0.0056769 , 0.00835767, 0.00773349, 0.01487874, 0.00807926,
       0.01121838, 0.00608563, 0.03635557, 0.00619357, 0.00756241,
       0.00432145, 0.01450878, 0.23255345, 0.14113677, 0.0965234 ,
       0.16265658, 0.00663172, 0.00340748, 0.00499085, 0.00606102,
       0.00974619, 0.02910462, 0.00489438, 0.00242959, 0.00573982,
       0.00632305], dtype=float32)

In [None]:
importances = best_xgb_model.feature_importances_
indices = np.argsort(importances)
print(indices)

[33 27 20 32 28  0 10 34 29 16 18 35  8 26  9  1 19 12 14 11 30  4  6 15
  5 21 13  7  2 31 17  3 24 23 25 22]


In [None]:
indices = np.argsort(importances)[::-1] # np.argsort returns the indices that would sort an array.
pd.DataFrame([X_train.columns[indices], np.sort(importances)[::-1]])

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,26,27,28,29,30,31,32,33,34,35
0,outside.sideline_False,outside.baseline_True,outside.sideline_True,outside.baseline_False,net.clearance,previous.time.to.net,previous.hitpoint_V,speed,player.impact.depth,previous.distance.from.sideline,...,opponent.distance.from.center,previous.hitpoint_F,gender_mens,player.distance.from.center,rally,previous.hitpoint_B,server.is.impact.player_False,hitpoint_U,same.side_True,server.is.impact.player_True
1,0.232553,0.162657,0.141137,0.0965234,0.0536027,0.0363556,0.0291046,0.0244291,0.0202916,0.0148787,...,0.00608563,0.00606102,0.00573982,0.0056769,0.0053074,0.00499085,0.00489438,0.00432145,0.00340748,0.00242959


## Stacking:

In [None]:
from sklearn.ensemble import VotingClassifier

In [None]:
voting_clf = VotingClassifier(estimators=[('clf_dt', clf_dt), ('clf_rf', clf_rf), ('clf_adaboost', clf_adaboost)]) 

In [None]:
voting_clf

VotingClassifier(estimators=[('clf_dt', DecisionTreeClassifier()),
                             ('clf_rf', RandomForestClassifier()),
                             ('clf_adaboost', AdaBoostClassifier())])

In [None]:
voting_clf.fit(X_train, y_train) 


VotingClassifier(estimators=[('clf_dt', DecisionTreeClassifier()),
                             ('clf_rf', RandomForestClassifier()),
                             ('clf_adaboost', AdaBoostClassifier())])

In [None]:
y_pred_train = voting_clf.predict(X_train)
y_pred_train

array([1, 2, 0, ..., 1, 1, 0])

In [None]:
y_pred_test = voting_clf.predict(X_test)
y_pred_test

array([1, 2, 2, ..., 2, 1, 1])

Performance Metric

In [None]:
scores = get_metrics(y_train, y_pred_train, y_test, y_pred_test, "Stacking", scores)
scores

Unnamed: 0,Model,Train_Accuracy,Train_Recall,Train_Precision,Train_F1_Score,Test_Accuracy,Test_Recall,Test_Precision,Test_F1_Score
0,DecisionTrees,1.0,1.0,1.0,1.0,0.795085,0.795085,0.794354,0.794691
1,RandomForest,1.0,1.0,1.0,1.0,0.865473,0.865473,0.863483,0.863991
2,Adaboost,0.849643,0.849643,0.847281,0.847952,0.836318,0.836318,0.833453,0.834513
3,GBM,0.919286,0.919286,0.918602,0.918517,0.866722,0.866722,0.864293,0.864928
4,XGBoost,1.0,1.0,1.0,1.0,0.868388,0.868388,0.866164,0.866799
5,Stacking,1.0,1.0,1.0,1.0,0.852978,0.852978,0.850944,0.851724
