## Problem Statement: 

**Tennis Australia open is trying to better automate how tennis points get categorized into three outcomes – winners, forced errors and unforced errors.**

## Dataset Description:

The dataset includes point outcomes of rallies only (where the number of shots hit exceeds two, which represents the serve and return). All points were played at a past Australian Open.


### Target

#### Outcome variable - classes
* Winner – the point winning player hits a shot that is not touched by the opponent
* Forced error – the point winning player hits a shot that causes the opponent to not be able to return it, i.e. a good shot that is hard to handle
* Unforced error – the player attempting to return the ball makes an error on an otherwise normal looking rally shot

# Atribute description

| Variable | Description| Value Range |
| :- | -: | :-: |
rally | The number of shots in the point counting serves and point-ending shot | An integer from 1, 2, 3...
| serve | A number indicating whether the point was played on a first or second serve.  | 1 = First, 2 = Second
| hitpoint | Shot category for point-ending shot | F = Forehand, B = Backhand, V = Volley, U = Unknown
| speed | Speed of point-ending shot | Continuous (m/s)
| net.clearance | Distance above the net as point-ending shot passed the net | Continuous (cm) distance above net. Can be negative if shot did not pass above the net.
| distance.from.sideline | Lateral distance of the point-ending shot bounce from the nearest singles sideline. | Perpendicular distance in meters (always positive even if out)
| depth | Distance of the point-ending shot bounce from the baseline | Perpendicular distance in meters
(always positive even if out)
| outside.sideline | Logical indicator of whether point-ending shot landed outside of the in-play singles sideline | TRUE, FALSE
| outside.baseline | Logical indicator of whether point-ending shot landed beyond the in-play baseline | TRUE, FALSE
| player.distance.travelled | Distance player who made the point-ending shot travelled between the impact of the penultimate shot and the impact of the point-ending shot | Euclidean distance in meters
| player.impact.depth | Distance of player who made point-ending shot from the net at the time the point-ending shot was made | Perpendicular distance along the length of court from net in meters
| player.impact.distance.from.center | Distance of player who made point-ending shot from the center line at the time the point-ending shot was made | Perpendicular distance from the center line in meters
| player.depth | Distance of player who made point-ending shot from the net at the time the penultimate shot was made | Perpendicular distance along the length of court from net in meters
| player.distance.from.center | Distance of player who made point-ending shot from the center line at the time the penultimate shot was made | Perpendicular distance from the center line in meters
| opponent.depth | Distance of opponent from the net at the time the at the time the penultimate shot was made | Perpendicular distance along the length of court from net in meters
| opponent.distance.from.center | Distance of opponent from the center line at the time the penultimate shot was made | Perpendicular distance from the center line in meters
| same.side | Logical indicator if both player and opponent were positioned on the same side of the center line (ad or deuce court) at the time the penultimate shot was made | TRUE, FALSE
| previous.speed | Speed of penultimate shot | Continuous (m/s)
| previous.net.clearance | Distance above the net as penultimate shot passed the net | Continuous (cm) distance above net. Can be negative if shot did not pass above the net.
| previous.distance.from.sideline | Lateral distance of the penultimate  shot bounce from the nearest singles sideline. | Perpendicular distance in meters (always positive even if out)
| previous.depth | Distance of the penultimate shot bounce from the baseline | Perpendicular distance in meters
(always positive even if out)
| previous.hitpoint | Shot category for penultimate shot | F = Forehand, B = Backhand, V = Volley, U = Unknown
| previous.time.to.net | Time for penultimate shot to be hit and pass the net | Continuous number in seconds
| server.is.impact.player | Logical if player who made point-ending shot was the server of the point | TRUE, FALSE
| outcome | Target variable, character with three categories indicating the type of shot that ended the point  | W (Winner), FE (Forced Error), UE (Unforced Error)
| id | A 10-character unique identifier for the point | Character

In [2]:
pip install -U scikit-learn

Note: you may need to restart the kernel to use updated packages.


'c:\users\gaurav' is not recognized as an internal or external command,
operable program or batch file.


# Import libraries

In [None]:
import pandas as pd
import numpy
from IPython.display import Image
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix
from sklearn.model_selection import GridSearchCV
import warnings
warnings.filterwarnings('ignore')

In [31]:
data = pd.read_csv("data/tennis.csv")

#### See the top 5 rows of the data

In [32]:
data.head()

Unnamed: 0,rally,serve,hitpoint,speed,net.clearance,distance.from.sideline,depth,outside.sideline,outside.baseline,player.distance.travelled,...,previous.depth,opponent.depth,opponent.distance.from.center,same.side,previous.hitpoint,previous.time.to.net,server.is.impact.player,outcome,gender,ID
0,4,1,B,35.515042,-0.021725,3.474766,6.797621,False,False,1.46757,...,0.705435,12.5628,2.0724,True,F,0.445318,False,UE,mens,8644
1,4,2,B,33.38264,1.114202,2.540801,2.608708,False,True,2.311931,...,3.8566,12.3544,5.1124,False,B,0.432434,False,FE,mens,1182
2,23,1,B,22.31669,-0.254046,3.533166,9.435749,False,False,3.903728,...,2.908892,13.862,1.6564,False,F,0.397538,True,FE,mens,9042
3,9,1,F,36.837309,0.766694,0.586885,3.34218,True,False,0.583745,...,0.557554,14.2596,0.1606,True,B,0.671984,True,UE,mens,1222
4,4,1,B,35.544208,0.116162,0.918725,5.499119,False,False,2.333456,...,3.945317,11.3658,1.1082,False,F,0.340411,False,W,mens,4085


#### Dimension of data

In [33]:
data.shape

(8001, 27)

#### Different classes in Outcome variable

In [34]:
data.outcome.unique()

array(['UE', 'FE', 'W'], dtype=object)

In [35]:
data.outcome.value_counts()

UE    3501
W     2682
FE    1818
Name: outcome, dtype: int64

In [36]:
data.outcome.value_counts(normalize= True)*100

UE    43.75703
W     33.52081
FE    22.72216
Name: outcome, dtype: float64

#### Check the number of columns

In [37]:
len(data.columns)

27

#### Display data type of each variable

In [38]:
data.dtypes

rally                                   int64
serve                                   int64
hitpoint                               object
speed                                 float64
net.clearance                         float64
distance.from.sideline                float64
depth                                 float64
outside.sideline                         bool
outside.baseline                         bool
player.distance.travelled             float64
player.impact.depth                   float64
player.impact.distance.from.center    float64
player.depth                          float64
player.distance.from.center           float64
previous.speed                        float64
previous.net.clearance                float64
previous.distance.from.sideline       float64
previous.depth                        float64
opponent.depth                        float64
opponent.distance.from.center         float64
same.side                                bool
previous.hitpoint                 

#### Identifying categorical attributes

In [39]:
categorical_list = ["hitpoint","outside.sideline",
                    "outside.baseline","same.side",
                    "previous.hitpoint",
                    "server.is.impact.player",
                    "gender","outcome"]

#### Converting to appropriate datatype

In [40]:
data[categorical_list] = data[categorical_list].astype("category")    

#### Display data type of each variable after conversion

In [41]:
data.dtypes

rally                                    int64
serve                                    int64
hitpoint                              category
speed                                  float64
net.clearance                          float64
distance.from.sideline                 float64
depth                                  float64
outside.sideline                      category
outside.baseline                      category
player.distance.travelled              float64
player.impact.depth                    float64
player.impact.distance.from.center     float64
player.depth                           float64
player.distance.from.center            float64
previous.speed                         float64
previous.net.clearance                 float64
previous.distance.from.sideline        float64
previous.depth                         float64
opponent.depth                         float64
opponent.distance.from.center          float64
same.side                             category
previous.hitp

#### Dropping ID column and checking the length of columns

In [42]:
len(data['ID'].unique())

8001

In [43]:
data.shape

(8001, 27)

In [44]:
data.drop(["ID"], axis=1, inplace=True)
len(data.columns)

26

#### Display summary statistics 

In [45]:
data.describe()

Unnamed: 0,rally,serve,speed,net.clearance,distance.from.sideline,depth,player.distance.travelled,player.impact.depth,player.impact.distance.from.center,player.depth,player.distance.from.center,previous.speed,previous.net.clearance,previous.distance.from.sideline,previous.depth,opponent.depth,opponent.distance.from.center,previous.time.to.net
count,8001.0,8001.0,8001.0,8001.0,8001.0,8001.0,8001.0,8001.0,8001.0,8001.0,8001.0,8001.0,8001.0,8001.0,8001.0,8001.0,8001.0,8001.0
mean,5.966004,1.3987,30.806938,0.629658,1.46763,4.421146,2.690463,11.899694,1.919544,12.253954,1.213795,28.763676,0.821562,2.19342,4.218717,12.61681,2.367952,0.549988
std,3.548182,0.489661,7.298917,0.982504,1.108697,3.144965,1.713136,2.788231,1.205449,2.039085,0.964364,6.47747,0.674663,1.038942,2.052946,2.075401,1.313927,0.186788
min,3.0,1.0,5.176078,-0.998184,0.000497,0.003135,0.0,2.156,0.0002,1.3898,0.0004,8.449117,0.028865,0.000164,0.000467,2.1612,0.0002,0.003201
25%,3.0,1.0,26.77029,-0.027092,0.5395,1.641161,1.444233,11.2214,0.9424,11.3742,0.5518,24.033218,0.404815,1.354458,2.733674,12.0824,1.3522,0.432164
50%,5.0,1.0,32.41769,0.44587,1.210847,3.860266,2.360894,12.6918,1.8294,12.5516,0.9838,29.793417,0.658382,2.168822,4.126864,12.9016,2.332,0.507559
75%,7.0,2.0,35.681431,0.970844,2.215955,7.029345,3.565853,13.553,2.7452,13.498,1.5966,33.581003,1.021397,3.022677,5.595515,13.7128,3.259,0.624135
max,38.0,2.0,55.052795,12.815893,7.569757,11.886069,14.480546,18.1256,7.7462,18.7458,9.3526,54.207506,6.730275,4.114361,9.997963,20.211,6.8526,1.635257


In [46]:
data.describe(include=['category'])

Unnamed: 0,hitpoint,outside.sideline,outside.baseline,same.side,previous.hitpoint,server.is.impact.player,outcome,gender
count,8001,8001,8001,8001,8001,8001,8001,8001
unique,4,2,2,2,4,2,3,2
top,F,False,False,False,F,True,UE,mens
freq,4402,6500,6380,6036,3684,4670,3501,4005


#### Checking for null values

In [47]:
data.isnull().sum()

rally                                 0
serve                                 0
hitpoint                              0
speed                                 0
net.clearance                         0
distance.from.sideline                0
depth                                 0
outside.sideline                      0
outside.baseline                      0
player.distance.travelled             0
player.impact.depth                   0
player.impact.distance.from.center    0
player.depth                          0
player.distance.from.center           0
previous.speed                        0
previous.net.clearance                0
previous.distance.from.sideline       0
previous.depth                        0
opponent.depth                        0
opponent.distance.from.center         0
same.side                             0
previous.hitpoint                     0
previous.time.to.net                  0
server.is.impact.player               0
outcome                               0


#### Divide the data into train and test

In [48]:
y=data["outcome"]
X=data.drop('outcome', axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30,random_state=123)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(5600, 25)
(2401, 25)
(5600,)
(2401,)


#### Display all the columns

In [49]:
data.columns

Index(['rally', 'serve', 'hitpoint', 'speed', 'net.clearance',
       'distance.from.sideline', 'depth', 'outside.sideline',
       'outside.baseline', 'player.distance.travelled', 'player.impact.depth',
       'player.impact.distance.from.center', 'player.depth',
       'player.distance.from.center', 'previous.speed',
       'previous.net.clearance', 'previous.distance.from.sideline',
       'previous.depth', 'opponent.depth', 'opponent.distance.from.center',
       'same.side', 'previous.hitpoint', 'previous.time.to.net',
       'server.is.impact.player', 'outcome', 'gender'],
      dtype='object')

#### Creating a list of numerical attributes and categorical list

In [50]:
numeric_list = ['rally','serve','speed','net.clearance',
                'distance.from.sideline','depth',
                'player.distance.travelled','player.impact.depth',
                'player.impact.distance.from.center',
                'player.depth','player.distance.from.center',
                'previous.speed','previous.net.clearance',
                'previous.distance.from.sideline','previous.depth',
                'opponent.depth','opponent.distance.from.center',
                'previous.time.to.net']

categorical_list = ["hitpoint","outside.sideline",
                    "outside.baseline","same.side",
                    "previous.hitpoint",
                    "server.is.impact.player",
                    "gender"]

In [51]:
len(numeric_list)

18

In [52]:
len(categorical_list)

7

### Implement Pipelines.

Use `pipelines` to clean up modelling code.

**`Pipelines`**: are a simple way to keep your data preprocessing and modeling code organized. Specifically, a pipeline bundles preprocessing and modeling steps so you can use the whole bundle as if it were a single step.

Many data scientists hack together models without pipelines, but pipelines have some important benefits. Those include:

- **`Cleaner Code`**: Accounting for data at each step of preprocessing can get messy. With a pipeline, you won't need to manually keep track of your training and validation data at each step.
- **`Fewer Bugs`**: There are fewer opportunities to misapply a step or forget a preprocessing step.
- **`Easier to Productionize`**: It can be surprisingly hard to transition a model from a prototype to something deployable at scale. We won't go into the many related concerns here, but pipelines can help.
- **`More Options for Model Validation`**: Easy to use cross-validation.

We construct the full pipeline in three steps.

**Define Preprocessing Steps**

Similar to how a pipeline bundles together preprocessing and modeling steps, we use the `ColumnTransformer` class to bundle together different preprocessing steps. 

The code below:
- imputes missing values in `numerical data`, and
- imputes missing values and applies a `one-hot encoding` to `categorical data`.

In [53]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder


# Preprocessing for numerical data

numerical_pipe = Pipeline([('scaler', StandardScaler()), ('imputer',SimpleImputer(strategy='mean'))])

# Preprocessing for categorical data


categorical_pipe = Pipeline([('imputer',SimpleImputer(strategy='most_frequent')),('onehot',OneHotEncoder(handle_unknown='ignore'))])

# Bundle preprocessing for numerical and categorical data


preprocessor = ColumnTransformer(transformers=[('num_trf',numerical_pipe,numeric_list),('cat_trf',categorical_pipe,categorical_list)])





### Model building

Build **`XGBoost`** Model.

In this activity, we'll work with the `XGBoost` library. `XGBoost` stands for `extreme gradient boosting`, which is an implementation of `gradient boosting` with several additional features focused on performance and speed. 

(Scikit-learn has another version of gradient boosting, but `XGBoost` has some technical advantages.)

In the code cell, we import the `scikit-learn` API for `XGBoost (xgboost.XGBClassifier)`. This allows us to build and fit a model just as we would in scikit-learn. 

The `XGBClassifier` class has many tunable parameters.

In [54]:
#!pip install xgboost

from xgboost import XGBClassifier

xgb_model = XGBClassifier()

In [55]:
%%time


# Bundle preprocessing and modeling code in a pipeline

xgb_pipeline = Pipeline(steps=[('preprocessor',preprocessor),('model',xgb_model)])

# Preprocessing of training data, fit model 


xgb_pipeline.fit(X_train,y_train)


# Preprocessing of training data, get predictions


y_pred_train = xgb_pipeline.predict(X_train)

# Evaluate the model - training

print("Train Accuracy:",accuracy_score(y_train,y_pred_train))


print("Train Confusion Matrix:",classification_report(y_train,y_pred_train))

# Preprocessing of validation data, get predictions

y_pred_test = xgb_pipeline.predict(X_test)


# Evaluate the model - validation


print("Test Accuracy:",accuracy_score(y_test,y_pred_test))

xgb_accuracy = accuracy_score(y_test,y_pred_test)

print("Test Confusion Matrix:",classification_report(y_test,y_pred_test))

Train Accuracy: 1.0
Train Confusion Matrix:               precision    recall  f1-score   support

          FE       1.00      1.00      1.00      1300
          UE       1.00      1.00      1.00      2419
           W       1.00      1.00      1.00      1881

    accuracy                           1.00      5600
   macro avg       1.00      1.00      1.00      5600
weighted avg       1.00      1.00      1.00      5600

Test Accuracy: 0.8754685547688463
Test Confusion Matrix:               precision    recall  f1-score   support

          FE       0.77      0.77      0.77       518
          UE       0.89      0.87      0.88      1082
           W       0.93      0.95      0.94       801

    accuracy                           0.88      2401
   macro avg       0.86      0.86      0.86      2401
weighted avg       0.87      0.88      0.88      2401

Wall time: 2.43 s


In [None]:
%%time

from sklearn.model_selection import GridSearchCV

param_grid = {
    "model__n_estimators": [10, 50, 100, 500],
    "model__learning_rate": [0.05, 0.1, 0.5, 1],
}

searchCV = GridSearchCV(xgb_pipeline, 
                        cv=5, 
                        param_grid=param_grid)

searchCV.fit(X_train, y_train)

In [None]:
searchCV.best_params_

In [None]:
searchCV.cv_results_['mean_test_score']

In [None]:
searchCV.cv_results_['mean_test_score'].mean()

In [None]:
xgb_grid_model = searchCV.best_estimator_

In [None]:


# Preprocessing of training data, get predictions
y_pred_train = xgb_grid_model.predict(X_train)

# Evaluate the model - training
print("Train Accuracy:",accuracy_score(y_train,y_pred_train))
print("Train Classification report\n", classification_report(y_train,y_pred_train,digits=4))
print("\n")

# Preprocessing of validation data, get predictions
y_pred_test = xgb_grid_model.predict(X_test)

# Evaluate the model - validation
print("Test Accuracy:",accuracy_score(y_test,y_pred_test))
print("Test Classification report\n", classification_report(y_test,y_pred_test,digits=4))

## LightGBM
LightGBM is a gradient boosting framework that uses tree based learning algorithms. It is designed to be distributed and efficient with the following advantages:
- Faster training speed and higher efficiency.
- Lower memory usage.
- Better accuracy.
- Support of parallel and GPU learning.
- Capable of handling large-scale data.

In [56]:
%%time

#!pip install lightgbm
from lightgbm import LGBMClassifier


lgbm_model = LGBMClassifier(n_estimators = 500,
                            learning_rate = 0.05,
                            n_jobs = 4,
                            random_state = 2)


# Bundle preprocessing and modeling code in a pipeline
lgbm_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                                ('model', lgbm_model)])

Wall time: 2.99 ms


In [57]:
%%time
# Preprocessing of training data, fit model 
lgbm_pipeline.fit(X_train, y_train)

# Preprocessing of training data, get predictions
y_pred_train = lgbm_pipeline.predict(X_train)

# Evaluate the model - training
print("Train Accuracy:",accuracy_score(y_train,y_pred_train))
print("Train Classification report\n", classification_report(y_train,y_pred_train,digits=4))
print("\n")

# Preprocessing of validation data, get predictions
y_pred_test = lgbm_pipeline.predict(X_test)

# Evaluate the model - validation
print("Test Accuracy:",accuracy_score(y_test,y_pred_test))


lgb_accuracy = accuracy_score(y_test,y_pred_test)


print("Test Classification report\n", classification_report(y_test,y_pred_test,digits=4))

Train Accuracy: 1.0
Train Classification report
               precision    recall  f1-score   support

          FE     1.0000    1.0000    1.0000      1300
          UE     1.0000    1.0000    1.0000      2419
           W     1.0000    1.0000    1.0000      1881

    accuracy                         1.0000      5600
   macro avg     1.0000    1.0000    1.0000      5600
weighted avg     1.0000    1.0000    1.0000      5600



Test Accuracy: 0.8767180341524364
Test Classification report
               precision    recall  f1-score   support

          FE     0.7685    0.7819    0.7751       518
          UE     0.8871    0.8641    0.8755      1082
           W     0.9329    0.9551    0.9439       801

    accuracy                         0.8767      2401
   macro avg     0.8628    0.8670    0.8648      2401
weighted avg     0.8768    0.8767    0.8766      2401

Wall time: 2.97 s


#### Combining the model results to one dataframe

In [67]:
#Create a dictionary

score_dict = {"Model 1 Accuracy":xgb_accuracy,"Model 2 Accuracy":lgb_accuracy}

print(score_dict)

{'Model 1 Accuracy': 0.8754685547688463, 'Model 2 Accuracy': 0.8767180341524364}


In [68]:
#Convert to Dataframe

#results_df = pd.DataFrame(score_dict,index=score_dict.keys())

results_df = pd.DataFrame(score_dict,index=['XGB Model Accuracy','LGBM Model Accuracy'])

In [69]:
results_df

Unnamed: 0,Model 1 Accuracy,Model 2 Accuracy
XGB Model Accuracy,0.875469,0.876718
LGBM Model Accuracy,0.875469,0.876718
