# 1. Introduction

In this project, we're going to use the tennis match dataset available at: https://www.kaggle.com/jordangoblet/atp-tour-20002016 to build a machine learning model that calculates the real chances of each player to win a match.
The goal is to use these chances to compare them against the available odds for each player in any given match, to extract value bets and see if there's some room to make an immaginary profit.


In this notebook we're going to load the original csv file with tennis matches data and we're going to explore it, to see how consistent is the data within the file.

We're going to import pandas as the library to manipulate csv files and DataFrame objects:

In [1]:
import pandas as pd

Let's load the 'Data.csv' file downloaded from https://www.kaggle.com/jordangoblet/atp-tour-20002016 and stored in the csv directory within this project:

In [2]:
df = pd.read_csv("csv/Data.csv", engine='python')

Let's take a short look at the data to see if we imported it correcly:

In [3]:
df.head()

Unnamed: 0,ATP,Location,Tournament,Date,Series,Court,Surface,Round,Best of,Winner,...,UBW,UBL,LBW,LBL,SJW,SJL,MaxW,MaxL,AvgW,AvgL
0,1,Adelaide,Australian Hardcourt Championships,3/01/2000,International,Outdoor,Hard,1st Round,3,Dosedel S.,...,,,,,,,,,,
1,1,Adelaide,Australian Hardcourt Championships,3/01/2000,International,Outdoor,Hard,1st Round,3,Enqvist T.,...,,,,,,,,,,
2,1,Adelaide,Australian Hardcourt Championships,3/01/2000,International,Outdoor,Hard,1st Round,3,Escude N.,...,,,,,,,,,,
3,1,Adelaide,Australian Hardcourt Championships,3/01/2000,International,Outdoor,Hard,1st Round,3,Federer R.,...,,,,,,,,,,
4,1,Adelaide,Australian Hardcourt Championships,3/01/2000,International,Outdoor,Hard,1st Round,3,Fromberg R.,...,,,,,,,,,,


How many matches we have in this dataset?

In [4]:
df.shape

(46652, 54)

We have as many as 46652 matches and 54 attributes for each match!

Let's display the dataframe columns and their types:

In [5]:
df.dtypes

ATP             int64
Location       object
Tournament     object
Date           object
Series         object
Court          object
Surface        object
Round          object
Best of         int64
Winner         object
Loser          object
WRank          object
LRank          object
W1            float64
L1            float64
W2            float64
L2            float64
W3            float64
L3            float64
W4            float64
L4            float64
W5            float64
L5            float64
Wsets         float64
Lsets         float64
Comment        object
CBW           float64
CBL           float64
GBW           float64
GBL           float64
IWW           float64
IWL           float64
SBW           float64
SBL           float64
B365W         float64
B365L         float64
B&WW          float64
B&WL          float64
EXW           float64
EXL           float64
PSW           float64
PSL           float64
WPts          float64
LPts          float64
UBW           float64
UBL       

Note: in the 'Attributes.txt' file we can look for the meaning of each of these columns.

# Problems of this Dataset

As we can see, there are already some things that don't look very promising.
For example, the rankings of the winner player and the loser one are not of type 'int' but 'object', and the 'Date' attribute is not a date, but an 'object' too.

Moreover, in this file we have the information about who wins each match already classified in the appropriate column ('Winner'), but that's not optimal since we're going to build a classifier model that will have to learn from the attributes of each match who's more likely to win and to lose.
In fact, if the winner of the match always appears in the same column, that's not going to help our model.

We're going to correct these issues, and many more, in the PreProcessing notebook.

# Next Steps

Now that we have looked at the dataset we can explore our next steps in order to achive our goal: building a model that predicts the real chances of each player to win a match, and finds value bets to beat the bookmakers (or at least tries to)!

So, the next step is going be pre-processing this original csv file, in order to remove the features we're not interested in, and to correct the issues that we will find in the data.

Then, using the fixed data, for each match we will 'build' new features from scratch, to help our prediction model make a better job.

Next, we're going to build some machine learning models to learn from these attributes and to make a prediction about who's more likely to win a given match. And we'll compare the models in order to pick up the best one.

Finally, we will focus on the betting aspect.
In fact, we'll use the model previously built to calculate the chances of each player to win. 
Once we'll have these chances, we're going to look at the odds available for each match, to extract (if possible) value bets, i.e. bets where the odds offered by a bookmaker are higher than the odds based on the chances of a player to win the match as calculated by our model.