# 1. Introduction

In this project, we're going to use the tennis match dataset available at: https://www.kaggle.com/hakeem/atp-and-wta-tennis-data to build a machine learning model that calculates the real chances of each player to win a match.
The goal is to use these chances to compare them against the available odds for each player in any given match, to extract value bets and see if there's some room to make an immaginary profit.


In this notebook we're going to load the original csv file with tennis matches data and we're going to explore it, to see how consistent is the data within the file.

We're going to import pandas as the library to manipulate csv files and DataFrame objects:

In [1]:
import pandas as pd

Let's load the 'Data.csv' file downloaded from https://www.kaggle.com/hakeem/atp-and-wta-tennis-data and stored in the csv directory within this project:

In [2]:
df = pd.read_csv("csv/df_atp.csv", engine='python')

Let's take a short look at the data to see if we imported it correcly:

In [3]:
df.head()

Unnamed: 0.1,Unnamed: 0,ATP,AvgL,AvgW,B&WL,B&WW,B365L,B365W,Best of,CBL,...,UBW,W1,W2,W3,W4,W5,WPts,WRank,Winner,Wsets
0,0,1,,,,,,,3,,...,,6.0,6.0,,,,,63,Dosedel S.,2.0
1,1,1,,,,,,,3,,...,,6.0,6.0,,,,,5,Enqvist T.,2.0
2,2,1,,,,,,,3,,...,,6.0,7.0,6.0,,,,40,Escude N.,2.0
3,3,1,,,,,,,3,,...,,6.0,6.0,,,,,65,Federer R.,2.0
4,4,1,,,,,,,3,,...,,7.0,5.0,6.0,,,,81,Fromberg R.,2.0


How many matches we have in this dataset?

In [4]:
df.shape

(54908, 55)

We have as many as 54908 matches and 55 attributes for each match!

Let's display the dataframe columns and their types:

In [5]:
df.dtypes

Unnamed: 0      int64
ATP             int64
AvgL          float64
AvgW          float64
B&WL          float64
B&WW          float64
B365L         float64
B365W         float64
Best of         int64
CBL           float64
CBW           float64
Comment        object
Court          object
Date           object
EXL           float64
EXW            object
GBL           float64
GBW           float64
IWL           float64
IWW           float64
L1            float64
L2             object
L3             object
L4            float64
L5            float64
LBL           float64
LBW           float64
LPts          float64
LRank          object
Location       object
Loser          object
Lsets          object
MaxL          float64
MaxW          float64
PSL           float64
PSW           float64
Round          object
SBL           float64
SBW           float64
SJL           float64
SJW           float64
Series         object
Surface        object
Tournament     object
UBL           float64
UBW       

# Problems of this Dataset

As we can see, there are already some things that don't look very promising.
For example, the rankings of the winner player and the loser one are not of type 'int' but 'object', and the 'Date' attribute is not a date, but an 'object' too.

Moreover, in this file we have the information about who wins each match already classified in the appropriate column ('Winner'), but that's not optimal since we're going to build a classifier model that will have to learn from the attributes of each match who's more likely to win and to lose.
In fact, if the winner of the match always appears in the same column, that's not going to help our model.

We're going to correct these issues, and many more, in the PreProcessing notebook.

# Next Steps

Now that we have looked at the dataset we can explore our next steps in order to achive our goal: building a model that predicts the real chances of each player to win a match, and finds value bets to beat the bookmakers (or at least tries to)!

So, the next step is going be pre-processing this original csv file, in order to remove the features we're not interested in, and to correct the issues that we will find in the data.

Then, using the fixed data, for each match we will 'build' new features from scratch, to help our prediction model make a better job.

Next, we're going to build some machine learning models to learn from these attributes and to make a prediction about who's more likely to win a given match. And we'll compare the models in order to pick up the best one.

Finally, we will focus on the betting aspect.
In fact, we'll use the model previously built to calculate the chances of each player to win. 
Once we'll have these chances, we're going to look at the odds available for each match, to extract (if possible) value bets, i.e. bets where the odds offered by a bookmaker are higher than the odds based on the chances of a player to win the match as calculated by our model.