# Analysis of ATP Tour 2019 season 

### Content
+ Introduction: ATP Tour
+ Data description 
+ Formulation of research questions
+ Data preparation: cleaning and shaping

## 1. Introduction: ATP Tour 
### Content
1. Definition
2. History
3. Tournament category
4. ATP Ranking


### 1.1 Definition
The ATP Tour is a worldwide top-tier tennis tour for men organized by the Association of Tennis Professionals.

### 1.2 History
It was formed in September 1972 by Donald Dell, Jack Kramer, and Cliff Drysdale to protect the interests of professional tennis players, and Drysdale became the first President. Since 1990, the association has organized the ATP Tour, the worldwide tennis tour for men and linked the title of the tour with the organization's name. It is the governing body of men's professional tennis. 

### 1.3 Tournament types 
The ATP Tour comprises ATP Masters 1000, ATP 500, and ATP 250 tournaments. Grand Slam tournaments do not fall under the auspices of the ATP, but are overseen by the ITF (International Tennis Federation) instead. So, we can see the detailed information from the image below.

<img src="ranking1.png" width="900">

So, for ATP Masters 1000, the player who win the tournament can acquire 1000 ranking points, for ATP 500, the winner can acquire 500 ranking points and for ATP 250, they can acquire 250 ranking points. The highest ranking points are given for the Grand Slam winners, which will give 2000 ranking points. Players and doubles teams with the most ranking points (collected during the calendar year) play in the season-ending ATP Finals, which can give 1100 to 1500 ranking points for the winner. Moreover, the bigger the level of tournament, the bigger total prize money will be given for the player. From the image, we can see that the winner of ATP Finals will get a total of 4.450.000 USD.


### 1.4 ATP Ranking 
The ATP Ranking is used for determining qualification for entry and seeding in all tournaments for both singles and doubles. Within the ATP Rankings period consisting of the past year, points are accumulated, with the exception of those for the ATP Finals, whose points are dropped following the last ATP event of the year. 

                                            Points are awarded as follows:

<img src="distribution.png" width="900">

#### The description of the columns from the image:
 1. R128- Reached round of last 128
 2. R64- Reached round of last 64
 3. R32- Reached round of last 32
 4. R16 - Reached round of last 16
 5. QF - Quarterfinalist
 6. SF - Semifinalist
 7. F- final
 8. W - winner

As an illustration, if the player win the tournament, the player will get all the points, like if he won the Grand Slam, 2000 ranking points will be given to him, while the player who lose at the final will get only slightly more than the half of that points (1200 points). 

According to the level of tournament, the number of players is varying. Ranking points are awarded according to the stage of tournament reached, and the prestige of the tournament, where the four Grand Slams awarding the most points. The rankings are updated every Monday, and points are dropped 52 weeks after being awarded.  

For a better result within the same tour type to be transposed one has to wait for the expiry of the first worse result from previous year. In order to get more points, the player should go to the further stage of the tournament, while if player lose at the earlier stage than previous year, the result of reached stage will be saved in rankings. So, the players should defend their ranking points or get more points each year on purpose to be higher in ranking. It only expires at the drop date of that tournament and only if the player reached a worse result or has not entered the current year.

The player with the most points by the season's end is the world No. 1 of the year. Players finishing in the top eight of the Emirates ATP Rankings will qualify for the ATP Finals. For current moment, the world No.1 of the year is Novak Djokovic. 

   

1. Source( https://en.wikipedia.org/wiki/ATP_Rankings)
2. Source(https://en.wikipedia.org/wiki/ATP_Tour)
3. Source(shorturl.at/lAIMP)


## 2. Data description


Based on the description of the ATP Tour above, it is clear that all the information about ATP Tour will be added to the list. As an example, the information about the number of tournaments during the year or up and down of the players ranking in the ATP rankings. The tournaments have full information about the results of players matches and it will give an opportunity to discover different data. 
    
Our analysis will be focused in 2019 season, as it has the recent information about the ATP Tour itself. The variables that we definitely are going to analyze will be provided below:
+ ATP - the number of tournament.
+ Location - the place where the tournament took place.
+ Tournament- the name of tournament. 
+ Date - the date when the match was played.
+ Series - the category of the tennis tournament.
+ Court - type of court (outdoors,indoors).
+ Surface - type of surface, where the tournament is played (hard, grass, carpet or clay).
+ Round - round of played match.
+ Best of - the maximum number of sets playable in the match. 
+ Winner - the winner of the match.
+ Loser - the player who lose the match.
+ Comments - comment on the match ( Completed, won through retirement of loser, or via walkover).
+ WRank- ATP Entry ranking of the match winner as of the start of the tournament.
+ LRank- ATP Entry ranking of the match loser as of the start of the tournament.

There are some columns which is also provided in the dataset, but they are unnecessary within the project and because of that, there is no description of them. However, if there are some problems with understanding of columns which is not in the list provided above, you can see the description for that variables from the link provided below. 

<b>Link:</b> <a href='notes.txt' style="text-decoration:None;">Notes</a>

## 3. Formulation of research questions

This project is mainly focused in analysis of the given 5 questions provided below. The data visualization will be done according to them. 
1. Analyze the court, surface and series of tournaments in ATP Tour 2019 season
2. Analyze the spreading of tournaments by the continents 
3. Analyze the popularity of tournaments by the time of the year
4. Analyze the players performance during the season
5. Analyze the relations between variables in dataset

## 4. Data preparation: cleaning and shaping
### Content:
+ The dataset information
+ Data cleaning and shaping

### 4.1 The dataset information
The dataset consists of 2610 matches,  containing the matches from all the tournament categories, such as Grand Slams, ATP Finals,  Masters 1000, 500 Series and 250 Series. The corresponding data covers the ATP 2019 season from the end of December,2018 and ending by November,2019. The given dataset is provided in excel format.

In [1]:
# Import modules that will be used for observations 
import pandas as pd
import numpy as np

In [2]:
# The name of dataset
address="2019.xlsx"
# Reading the file in excel format
df=pd.read_excel(address)

In [3]:
# Dropping the unuseful columns from the dataset
dropped_df=df.drop(columns=['MaxW', 'MaxL',
       'AvgW', 'AvgL','B365W','B365L','PSW','PSL'])

In [4]:
# Showing the total number of rows(goes first) and the number of columns(goes second)
dropped_df.shape

(2610, 28)

In [5]:
# Showing the data type of each column in the dataset
dropped_df.dtypes

ATP                    int64
Location              object
Tournament            object
Date          datetime64[ns]
Series                object
Court                 object
Surface               object
Round                 object
Best of                int64
Winner                object
Loser                 object
WRank                float64
LRank                float64
WPts                 float64
LPts                 float64
W1                   float64
L1                   float64
W2                   float64
L2                   float64
W3                   float64
L3                   float64
W4                   float64
L4                   float64
W5                   float64
L5                   float64
Wsets                float64
Lsets                float64
Comment               object
dtype: object

In [6]:
#Finally, let's take a look at the dataset itself
# Start counting the rows from 1 instead of default 0
dropped_df.index=dropped_df.index+1
dropped_df

Unnamed: 0,ATP,Location,Tournament,Date,Series,Court,Surface,Round,Best of,Winner,...,L2,W3,L3,W4,L4,W5,L5,Wsets,Lsets,Comment
1,1,Brisbane,Brisbane International,2018-12-31,ATP250,Outdoor,Hard,1st Round,3,Dimitrov G.,...,4.0,,,,,,,2.0,0.0,Completed
2,1,Brisbane,Brisbane International,2018-12-31,ATP250,Outdoor,Hard,1st Round,3,Raonic M.,...,3.0,,,,,,,2.0,0.0,Completed
3,1,Brisbane,Brisbane International,2018-12-31,ATP250,Outdoor,Hard,1st Round,3,Kecmanovic M.,...,1.0,,,,,,,2.0,0.0,Completed
4,1,Brisbane,Brisbane International,2018-12-31,ATP250,Outdoor,Hard,1st Round,3,Millman J.,...,7.0,6.0,0.0,,,,,2.0,1.0,Completed
5,1,Brisbane,Brisbane International,2018-12-31,ATP250,Outdoor,Hard,1st Round,3,Uchiyama Y.,...,6.0,,,,,,,2.0,0.0,Completed
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2606,66,London,Masters Cup,2019-11-15,Masters Cup,Indoor,Hard,Round Robin,3,Nadal R.,...,4.0,7.0,5.0,,,,,2.0,1.0,Completed
2607,66,London,Masters Cup,2019-11-15,Masters Cup,Indoor,Hard,Round Robin,3,Zverev A.,...,6.0,,,,,,,2.0,0.0,Completed
2608,66,London,Masters Cup,2019-11-16,Masters Cup,Indoor,Hard,Semifinals,3,Tsitsipas S.,...,4.0,,,,,,,2.0,0.0,Completed
2609,66,London,Masters Cup,2019-11-16,Masters Cup,Indoor,Hard,Semifinals,3,Thiem D.,...,3.0,,,,,,,2.0,0.0,Completed


### 4.2 Data cleaning and shaping

One of the most important things before starting the analysis is checking the dataset for its clearness. \
Some of the steps which is need to be done in order to be sure:
1. Changing the data types of some columns from float to int
1. Checking the dataset for inconsistent column names
2. Checking the dataset for null values
3. Checking the dataset for duplicates

#### 4.2.1 Changing the data types of some columns from float to int


In [7]:
# Changing of some column data types from from float to int for better representation
# pd.Int16Dtype() can work with null values
dropped_df=dropped_df.astype({"W1": pd.Int16Dtype(), "W2": pd.Int16Dtype(), 
                "WRank": pd.Int16Dtype(), "LRank": pd.Int16Dtype(), 
                "L1": pd.Int16Dtype(), "L2": pd.Int16Dtype(), 
                "W3": pd.Int16Dtype(),"L3":pd.Int16Dtype(),"W4":pd.Int16Dtype(),
                "L4":pd.Int16Dtype(),"W5":pd.Int16Dtype(),"L5":pd.Int16Dtype(),"Wsets":pd.Int16Dtype(),
                "Lsets":pd.Int16Dtype(),"WPts":pd.Int16Dtype(),"LPts":pd.Int16Dtype()
                    })
dropped_df

Unnamed: 0,ATP,Location,Tournament,Date,Series,Court,Surface,Round,Best of,Winner,...,L2,W3,L3,W4,L4,W5,L5,Wsets,Lsets,Comment
1,1,Brisbane,Brisbane International,2018-12-31,ATP250,Outdoor,Hard,1st Round,3,Dimitrov G.,...,4,,,,,,,2,0,Completed
2,1,Brisbane,Brisbane International,2018-12-31,ATP250,Outdoor,Hard,1st Round,3,Raonic M.,...,3,,,,,,,2,0,Completed
3,1,Brisbane,Brisbane International,2018-12-31,ATP250,Outdoor,Hard,1st Round,3,Kecmanovic M.,...,1,,,,,,,2,0,Completed
4,1,Brisbane,Brisbane International,2018-12-31,ATP250,Outdoor,Hard,1st Round,3,Millman J.,...,7,6,0,,,,,2,1,Completed
5,1,Brisbane,Brisbane International,2018-12-31,ATP250,Outdoor,Hard,1st Round,3,Uchiyama Y.,...,6,,,,,,,2,0,Completed
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2606,66,London,Masters Cup,2019-11-15,Masters Cup,Indoor,Hard,Round Robin,3,Nadal R.,...,4,7,5,,,,,2,1,Completed
2607,66,London,Masters Cup,2019-11-15,Masters Cup,Indoor,Hard,Round Robin,3,Zverev A.,...,6,,,,,,,2,0,Completed
2608,66,London,Masters Cup,2019-11-16,Masters Cup,Indoor,Hard,Semifinals,3,Tsitsipas S.,...,4,,,,,,,2,0,Completed
2609,66,London,Masters Cup,2019-11-16,Masters Cup,Indoor,Hard,Semifinals,3,Thiem D.,...,3,,,,,,,2,0,Completed


#### 4.2.2 Checking the dataset for inconsistent column names

In [8]:
#Showing the total number of columns
print("There is a total of: "+ str(len(dropped_df.columns))+" columns")
# Showing the columns which is shown in the dataset
dropped_df.columns

There is a total of: 28 columns


Index(['ATP', 'Location', 'Tournament', 'Date', 'Series', 'Court', 'Surface',
       'Round', 'Best of', 'Winner', 'Loser', 'WRank', 'LRank', 'WPts', 'LPts',
       'W1', 'L1', 'W2', 'L2', 'W3', 'L3', 'W4', 'L4', 'W5', 'L5', 'Wsets',
       'Lsets', 'Comment'],
      dtype='object')

We can see that, the column names is consistent and have capitalizations. Moreover, it doesn't have bad characters in it. So, there are no problems with inconsistency. The description of column names is written in "Data description" part. 

#### 4.2.3 Checking the dataset for missing values

In [9]:
#Showing the number of non-missed values per column
df_null=abs(dropped_df.isnull().sum()-len(dropped_df))
df_null

ATP           2610
Location      2610
Tournament    2610
Date          2610
Series        2610
Court         2610
Surface       2610
Round         2610
Best of       2610
Winner        2610
Loser         2610
WRank         2606
LRank         2597
WPts          2607
LPts          2597
W1            2589
L1            2589
W2            2576
L2            2576
W3            1248
L3            1248
W4             265
L4             265
W5              96
L5              96
Wsets         2589
Lsets         2589
Comment       2610
dtype: int64

From the figure we can see that there are lots of missing values starting from WRank column. The columns starting from W1 to L5 are not going to be used in analysis, but the reasons of null values will be provided below.
1. The tournaments category, where they having best of 5 sets or best of 3 sets
2. The player retirements or walkover during the match
3. The low ranking of the players

Since there are some null values,we can not do anything about that. The change of some data can lead to the restructurization of the given dataset hierarchy. Because of that, we can not modify information there, as dataset is filled according to tournaments interval consistency. 


### The tournaments category where they having best of 5 sets or best of 3 sets.
#### The columns which was affected by that reason:  (W1,L1,W2,L2,W3,L3,W4,L4,L5,W5).
To begin with, let's talk about the tennis tournament categories. So, in Grand Slams tournaments, the maximum number of sets playable in match is equal to 5, while in other tournaments different rules are used, and the number of sets is no more than 3. The information about the number of sets can be seen from the "Best of" column. 

For ATP Finals, Masters 1000, 500 Series and 250 Series, the maximum number of playable sets is equal to 3. When the tennis player wins two sets, the referee will stop the game and declare him as a winner. 

So, for that cases W1, L1, W2, L2,  W3, L3 columns (which represent the sets winner and loser by set number) will represent the information about the match performance, as players can play maximum only 3 sets. W4, L4, W5, L5 for that tennis series will not be used and because of that, the given  columns will contain <b>null values</b>. The null values from that columns cann't be changed or dropped, as it is going to destroy the structure of the dataset and the values of the dataset will be inconsistent.

For Grand Slams, the winner is the player who wins three sets. So, in that cases all the columns of the dataset can be used, as players are playing the best of 5 sets. 

In [10]:
# Getting the info using the column labels
k=dropped_df.loc[5:6,"Series":"L5"]
k

Unnamed: 0,Series,Court,Surface,Round,Best of,Winner,Loser,WRank,LRank,WPts,...,W1,L1,W2,L2,W3,L3,W4,L4,W5,L5
5,ATP250,Outdoor,Hard,1st Round,3,Uchiyama Y.,Humbert U.,185,102,275,...,6,4,7,6,,,,,,
6,ATP250,Outdoor,Hard,1st Round,3,Kudla D.,Fritz T.,63,49,810,...,7,6,6,7,6.0,4.0,,,,


From the information provided above, let's look at 5 and 6 row. The game between two players, where the total number of sets is equal to 2, we can see it from W1, L1, W2, L2 and in that case W3,L3,W4,L4,W5,L5 will be Null values, as Uchiyama wins two sets in a row and as players are playing best of 3 and ATP250 Series, which is shown in "Best of" column. 

While in the 6 row, the total number of sets is 3, where the only null values are W4,L4,W5,L5, as the tournament type is ATP250.


In [11]:
# Getting the info using the integer position labels
k=dropped_df.iloc[2138:2140,5:25]
k

Unnamed: 0,Court,Surface,Round,Best of,Winner,Loser,WRank,LRank,WPts,LPts,W1,L1,W2,L2,W3,L3,W4,L4,W5,L5
2139,Outdoor,Hard,2nd Round,5,Bublik A.,Fabbiano T.,75,87,808,610,6,7,5,7,6,4,6.0,3.0,6.0,3.0
2140,Outdoor,Hard,2nd Round,5,Andujar P.,Sonego L.,70,49,850,1029,6,2,6,4,6,2,,,,


There is an example of Grand Slam matches, where the players are playing best of 5. So, in that case the winner is the one who win in 3 sets. 

### The player retirements or walkover during the match
#### The columns which was affected by that reason:  (W1,L1,W2,L2,W3,L3,W4,L4,L5,W5,Wsets,Lsets)
Players can have injuries or personal problems during the match or before the match. In order to be safe or solve the problem, they usually retire or walkover. Retirement is when the player stop the game and defeat during the match, while walkover is not coming to the match in time. 

There were some cases, when that types of things happen and because of that, it will be impossible to drop or change the data in that columns, as they are going to affect all columns of dataset.

### The low ranking of the players
#### The columns which was affected by that reason:  (WRank, LRank,WPts,LPts)
There are lots of professional players who play in tournaments and sometimes, the players who have very low score or no score at all in ATP Ranking can qualify for the tournament. There is no data about that players in ATP Rankings and their rankings will not be available. 

From the example above can see that, the ranking of Zayid M.S. is not available, because his rating is very low. 

In [12]:
k=dropped_df.loc[39,"Winner":"L5"]
k

Winner    Garcia-Lopez G.
Loser          Zayid M.S.
WRank                 105
LRank                <NA>
WPts                  543
LPts                 <NA>
W1                      6
L1                      1
W2                      6
L2                      3
W3                   <NA>
L3                   <NA>
W4                   <NA>
L4                   <NA>
W5                   <NA>
L5                   <NA>
Name: 39, dtype: object

### 4.2.4 Checking the dataset for duplicates


In [13]:
# Checking for duplicated values, return True if there are any
k=dropped_df.duplicated()
# Checking for Bool values, return False if all values are False
q=k.any()
if q==False:
    print("There are no duplicates")
else:
    print("There are duplicates")

There are no duplicates
