# Analyzing the ATP Data

## 1. Project Description
### 1.1. Overview
The _Association of Tennis Professionals_, or _ATP_ is the global governing body of men’s professional tennis. It corresponds to FIFA in football. It is the federation that holds all the prestigious tennis tournaments in the world and showcases the world’s greatest players. More on [this link](https://www.atptour.com/en/corporate/about).

In this project, we will be exploring a dataset containing information about ATP players and matches in order to highlight some facts and extract some useful insights from this data.
### 1.2. Dataset Description
The dataset used in this project (from data.world) consists of two `csv` files; `atp_players.csv` and `atp_matches.csv`. The former contains information about the ATP players while the latter contains information about the ATP matches from Jan. 3, 2000 till Aug. 15, 2021.

## 2. Data Wrangling
Before we begin, we need to import the necessary tools and libraries that we will use throughout the project.

In [128]:
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import pandas as pd
import re
import numbers

### 2.1. Data Loading
We will begin by loading the data into the dataset.

In [129]:
players = pd.read_csv('atp_players.csv')
players.head(3)

Unnamed: 0,player_id,first_name,first_initial,last_name,full_name,player_url,flag_code,residence,birthplace,birthdate,...,birth_month,birth_day,turned_pro,weight_lbs,weight_kg,height_ft,height_inches,height_cm,handedness,backhand
0,a002,Ricardo,R,Acuna,Acuna R,http://www.atpworldtour.com/en/players/ricardo...,CHI,"Jupiter, FL, USA","Santiago, Chile",19580113,...,1.0,13.0,0.0,150.0,68.0,"5'9""",69.0,175.0,,
1,a001,Sadiq,S,Abdullahi,Abdullahi S,http://www.atpworldtour.com/en/players/sadiq-a...,NGR,,,19600202,...,2.0,2.0,0.0,0.0,0.0,"0'0""",0.0,0.0,,
2,a005,Nelson,N,Aerts,Aerts N,http://www.atpworldtour.com/en/players/nelson-...,BRA,,"Cachoeira Do Sul, Brazil",19630425,...,4.0,25.0,0.0,165.0,75.0,"6'2""",74.0,188.0,,


In [130]:
matches = pd.read_csv('atp_matches.csv')
matches.head(3)

  exec(code_obj, self.user_global_ns, self.user_ns)


Unnamed: 0,location,tournament,match_date,series,court,surface,round,best_of,winner,loser,...,l3,w4,l4,w5,l5,wsets,lsets,tourney_id,wpts,lpts
0,Adelaide,Australian Hardcourt Championships,2000-01-03,International,Outdoor,Hard,1st Round,3,Dosedel S.,Ljubicic I.,...,0,0,0,0,0,2.0,0.0,2000-001,,
1,Adelaide,Australian Hardcourt Championships,2000-01-03,International,Outdoor,Hard,1st Round,3,Enqvist T.,Clement A.,...,0,0,0,0,0,2.0,0.0,2000-001,,
2,Adelaide,Australian Hardcourt Championships,2000-01-03,International,Outdoor,Hard,1st Round,3,Escude N.,Baccanello P.,...,3,0,0,0,0,2.0,1.0,2000-001,,


### 2.2. Data Cleaning
In this part, we will get the data ready for the analysis process. In particular, we will do the following:
- Remove any duplicate rows.
- Remove the unwanted columns from both files.
- Put every column in a proper format.
- Identify how we will deal with missing data, this will be postponed till the end as we get a deeper insight into the structure of the data.

#### 2.2.1. Removing Duplicate Values

_For `players.csv`_

In [131]:
#View the duplicates in `players.csv`
players[players.duplicated() == True]

Unnamed: 0,player_id,first_name,first_initial,last_name,full_name,player_url,flag_code,residence,birthplace,birthdate,...,birth_month,birth_day,turned_pro,weight_lbs,weight_kg,height_ft,height_inches,height_cm,handedness,backhand
10000,v007,Jerome,J,Vanier,Vanier J,http://www.atpworldtour.com/en/players/jerome-...,FRA,,"Boulogne, France",19571102,...,11.0,2.0,,154.0,70.0,"5'9""",69.0,175.0,,


In [132]:
#Remove the duplicates in `players.csv`
players.drop_duplicates(inplace=True)
#View the players to check
players[players.duplicated() == True]
players.head()

Unnamed: 0,player_id,first_name,first_initial,last_name,full_name,player_url,flag_code,residence,birthplace,birthdate,...,birth_month,birth_day,turned_pro,weight_lbs,weight_kg,height_ft,height_inches,height_cm,handedness,backhand
0,a002,Ricardo,R,Acuna,Acuna R,http://www.atpworldtour.com/en/players/ricardo...,CHI,"Jupiter, FL, USA","Santiago, Chile",19580113,...,1.0,13.0,0.0,150.0,68.0,"5'9""",69.0,175.0,,
1,a001,Sadiq,S,Abdullahi,Abdullahi S,http://www.atpworldtour.com/en/players/sadiq-a...,NGR,,,19600202,...,2.0,2.0,0.0,0.0,0.0,"0'0""",0.0,0.0,,
2,a005,Nelson,N,Aerts,Aerts N,http://www.atpworldtour.com/en/players/nelson-...,BRA,,"Cachoeira Do Sul, Brazil",19630425,...,4.0,25.0,0.0,165.0,75.0,"6'2""",74.0,188.0,,
3,a004,Egan,E,Adams,Adams E,http://www.atpworldtour.com/en/players/egan-ad...,USA,"Palmetto, FL, USA","Miami Beach, FL, USA",19590615,...,6.0,15.0,0.0,160.0,73.0,"5'10""",70.0,178.0,,
4,a006,Ronald,R,Agenor,Agenor R,http://www.atpworldtour.com/en/players/ronald-...,USA,"Beverly Hills, California, USA","Rabat, Morocco",19641113,...,11.0,13.0,1983.0,180.0,82.0,"5'11""",71.0,180.0,,


_For `matches.csv`_

In [133]:
#View the duplicates in `matches.csv`
matches[matches.duplicated() == True]

Unnamed: 0,location,tournament,match_date,series,court,surface,round,best_of,winner,loser,...,l3,w4,l4,w5,l5,wsets,lsets,tourney_id,wpts,lpts


#### 2.2.2. Removing Unwanted Columns

_For `players.csv`_

In [134]:
columns_to_remove = ['first_name', 'first_initial', 'last_name', 
        'flag_code', 'residence', 'birthdate', 'birth_year',
        'birth_month', 'birth_day', 'birth_month', 'birth_day', 
        'weight_lbs', 'height_ft', 'height_inches']
players.drop(columns=columns_to_remove, inplace=True)
players.head(3)

Unnamed: 0,player_id,full_name,player_url,birthplace,turned_pro,weight_kg,height_cm,handedness,backhand
0,a002,Acuna R,http://www.atpworldtour.com/en/players/ricardo...,"Santiago, Chile",0.0,68.0,175.0,,
1,a001,Abdullahi S,http://www.atpworldtour.com/en/players/sadiq-a...,,0.0,0.0,0.0,,
2,a005,Aerts N,http://www.atpworldtour.com/en/players/nelson-...,"Cachoeira Do Sul, Brazil",0.0,75.0,188.0,,


_For `matches.csv`_

In [135]:
columns_to_remove = ['tourney_id']
matches.drop(columns=columns_to_remove, inplace=True)
matches.head(3)

Unnamed: 0,location,tournament,match_date,series,court,surface,round,best_of,winner,loser,...,w3,l3,w4,l4,w5,l5,wsets,lsets,wpts,lpts
0,Adelaide,Australian Hardcourt Championships,2000-01-03,International,Outdoor,Hard,1st Round,3,Dosedel S.,Ljubicic I.,...,0,0,0,0,0,0,2.0,0.0,,
1,Adelaide,Australian Hardcourt Championships,2000-01-03,International,Outdoor,Hard,1st Round,3,Enqvist T.,Clement A.,...,0,0,0,0,0,0,2.0,0.0,,
2,Adelaide,Australian Hardcourt Championships,2000-01-03,International,Outdoor,Hard,1st Round,3,Escude N.,Baccanello P.,...,6,3,0,0,0,0,2.0,1.0,,


#### 2.2.3. Putting Every Column in a Proper Format

_For `players.csv`_

In [136]:
players['handedness'].unique()

array([nan, 'Right-Handed', 'Left-Handed', 'Ambidextrous'], dtype=object)

It's clear that, in the dataset, some players are assigned right-handedness, some are assigned left-handednes, while others have this feature missing. We will replace 'right-handed' and 'left-handed' with '1' and '0' respectively. 

In [111]:
players['handedness'] = players['handedness'].replace(['Right-Handed'], '1')
players['handedness'] = players['handedness'].replace(['Left-Handed'], '0')
players.where(players['handedness'] != 'Ambidextrous', inplace=True)

nanCount = players['handedness'].size - players['handedness'][players['handedness'] == '1'].size - players['handedness'][players['handedness'] == '0'].size
nanCount

data_available = players['handedness'][players['handedness'] == '1'].size + players['handedness'][players['handedness'] == '0'].size
print(players['handedness'][players['handedness'] == '1'].size/data_available)
print(players['handedness'][(players['handedness'] == '1') | (players['handedness'] == '0')].size/players['handedness'].size)

0.8640776699029126
0.10384016130510494


It turns out the players that have available data for their handedness are represent about 10% of the dataset, which is too small. That said, it seems like it's best to perform the analysis only on the portion of the dataset that has available data for the players' handedness. Since this column is mostly `nan`, will leave it formatted as `str` instead of converting it to `int`.

_For `matches.csv`_

#### 2.4. Dealing with Missing and Corrupted Data

_For `players.csv`_

We will begin by counting the nan values in each column.

In [137]:
for col in players.columns:
    print(col + ":", end='')
    print(players[col].isna().sum())

player_id:0
full_name:0
player_url:0
birthplace:7874
turned_pro:9378
weight_kg:8206
height_cm:8254
handedness:9775
backhand:9775


In [138]:
players = players[players['player_id'].notna()]
players = players[players['full_name'].notna()]
players = players[players['player_url'].notna()]

for col in players.columns:
    print(col + ":", end='')
    print(players[col].isna().sum())

player_id:0
full_name:0
player_url:0
birthplace:7874
turned_pro:9378
weight_kg:8206
height_cm:8254
handedness:9775
backhand:9775


_For `matches.csv`_

## 3. Exploratory Data Analysis

## 4. Conclusions
### 4.1. Observations Summary

### 4.2. Final Thoughts