<span>
<b>Authors:</b> 
<a href="http://------">Ornela Danushi </a>
<a href="http://------">Gerlando Gramaglia </a>
<a href="http://------">Domenico Profumo </a><br/>
<b>Python version:</b>  3.x<br/>
</span>

# Data Understanding & Preparation on Tennis Matches dataset 
Explore the dataset by studying the data quality, their distribution among several different features and the correlations.

The **central component** of the data science toolkit is **Pandas library** is a and it is used in conjunction with other libraries in that collection. Pandas is built on top of the **NumPy package**, meaning a lot of the structure of NumPy is used or replicated in Pandas. Data in pandas is often used to feed statistical analysis in **SciPy**, plotting functions from **Matplotlib**, and machine learning algorithms in Scikit-learn.

**Install Pandas**

Pandas is an easy package to install. Open up your terminal program (for Mac users) or command line (for PC users) and install it using either of the following commands: **conda install pandas OR pip install pandas.**

Alternatively, if you're using Jupyter notebook you can run a cell with: **!pip install pandas**

In [1]:

import pandas as pd
import math
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
from collections import defaultdict
from scipy.stats.stats import pearsonr

In [2]:
#df.dtypes #return the type of each attribute but is already included in the df.info() called later

# Types of Attributes and basic checks 
## Data Quality with reference to Syntactic Accuracy

Check the data integrity, that is whether there are any empty cells or corrupted data. 
We will use for this purpose the Pandas function **info()**, which checks if there is any 
null value in any column. This function also checks data type for each column, as well as 
number of each data types and number of observations (rows).

Moreover we check if each attribute is syntactically correct according to the specifications

In [3]:
df = df = pd.read_csv('matches_with_gender.csv', index_col = 0)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 186128 entries, 0 to 186127
Data columns (total 51 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   tourney_id          186073 non-null  object 
 1   tourney_name        186103 non-null  object 
 2   surface             185940 non-null  object 
 3   draw_size           186099 non-null  float64
 4   tourney_level       186099 non-null  object 
 5   tourney_date        186100 non-null  float64
 6   match_num           186101 non-null  float64
 7   winner_id           186073 non-null  float64
 8   winner_entry        25827 non-null   object 
 9   winner_name         186101 non-null  object 
 10  winner_hand         186082 non-null  object 
 11  winner_ht           49341 non-null   float64
 12  winner_ioc          186099 non-null  object 
 13  winner_age          183275 non-null  float64
 14  loser_id            186100 non-null  float64
 15  loser_entry         44154 non-null

## Classification of Data Domain

### Tourney

     - *tourney_id* - ***object*** 
     - *tourney_name* - ***object*** 
     - *tourney_level* - ***object***
     - *tourney_spectators* - ***float64*** 
     - *tourney_revenue* - ***float64***
    
 ### Matches
 
     - *match_num*- ***float64***
     - *surface* - ***object***
     - *draw_size* - ***float64*** => ***int64*** 
     - *tourney_date* - ***float64*** => ***Datetime64***
     - *minutes* - ***float64*** 
     - *score* - ***object***
     - *round* - ***object** 
     - *best_of* - ***float64***

 ### Players
 
    - *winner_id*- ***float64***           - *loser_id*- ***float64*** 
    - *winner_name* - ***object***         - *loser_name* - ***object***
    - *winner_ioc* - ***object***          - *loser_ioc* - ***object*** 
    - *winner_ht*- ***float64***           - *loser_ht*- ***float64*** 
    - *winner_age*- ***float64***          - *loser_age*- ***float64*** 
    - *winner_hand* - ***object***         - *loser_hand* - ***object***
    - *winner_entry* - ***object***        - *loser_entry* - ***object***
    - *winner_rank* - ***float64***        - *loser_rank* - ***float64***
    - *winner_rank_points* - ***float64*** - *loser_rank_points* - ***float64***
    - *w_ace* - ***float64***              - *l_ace* - ***float64***
    - *w_df* - ***float64***               - *l_df* - ***float64***
    - *w_svpt* - ***float64***             - *l_svpt* - ***float64***
    - *w_1stIn* - ***float64***            - *l_1stIn* - ***float64***
    - *w_1stWon* - ***float64***           - *l_1stWon* - ***float64***
    - *w_2ndWon* - ***float64***           - *l_2ndWon* - ***float64***
    - *w_SvGms* - ***float64***            - *l_SvGms* - ***float64***
    - *w_bpSaved* - ***float64***          - *l_bpSaved* - ***float64***
    - *w_bpFaced* - ***float64***          - *l_bpFaced* - ***float64*** 
    - *winner_gender* - ***object***      - *loser_gender* - ***object*** 
 

## Tourney


### tourney_id

has to be unique and the first four characters are always the year while the remaining part is random

Analysis:

Split the value of the attribute in 2 parts, the first with 4 characters and the second with the remaing characters.
For both the parts build a set inserting each different year and id. 
Count how many wrong years are inserted through a NaN value. 
To detect the wrong ids is more difficult since they are random and don't follow a known a structure.
We can only conclude that the wrong id elements are those deriving from the wrong years.

NOTE: The decision to take is to delete or edit all these attributes.

In [4]:
print(df['tourney_id'].describe())
tourney_year= df['tourney_id'].str[:4]
tourney_id= df['tourney_id'].str[5:]

tourney_year_set= set()
wrong_year= 0
#which_id=0
for i in tourney_year:    
    if math.isnan(float(i)):
        wrong_year += 1
        #print(tourney_id[which_id])
    else:
        tourney_year_set.add(i)
    #which_id += 1
print("Present years: "+ str(tourney_year_set)) #{'2018', '2019', '2016', '2020', '2021', '2017'}
print("Wrong years counting: "+ str(wrong_year)) #55

tourney_id_set=set()
for i in tourney_id:
    tourney_id_set.add(i)
#print(tourney_id_set)#come individuare i nan se i valori non sono per forza numerici? Questi valori nan sono parte di quelli individuati dall'anno nan

#print("\t"+"tourney_year" +"\t"+" tourney_id")
#print(+ tourney_year + "             " + tourney_id)

count       186073
unique        4853
top       2018-560
freq           478
Name: tourney_id, dtype: object
Present years: {'2019', '2018', '2021', '2017', '2016', '2020'}
Wrong years counting: 55


### tourney_name
is the name of the tourney

Analysis: with just viewing the values we see some NaN that have to be discovered. We build a set to view all the unique names.

In [5]:
print("Distinct Values in tourney_name:", df.tourney_name.unique())
df.tourney_name.describe() 

Distinct Values in tourney_name: ['Brisbane' 'Doha' 'Pune' ... 'W100 Nicholasville KY' 'W25 Las Vegas NV'
 nan]


count          186103
unique           2488
top       W15 Antalya
freq             4634
Name: tourney_name, dtype: object

### tourney_level
they are split for men and women.

○ For men: 'G' = Grand Slams, 'M' = Masters 1000s, 'A' = other tour-level
events, 'C' = Challengers, 'S' = Satellites/ITFs, 'F' = Tour finals and other
season-ending events, and 'D' = Davis Cup. F

○ For women, there are several additional tourney_level codes, including 'P' =
Premier, 'PM' = Premier Mandatory, and 'I' = International. The various levels
of ITFs are given by the prize money (in thousands), such as '15' = ITF
$15,000. Other codes, such as 'T1' for Tier I (and so on) are used for older
WTA tournament designations. 'D' is used for the Federation/Fed/Billie Jean
King Cup, and also for the Wightman Cup and Bonne Bell Cup.

○ There is also some competition which can be for both men and women: 'E' =
exhibition (events not sanctioned by the tour, though the definitions can be
ambiguous), 'J' = juniors, and 'T' = team tennis, which does yet appear
anywhere in the dataset but will at some point.


Analysis:

In [6]:
tourney_level=df['tourney_level']

print("Distinct Values in tourney_level: \t", df.tourney_level.unique())
print(df.tourney_level.describe())



Distinct Values in tourney_level: 	 ['A' 'P' 'G' 'I' 'M' 'PM' 'F' 'D' 'C' '15' '25' '60' '100' '80' '10' '50'
 '75' 'O' 'W' nan]
count     186099
unique        19
top           15
freq       45807
Name: tourney_level, dtype: object


'''
tourney_level_set= set()
wrong_tourneylevel= 0
men_levels=['G','M','A','C','S','F','D']
women_levels=['P','PM','I','ITF','WTA','D']
both_levels=['E','J','T']
#women_levels.append(men_levels)

gender=[] #['M','W','-']
loc=0

for i in tourney_level:
    if pd.isna(i):
        wrong_tourneylevel+=1
    else:
        if i in men_levels:
            gender.iloc[loc]='M'
        else if i in 
        tourney_level_set.add(i)
    loc +=1
    
print("tourney_level_set: " + str(tourney_level_set)) # {'M', 'W', 'PM', '100', 'D', 'O', '15', 'I', 'F', '60', 'G', '80', '50', 'A', 'P', '25', '75', 'C', '10'}
print("wrong_tourneylevel: " + str(wrong_tourneylevel)) # 29
'''

## tourney_spectators     tourney_revenue

In [7]:
print(df['tourney_spectators'].describe())
#print(df['tourney_spectators'].unique())
print(df['tourney_revenue'].describe())
#print(df['tourney_revenue'].unique())

count    186101.000000
mean       4108.569153
std        2707.042984
min          91.000000
25%        2836.000000
50%        3340.000000
75%        4008.000000
max       18086.000000
Name: tourney_spectators, dtype: float64
count    1.861020e+05
mean     8.226442e+05
std      6.008570e+05
min      1.786574e+04
25%      5.473662e+05
50%      6.633297e+05
75%      8.340290e+05
max      5.002794e+06
Name: tourney_revenue, dtype: float64


## Matches

### match_num

a match-specific identifier. Often starting from 1, sometimes counting down from 300, and sometimes arbitrary.

Analysis: I can convert in integer values and left nan in value 0 because my values is limited in a range [1,300]

In [8]:
print(df['match_num'].describe())
print("null values:",  df['match_num'].isnull().sum())


count    186101.000000
mean        160.627992
std         289.326473
min           1.000000
25%          17.000000
50%         131.000000
75%         272.000000
max        8312.000000
Name: match_num, dtype: float64
null values: 27


### surface
kind of surface for the match

Analysis: Detected some unknown values

In [9]:
print(df['surface'].unique())
print(df['surface'].describe())
print("null values:",  df['surface'].isnull().sum())

['Hard' 'Clay' 'Grass' 'Carpet' nan]
count     185940
unique         4
top         Hard
freq       95243
Name: surface, dtype: object
null values: 188


## TOURNEY_DATE

tourney_date: eight digits, YYYYMMDD, usually the Monday of the tournament week.

Analysis:
we convert type of tourney_date in Datetime64

In [10]:
df['tourney_date'] = pd.to_datetime(df['tourney_date'], format='%Y%m%d')
df['tourney_date'] = [x.date() for x in df.tourney_date]
#df.tourney_date
#print("convert into Datatime64", df['tourney_date'])
#print("Distinct Values in tourney_date: \t", df.tourney_date.unique())



### draw_size

number of players in the draw, often rounded up to the nearest power of 2. (For instance, a tournament with 28 players may be shown as 32.)

Analysis: since all powers of 2 are integers the idea is to convert to the 'int' format. In this case we can have undefined number errors so I consider undefined numbers 0 (controllare se è una soluzione corretta)

In [11]:
#print("Distinct Values in draw_size: \t", df.draw_size.unique())

dw = df['draw_size']
log= np.log2(df['draw_size'])
for i in range(0, len(log)):
    dec, inter = math.modf(log[i])
    if dec == 0:
        log[i] = inter
    else:
        log[i] = inter + 1
dw = pow(2, log)
df['draw_size'] = dw
df['draw_size'].unique()

array([ 32., 128.,  64.,   8.,   4.,  16.,   2.,  nan])

In [12]:
print("Distinct Values in surface: \t", df.surface.unique())
print("Distinct Values in tourney_level: \t", df.tourney_level.unique())
print("Distinct Values in winner_entry: \t", df.winner_entry.unique())
print("Distinct Values in best_of: \t", df.best_of.unique())
print("Distinct Values in winner_hand: \t", df.winner_hand.unique())


Distinct Values in surface: 	 ['Hard' 'Clay' 'Grass' 'Carpet' nan]
Distinct Values in tourney_level: 	 ['A' 'P' 'G' 'I' 'M' 'PM' 'F' 'D' 'C' '15' '25' '60' '100' '80' '10' '50'
 '75' 'O' 'W' nan]
Distinct Values in winner_entry: 	 [nan 'PR' 'Q' 'WC' 'Alt' 'LL' 'SE' 'ALT' 'SR' 'JE' 'A' 'ITF' 'P' 'I' 'IR'
 'JR']
Distinct Values in best_of: 	 [ 3.  5. nan]
Distinct Values in winner_hand: 	 ['R' 'L' 'U' nan]


## winner and loser ht

height in centimetres, where available

check if the attributes contain negative values, we have some outliers and nan values

In [13]:
print(df['loser_ht'].describe())
print(df['winner_ht'].describe())
print(df['loser_ht'].unique())
print(df['winner_ht'].unique())

count    38348.00000
mean       181.56308
std         10.81565
min          2.00000
25%        175.00000
50%        183.00000
75%        188.00000
max        211.00000
Name: loser_ht, dtype: float64
count    49341.000000
mean       181.407106
std         11.630899
min          2.000000
25%        175.000000
50%        183.000000
75%        188.000000
max        211.000000
Name: winner_ht, dtype: float64
[198. 188. 183. 196.  nan 190. 193. 180. 181. 170. 184. 166. 178. 162.
 163. 172. 174. 179. 182. 211. 185. 155. 165. 177. 175. 208. 206. 161.
 168. 169. 173. 171. 164. 203. 176. 159. 167. 191. 189. 157.   2. 194.
 160. 145.]
[178. 198. 188. 183. 196.  nan 190. 193. 180. 174. 184. 182. 185. 177.
 175. 170. 203. 208. 211. 168. 169. 172. 163. 179. 166. 173. 162. 176.
 181. 191. 159. 155. 206. 164. 157. 171. 165. 167. 161. 189. 194.   2.
 160. 145.]


## Winner and loser ioc 

three-character country code

In [14]:
print(df['winner_ioc'].describe())
print(df['loser_ioc'].describe())
print(df['winner_ioc'].unique())
print(df['loser_ioc'].unique())

count     186099
unique       124
top          USA
freq       16464
Name: winner_ioc, dtype: object
count     186102
unique       154
top          USA
freq       16728
Name: loser_ioc, dtype: object
['JPN' 'RUS' 'FRA' 'AUS' 'CAN' 'BUL' 'GBR' 'SRB' 'USA' 'LAT' 'CZE' 'EST'
 'UKR' 'NED' 'CRO' 'BLR' 'CHI' 'SUI' 'POL' 'GER' 'LUX' 'ESP' 'ITA' 'GEO'
 'HUN' 'LTU' 'ARG' 'CYP' 'BIH' 'RSA' 'BEL' 'TUN' 'IND' 'BRA' 'AUT' 'POR'
 'NZL' 'URU' 'GRE' 'SVK' 'TPE' 'KAZ' 'PUR' 'KOR' 'ROU' 'MDA' 'SLO' 'CHN'
 'SWE' 'DEN' 'TUR' 'ESA' 'BAR' 'UZB' 'MNE' 'BOL' 'NOR' 'ECU' 'MEX' 'COL'
 'LIE' 'ISR' 'PAR' 'DOM' 'FIN' 'GUA' 'PER' 'INA' 'THA' 'PHI' 'EGY' 'ALG'
 'ZIM' 'PAK' 'MAR' 'HKG' 'IRL' 'LIB' 'SRI' 'VEN' 'MKD' 'PNG' 'SIN' 'GRN'
 'BAH' 'CUB' 'TRI' 'OMA' 'MLT' 'KGZ' 'MAS' 'BDI' 'MRI' 'SAM' 'KEN' 'ARM'
 'NAM' 'REU' 'UNK' 'MON' 'HAI' 'VIE' 'HON' 'PAN' 'CRC' 'SGP' 'TJK' 'POC'
 'IRI' 'PHL' 'MGL' 'GUM' 'GAB' 'NGR' 'GUD' 'CAM' 'CMR' 'KUW' 'MAD' 'DEU'
 'AND' 'NLD' 'NGA' 'GRC' nan]
['RUS' 'FRA' 'AUS' 'CAN' 'JPN' 'BUL' 'GBR

## winner and loser age

the age of the player, in years, depending on the date of the
tournament


Cosa rappresenta la parte decimale??

In [15]:
print(df['winner_age'].describe())
print(df['loser_age'].describe())
print(df['winner_age'].unique())
print(df['loser_age'].unique())

count    183275.000000
mean         23.963517
std           4.462318
min          14.042437
25%          20.492813
50%          23.457906
75%          26.869268
max          95.000000
Name: winner_age, dtype: float64
count    179590.000000
mean         23.765932
std           4.629857
min          14.006845
25%          20.131417
50%          23.227926
75%          26.767967
max          74.485969
Name: loser_age, dtype: float64
[95.         22.88569473 29.00479124 ... 15.29089665 15.4880219
 33.51403149]
[22.88569473 33.70568104 31.88227242 ... 41.46201232 46.85284052
 14.14921287]


## minutes 

match length, where available

from this first analysis we found some outliers observing max value

In [16]:
print(df['minutes'].describe())

count    81660.000000
mean        97.675753
std         41.492701
min          0.000000
25%         72.000000
50%         91.000000
75%        119.000000
max       4756.000000
Name: minutes, dtype: float64


## w_ace l_ace

winner's number of aces

In [17]:
print(df['w_ace'].describe())

#df['w_ace']=df['w_ace'].fillna(-1)
#df['w_ace']=df['w_ace'].astype(int)

print(df['w_ace'].unique())
print(df['l_ace'].describe())
print(df['l_ace'].unique())


count    82310.000000
mean         4.813425
std          4.387105
min          0.000000
25%          2.000000
50%          4.000000
75%          7.000000
max         75.000000
Name: w_ace, dtype: float64
[ 3. 10.  2. 12.  5. 11.  1. 16. 21. 17. 15.  6.  7. 18. 44.  9.  8.  4.
  0. 14. 13. nan 33. 25. 20. 22. 26. 40. 19. 28. 29. 24. 30. 39. 23. 43.
 27. 32. 38. 35. 31. 48. 53. 36. 42. 34. 64. 37. 49. 45. 75. 41. 51. 61.
 46. 72. 52.]
count    82313.000000
mean         3.527875
std          3.828217
min          0.000000
25%          1.000000
50%          2.000000
75%          5.000000
max         67.000000
Name: l_ace, dtype: float64
[ 8. 17. 10.  1. 29. 12.  3.  5.  6. 13.  7.  4. 27. 22.  2.  0. 11. 20.
  9. nan 14. 36. 26. 15. 16. 47. 24. 18. 25. 21. 59. 28. 19. 23. 67. 38.
 37. 30. 34. 33. 31. 32. 40. 35. 52. 61. 53. 44. 45. 46. 56. 43. 39.]


## w_df      l_df

winner's number of doubles faults

In [18]:
print(df['w_df'].describe())
print(df['w_df'].unique())
print(df['l_df'].describe())
print(df['l_df'].unique())


count    82312.000000
mean         2.858174
std          2.421105
min          0.000000
25%          1.000000
50%          2.000000
75%          4.000000
max        114.000000
Name: w_df, dtype: float64
[  3.   1.   2.   8.   5.   0.   4.   6.   7.   9.  nan  10.  14.  12.
  11.  13.  18.  19.  15.  26.  17.  25.  16.  20.  21.  22.  23.  24.
 114.  72.  28.  45.]
count    82319.000000
mean         3.612556
std          2.608092
min          0.000000
25%          2.000000
50%          3.000000
75%          5.000000
max        114.000000
Name: l_df, dtype: float64
[  6.   2.   3.   5.   7.   0.   1.   4.  10.   8.   9.  nan  12.  11.
  13.  16.  15.  17.  14.  20.  21.  28.  18.  22.  19.  23.  26.  25.
  31. 114.  40.  36.]


## w_svpt   l_svpt

winner's number of serve points

In [19]:
print(df['w_svpt'].describe())
print(df['w_svpt'].unique())
print(df['l_svpt'].describe())
print(df['l_svpt'].unique())


count    82310.000000
mean        71.288069
std         25.524468
min          0.000000
25%         53.000000
50%         67.000000
75%         87.000000
max       1957.000000
Name: w_svpt, dtype: float64
[7.700e+01 5.200e+01 4.700e+01 6.800e+01 1.050e+02 9.400e+01 5.900e+01
 6.400e+01 4.900e+01 5.400e+01 6.200e+01 8.400e+01 7.500e+01 6.300e+01
 5.600e+01 7.600e+01 4.200e+01 4.300e+01 6.600e+01 4.500e+01 3.800e+01
 1.190e+02 8.700e+01 5.800e+01 9.800e+01 1.170e+02 9.300e+01 8.600e+01
 4.800e+01 4.000e+01 1.140e+02 9.100e+01 7.400e+01 9.700e+01 7.900e+01
 6.500e+01 7.200e+01 6.900e+01 7.800e+01 7.300e+01 6.100e+01 8.900e+01
 1.100e+02 5.300e+01 1.020e+02 9.500e+01 4.100e+01 1.090e+02 4.400e+01
 8.200e+01 9.900e+01 9.600e+01 6.700e+01 1.120e+02 3.300e+01 9.000e+01
 3.200e+01 1.340e+02 8.500e+01 8.800e+01 7.000e+01 6.000e+01 1.000e+02
       nan 1.240e+02 5.100e+01 1.040e+02 8.000e+01 9.200e+01 1.030e+02
 8.300e+01 8.100e+01 1.070e+02 1.310e+02 1.430e+02 5.500e+01 7.100e+01
 3.900e+01 1.3

## w_1st    ln l_1stln

winner’s number of first serves made

In [20]:
print(df['w_1stIn'].describe())
#print(df['w_1stIn'].unique())
print(df['l_1stIn'].describe())
#print(df['l_1stIn'].unique())


count    82310.000000
mean        44.270477
std         16.951922
min          0.000000
25%         32.000000
50%         42.000000
75%         54.000000
max       1330.000000
Name: w_1stIn, dtype: float64
count    82304.000000
mean        44.557737
std         16.776201
min          0.000000
25%         33.000000
50%         42.000000
75%         54.000000
max        893.000000
Name: l_1stIn, dtype: float64


## w_1stWon    l_1stWon

winner’s number of first-serve points won

In [21]:
print(df['w_1stWon'].describe())
#print(df['w_1stIn'].unique())
print(df['l_1stWon'].describe())
#print(df['l_1stIn'].unique())


count    82312.000000
mean        32.130564
std         11.409554
min          0.000000
25%         24.000000
50%         30.000000
75%         38.000000
max        836.000000
Name: w_1stWon, dtype: float64
count    82311.000000
mean        28.028903
std         12.270939
min          0.000000
25%         19.000000
50%         26.000000
75%         35.000000
max        532.000000
Name: l_1stWon, dtype: float64


## w_2ndWon  l_2ndWon

winner’s number of second-serve points won

In [22]:
print(df['w_2ndWon'].describe())
#print(df['w_2ndWon'].unique())
print(df['l_2ndWon'].describe())
#print(df['l_2ndWon'].unique())



count    82309.000000
mean        14.451251
std          5.933102
min          0.000000
25%         10.000000
50%         14.000000
75%         18.000000
max        304.000000
Name: w_2ndWon, dtype: float64
count    82312.000000
mean        12.705681
std          6.320212
min          0.000000
25%          8.000000
50%         12.000000
75%         16.000000
max        399.000000
Name: l_2ndWon, dtype: float64


## w_SvGms    l_SvGms

winner’s number of serve games

In [23]:
print(df['w_SvGms'].describe())
#print(df['w_SvGms'].unique())
print(df['l_SvGms'].describe())
#print(df['l_SvGms'].unique())



count    82311.000000
mean        11.114784
std          3.512519
min          0.000000
25%          9.000000
50%         10.000000
75%         14.000000
max         49.000000
Name: w_SvGms, dtype: float64
count    82318.000000
mean        10.940353
std          3.497649
min          0.000000
25%          9.000000
50%         10.000000
75%         13.000000
max         50.000000
Name: l_SvGms, dtype: float64


## w_bpSaved   l_bpSaved

winner's number of breakpoints saved

In [24]:
print(df['w_bpSaved'].describe())
#print(df['w_bpSaved'].unique())
print(df['l_bpSaved'].describe())
#print(df['l_bpSaved'].unique())



count    82315.000000
mean         3.540861
std          3.109012
min          0.000000
25%          1.000000
50%          3.000000
75%          5.000000
max        209.000000
Name: w_bpSaved, dtype: float64
count    82311.000000
mean         4.660641
std          3.148227
min          0.000000
25%          2.000000
50%          4.000000
75%          6.000000
max        120.000000
Name: l_bpSaved, dtype: float64


## w_bpFaced   l_bpFaced

winner's number of breakpoints faced

In [25]:
print(df['w_bpFaced'].describe())
#print(df['w_bpFaced'].unique())
print(df['l_bpFaced'].describe())
#print(df['l_bpFaced'].unique())

count    82312.000000
mean         5.410244
std          4.206825
min          0.000000
25%          2.000000
50%          5.000000
75%          8.000000
max        266.000000
Name: w_bpFaced, dtype: float64
count    82306.000000
mean         8.872124
std          3.969575
min          0.000000
25%          6.000000
50%          8.000000
75%         11.000000
max        190.000000
Name: l_bpFaced, dtype: float64


## winner_rank     loser_rank       

winner's ATP or WTA rank, as of the tourney_date, or the most recentranking date before the tourney_date

In [26]:
print(df['winner_rank'].describe())
#print(df['winner_rank'].unique())
print(df['loser_rank'].describe())
#print(df['loser_rank'].unique())

count    166719.000000
mean        383.810723
std         313.996466
min           1.000000
25%         137.000000
50%         298.000000
75%         562.000000
max        2220.000000
Name: winner_rank, dtype: float64
count    150845.000000
mean        434.303736
std         355.803171
min           1.000000
25%         157.000000
50%         325.000000
75%         642.000000
max        2257.000000
Name: loser_rank, dtype: float64


## winner_rank_points     loser_rank_points

number of ranking points, where available.

In [27]:
print(df['winner_rank_points'].describe())
#print(df['winner_rank_points'].unique())
print(df['loser_rank_points'].describe())
#print(df['loser_rank_points'].unique())

count    166701.000000
mean        470.450789
std        1041.008107
min           1.000000
25%          49.000000
50%         161.000000
75%         438.000000
max       16950.000000
Name: winner_rank_points, dtype: float64
count    150828.000000
mean        356.328692
std         702.626048
min           1.000000
25%          35.000000
50%         138.000000
75%         377.000000
max       16950.000000
Name: loser_rank_points, dtype: float64


### winner_entry and loser_entry

'WC' = wild card, 'Q' = qualifier, 'LL' = lucky loser, 'PR' = protected
ranking, 'ITF' = ITF entry, and there are a few others that are occasionally used.

loser_entry: analogous

Analysis: Correct! nothing to do

In [28]:
df['winner_entry'].head()

0    NaN
1    NaN
2    NaN
3     PR
4    NaN
Name: winner_entry, dtype: object

### winner_name and loser_name

name of player that win 

Analysis: Correct! nothing to do

In [29]:
df['winner_name'].head()

0         Kei Nishikori
1       Daniil Medvedev
2         Kei Nishikori
3    Jo-Wilfried Tsonga
4       Daniil Medvedev
Name: winner_name, dtype: object

### winner_hand and loser_hand
R= right, L = left, U = unknown. For ambidextrous players, this is their
serving hand.

Analysis: Correct! nothing to do

In [30]:
df['winner_hand'].head()

0    R
1    R
2    R
3    R
4    R
Name: winner_hand, dtype: object

### winner_ioc and loser_ioc

three-character country code

Analysis: Correct! nothing to do

In [31]:
df['winner_ioc'].head()

0    JPN
1    RUS
2    JPN
3    FRA
4    RUS
Name: winner_ioc, dtype: object

### Score

Analysis: ?

In [32]:
df['score'].head()

0       6-4 3-6 6-2
1        7-6(6) 6-2
2           6-2 6-2
3        6-4 7-6(2)
4    6-7(2) 6-3 6-4
Name: score, dtype: object

### Round

Analysis: Correct! nothing to do

In [33]:
df['round'].head()

0     F
1    SF
2    SF
3    QF
4    QF
Name: round, dtype: object

## Now let's start with analizyning each numerical Attributes


### tourney_date

eight digits, YYYYMMDD, usually the Monday of the tournament week.

Analysis: use pandas function for convert into date format

In [34]:
#df['tourney_date']=pd.to_datetime(df['tourney_date'], format='%Y%m%d')

#print(df['tourney_date'])


### winner_id and loser_id

the player_id used in this repo for the winner of the match.

In [35]:
df['winner_id'].describe()

count    186073.000000
mean     180151.623529
std       46547.170898
min      100644.000000
25%      122425.000000
50%      203530.000000
75%      214152.000000
max      245099.000000
Name: winner_id, dtype: float64

### winner_ht and loser_ht

height in centimetres, where available

In [36]:
df['winner_ht']=np.nan_to_num(df['winner_ht']).astype(int)
df['winner_ht'].head()

0    178
1    198
2    178
3    188
4    198
Name: winner_ht, dtype: int64

### winner_age and loser_age

the age of the player, in years, depending on the date of the tournament

In [37]:
df['winner_age'].head()

0    95.000000
1    22.885695
2    29.004791
3    33.705681
4    22.885695
Name: winner_age, dtype: float64

# Data Integration
We load another table containing the name and surname of any male and female person

In [38]:
df_male = pd.read_csv('male_players.csv')#, sep=',') 
df_male.dropna(inplace=True)
df_male_names = pd.DataFrame()
df_male_names['Name'] = df_male['name'].astype(str) + " " + df_male['surname'].astype(str)
df_male_names.drop_duplicates(inplace=True)

#print(df_male_names.isna().sum())
#print(df_male_names.isnull().sum())
#print(df_male_names.describe())
#print(df_male_names)

df_male_names.insert(1, 'gender', 'M', allow_duplicates=True)
#print(df_male_names)

df_female = pd.read_csv('female_players.csv')#, sep=',') 
df_female.dropna(inplace=True)
df_female_names = pd.DataFrame()
df_female_names['Name'] = df_female['name'].astype(str) + " " + df_female['surname'].astype(str)
df_female_names.drop_duplicates(inplace=True)
df_female_names.insert(1, 'gender', 'F', allow_duplicates=True)
#print(df_female_names)


result = pd.concat([df_male_names, df_female_names]) #we create a new table to have the global view
#print(result.duplicated(subset=['Name']).any())
#print(result.describe())

pd.options.mode.chained_assignment = None 

duplicateDFRow = result[result.duplicated(['Name'], keep=False)]
duplicateDFRow.drop_duplicates(subset=['Name'], inplace=True)
duplicateDFRow['gender'].replace({"M": 'U'},inplace=True)

#print(result.describe())
#print(duplicateDFRow.describe())

table_gender = pd.concat([result, duplicateDFRow]) 

table_gender.drop_duplicates(subset=['Name'], keep='last', inplace=True)

print(table_gender['gender'].describe())

count     98581
unique        3
top           M
freq      54492
Name: gender, dtype: object


In [39]:
#df.drop_duplicates(inplace = True)
tmp = df
#print(tmp.info())
tmp = tmp.join(table_gender.set_index('Name'), on='winner_name')
tmp.rename(columns={'gender': 'winner_gender'}, inplace = True)
#print(tmp['winner_name'].head(), tmp['gender'].head())
tmp = tmp.join(table_gender.set_index('Name'), on='loser_name')
tmp.rename(columns={'gender': 'loser_gender'}, inplace = True)
#print(tmp.head())
#print(tmp.info())

# Data Integration
We load another table containing the name and surname of any male and female person

In [40]:
df_male = pd.read_csv('male_players.csv')#, sep=',') 
df_male.dropna(inplace=True)
df_male_names = pd.DataFrame()
df_male_names['Name'] = df_male['name'].astype(str) + " " + df_male['surname'].astype(str)
df_male_names.drop_duplicates(inplace=True)

#print(df_male_names.isna().sum())
#print(df_male_names.isnull().sum())
#print(df_male_names.describe())
#print(df_male_names)

df_male_names.insert(1, 'gender', 'M', allow_duplicates=True)
#print(df_male_names)

df_female = pd.read_csv('female_players.csv')#, sep=',') 
df_female.dropna(inplace=True)
df_female_names = pd.DataFrame()
df_female_names['Name'] = df_female['name'].astype(str) + " " + df_female['surname'].astype(str)
df_female_names.drop_duplicates(inplace=True)
df_female_names.insert(1, 'gender', 'F', allow_duplicates=True)
#print(df_female_names)


result = pd.concat([df_male_names, df_female_names]) #we create a new table to have the global view
#print(result.duplicated(subset=['Name']).any())
#print(result.describe())

pd.options.mode.chained_assignment = None 

duplicateDFRow = result[result.duplicated(['Name'], keep=False)]
duplicateDFRow.drop_duplicates(subset=['Name'], inplace=True)
duplicateDFRow['gender'].replace({"M": 'U'},inplace=True)

#print(result.describe())
#print(duplicateDFRow.describe())

table_gender = pd.concat([result, duplicateDFRow]) 

table_gender.drop_duplicates(subset=['Name'], keep='last', inplace=True)

print(table_gender['gender'].describe())

count     98581
unique        3
top           M
freq      54492
Name: gender, dtype: object


1.cancellare i duplicati sulla tabella centrale
2.fare join tra tabella centrale e table_gender

In [41]:
#df.drop_duplicates(inplace = True)
tmp = df
#print(tmp.info())
tmp = tmp.join(table_gender.set_index('Name'), on='winner_name')
tmp.rename(columns={'gender': 'winner_gender'}, inplace = True)
#print(tmp['winner_name'].head(), tmp['gender'].head())
tmp = tmp.join(table_gender.set_index('Name'), on='loser_name')
tmp.rename(columns={'gender': 'loser_gender'}, inplace = True)
#print(tmp.head())
#print(tmp.info())