<span>
<b>Authors:</b> 
<a href="http://------">Ornela Danushi </a>
<a href="http://------">Gerlando Gramaglia </a>
<a href="http://------">Domenico Profumo </a><br/>
<b>Python version:</b>  3.x<br/>
</span>

# Data Understanding & Preparation on Tennis Matches dataset 
Explore the dataset by studying the data quality, their distribution among several different features and the correlations.

The **central component** of the data science toolkit is **Pandas library** is a and it is used in conjunction with other libraries in that collection. Pandas is built on top of the **NumPy package**, meaning a lot of the structure of NumPy is used or replicated in Pandas. Data in pandas is often used to feed statistical analysis in **SciPy**, plotting functions from **Matplotlib**, and machine learning algorithms in Scikit-learn.

**Install Pandas**

Pandas is an easy package to install. Open up your terminal program (for Mac users) or command line (for PC users) and install it using either of the following commands: **conda install pandas OR pip install pandas.**

Alternatively, if you're using Jupyter notebook you can run a cell with: **!pip install pandas**

In [1]:
%matplotlib inline
import math
import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt

from collections import defaultdict
from scipy.stats.stats import pearsonr

## Loading the data set

Read the .csv file containing the data. The first line contains the list of attributes. The data is assigned to a Pandas dataframe.

In [2]:
df = pd.read_csv('tennis_matches.csv')#, sep=',', index_col=0) #alternative in case of json source df.to_json('filename.json')

In [3]:
print(df.head()) #print the first records of a df, 
# print()
# print(df.tail()) #print the last records of a df.

   Unnamed: 0 tourney_id tourney_name surface  draw_size tourney_level  \
0           0  2019-M020     Brisbane    Hard       32.0             A   
1           1  2019-M020     Brisbane    Hard       32.0             A   
2           2  2019-M020     Brisbane    Hard       32.0             A   
3           3  2019-M020     Brisbane    Hard       32.0             A   
4           4  2019-M020     Brisbane    Hard       32.0             A   

   tourney_date  match_num  winner_id winner_entry  ... l_2ndWon l_SvGms  \
0    20181231.0      300.0   105453.0          NaN  ...     20.0    14.0   
1    20181231.0      299.0   106421.0          NaN  ...      7.0    10.0   
2    20181231.0      298.0   105453.0          NaN  ...      6.0     8.0   
3    20181231.0      297.0   104542.0           PR  ...      9.0    11.0   
4    20181231.0      296.0   106421.0          NaN  ...     19.0    15.0   

   l_bpSaved l_bpFaced  winner_rank  winner_rank_points loser_rank  \
0       10.0      15.0      

In [4]:
#df.dtypes #return the type of each attribute but is already included in the df.info() called later

# Types of Attributes and basic checks 
## Data Quality with reference to Syntactic Accuracy

Check the data integrity, that is whether there are any empty cells or corrupted data. 
We will use for this purpose the Pandas function **info()**, which checks if there is any 
null value in any column. This function also checks data type for each column, as well as 
number of each data types and number of observations (rows).

Moreover we check if each attribute is syntactically correct according to the specifications

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 186128 entries, 0 to 186127
Data columns (total 50 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   Unnamed: 0          186128 non-null  int64  
 1   tourney_id          186073 non-null  object 
 2   tourney_name        186103 non-null  object 
 3   surface             185940 non-null  object 
 4   draw_size           186099 non-null  float64
 5   tourney_level       186099 non-null  object 
 6   tourney_date        186100 non-null  float64
 7   match_num           186101 non-null  float64
 8   winner_id           186073 non-null  float64
 9   winner_entry        25827 non-null   object 
 10  winner_name         186101 non-null  object 
 11  winner_hand         186082 non-null  object 
 12  winner_ht           49341 non-null   float64
 13  winner_ioc          186099 non-null  object 
 14  winner_age          183275 non-null  float64
 15  loser_id            186100 non-nul

# Attributes category check

- Categorical Attributes:<br>
    - *tourney_id* - ***object*** => ***Date***-***Id***?
    - *tourney_name* - ***object*** 
    - *surface* - ***object***
    - *tourney_level* - ***object***
    - *winner_entry* - ***object***
    - *winner_name* - ***object***
    - *winner_hand* - ***object***
    - *winner_ioc* - ***object***
    - *loser_entry* - ***object***
    - *loser_name* - ***object***
    - *loser_hand* - ***object***
    - *loser_ioc* - ***object*** 
    - *score* - ***object***
    - *round* - ***object***
    
    
- Numerical Attributes:<br>
    - *draw_size* - ***float64*** => ***int64***
    - *tourney_date* - ***float64*** => ***DateTime***
    - *match_num*- ***float64*** =>
    - *winner_id*- ***float64*** =>
    - *winner_ht*- ***float64*** => 
    - *winner_age*- ***float64*** => 
    - *loser_id*- ***float64*** =>
    - *loser_ht*- ***float64*** =>
    - *loser_age*- ***float64*** =>
    - *best_of* - ***float64***
    - *minutes* - ***float64***
    - *w_ace* - ***float64***
    - *w_df* - ***float64***
    - *w_svpt* - ***float64***
    - *w_1stIn* - ***float64***
    - *w_1stWon* - ***float64***
    - *w_2ndWon* - ***float64***
    - *w_SvGms* - ***float64***
    - *w_bpSaved* - ***float64***
    - *w_bpFaced* - ***float64***
    - *l_ace* - ***float64***
    - *l_df* - ***float64***
    - *l_svpt* - ***float64***
    - *l_1stIn* - ***float64***
    - *l_1stWon* - ***float64***
    - *l_2ndWon* - ***float64***
    - *l_SvGms* - ***float64***
    - *l_bpSaved* - ***float64***
    - *l_bpFaced* - ***float64***
    - *winner_rank* - ***float64***
    - *winner_rank_points* - ***float64***
    - *loser_rank* - ***float64***
    - *loser_rank_points* - ***float64***
    - *tourney_spectators* - ***float64*** 
    - *tourney_revenue* - ***float64***


 ## Now let's start with analyzing each categorical Attributes



### tourney_id

has to be unique and the first four characters are always the year while the remaining part is random

Analysis: 
Split the value of the attribute in 2 parts, the first with 4 characters and the second with the remaing characters.
For both the parts build a set inserting each different year and id. 
Count how many wrong years are inserted through a NaN value. 
To detect the wrong ids is more difficult since they are random and don't follow a known a structure.
We can only conclude that the wrong id elements are those deriving from the wrong years.

NOTE: The decision to take is to delete or edit all these attributes.

In [25]:
#tourney_id=df['tourney_id'] 
#tourney_id.describe() 

tourney_year= df['tourney_id'].str[:4]
tourney_id= df['tourney_id'].str[5:]

tourney_year_set= set()
wrong_year= 0
#which_id=0
for i in tourney_year:    
    if math.isnan(float(i)):
        wrong_year += 1
        #print(tourney_id[which_id])
    else:
        tourney_year_set.add(i)
    #which_id += 1
print("Present years: ")
print(tourney_year_set) #{'2018', '2019', '2016', '2020', '2021', '2017'}
print("Wrong years counting: ")
print(wrong_year) #55

tourney_id_set=set()
for i in tourney_id:
    tourney_id_set.add(i)
#print(tourney_id_set)#come individuare i nan se i valori non sono per forza numerici? Questi valori nan sono parte di quelli individuati dall'anno nan

#print("\t"+"tourney_year" +"\t"+" tourney_id")
#print(+ tourney_year + "             " + tourney_id)

Present years: 
{'2018', '2019', '2016', '2020', '2021', '2017'}
Wrong years counting: 
55


### tourney_name
is the name of the tourney

Analysis: with just viewing the values we see some NaN that have to be discovered. We build a set to view all the unique names.

In [30]:
tourney_name = df['tourney_name']
#tourney_name.describe() 
tourney_name_set = set()
for i in tourney_name:
    tourney_name_set.add(i)
#print(tourney_name_set)

### surface
kind of surface for the match

Analysis: Correct! nothing to do

In [8]:
df['surface']

0         Hard
1         Hard
2         Hard
3         Hard
4         Hard
          ... 
186123    Hard
186124     NaN
186125     NaN
186126     NaN
186127    Hard
Name: surface, Length: 186128, dtype: object

### tourney_level
they are split for men and women.

○ For men: 'G' = Grand Slams, 'M' = Masters 1000s, 'A' = other tour-level
events, 'C' = Challengers, 'S' = Satellites/ITFs, 'F' = Tour finals and other
season-ending events, and 'D' = Davis Cup. F

○ For women, there are several additional tourney_level codes, including 'P' =
Premier, 'PM' = Premier Mandatory, and 'I' = International. The various levels
of ITFs are given by the prize money (in thousands), such as '15' = ITF
$15,000. Other codes, such as 'T1' for Tier I (and so on) are used for older
WTA tournament designations. 'D' is used for the Federation/Fed/Billie Jean
King Cup, and also for the Wightman Cup and Bonne Bell Cup.

○ There is also some competition which can be for both men and women: 'E' =
exhibition (events not sanctioned by the tour, though the definitions can be
ambiguous), 'J' = juniors, and 'T' = team tennis, which does yet appear
anywhere in the dataset but will at some point.


Analysis: Correct! nothing to do

In [9]:
df['tourney_level']

0           A
1           A
2           A
3           A
4           A
         ... 
186123    NaN
186124      C
186125      C
186126      C
186127      C
Name: tourney_level, Length: 186128, dtype: object

### winner_entry and loser_entry

'WC' = wild card, 'Q' = qualifier, 'LL' = lucky loser, 'PR' = protected
ranking, 'ITF' = ITF entry, and there are a few others that are occasionally used.

loser_entry: analogous

Analysis: Correct! nothing to do

In [10]:
df['winner_entry'].head()

0    NaN
1    NaN
2    NaN
3     PR
4    NaN
Name: winner_entry, dtype: object

### winner_name

name of player that win 

Analysis: Correct! nothing to do

In [11]:
df['winner_name'].head()

0         Kei Nishikori
1       Daniil Medvedev
2         Kei Nishikori
3    Jo-Wilfried Tsonga
4       Daniil Medvedev
Name: winner_name, dtype: object

### winner_hand and loser_hand
R= right, L = left, U = unknown. For ambidextrous players, this is their
serving hand.

Analysis: Correct! nothing to do

In [12]:
df['winner_hand'].head()

0    R
1    R
2    R
3    R
4    R
Name: winner_hand, dtype: object

### winner_ioc and loser_ioc

three-character country code

Analysis: Correct! nothing to do

In [13]:
df['winner_ioc'].head()

0    JPN
1    RUS
2    JPN
3    FRA
4    RUS
Name: winner_ioc, dtype: object

### Score

Analysis: ?

In [14]:
df['score'].head()

0       6-4 3-6 6-2
1        7-6(6) 6-2
2           6-2 6-2
3        6-4 7-6(2)
4    6-7(2) 6-3 6-4
Name: score, dtype: object

### Round

Analysis: Correct! nothing to do

In [15]:
df['round'].head()

0     F
1    SF
2    SF
3    QF
4    QF
Name: round, dtype: object

## Now let's start with analizyning each numerical Attributes


### draw_size

number of players in the draw, often rounded up to the nearest power of 2. (For instance, a tournament with 28 players may be shown as 32.)

Analysis: since all powers of 2 are integers the idea is to convert to the 'int' format. In this case we can have undefined number errors so I consider undefined numbers 0 (controllare se è una soluzione corretta)

In [16]:
df['draw_size']= df['draw_size'].fillna(0)
df['draw_size']= df['draw_size'].astype(int)
df['draw_size'].head()

0    32
1    32
2    32
3    32
4    32
Name: draw_size, dtype: int64

### tourney_date

eight digits, YYYYMMDD, usually the Monday of the tournament week.

Analysis: non posso convertirlo nel tipo data perchè anche qui abbiamo valori null, come ci comportiamo?

In [17]:
df['tourney_date'].head()

0    20181231.0
1    20181231.0
2    20181231.0
3    20181231.0
4    20181231.0
Name: tourney_date, dtype: float64

# Data Integration
We load another table containing the name and surname of any male and female person

In [18]:
df_male = pd.read_csv('male_players.csv', sep=',') 
print(df_male.head())

print()

df_female = pd.read_csv('female_players.csv', sep=',') 
print(df_female.head())

             name   surname
0         Gardnar    Mulloy
1          Pancho    Segura
2           Frank   Sedgman
3        Giuseppe     Merlo
4  Richard Pancho  Gonzales

      name surname
0    Bobby   Riggs
1        X       X
2  Martina  Hingis
3  Mirjana   Lucic
4  Justine   Henin
