<span>
<b>Authors:</b> 
<a href="http://------">Ornela Danushi </a>
<a href="http://------">Gerlando Gramaglia </a>
<a href="http://------">Domenico Profumo </a><br/>
<b>Python version:</b>  3.x<br/>
</span>

# Data Understanding & Preparation on Tennis Matches dataset 
Explore the dataset by studying the data quality, their distribution among several different features and the correlations.

The **central component** of the data science toolkit is **Pandas library** is a and it is used in conjunction with other libraries in that collection. Pandas is built on top of the **NumPy package**, meaning a lot of the structure of NumPy is used or replicated in Pandas. Data in pandas is often used to feed statistical analysis in **SciPy**, plotting functions from **Matplotlib**, and machine learning algorithms in Scikit-learn.

**Install Pandas**

Pandas is an easy package to install. Open up your terminal program (for Mac users) or command line (for PC users) and install it using either of the following commands: **conda install pandas OR pip install pandas.**

Alternatively, if you're using Jupyter notebook you can run a cell with: **!pip install pandas**

In [6]:
%matplotlib inline
import math
import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt

from collections import defaultdict
from scipy.stats.stats import pearsonr

## Loading the data set

Read the .csv file containing the data. The first line contains the list of attributes. The data is assigned to a Pandas dataframe.

In [12]:
df = pd.read_csv('tennis_matches.csv', sep=',', index_col=0) #alternative in case of json source df.to_json('filename.json')

In [19]:
print(df.head()) #print the first records of a df, 
# print()
# print(df.tail()) #print the last records of a df.

  tourney_id tourney_name surface  draw_size tourney_level  tourney_date  \
0  2019-M020     Brisbane    Hard       32.0             A    20181231.0   
1  2019-M020     Brisbane    Hard       32.0             A    20181231.0   
2  2019-M020     Brisbane    Hard       32.0             A    20181231.0   
3  2019-M020     Brisbane    Hard       32.0             A    20181231.0   
4  2019-M020     Brisbane    Hard       32.0             A    20181231.0   

   match_num  winner_id winner_entry         winner_name  ... l_2ndWon  \
0      300.0   105453.0          NaN       Kei Nishikori  ...     20.0   
1      299.0   106421.0          NaN     Daniil Medvedev  ...      7.0   
2      298.0   105453.0          NaN       Kei Nishikori  ...      6.0   
3      297.0   104542.0           PR  Jo-Wilfried Tsonga  ...      9.0   
4      296.0   106421.0          NaN     Daniil Medvedev  ...     19.0   

   l_SvGms l_bpSaved  l_bpFaced  winner_rank winner_rank_points loser_rank  \
0     14.0      10.0

In [18]:
df.dtypes #return the type of each attribute

tourney_id             object
tourney_name           object
surface                object
draw_size             float64
tourney_level          object
tourney_date          float64
match_num             float64
winner_id             float64
winner_entry           object
winner_name            object
winner_hand            object
winner_ht             float64
winner_ioc             object
winner_age            float64
loser_id              float64
loser_entry            object
loser_name             object
loser_hand             object
loser_ht              float64
loser_ioc              object
loser_age             float64
score                  object
best_of               float64
round                  object
minutes               float64
w_ace                 float64
w_df                  float64
w_svpt                float64
w_1stIn               float64
w_1stWon              float64
w_2ndWon              float64
w_SvGms               float64
w_bpSaved             float64
w_bpFaced 

## Types of Attributes and basic checks

Check the data integrity, that is whether there are any empty cells or corrupted data. 
We will use for this purpose the Pandas function **info()**, which checks if there is any 
null value in any column. This function also checks data type for each column, as well as 
number of each data types and number of observations (rows).

In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 186128 entries, 0 to 186127
Data columns (total 49 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   tourney_id          186073 non-null  object 
 1   tourney_name        186103 non-null  object 
 2   surface             185940 non-null  object 
 3   draw_size           186099 non-null  float64
 4   tourney_level       186099 non-null  object 
 5   tourney_date        186100 non-null  float64
 6   match_num           186101 non-null  float64
 7   winner_id           186073 non-null  float64
 8   winner_entry        25827 non-null   object 
 9   winner_name         186101 non-null  object 
 10  winner_hand         186082 non-null  object 
 11  winner_ht           49341 non-null   float64
 12  winner_ioc          186099 non-null  object 
 13  winner_age          183275 non-null  float64
 14  loser_id            186100 non-null  float64
 15  loser_entry         44154 non-null

# Data Integration
We load another table containing the name and surname of any male and female person

In [25]:
df_male = pd.read_csv('male_players.csv', sep=',') 
print(df_male.head())

print()

df_female = pd.read_csv('female_players.csv', sep=',') 
print(df_female.head())

             name   surname
0         Gardnar    Mulloy
1          Pancho    Segura
2           Frank   Sedgman
3        Giuseppe     Merlo
4  Richard Pancho  Gonzales

      name surname
0    Bobby   Riggs
1        X       X
2  Martina  Hingis
3  Mirjana   Lucic
4  Justine   Henin
