# Inferential statistics
## Part I - Data Cleaning

Your family is very passionate about basketball. You always have discussions over players, games, statistics and whatnot. As you can imagine those discussions never reach a conclusion since everyone is simply sharing their opinion with no statistics to back them up!

![](../images/basket.jpg)

Since you are attending a data analysis bootcamp you'd like to take advantage of your newfound knowledge to finally put an end to your family's discussions. 

Luckily we have found a dataset containing data related to the players of the WNBA for the 2016-2017 season that we can use. 

Let's start with cleaning the data and then we'll continue with a general exploratory analysis and some inferential statistics.

### Dataset

The dataset we will be using contains the statistics from the WNBA players for the 2016-2017 season. You will be able to find more information on the dataset in the [codebook](../data/codebook.md) uploaded to the repository.

### Libraries

First we'll import the necessary libraries first and increase the maximum number of displayed columns so you will be able to see all the dataset in the same window.

In [1]:
import pandas as pd
import numpy as np
pd.set_option('max_columns', 100)

### Load the dataset

Load the dataset into a df called `wnba` and take an initial look at it using the `head()` method.

In [2]:
#your code here
wnba = pd.read_csv('../data/wnba.csv')
wnba.head()

Unnamed: 0,Name,Team,Pos,Height,Weight,BMI,Birth_Place,Birthdate,Age,College,Experience,Games Played,MIN,FGM,FGA,FG%,3PM,3PA,3P%,FTM,FTA,FT%,OREB,DREB,REB,AST,STL,BLK,TO,PTS,DD2,TD3
0,Aerial Powers,DAL,F,183,71.0,21.200991,US,"January 17, 1994",23,Michigan State,2,8,173,30,85,35.3,12,32,37.5,21,26,80.8,6,22,28,12,3,6,12,93,0,0
1,Alana Beard,LA,G/F,185,73.0,21.329438,US,"May 14, 1982",35,Duke,12,30,947,90,177,50.8,5,18,27.8,32,41,78.0,19,82,101,72,63,13,40,217,0,0
2,Alex Bentley,CON,G,170,69.0,23.875433,US,"October 27, 1990",26,Penn State,4,26,617,82,218,37.6,19,64,29.7,35,42,83.3,4,36,40,78,22,3,24,218,0,0
3,Alex Montgomery,SAN,G/F,185,84.0,24.543462,US,"December 11, 1988",28,Georgia Tech,6,31,721,75,195,38.5,21,68,30.9,17,21,81.0,35,134,169,65,20,10,38,188,2,0
4,Alexis Jones,MIN,G,175,78.0,25.469388,US,"August 5, 1994",23,Baylor,R,24,137,16,50,32.0,7,20,35.0,11,12,91.7,3,9,12,12,7,0,14,50,0,0


### Check NaN values
As you know, one of our first steps is to check if there are any NaN values in the dataset to find any issues. Look for the columns that cointain NaN values and count how many rows there are with that value.

In [3]:
#your code here
wnba.info()
#4   Weight        142 non-null    float64
#5   BMI           142 non-null    float64
#we can see that it is missing 1 value in each column

wnba[90:100]

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 143 entries, 0 to 142
Data columns (total 32 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Name          143 non-null    object 
 1   Team          143 non-null    object 
 2   Pos           143 non-null    object 
 3   Height        143 non-null    int64  
 4   Weight        142 non-null    float64
 5   BMI           142 non-null    float64
 6   Birth_Place   143 non-null    object 
 7   Birthdate     143 non-null    object 
 8   Age           143 non-null    int64  
 9   College       143 non-null    object 
 10  Experience    143 non-null    object 
 11  Games Played  143 non-null    int64  
 12  MIN           143 non-null    int64  
 13  FGM           143 non-null    int64  
 14  FGA           143 non-null    int64  
 15  FG%           143 non-null    float64
 16  3PM           143 non-null    int64  
 17  3PA           143 non-null    int64  
 18  3P%           143 non-null    

Unnamed: 0,Name,Team,Pos,Height,Weight,BMI,Birth_Place,Birthdate,Age,College,Experience,Games Played,MIN,FGM,FGA,FG%,3PM,3PA,3P%,FTM,FTA,FT%,OREB,DREB,REB,AST,STL,BLK,TO,PTS,DD2,TD3
90,Maimouna Diarra,LA,C,198,90.0,22.956841,SN,"January 30, 1991",26,Sengal,R,9,16,1,3,33.3,0,0,0.0,1,2,50.0,3,4,7,1,1,0,3,3,0,0
91,Makayla Epps,CHI,G,178,,,US,"June 6, 1995",22,Kentucky,R,14,52,2,14,14.3,0,5,0.0,2,5,40.0,2,0,2,4,1,0,4,6,0,0
92,Marissa Coleman,IND,G/F,185,73.0,21.329438,US,"April 1, 1987",30,Maryland,9,30,539,50,152,32.9,27,79,34.2,27,33,81.8,7,53,60,25,8,4,34,154,0,0
93,Matee Ajavon,ATL,G,173,73.0,24.391059,US,"July 5, 1986",31,Syracruse,R,27,218,22,69,31.9,0,3,0.0,29,35,82.9,8,26,34,27,10,0,26,73,0,0
94,Maya Moore,MIN,F,183,80.0,23.888441,US,"November 6, 1989",27,Connecticut,7,29,904,170,398,42.7,52,132,39.4,98,114,86.0,50,106,156,99,53,13,56,490,3,0
95,Monique Currie,PHO,G/F,183,80.0,23.888441,US,"February 25, 1983",34,Duke,11,32,717,121,284,42.6,37,93,39.8,85,103,82.5,19,103,122,67,22,11,48,364,0,0
96,Morgan Tuck,CON,F,188,91.0,25.746944,US,"April 30, 1994",23,Connecticut,1,17,294,35,101,34.7,8,28,28.6,13,16,81.3,9,34,43,19,7,0,15,91,1,0
97,Moriah Jefferson,SAN,G,168,55.0,19.486961,US,"August 3, 1994",23,Connecticut,1,21,514,81,155,52.3,9,20,45.0,20,27,74.1,6,31,37,92,33,2,43,191,0,0
98,Natalie Achonwa,IND,C,193,83.0,22.282477,CA,"November 22, 1992",24,Notre Dame,3,30,529,82,151,54.3,0,0,0.0,43,55,78.2,31,70,101,21,11,16,25,207,0,0
99,Natasha Cloud,WAS,G,183,73.0,21.798202,US,"February 22, 1992",25,Saint Joseph's,3,24,448,37,118,31.4,12,51,23.5,20,27,74.1,7,52,59,69,17,3,23,106,0,0


We can see that there are only two NaNs in the whole dataset, one in the Weight column and one in the BMI one. Let's look at the actual rows that contain the NaN values.

In [4]:
#your code here
wnba[wnba.isna().any(axis=1)]

Unnamed: 0,Name,Team,Pos,Height,Weight,BMI,Birth_Place,Birthdate,Age,College,Experience,Games Played,MIN,FGM,FGA,FG%,3PM,3PA,3P%,FTM,FTA,FT%,OREB,DREB,REB,AST,STL,BLK,TO,PTS,DD2,TD3
91,Makayla Epps,CHI,G,178,,,US,"June 6, 1995",22,Kentucky,R,14,52,2,14,14.3,0,5,0.0,2,5,40.0,2,0,2,4,1,0,4,6,0,0


It looks like there is only a single row that has NaN values in it, which is good! Just in case, let's check how much removing a single row may influence our dataset by calculating the percentage of values we will be removing.

In [5]:
#your code here
1/wnba['Name'].count()

0.006993006993006993

It is very important to be as careful as possible when dealing with NaN values and only drop data when it is strictly necessary. This decision can also be influenced by the nature of our analysis. If, for example, our analysis will not require the Weight and BMI of the players at all we can simply keep the row, given that the NaN values are only present in the Weight and BMI column.

In this specific example, let's say our decision is to drop it. Write some code to drop the NaN values. 

In [6]:
#your code here
wnba = wnba.dropna(axis=0).reset_index(drop=True)

**Do you think it is a good decision? Think about a case in which you wouldn't want to drop the value.**

In [7]:
#your answer here
#in the case we only wanted to make an age distribution plot


### Let's make an overview of the dataset
First, check the data types of our data:

In [8]:
#your code here
wnba.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 142 entries, 0 to 141
Data columns (total 32 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Name          142 non-null    object 
 1   Team          142 non-null    object 
 2   Pos           142 non-null    object 
 3   Height        142 non-null    int64  
 4   Weight        142 non-null    float64
 5   BMI           142 non-null    float64
 6   Birth_Place   142 non-null    object 
 7   Birthdate     142 non-null    object 
 8   Age           142 non-null    int64  
 9   College       142 non-null    object 
 10  Experience    142 non-null    object 
 11  Games Played  142 non-null    int64  
 12  MIN           142 non-null    int64  
 13  FGM           142 non-null    int64  
 14  FGA           142 non-null    int64  
 15  FG%           142 non-null    float64
 16  3PM           142 non-null    int64  
 17  3PA           142 non-null    int64  
 18  3P%           142 non-null    

It looks like most of the data types are correct. Birthdate column could be casted to a `datetime` type, however, we won't use it in our analysis so for simplicity, let's leave it as an `object`. Weight column could also be casted to an `int64` type as all numbers are integers.

**Let's change the type of Weight column for practice.**

In [9]:
#your code here
wnba['Weight']=wnba['Weight'].astype('int64')
wnba.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 142 entries, 0 to 141
Data columns (total 32 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Name          142 non-null    object 
 1   Team          142 non-null    object 
 2   Pos           142 non-null    object 
 3   Height        142 non-null    int64  
 4   Weight        142 non-null    int64  
 5   BMI           142 non-null    float64
 6   Birth_Place   142 non-null    object 
 7   Birthdate     142 non-null    object 
 8   Age           142 non-null    int64  
 9   College       142 non-null    object 
 10  Experience    142 non-null    object 
 11  Games Played  142 non-null    int64  
 12  MIN           142 non-null    int64  
 13  FGM           142 non-null    int64  
 14  FGA           142 non-null    int64  
 15  FG%           142 non-null    float64
 16  3PM           142 non-null    int64  
 17  3PA           142 non-null    int64  
 18  3P%           142 non-null    

**After checking the data types, let's check for outliers using the describe() method.**

In [13]:
#your code here
wnba.describe()

Unnamed: 0,Height,Weight,BMI,Age,Games Played,MIN,FGM,FGA,FG%,3PM,3PA,3P%,FTM,FTA,FT%,OREB,DREB,REB,AST,STL,BLK,TO,PTS,DD2,TD3
count,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0
mean,184.612676,78.978873,23.091214,27.112676,24.429577,500.105634,74.401408,168.704225,43.102817,14.830986,43.697183,24.978169,39.535211,49.422535,75.828873,22.06338,61.591549,83.65493,44.514085,17.725352,9.78169,32.288732,203.169014,1.140845,0.007042
std,8.698128,10.99611,2.073691,3.66718,7.075477,289.373393,55.980754,117.165809,9.855199,17.372829,46.155302,18.459075,36.743053,44.244697,18.536151,21.519648,49.669854,68.200585,41.49079,13.413312,12.537669,21.447141,153.032559,2.909002,0.083918
min,165.0,55.0,18.390675,21.0,2.0,12.0,1.0,3.0,16.7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,2.0,0.0,0.0,0.0,2.0,2.0,0.0,0.0
25%,175.75,71.5,21.785876,24.0,22.0,242.25,27.0,69.0,37.125,0.0,3.0,0.0,13.0,17.25,71.575,7.0,26.0,34.25,11.25,7.0,2.0,14.0,77.25,0.0,0.0
50%,185.0,79.0,22.873314,27.0,27.5,506.0,69.0,152.5,42.05,10.5,32.0,30.55,29.0,35.5,80.0,13.0,50.0,62.5,34.0,15.0,5.0,28.0,181.0,0.0,0.0
75%,191.0,86.0,24.180715,30.0,29.0,752.5,105.0,244.75,48.625,22.0,65.5,36.175,53.25,66.5,85.925,31.0,84.0,116.5,66.75,27.5,12.0,48.0,277.75,1.0,0.0
max,206.0,113.0,31.55588,36.0,32.0,1018.0,227.0,509.0,100.0,88.0,225.0,100.0,168.0,186.0,100.0,113.0,226.0,334.0,206.0,63.0,64.0,87.0,584.0,17.0,1.0


**Comment on your result. What do you see?**

In [11]:
#your answer here
#Huge outlier on the DD2 value 17 as max
# to be more precise I would display a boxplot for every column 

**Now we can save the cleaned data to a new .csv file called `wnba_clean.csv` in the data folder.**

In [19]:
#your code here
wnba.to_csv (r'C:\\Users\\XicoCoder\\Labs\\M2-mini-project2\\data\\wnba_clean2.csv', index = False)