# Inferential statistics
## Part III - Inferential Analysis

We're now going to look for answers to the ongoing basketball discussions between you and your family. The main ones we want to reasearch are the following:

- Your grandmother says that your sister couldn't play in a professional basketball league (not only the WNBA, but ANY professional basketball league) because she's too skinny and lacks muscle.
- Your sister says that most female professional players fail their free throws.
- Your brother-in-law heard on the TV that the average assists among NBA (male) and WNBA (female) players is 52 for the 2016-2017 season. He is convinced this average would be higher if we only considered the players from the WNBA.

Let's investigate these claims and see if we can find proof to refute or support them.

### Libraries
Import the necessary libraries first.

In [None]:
# Libraries
import math
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
from scipy.stats import ttest_1samp
pd.set_option('max_columns', 50)

### Load the dataset

Load the cleaned dataset.

In [None]:
wnba = pd.read_csv('../data/wnba_clean.csv')
wnba.head()

# Question 1: Can my sister play in a professional female basketball league?

As we said, you grandmother is convinced that your sister couldn't play in a professional league because of her physique and weight (her weight is 67kg). 

To find an actual answer to the question we first need to know what's the average weight of a professional female basketball player. The data we have only refers to the WNBA league and not to every female professional basketball league in the world, therefore we have no way of actually calculating it.

Still, given that we do have *some* data we can **infer** it using a sample of players like the one we have. 

**How would you do it? Try and think about the requirements that your sample must satisfy in order to be used to infer the average weight. Do you feel it actually fulfills those requirements? Do you need to make any assumptions? We could calculate a confidence interval to do the inference, but do you know any other ways?**

In [None]:
"""
We can infer the average of all professional female players by calculating a confidence interval.
To do so, we would need to have a data that is normally distributed.

Then, I would make an hypothesis testing on the average of our sample to check 
if the weight of the sister is significantly different from our deduced population average.

"""

**Now that all the requirements have been taken into account, compute the confidence interval of the average weight with a confidence level of 95%.**

In [None]:
wnba.Weight.agg(['mean','std'])

In [None]:
weight_mean = wnba.Weight.mean()
weight_size = len(wnba.Weight)
se = wnba.Weight.std()/weight_size**0.5

conf_interval = stats.t.interval(.95,df=weight_size-1,loc=weight_mean,scale=se)
conf_interval

**What can you say about these results?**

In [None]:
"""
We can see that the interval is actually really small despite the fact we the std of our sample is equal to 10. 

"""

**If your sister weighs 67kg what would you tell your grandmother in regards to her assumption?**

In [None]:
"""
We can say that our grandmother were right about our sister ability to play on basketball team as a pro,
her weight seems low compared to the global average we deduced.

"""

## Bonus: Can you plot the probability distribution of the average weight, indicating where the critical region is?

In [None]:
plt.figure(figsize=(20,10))
wnba.Weight.plot(kind='density', color='lightblue')
plt.vlines(x=conf_interval[0],ymin=0.0,ymax=0.04,color='red',linestyle='dashed',label='critical_values')
plt.vlines(x=conf_interval[1],ymin=0.0,ymax=0.04,color='red',linestyle='dashed')
plt.legend()
plt.show()

# Question 2: Do female professional basketball players fail the majority of their free throws?

You do not agree with your sister when she says that most female players fail their free throws. You decide to try and estimate the percentage of players that fail more than 40% of their free throws using, you guessed it, the WNBA sample.

**How would you do it? Try and think about the requirements that your sample must satisfy in order to be used to infer the proportion of players that miss more than 40% of their free throws. Do you feel it actually fulfills those requirements? Do you need to make any assumptions?**

In [None]:
"""
We want to know with a high confidence level the proportion of female players that fail their free throws.
For that I would need to create a confidence interval for the proportion of failed free throws.

We can consider the % of Free Throws, though we should keep in mind stats may be affected by this time played 
with a player with less getting higher or lower chances to succeed is FT.

Then the data should be normally distributed and we could use the t-distribution to infer the proportion.

Finally, it will be a reverse calculation of the proportion because FT% illustrate the percentage of success,
and we would like to know the proportion of players that fails more than 40% of FT.

"""

**Now that all the requirements have been taken into account, compute the confidence interval of the proportion with a confidence level of 95%:**

In [None]:
wnba['FT%'].plot(kind='density') # we can see the FT% is kinda normally distributed.

In [None]:
n  = len(wnba['FT%'])
fails_40 = len(wnba[(1 - wnba['FT%']/100)>.40])
p = fails_40/n
se = (p*(1-p)/n)**0.5

conf_interval = stats.t.interval(.95, df=n-1, loc=p, scale=se)
conf_interval

**What can you comment about our result? What would you tell your sister?**

In [None]:
"""
We can conclude with high level of confidence (95%) that in average 4,9% to 14,8% of female basketball 
professional players fail more than 40% of their FT.

The sister is wrong, there is very low chances that female professional players fail their FT.

--
Questions: 
1. Should I have normalized the data by taking into account the time played? 
2. How could I take into account that some stats are equal to 0 due probably to the fact some players
didn't played that much?

"""

## Bonus: Can you plot the probability distribution of the proportion of missed free throws, indicating where the critical region is?

In [None]:
x = (1 - wnba['FT%']/100)*100

plt.figure(figsize=(20,10))
x.plot(kind='density', color='lightblue')
plt.vlines(x=40,ymin=0,ymax=.035,color='red',linestyle='dashed',label='critical_value')
plt.vlines(x=100,ymin=0,ymax=.035,color='red',linestyle='dashed')
plt.xlim(0,100)
plt.legend()
plt.show()

# Question 3: Is the average number of assists for WNBA players only higher than the average for WNBA and NBA players together?

Your brother-in-law is convinced that the average assists for female professional players is higher than the average of both female and male players combined (which is 52 for the 2016-2017 season). You would like to actually prove if this is true or not but you remember your stats teacher saying "you can't *prove* anything, you just can say that *you are not* saying foolishness".

**How would you do it? Try and think about the requirements that your sample must satisfy in order to do that. Do you feel it actually fulfills those requirements? Do you need to make any assumptions?**

In [None]:
"""
I would need to compare using hypothesis testing if the average assists of female professional players
is significantly different from the average of both female and male (52).
 
I will use the standard t-test, the data should be kinda normally distributed.

We can use one sample two-sided test, where our Null hypothesis H0 = 52 and our alternative hypothesis H1 != 52.
We also use one sample one-sided test, where H0 = 52 and the alternative hypothesis H1 > 52, H1 < 52.

"""

**Use a two-tailed one-sample t-test to see if we can reject (or not) the null hypothesis with a 95% confidence level.**

In [None]:
# checking the mean and standard deviation of WNBA assists data

wnba['AST'].agg(['mean','std'])

In [None]:
# calculating the hypothesis testing with one sample t-test

ttest_1samp(wnba['AST'],52)

In [None]:
# check the critival values with a 95% confidence level

stats.t.interval(.95,df=len(wnba['AST'])-1)

In [None]:
"""
Our t-statistic is smaller than the critical value so we can't reject the Null Hypothesis. 
In the meantime, our pvalue is smaller than 0.05 so we have little chances of being wrong 
if we reject the Null Hypothesis. 

We should do a one sample one-sided test to confirm our doubts.

"""

**Now use a one-tailed one-sample t-test to see if we can reject (or not) the null hypothesis with a 95% confidence level.**

In [None]:
# Using .90 confidence interval for one-sided test to check if value is outside of the 95% confidence level.

stats.t.interval(.90,df=len(wnba['AST'])-1)

In [None]:
"""
Yet our t-statistic is smaller than positive critival value and higher than the negative critival value,
so we can't reject the Null hypothesis. 

I would say with 95% of confidence level that the average assists of female professional players in WNBA
does not have a significant difference with the average of both female and male players in NBA and WNBA.

"""

## Bonus: Can you plot the resulting t-distribution of both tests? Indicate where is the critical region and where does your statistic fall.**

In [None]:
stats.t.pdf()

In [None]:
plt.figure(figsize=(20,10))
wnba.AST.plot(kind='density', color='lightblue')
#plt.vlines(x=conf_interval[0],ymin=0.0,ymax=0.04,color='red',linestyle='dashed',label='critical_values')
#plt.vlines(x=conf_interval[1],ymin=0.0,ymax=0.04,color='red',linestyle='dashed')
plt.legend()
plt.show()

# MegaBonus: Satisfying your curiosity

You finally managed to solve your family's debates over basketball! While you were doing that you started to take an interest in the normal distribution.

You read that the normal distribution is present in a lot of natural phenomenons, like blood pressure, IQ, weight and height. If, for example, we could plot the distribution of the weights of every human on the planet right now it would have the shape of a normal distribution.

In light of this you would like to see if it's possible to check if the distribution of the weights of the WNBA players is a sample distribution that comes from a population that has a normal distribution, because theoretically this should be the case.

**How would you try to demonstrate that our sample fits a normal distribution? What kind of test would you use? Would you have to make any assumptions?**

In [None]:
#your-answer-here

In [None]:
# your code here

**What are your comments in regards to the results of the test?**

In [None]:
#your-answer-here