# Inferential statistics
## Part III - Inferential Analysis

We're now going to look for answers to the ongoing basketball discussions between you and your family. The main ones we want to reasearch are the following:

- Your grandmother says that your sister couldn't play in a professional basketball league (not only the WNBA, but ANY professional basketball league) because she's too skinny and lacks muscle.
- Your sister says that most female professional players fail their free throws.
- Your brother-in-law heard on the TV that the average assists among NBA (male) and WNBA (female) players is 52 for the 2016-2017 season. He is convinced this average would be higher if we only considered the players from the WNBA.

Let's investigate these claims and see if we can find proof to refute or support them.

### Libraries
Import the necessary libraries first.

In [44]:
# Libraries
import math
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
from scipy.stats import ttest_1samp
pd.set_option('max_columns', 50)

### Load the dataset

Load the cleaned dataset.

In [45]:
#your code here
wnba = pd.read_csv('../data/wnba_clean.csv')
wnba.head()

Unnamed: 0,Name,Team,Pos,Height,Weight,BMI,Birth_Place,Birthdate,Age,College,Experience,Games Played,MIN,FGM,FGA,FG%,3PM,3PA,3P%,FTM,FTA,FT%,OREB,DREB,REB,AST,STL,BLK,TO,PTS,DD2,TD3
0,Aerial Powers,DAL,F,183,71,21.200991,US,"January 17, 1994",23,Michigan State,2,8,173,30,85,35.3,12,32,37.5,21,26,80.8,6,22,28,12,3,6,12,93,0,0
1,Alana Beard,LA,G/F,185,73,21.329438,US,"May 14, 1982",35,Duke,12,30,947,90,177,50.8,5,18,27.8,32,41,78.0,19,82,101,72,63,13,40,217,0,0
2,Alex Bentley,CON,G,170,69,23.875433,US,"October 27, 1990",26,Penn State,4,26,617,82,218,37.6,19,64,29.7,35,42,83.3,4,36,40,78,22,3,24,218,0,0
3,Alex Montgomery,SAN,G/F,185,84,24.543462,US,"December 11, 1988",28,Georgia Tech,6,31,721,75,195,38.5,21,68,30.9,17,21,81.0,35,134,169,65,20,10,38,188,2,0
4,Alexis Jones,MIN,G,175,78,25.469388,US,"August 5, 1994",23,Baylor,R,24,137,16,50,32.0,7,20,35.0,11,12,91.7,3,9,12,12,7,0,14,50,0,0


# Question 1: Can my sister play in a professional female basketball league?

As we said, you grandmother is convinced that your sister couldn't play in a professional league because of her physique and weight (her weight is 67kg). 

To find an actual answer to the question we first need to know what's the average weight of a professional female basketball player. The data we have only refers to the WNBA league and not to every female professional basketball league in the world, therefore we have no way of actually calculating it.

Still, given that we do have *some* data we can **infer** it using a sample of players like the one we have. 

**How would you do it? Try and think about the requirements that your sample must satisfy in order to be used to infer the average weight. Do you feel it actually fulfills those requirements? Do you need to make any assumptions? We could calculate a confidence interval to do the inference, but do you know any other ways?**

In [46]:
# your answer here
from scipy import stats

# For the first question we could check wether the weight would fall into the binominal range
# We could write the following hypothesis and test this 

# H0: Mu /= 67
# H1: Mu = 67

# However, in my opinion we cannot decide succes (e.g. making the team) defined by just one variable
# The right way would be to collect more data on the sister and do a multiple regression to see if there are 
# correlations between vairables and perform a multivariate analysis

**Now that all the requirements have been taken into account, compute the confidence interval of the average weight with a confidence level of 95%.**

In [47]:
# your code here

sample = wnba['Weight']

weight_sis = 67 # kg 
sample_mean = np.mean(sample)
df = len(sample) - 1
confidence_level = 0.95

# Using the t-test due to a lack of the population standard deviation
stats.t.interval(confidence_level, df, loc=sample_mean, scale=stats.sem(sample))

(77.15461406720749, 80.80313241166576)

In [48]:
# Alternative way to determine confidence interval based on t-test
import statsmodels.stats.api as sms

sms.DescrStatsW(sample).tconfint_mean()

(77.15461406720749, 80.80313241166576)

In [49]:
# We could also use the normal distribution as the sample size is above 100 and this would be more precise
stats.norm.interval(0.95, loc=sample_mean, scale=stats.sem(sample))

(77.17027122332428, 80.78747525554897)

**What can you say about these results?**

In [50]:
#your-answer-here
'''
We can say with a confidence level of 95 percent that the mean of the population will be between 77,17 kg and 80,79 kg.
'''

'\nWe can say with a confidence level of 95 percent that the mean of the population will be between 77,17 kg and 80,79 kg.\n'

**If your sister weighs 67kg what would you tell your grandmother in regards to her assumption?**

In [51]:
#your-answer-here
'''
This doesn't say much about the assumption of the grandmother due to the fact that we are dealing with the 
range of the population mean and not the individual case of the sister.
'''

"\nThis doesn't say much about the assumption of the grandmother due to the fact that we are dealing with the \nrange of the population mean and not the individual case of the sister.\n"

## Bonus: Can you plot the probability distribution of the average weight, indicating where the critical region is?

In [8]:
# your code here


# Question 2: Do female professional basketball players fail the majority of their free throws?

You do not agree with your sister when she says that most female players fail their free throws. You decide to try and estimate the percentage of players that fail more than 40% of their free throws using, you guessed it, the WNBA sample.

**How would you do it? Try and think about the requirements that your sample must satisfy in order to be used to infer the proportion of players that miss more than 40% of their free throws. Do you feel it actually fulfills those requirements? Do you need to make any assumptions?**

In [38]:
# your answer here
# | FTM | Free Throws made |
# | FTA | Free Throws Attempts |

# 1. First we create a new column with: if FTM / FTA > 0.4 then 1 else 0 
# 2. Then we check the percentage of players having free throw rate of >0.4 
# 3. Create a confidence interval of the proportion with a confidence level of 95%

# Create a function that takes two inputs, pre and post
def pre_post_difference(first, second):
    # returns the difference between post and pre
    return first / second

# Create a variable that is the output of the function
wnba['FTA-1'] = pre_post_difference(wnba['FTM'], wnba['FTA'])

# View the dataframe
wnba

Unnamed: 0,Name,Team,Pos,Height,Weight,BMI,Birth_Place,Birthdate,Age,College,Experience,Games Played,MIN,FGM,FGA,FG%,3PM,3PA,3P%,FTM,FTA,FT%,OREB,DREB,REB,AST,STL,BLK,TO,PTS,DD2,TD3,FTA-1
0,Aerial Powers,DAL,F,183,71,21.200991,US,"January 17, 1994",23,Michigan State,2,8,173,30,85,35.3,12,32,37.5,21,26,80.8,6,22,28,12,3,6,12,93,0,0,0.807692
1,Alana Beard,LA,G/F,185,73,21.329438,US,"May 14, 1982",35,Duke,12,30,947,90,177,50.8,5,18,27.8,32,41,78.0,19,82,101,72,63,13,40,217,0,0,0.780488
2,Alex Bentley,CON,G,170,69,23.875433,US,"October 27, 1990",26,Penn State,4,26,617,82,218,37.6,19,64,29.7,35,42,83.3,4,36,40,78,22,3,24,218,0,0,0.833333
3,Alex Montgomery,SAN,G/F,185,84,24.543462,US,"December 11, 1988",28,Georgia Tech,6,31,721,75,195,38.5,21,68,30.9,17,21,81.0,35,134,169,65,20,10,38,188,2,0,0.809524
4,Alexis Jones,MIN,G,175,78,25.469388,US,"August 5, 1994",23,Baylor,R,24,137,16,50,32.0,7,20,35.0,11,12,91.7,3,9,12,12,7,0,14,50,0,0,0.916667
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
137,Tiffany Hayes,ATL,G,178,70,22.093170,US,"September 20, 1989",27,Connecticut,6,29,861,144,331,43.5,43,112,38.4,136,161,84.5,28,89,117,69,37,8,50,467,0,0,0.844720
138,Tiffany Jackson,LA,F,191,84,23.025685,US,"April 26, 1985",32,Texas,9,22,127,12,25,48.0,0,1,0.0,4,6,66.7,5,18,23,3,1,3,8,28,0,0,0.666667
139,Tiffany Mitchell,IND,G,175,69,22.530612,US,"September 23, 1984",32,South Carolina,2,27,671,83,238,34.9,17,69,24.6,94,102,92.2,16,70,86,39,31,5,40,277,0,0,0.921569
140,Tina Charles,NY,F/C,193,84,22.550941,US,"May 12, 1988",29,Connecticut,8,29,952,227,509,44.6,18,56,32.1,110,135,81.5,56,212,268,75,21,22,71,582,11,0,0.814815


In [41]:
sample_below = wnba.loc[(wnba['FTA-1'] < .6)]
sample_below

Unnamed: 0,Name,Team,Pos,Height,Weight,BMI,Birth_Place,Birthdate,Age,College,Experience,Games Played,MIN,FGM,FGA,FG%,3PM,3PA,3P%,FTM,FTA,FT%,OREB,DREB,REB,AST,STL,BLK,TO,PTS,DD2,TD3,FTA-1
11,Alyssa Thomas,CON,F,188,84,23.76641,US,"December 4, 1992",24,Maryland,3,28,833,154,303,50.8,0,3,0.0,91,158,57.6,34,158,192,136,48,11,87,399,4,0,0.575949
32,Cayla George,PHO,C,193,87,23.356332,AU,"April 20, 1987",30,Georgia,1,28,365,40,105,38.1,13,45,28.9,7,12,58.3,10,71,81,15,9,11,13,100,1,0,0.583333
36,Courtney Paris,DAL,C,193,113,30.336385,US,"September 21, 1987",29,Oklahoma,7,16,217,32,57,56.1,0,0,0.0,6,12,50.0,28,34,62,5,6,8,18,70,0,0,0.5
47,Elizabeth Williams,ATL,F/C,191,87,23.84803,US,"June 23, 1993",24,Duke,3,30,377,48,96,50.0,0,1,0.0,32,55,58.2,35,61,96,5,5,4,21,128,0,0,0.581818
80,Kia Vaughn,NY,C,193,90,24.161722,US,"January 24, 1987",30,Rutgers,9,23,455,62,116,53.4,0,0,0.0,10,19,52.6,39,71,110,16,8,9,21,134,1,0,0.526316
90,Maimouna Diarra,LA,C,198,90,22.956841,SN,"January 30, 1991",26,Sengal,R,9,16,1,3,33.3,0,0,0.0,1,2,50.0,3,4,7,1,1,0,3,3,0,0,0.5
108,Rebecca Allen,NY,G/F,188,74,20.937076,AU,"June 11, 1992",25,Australia,3,28,254,31,86,36.0,14,40,35.0,2,6,33.3,13,51,64,15,9,12,17,78,0,0,0.333333
117,Sequoia Holmes,SAN,G,185,70,20.452885,US,"June 13, 1986",31,UNLV,2,24,280,31,89,34.8,13,46,28.3,6,11,54.5,12,12,24,23,13,5,11,81,0,0,0.545455
129,Sydney Wiese,LA,G,183,68,20.305175,US,"July 13, 1992",25,Oregon State,R,25,189,19,50,38.0,13,32,40.6,4,8,50.0,3,18,21,6,4,3,2,55,0,0,0.5


**Now that all the requirements have been taken into account, compute the confidence interval of the proportion with a confidence level of 95%:**

In [36]:
# your code here
sample2 = wnba.loc[(wnba['FTA-1'] > .6)]
sample2 = sample2['FTA-1']

loc = np.mean(sample2)
scale = stats.sem(sample2)
df = len(sample2) - 1
confidence_level = 0.95

stats.norm.interval(0.95, loc=loc, scale=scale)

(0.7886506733549887, 0.8211277732097)

In [None]:
len(sample2)

In [54]:
wnba['FTA-1'] = np.where(wnba['FT%'] > 0.6, 1, 0)
wnba.head(10)

Unnamed: 0,Name,Team,Pos,Height,Weight,BMI,Birth_Place,Birthdate,Age,College,Experience,Games Played,MIN,FGM,FGA,FG%,3PM,3PA,3P%,FTM,FTA,FT%,OREB,DREB,REB,AST,STL,BLK,TO,PTS,DD2,TD3,FTA-1
0,Aerial Powers,DAL,F,183,71,21.200991,US,"January 17, 1994",23,Michigan State,2,8,173,30,85,35.3,12,32,37.5,21,26,80.8,6,22,28,12,3,6,12,93,0,0,1
1,Alana Beard,LA,G/F,185,73,21.329438,US,"May 14, 1982",35,Duke,12,30,947,90,177,50.8,5,18,27.8,32,41,78.0,19,82,101,72,63,13,40,217,0,0,1
2,Alex Bentley,CON,G,170,69,23.875433,US,"October 27, 1990",26,Penn State,4,26,617,82,218,37.6,19,64,29.7,35,42,83.3,4,36,40,78,22,3,24,218,0,0,1
3,Alex Montgomery,SAN,G/F,185,84,24.543462,US,"December 11, 1988",28,Georgia Tech,6,31,721,75,195,38.5,21,68,30.9,17,21,81.0,35,134,169,65,20,10,38,188,2,0,1
4,Alexis Jones,MIN,G,175,78,25.469388,US,"August 5, 1994",23,Baylor,R,24,137,16,50,32.0,7,20,35.0,11,12,91.7,3,9,12,12,7,0,14,50,0,0,1
5,Alexis Peterson,SEA,G,170,63,21.799308,US,"June 20, 1995",22,Syracuse,R,14,90,9,34,26.5,2,9,22.2,6,6,100.0,3,13,16,11,5,0,11,26,0,0,1
6,Alexis Prince,PHO,G,188,81,22.91761,US,"February 5, 1994",23,Baylor,R,16,112,9,34,26.5,4,15,26.7,2,2,100.0,1,14,15,5,4,3,3,24,0,0,1
7,Allie Quigley,CHI,G,178,64,20.19947,US,"June 20, 1986",31,DePaul,8,26,847,166,319,52.0,70,150,46.7,40,46,87.0,9,83,92,95,20,13,59,442,0,0,1
8,Allisha Gray,DAL,G,185,76,22.20599,US,"October 20, 1992",24,South Carolina,2,30,834,131,346,37.9,29,103,28.2,104,129,80.6,52,75,127,40,47,19,37,395,0,0,1
9,Allison Hightower,WAS,G,178,77,24.302487,US,"June 4, 1988",29,LSU,5,7,103,14,38,36.8,2,11,18.2,6,6,100.0,3,7,10,10,5,0,2,36,0,0,1


**What can you comment about our result? What would you tell your sister?**

In [52]:
#your-answer-here
'''
With a confidence level of 95 procent we can say that the mean of the population is between 78,87% and 82,12%
'''

'\nWith a confidence level of 95 procent we can say that the mean of the population is between 78,87% and 82,12%'

# Bonus: Can you plot the probability distribution of the proportion of missed free throws, indicating where the critical region is?

In [70]:
#your code here
from statsmodels.stats.proportion import proportions_ztest

count = wnba['FT%'].loc[(wnba['FT%'] > 60)]
nobs = wnba['FT%'].count()
stat, pval = proportions_ztest(count, nobs, value = 0.4)
print('{}'.format(pval))

NotImplementedError: more than two samples are not implemented yet

# Question 3: Is the average number of assists for WNBA players only higher than the average for WNBA and NBA players together?

Your brother-in-law is convinced that the average assists for female professional players is higher than the average of both female and male players combined (which is 52 for the 2016-2017 season). You would like to actually prove if this is true or not but you remember your stats teacher saying "you can't *prove* anything, you just can say that *you are not* saying foolishness".

**How would you do it? Try and think about the requirements that your sample must satisfy in order to do that. Do you feel it actually fulfills those requirements? Do you need to make any assumptions?**

In [62]:
#your-answer-here

# The average assists for female professional players is higher than the average of both female and male players 
# combined (which is 52 for the 2016-2017 season).

# H0: Mu > 52
# H1: Mu =< 52

mean_sample = np.mean(wnba['AST'])
mean_H0 = 52

**Use a two-tailed one-sample t-test to see if we can reject (or not) the null hypothesis with a 95% confidence level.**

In [66]:
#your code here
from scipy.stats import ttest_1samp

print("t-statistic: {}".format(ttest_1samp(wnba['AST'], mean_H0)[0]))
print("p-value: {}".format(ttest_1samp(wnba['AST'], mean_H0)[1]))

t-statistic: -2.1499947192482898
p-value: 0.033261541354107166


In [18]:
#your-answer-here
'''
Based on the negative t-statistic we know the mean lies to the left of 52. 
The p-value is below the alpha (0,033 < 0,05) meaning that the result is significant.
This means that the population mean is significantly different from 52 and the brother in law is wrong.
'''

**Now use a one-tailed one-sample t-test to see if we can reject (or not) the null hypothesis with a 95% confidence level.**

In [65]:
#your-answer-here
print("t-statistic: {}".format(ttest_1samp(wnba['AST'], mean_H0)[0]))
print("p-value: {}".format(ttest_1samp(wnba['AST'], mean_H0)[1]/2))

'''
Based on the negative t-statistic we know the mean lies to the left of 52. 
The p-value is below the alpha (0,166 < 0,05) meaning that the result is significant.
We reject H0. Meaning that the population mean is significantly different from 52.
'''

t-statistic: -2.1499947192482898
p-value: 0.016630770677053583


# Bonus: Can you plot the resulting t-distribution of both tests? Indicate where the is the critical region and where does your statistic fall.**

In [None]:
#your code here

# Bonus: Satisfying your curiosity

You finally managed to solve your family's debates over basketball! While you were doing that you started to take an interest in the normal distribution.

You read that the normal distribution is present in a lot of natural phenomenons, like blood pressure, IQ, weight and height. If, for example, we could plot the distribution of the weights of every human on the planet right now it would have the shape of a normal distribution.

In light of this you would like to see if it's possible to check if the distribution of the weights of the WNBA players is a sample distribution that comes from a population that has a normal distribution, because theoretically this should be the case.

**How would you try to demonstrate that our sample fits a normal distribution? What kind of test would you use? Would you have to make any assumptions?**

In [22]:
#your-answer-here

In [19]:
# your code here

**What are your comments in regards to the results of the test?**

In [24]:
#your-answer-here