# Quantile

In [2]:
import scipy.stats as st
import numpy as np

## Standard Normal

In [None]:
# Standard Normal corresponds to mu = 0 and sigma = 1.
mu = 0
sigma = 1

Question 1 : Use st.norm.ppf to calculate the Quantile at $alpha = 0.95$


The `st.norm.ppf()` method takes a percentage(which is the probability or the area under the curve ) and returns the corresponding z-score.

In [3]:
# This is useful when calculating 90% confidence interval.
st.norm.ppf(0.95)

1.6448536269514722

Question 2 : Now, use ``st.norm.ppf`` to calculate the Quantile at $alpha = 0.975$

## T-Student 

the following code runs the function `st.t.ppf` of $alpha = 0.7$ with various degrees of freedom :


In [5]:
alphas=[0.7,0.95,0.975]

for alpha in alphas:
    print("with alpha =",alpha)
    liste = [10,50,100,2000,100000]
    for df in liste:
        val = st.t.ppf(alpha,df)
        print( 'Degree of Freedom = %7d,  Quantile = %10f' %(df,val))

with alpha = 0.7
Degree of Freedom =      10,  Quantile =   0.541528
Degree of Freedom =      50,  Quantile =   0.527760
Degree of Freedom =     100,  Quantile =   0.526076
Degree of Freedom =    2000,  Quantile =   0.524484
Degree of Freedom =  100000,  Quantile =   0.524402
with alpha = 0.95
Degree of Freedom =      10,  Quantile =   1.812461
Degree of Freedom =      50,  Quantile =   1.675905
Degree of Freedom =     100,  Quantile =   1.660234
Degree of Freedom =    2000,  Quantile =   1.645616
Degree of Freedom =  100000,  Quantile =   1.644869
with alpha = 0.975
Degree of Freedom =      10,  Quantile =   2.228139
Degree of Freedom =      50,  Quantile =   2.008559
Degree of Freedom =     100,  Quantile =   1.983972
Degree of Freedom =    2000,  Quantile =   1.961151
Degree of Freedom =  100000,  Quantile =   1.959988


Question 1 : modify the previous code to get the quantiles of $alpha=0.95$ and $alpha=0.975$

Question 2 : what do you conclude about the relationship between quantiles and number of freedom ? 
 
 


##  Chi-square:

Question 1- : Calculate P(X <= 8) with degree of freedom = 5.

In [6]:
st.chi2.cdf(8,df=5)

0.8437643724222779

Qiestion 2- : use st.chi2.ppf using Quantile at alpha = 0.843764373 with degree of freedom = 5


In [7]:
st.chi2.ppf(0.843764373,df=5)

8.000000010482703



# Interval Estimation of the Mean:

In [9]:
# x contains a sample.
# n = sample size.
x = np.array([25,24,24,27,29,31,28,24,25,26,25,18,30,28,23,26,27,23,16,20,22,22,25,24, 24,25,25,27,26,30,25,25,26,26,25,24])
n = len(x)

Question 1 : Calculate the Standard Error of the Mean (SEM).( note that to get an unbiased estimation of the standard deviation, you should set `ddof= 1` ). 
Hint:  \begin{equation}
\mathrm{SE}=\frac{\sigma}{\sqrt{n}}
\end{equation}


In [14]:
# SEM = Standard Error of the Mean.
sem=st.sem(x,ddof=1)
#second method
#sem = x.std(ddof=1)/((n)**0.5)

<img width="500" height="500" src="https://analystnotes.com/graph/quan/SS02SDlosn1.gif"/>

<img width="500" height="500" src="https://i.ytimg.com/vi/sJyZ9vRhP7o/maxresdefault.jpg"/>



## 90% Confidence Interval:

Documentation : https://www0.gsb.columbia.edu/faculty/pglasserman/B6014/ConfidenceIntervals.pdf


Question 1 : Using the approximated quantiles of the Standard Normal distribution, use the following equation:

\begin{equation}
\left ( \bar{X}-1.645 *\frac{\sigma}{\sqrt{n}}, \bar{X}+1.645 *\frac{\sigma}{\sqrt{n}} \right )
\end{equation}

In [15]:
(x.mean() - 1.645 * sem , x.mean() + 1.645 * sem )

(24.165832750582954, 25.834167249417046)

Question 2 : Using the exact quantiles of the Standard Normal (hint : replace $1.645 \frac{\sigma}{\sqrt{n}}$ and consider including `st.norm.ppf` )

In [19]:
critical_level=abs(st.norm.ppf(0.05))
(x.mean() - critical_level * sem , x.mean() + critical_level * sem )

(24.1659069752658, 25.8340930247342)

Question 3 : Use the `interval()` function from the SciPy library to get the 90% confidance interval of Standard Normal : `st.norm.interval(percentage, loc, scale)`


Documentation:

https://stackoverflow.com/questions/28242593/correct-way-to-obtain-confidence-interval-with-scipy

In [22]:
st.norm.interval(0.9,x.mean(),sem)

(24.1659069752658, 25.8340930247342)

Question 4 : same as Question 1 , get the 95% confidance interval (using the exact quantiles) of the Student-t distrubution. (consider using  st.t.ppf(0.95,df=n-1) ).


Documentation:

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.t.html



In [24]:
critical_level=abs(st.t.ppf(0.025,df=n-1))
(x.mean() - critical_level * sem , x.mean() + critical_level * sem )

(23.970547388128676, 26.029452611871324)

Question 5 - Using the interval() function from the SciPy library (Student-t).
st.t.interval( )

In [28]:
st.t.interval(0.95,n-1,x.mean(),sem)

(23.970547388128676, 26.029452611871324)

## 95% Confidence Interval

Question 1- Using the approximated quantiles of the Standard Normal.


In [29]:
critical_level=1.96
(x.mean() - critical_level * sem , x.mean() + critical_level * sem )

(24.006098596439266, 25.993901403560734)

Question 2- Using the exact quantiles of the Standard Normal.

In [30]:
critical_level=abs(st.norm.ppf(0.025))
(x.mean() - critical_level * sem , x.mean() + critical_level * sem )

(24.00611685961079, 25.99388314038921)

Question 3- : Using the interval() function from the SciPy library (Standard Normal). 


In [31]:
st.norm.interval(0.95,x.mean(),sem)

(24.006116859610792, 25.993883140389208)

Question 4- : Using the exact quantiles of the Student-t.


In [34]:
critical_level=abs(st.t.ppf(0.025,df=n-1))
x.mean() - critical_level * sem , x.mean() + critical_level * sem

(23.970547388128676, 26.029452611871324)

Question 5 - :  Using the interval() function from the SciPy library (Student-t).

In [35]:
st.t.interval(0.95,n-1,x.mean(),sem)

(23.970547388128676, 26.029452611871324)

## 99% Confidence Interval

Question 1 : Using the approximated quantiles of the Standard Normal.


Question 2- Using the exact quantiles of the Standard Normal.


Question 3 : Using the interval() function from the SciPy library (Standard Normal). 


Question 4- : Using the exact quantiles of the Student-t.


Question 5- : Using the interval() function from the SciPy library (Student-t).


# Interval Estimation of the Proportion: 

Question 1 : Suppose there is two candidates for the election: T and B. (T for trump , B for biden , just saying ...lol)
The candidate T wants to survey his approval rating.
Out of 100 suerveyed, 55 answered positively.
Can T be sure of this election? 
- Assume that T gets elected with 50% or more. 
- Use 95% confidence interval.

draw your conclusion using the confidence interval

In [38]:
#alternative hypotesis : x>=50%

#null hypotesis : x<50%

# 95% Confidence Interval.
# We apply the expression for the standard error (SE) forumla used for proportions from the lecture.

p_mean=55/100

sem=np.sqrt((p_mean*(1-p_mean))/100)

p_mean - 1.96 * sem , p_mean + 1.96 * sem

(0.4524912311635513, 0.6475087688364488)

Question 2 : Out of 1000 suerveyed, 550 answered positively. Can T be sure of this election?
- Assume that T gets elected with 50% or more. 
- Use 95% confidence interval.

draw your conclusion using the confidence interval

In [39]:
# 95% Confidence Interval.
# We apply the expression for the standard error (SE) from the lecture.
# W ignore additional corrections. 
p_mean=550/1000

sem=np.sqrt((p_mean*(1-p_mean))/1000)

p_mean - 1.96 * sem , p_mean + 1.96 * sem

(0.5191650198637976, 0.5808349801362025)

# Correlation



some resources to read to learn more about these correlations : 

https://datascience.stackexchange.com/questions/64260/pearson-vs-spearman-vs-kendall/64261


https://towardsdatascience.com/clearly-explained-pearson-v-s-spearman-correlation-coefficient-ada2f473b8

In [40]:
import pandas as pd
import numpy as np
import scipy.stats as st
import os
import seaborn as sns

In [41]:
sns.get_dataset_names()


['anagrams',
 'anscombe',
 'attention',
 'brain_networks',
 'car_crashes',
 'diamonds',
 'dots',
 'exercise',
 'flights',
 'fmri',
 'gammas',
 'geyser',
 'iris',
 'mpg',
 'penguins',
 'planets',
 'tips',
 'titanic']

In [42]:
df = sns.load_dataset('iris')

In [43]:
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [44]:
type(df)

pandas.core.frame.DataFrame

In [45]:
x = df.petal_length
y = df.sepal_length

## P-Value

Documentation : 
https://towardsdatascience.com/p-value-basics-with-python-code-ae5316197c52

Question 1 : Using the SciPy function: 
- calculate correlation and p-value. hint : ``np.round(st.pearsonr(..),..)``  
- interpret the results 

In [46]:
corr , p_value = st.pearsonr(x,y)

(0.8717537758865833, 1.0386674194496954e-47)

Question2 : Using the Pandas function.


In [47]:
x.corr(y)

0.871753775886583

Question 3: Correlation array, use: `np.round(..)`   

In [48]:
# Correlation array.
np.round(x.corr(y),2)

0.87

## Spearman

Question 1 :  Using the Spearman SciPy function and Correlation and p-value.
hint : ``np.round(st.spearmanr(..),..) ``         

In [49]:
np.round(st.spearmanr(x,y),2)

array([0.88, 0.  ])

## Kendall

Question 1: Using the SciPy function and Correlation and p-value.

In [53]:
np.round(st.kendalltau(x,y),2)

array([0.72, 0.  ])

Question : Confidence Interval of the Pearson Correlation: 

In [55]:
# Apply the Fisher's z-transformation.
# See the lecture.
r= x.corr(y)
z=np.arctanh(r)
sem = 1/np.sqrt(len(x)-3)
lower_bound =np.tanh(z - 1.645*sem) 
upper_bound = np.tanh(z + 1.645*sem )

lower_bound , upper_bound

(0.8350711628761325, 0.9007190378984153)

In [9]:
# 95% confidence interval. 
# Expressed as a dictionary object.


Question 1 :  99% confidence interval.Expressed as a dictionary object.

In [56]:
# 99% confidence interval.
# Expressed as a dictionary object.

{"lower":lower_bound ,"upper": upper_bound}

{'lower': 0.8350711628761325, 'upper': 0.9007190378984153}