<a href="https://colab.research.google.com/github/Camicb/Statistics-w-python-Coursera/blob/master/week3_assessment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Creating confidence intervals in python
In this assessment, you will look at data from a study on toddler sleep habits. 

The confidence intervals you create and the questions you answer in this Jupyter notebook will be used to answer questions in the following graded assignment.

In [None]:
import numpy as np
import pandas as pd
import scipy
from scipy.stats import t
pd.set_option('display.max_columns', 30) # set so can see all columns of the DataFrame

Your goal is to analyse data which is the result of a study that examined
differences in a number of sleep variables between napping and non-napping toddlers. Some of these
sleep variables included: Bedtime (lights-off time in decimalized time), Night Sleep Onset Time (in
decimalized time), Wake Time (sleep end time in decimalized time), Night Sleep Duration (interval
between sleep onset and sleep end in minutes), and Total 24-Hour Sleep Duration (in minutes). Note:
[Decimalized time](https://en.wikipedia.org/wiki/Decimal_time) is the representation of the time of day using units which are decimally related.   


The 20 study participants were healthy, normally developing toddlers with no sleep or behavioral
problems. These children were categorized as napping or non-napping based upon parental report of
children’s habitual sleep patterns. Researchers then verified napping status with data from actigraphy (a
non-invasive method of monitoring human rest/activity cycles by wearing of a sensor on the wrist) and
sleep diaries during the 5 days before the study assessments were made.


You are specifically interested in the results for the Bedtime, Night Sleep Duration, and Total 24-
Hour Sleep Duration. 



Reference: Akacem LD, Simpkin CT, Carskadon MA, Wright KP Jr, Jenni OG, Achermann P, et al. (2015) The Timing of the Circadian Clock and Sleep Differ between Napping and Non-Napping Toddlers. PLoS ONE 10(4): e0125181. https://doi.org/10.1371/journal.pone.0125181



In [None]:
# Import the data
df = pd.read_csv("https://raw.githubusercontent.com/UMstatspy/UMStatsPy/master/Course_2/nap_no_nap.csv")
df.head()

**Question 1**: What variable is used in the column 'napping' to indicate a toddler takes a nap?


**Question 2**: What is the sample size $n$?

In [None]:
#Q1: 1=nap, 0=no nap
n=df['napping'].count()
print('Sample size:',n) #Q2

## Hypothesis testing
We will look at two hypothesis test, each with $\alpha = .025$:  


1. Is the average bedtime for toddlers who nap later than the average bedtime for toddlers who don't nap?


$$H_0: \mu_{nap}=\mu_{no\ nap}, \ H_a:\mu_{nap}>\mu_{no\ nap}$$
Or equivalently:
$$H_0: \mu_{nap}-\mu_{no\ nap}=0, \ H_a:\mu_{nap}-\mu_{no\ nap}>0$$


2. The average 24 h sleep duration (in minutes) for napping toddlers is different from toddlers who don't nap.


$$H_0: \mu_{nap}=\mu_{no\ nap}, \ H_a:\mu_{nap}\neq\mu_{no\ nap}$$
Or equivalenty:
$$H_0: \mu_{nap}-\mu_{no\ nap}=0, \ H_a:\mu_{nap}-\mu_{no\ nap} \neq 0$$

Aside: This $\alpha$ level is equivalent to $\alpha = .05$ and then applying the [Bonferonni correction](https://en.wikipedia.org/wiki/Bonferroni_correction).

Before any analysis, we will convert 'night bedtime' into decimalized time:

In [None]:
alpha=.05/2
print(alpha)

In [None]:
# Convert 'night bedtime' into decimalized time
df.loc[:,'night bedtime'] = np.floor(df['night bedtime'])*60 + np.round(df['night bedtime']%1,2 )*100

Now, isolate the column 'night bedtime' for those who nap into a new variable, and those who didn't nap into another new variable. 

In [None]:
bedtime_nap =df.loc[(df['napping']==1),'night bedtime']
print('bedtime_nap')
print(bedtime_nap)
#print(len(bedtime_nap))

bedtime_no_nap = df.loc[(df['napping']==0),'night bedtime']
print('bedtime_no_nap')
print(bedtime_no_nap)
#print(bedtime_no_nap.count())

Now find the sample mean bedtime for nap and no_nap.


In [None]:
nap_mean_bedtime = bedtime_nap.mean()
print('Sample mean bedtime for nap:',nap_mean_bedtime)

no_nap_mean_bedtime =  bedtime_no_nap.mean()
print('Sample mean bedtime for no_nap:',no_nap_mean_bedtime)

**Question**: What is the sample difference of mean bedtime for nappers minus no nappers?

In [None]:
mean_diff = nap_mean_bedtime - no_nap_mean_bedtime
print(mean_diff)

Now find the sample standard deviation for $X_{nap}$ and $X_{no\ nap}$.

In [None]:
nap_s_bedtime = bedtime_nap.std()
print('Standard deviation for nap:', nap_s_bedtime)

no_nap_s_bedtime = bedtime_no_nap.std()
print('Standard deviation for no nap:', no_nap_s_bedtime)
#Important:
#Population standard deviation: (DDOF = 0)
#Sample standard deviation: (DDOF = 1)
#Standard error (of the mean): (DDOF=0)
#So if we leave DDOF set to the defaults...
#np.std() is appropriate for a population standard deviation or a standard error, but not a sample standard deviation.
#pandas.dataframe.std() is appropriate for a sample standard deviation but not a population standard deviation or a standard error. The default DDOF=1 subtracts one from that expression in the denominator, which gives us the sample size minus one (n-1) that we want for a sample standard deviation.
#But ultimately, because we can set the DDOF argument to whatever we need it to be, either function can be made to work for any of the above situations.

**Question**: What is the s.e.$(\bar{X}_{nap} - \bar{X}_{no\ nap})$?

We expect the variance in sleep time for toddlers who nap and toddlers who don't nap to be the same. So we use a pooled standard error.

Calculate the pooled standard error of $\bar{X}_{nap} - \bar{X}_{no\ nap}$ using the formula below.

$s.e.(\bar{X}_{nap} - \bar{X}_{no\ nap}) = \sqrt{\frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1+n_2-2}(\frac{1}{n_1}+\frac{1}{n_2})}$

In [None]:
n1=df.loc[(df['napping']==1),'id'].count()# sample size for toddlers who nap
n2=df.loc[(df['napping']==0), 'id'].count()#sample size for toddlers who don't nap
s1=nap_s_bedtime
s2=no_nap_s_bedtime
term=(((n1-1)*s1**2)+((n2-1)*s2**2))*((1/n1)+(1/n2))/(n1+n2-2)
pooled_se = np.sqrt(term)
print(pooled_se)

**Question**: Given our sample size of $n$, how many degrees of freedom ($df$) are there for the associated $t$ distribution?

In [None]:
df_n1=n1-1
df_n2=n2-1
print(df_n1, ',', df_n2)

Now calculate the $t$-test statistic for our first hypothesis test using  
* pooled s.e.($\bar{X}_{nap} - \bar{X}_{no\ nap}$)  
* $\bar{X}_{nap} - \bar{X}_{no\ nap}$  
* $\mu_{0,\ nap} - \mu_{0,\ no\ nap}=0$, the population difference in means under the null hypothesis

In [None]:
tstar=mean_diff/pooled_se
print(tstar)

**Question**: What is the p-value for the first hypothesis test?
To find the p-value, we can use the function:
```
t.cdf(y, df)
```
Which for $X \sim t(df)$ returns $P(X \leq y)$.

Because of the symmetry of the $t$ distrubution, we have that 
```
1-t.cdf(y, df)
```
returns $P(X > y)$

The function t.cdf(y, df) will give you the same value as finding the one-tailed probability of y on a t-table with the specified degrees of freedom.

Use the function t.cdf(y, df) to find the p-value for the first hypothesis test.

In [None]:
# this is a two sided test so the first argument is the absolute value for tstar and we compare the p-value to 1/2 the alpha
pval = 1 - t.cdf(np.abs(tstar), df_n1+df_n2)
print(pval)
print("1/2 Alpha:", alpha / 2)

**Question**: What are the t-statistic and p-value for the second hypothesis test?

Calculate the $t$ test statistics and corresponding p-value using the scipy function scipy.stats.ttest_ind(a, b, equal_var=True) and check with your answer. 

**Question**: Does scipy.stats.ttest_ind return values for a one-sided or two-sided test?

**Question**: Can you think of a way to recover the results you got using 1-t.cdf from the p-value given by scipy.stats.ttest_ind?

Use the scipy function scipy.stats.ttest_ind(a, b, equal_var=True) to find the $t$ test statistic and corresponding p-value for the second hypothesis test.

In [None]:
from scipy import stats

In [None]:
#first hypo test
x=scipy.stats.ttest_ind(bedtime_nap, bedtime_no_nap, equal_var=True) 
#This is a two-sided test for the null hypothesis that 2 independent samples have identical average (expected) values.
print(x)
print(pval*2)

In [None]:
#second hypo test

#Convert 'night bedtime' into decimalized time
df.loc[:,'24 h sleep duration'] = np.floor(df['24 h sleep duration'])*60 + np.round(df['24 h sleep duration']%1,2 )*100

#isolate the column 'night bedtime' for those who nap into a new variable, and those who didn't nap into another new variable.
sldur_nap =df.loc[(df['napping']==1),'24 h sleep duration']
#print('24 h sleep duration_nap')
#print(sldur_nap)

sldur_nonap = df.loc[(df['napping']==0),'24 h sleep duration']
#print('24 h sleep duration_nonap')
#print(sldur_nonap)

y=scipy.stats.ttest_ind(sldur_nap, sldur_nonap, equal_var=True) 
print(y)


**Question**: For the $\alpha=.025$, do you reject or fail to reject the first hypothesis?

**Question**: For the $\alpha=.025$, do you reject or fail to reject the second hypothesis?



In [None]:
# if the p-value is less tha alpha, then we reject the null hypothesis
# if the p-value is not less than alpha, then we do not reject the null hypothesis

#For first hypo : fail to reject the null hypo
print(pval) #0.014667
print(alpha/2) #0.0125
print(x) #pvalue=0.02933
#for second hypo : fail to reject null hypo???
print(y) #pvalue= 0.0156