<a href="https://colab.research.google.com/github/Camicb/Statistics-w-python-Coursera/blob/master/c2week2_assessment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Creating confidence intervals in python
In this assessment, you will look at data from a study on toddler sleep habits. 

The confidence intervals you create and the questions you answer in this Jupyter notebook will be used to answer questions in the following graded assignment.

In [None]:
import numpy as np
import pandas as pd
from scipy.stats import t
pd.set_option('display.max_columns', 30) # set so can see all columns of the DataFrame

Your goal is to analyse data which is the result of a study that examined
differences in a number of sleep variables between napping and non-napping toddlers. Some of these
sleep variables included: Bedtime (lights-off time in decimalized time), Night Sleep Onset Time (in
decimalized time), Wake Time (sleep end time in decimalized time), Night Sleep Duration (interval
between sleep onset and sleep end in minutes), and Total 24-Hour Sleep Duration (in minutes). Note:
[Decimalized time](https://en.wikipedia.org/wiki/Decimal_time) is the representation of the time of day using units which are decimally related.   


The 20 study participants were healthy, normally developing toddlers with no sleep or behavioral
problems. These children were categorized as napping or non-napping based upon parental report of
children’s habitual sleep patterns. Researchers then verified napping status with data from actigraphy (a
non-invasive method of monitoring human rest/activity cycles by wearing of a sensor on the wrist) and
sleep diaries during the 5 days before the study assessments were made.


You are specifically interested in the results for the Bedtime, Night Sleep Duration, and Total 24-
Hour Sleep Duration. 

Reference: Akacem LD, Simpkin CT, Carskadon MA, Wright KP Jr, Jenni OG, Achermann P, et al. (2015) The Timing of the Circadian Clock and Sleep Differ between Napping and Non-Napping Toddlers. PLoS ONE 10(4): e0125181. https://doi.org/10.1371/journal.pone.0125181

In [None]:
# Import the data
# Look at the DataFrame to get a sense of the data
df = pd.read_csv("https://raw.githubusercontent.com/UMstatspy/UMStatsPy/master/Course_2/nap_no_nap.csv")
df

**Question 1**: What variable is used in the column 'napping' to indicate a toddler takes a nap?  

**Question 2**: What is the sample size $n$? 

**Question 3**: What is the sample size for toddlers who nap, $n_1$, and toddlers who don't nap, $n_2$?

In [None]:
#Q1: 1=nap, 0=no nap
print('Sample size:', df['napping'].count()) #Q2
#print(len(df['napping'])) Q2 too
#Q3:
n1=df.loc[(df['napping']==1),'id'].count()# sample size for toddlers who nap
n2=df.loc[(df['napping']==0), 'id'].count()#sample size for toddlers who don't nap
print('n1:',n1, ', '  'n2:', n2)


### Average bedtime confidence interval for napping and non napping toddlers
Create two 95% confidence intervals for the average bedtime, one for toddler who nap and one for toddlers who don't.

Before any analysis, we will convert 'night bedtime' into decimalized time.

In [None]:
# Convert 'night bedtime' into decimalized time
df.loc[:,'night bedtime'] = np.floor(df['night bedtime'])*60 + np.round(df['night bedtime']%1,2 )*100

Now, isolate the column 'night bedtime' for those who nap into a new variable, and those who didn't nap into another new variable. 

In [None]:
bedtime_nap =df.loc[(df['napping']==1),'night bedtime']
print('bedtime_nap')
print(bedtime_nap)
#print(len(bedtime_nap))

In [None]:
bedtime_no_nap = df.loc[(df['napping']==0),'night bedtime']
print('bedtime_no_nap')
print(bedtime_no_nap)
#print(bedtime_no_nap.count())

Now find the sample mean bedtime for nap and no_nap, and the standard error for $\bar{X}_{nap}$ and $\bar{X}_{no\ nap}$.

In [None]:
nap_mean_bedtime = bedtime_nap.mean()
print('Sample mean bedtime for nap:',nap_mean_bedtime)

no_nap_mean_bedtime =  bedtime_no_nap.mean()
print('Sample mean bedtime for no_nap:',no_nap_mean_bedtime)

nap_se_mean_bedtime = np.std(bedtime_nap) / np.sqrt(len(bedtime_nap))
print('Standard error for mean for nap:', nap_se_mean_bedtime)

no_nap_se_mean_bedtime = np.std(bedtime_no_nap) / np.sqrt(len(bedtime_no_nap))
print('Standard error for mean for no nap:', no_nap_se_mean_bedtime)

#Important:
#Population standard deviation: (DDOF = 0)
#Sample standard deviation: (DDOF = 1)
#Standard error (of the mean): (DDOF=0)
#So if we leave DDOF set to the defaults...
#np.std() is appropriate for a population standard deviation or a standard error, but not a sample standard deviation.
#pandas.dataframe.std() is appropriate for a sample standard deviation but not a population standard deviation or a standard error. The default DDOF=1 subtracts one from that expression in the denominator, which gives us the sample size minus one (n-1) that we want for a sample standard deviation.
#But ultimately, because we can set the DDOF argument to whatever we need it to be, either function can be made to work for any of the above situations.


**Question**: Given our sample sizes of $n_1$ and $n_2$ for napping and non napping toddlers respectively, how many degrees of freedom ($df$) are there for the associated $t$ distributions?

In [None]:
df_n1=n1-1
df_n2=n2-1
print(df_n1, ',', df_n2)

To build a 95% confidence interval, what is the value of t\*?  You can find this value using the percent point function: 
```
from scipy.stats import t

t.ppf(probabiliy, df)
```
This will return the quantile value such that to the left of this value, the tail probabiliy is equal to the input probabiliy (for the specified degrees of freedom). 

Example: to find the $t^*$ for a 90% confidence interval, we want $t^*$ such that 90% of the density of the $t$ distribution lies between $-t^*$ and $t^*$.

Or in other words if $X \sim t(df)$:

P($-t^*$ < X < $t^*$) = .90

Which, because the $t$ distribution is symmetric, is equivalent to finding $t^*$ such that:  

P(X < $t^*$) = .95

So the $t^*$ for a 90% confidence interval, and lets say df=10, will be:

t_star = t.ppf(.95, df=10)

*The distribution in hand is symmetric, so it has two tails (left and right). 
The portion of the data which lies beyond the the 95% are on left tail and right tail, 5% = 2.5% + 2.5%. The function calculates only one tail, so we should consider only 2.5%, meaning 100% - 2,5% = 97,5%. 
If the distribution was completely skewed to one side, having only one tail, then the 5% would be concentrated on one tail, and we could use 95% in t.ppf() function.

In [None]:
# Find the t_stars for the 95% confidence intervals
nap_t_star = t.ppf(.975, df=df_n1)
no_nap_t_star = t.ppf(.975, df=df_n2)
print('nap_t_star:', nap_t_star,', ', 'no_nap_t_star:', no_nap_t_star )


Now to create our confidence intervals. For the average bedtime for nap and no nap, find the upper and lower bounds for the respective confidence intervals.

In [None]:
#lcb = mean - tstar * se
#ucb = mean + tstar * se

lcb1 = nap_mean_bedtime - nap_t_star * nap_se_mean_bedtime
ucb1 = nap_mean_bedtime + nap_t_star * nap_se_mean_bedtime
lcb2 = no_nap_mean_bedtime - no_nap_t_star * no_nap_se_mean_bedtime
ucb2 = no_nap_mean_bedtime + no_nap_t_star * no_nap_se_mean_bedtime
CI_nap=[lcb1, ucb1]
CI_no_nap=[lcb2, ucb2]
print('CI nap:', CI_nap)
print('CI no_nap:', CI_no_nap)


**Question**: What are the 95% confidence intervals, rounded to the nearest ten, for the average bedtime (in decimalized time) for toddlers who nap and for toddlers who don't nap? 

CI = $\bar{X} \pm \ t^* \cdot s.e.(\bar{X})$


Answer:

CI nap: [1215, 1251]

CI no_nap: [1153, 1229]