# Laboratory - YRBSS

## Getting Started

### Load packages
Let's load the relevant packages here.

In [15]:
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns

from scipy import stats 

from pathlib import Path

Let's now import the data sample from the _YRBSS_.

In [16]:
yrbss_path = Path("../datasets/yrbss.csv")
yrbss_df = pd.read_csv(yrbss_path)

yrbss_df.sample(5)

Unnamed: 0,age,gender,grade,hispanic,race,height,weight,helmet_12m,text_while_driving_30d,physically_active_7d,hours_tv_per_school_day,strength_training_7d,school_night_hours_sleep
10570,18.0,male,12,hispanic,,1.65,54.43,never,0,6.0,4,7.0,8
8492,16.0,female,10,not,White,1.65,79.38,never,1-2,5.0,3,0.0,6
12220,16.0,female,9,not,White,1.63,88.45,never,1-2,7.0,2,1.0,5
4951,14.0,male,9,not,Black or African American,,,did not ride,0,2.0,3,2.0,5
6338,17.0,male,11,not,,1.83,68.04,did not ride,did not drive,1.0,5+,2.0,6


### Exercise 1 - What are the counts within each category for the amount of days these students have texted while driving within the past 30 days?

In [17]:
yrbss_df.groupby("text_while_driving_30d").size()

text_while_driving_30d
0                4792
1-2               925
10-19             373
20-29             298
3-5               493
30                827
6-9               311
did not drive    4646
dtype: int64

### Exercise 2 - What is the proportion of people who have texted while driving every day in the past 30 days and never wear helmets?

In [18]:
yrbss_df.loc[lambda df: (df["helmet_12m"] == "never") & (df["text_while_driving_30d"] != "did not drive")] \
        .shape[0] / yrbss_df.shape[0]

0.3578738128543032

Remember that you can use `.loc` to limit the dataset to just non-helmet wearers. Here, we will name the dataset `no_helmet`.

In [19]:
no_helmet = yrbss_df.loc[yrbss_df["helmet_12m"] == "never", :]

Also, it may be easier to calculate the proportion if you create a new variable that specifies whether the individual has texted every day while driving over the past 30 days or not. We will call this variable `text_ind`.

In [27]:
no_helmet["text_ind"] = no_helmet.apply(lambda x: "yes" if x["text_while_driving_30d"] == "30" else "no", axis=1)

no_helmet.sample(5)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  no_helmet["text_ind"] = no_helmet.apply(lambda x: "yes" if x["text_while_driving_30d"] == "30" else "no", axis=1)


Unnamed: 0,age,gender,grade,hispanic,race,height,weight,helmet_12m,text_while_driving_30d,physically_active_7d,hours_tv_per_school_day,strength_training_7d,school_night_hours_sleep,text_ind
10002,16.0,female,10,not,American Indian or Alaska Native,1.52,41.73,never,did not drive,4.0,5+,0.0,5,no
7561,17.0,male,11,not,White,1.8,58.97,never,30,7.0,3,3.0,6,yes
4551,18.0,female,12,not,,1.7,68.04,never,0,7.0,<1,7.0,8,no
1658,14.0,female,9,not,Asian,1.45,45.36,never,did not drive,2.0,2,0.0,6,no
9166,17.0,female,11,hispanic,,1.65,54.43,never,0,5.0,1,5.0,9,no


## Inference on proportions

When summarizing the YRBSS, the Centers for Disease Control and Prevention seeks insight into the population _parameters_. To do this, you can answer the question, “What proportion of people in your sample reported that they have texted while driving each day for the past 30 days?” with a statistic; while the question “What proportion of people on earth have texted while driving each day for the past 30 days?” is answered with an estimate of the parameter.

The inferential tools for estimating population proportion are analogous to those used for means in the last chapter: the confidence interval and the hypothesis test.

In [32]:
res = stats.bootstrap((no_helmet["text_ind"].map({"yes": 1, "no": 0}),),
                       statistic=np.mean, 
                       confidence_level=0.95, 
                       n_resamples=1000,
                       method="percentile")

res.confidence_interval

ConfidenceInterval(low=0.06034112082556973, high=0.07252759065500931)

### Excercise 3 - What is the margin of error for the estimate of the proportion of non-helmet wearers that have texted while driving each day for the past 30 days based on this survey?

In [34]:
res.standard_error * 1.96

0.006200895632995962

### Exercise 4 - Calculate confidence intervals for two other categorical variables (you’ll need to decide which level to call “success”, and report the associated margins of error. Interpet the interval in context of the data. It may be helpful to create new data sets for each of the two countries first, and then use these data sets to construct the confidence intervals.

We can check whether or not the sample individual was phisically active at least three times. Also, we can check whether or not he did strength training at least three times.

In [42]:
phy = (yrbss_df["physically_active_7d"] >= 3).astype(int)
strength = (yrbss_df["strength_training_7d"] >= 3).astype(int)

phy_res = stats.bootstrap((phy,),
                           statistic=np.mean, 
                           confidence_level=0.95, 
                           n_resamples=1000,
                           method="percentile")

strength_res = stats.bootstrap((strength,),
                           statistic=np.mean, 
                           confidence_level=0.95, 
                           n_resamples=1000,
                           method="percentile")

for res in (phy_res, strength_res):
    print(f"Confidence interval is {res.confidence_interval} and margin of error is {1.96 * res.standard_error}")

Confidence interval is ConfidenceInterval(low=0.6477195759405139, high=0.6633291614518148) and margin of error is 0.007981023012333032
Confidence interval is ConfidenceInterval(low=0.4672734300228226, high=0.48435544430538174) and margin of error is 0.008512303181469838


From the above result, we see that we can be 95% sure that the true proportion of physical active individuals is between 0.6477 and 0.6633, and that the true proportion of those who trained stength in the past seven days at least three times is between 0.4672 and 0.4844.