# Laboratory - YRBSS

## Getting Started

### Load packages
Let's load the relevant packages here.

In [55]:
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns

from scipy import stats 

from pathlib import Path

Let's now import the data sample from the _YRBSS_.

In [56]:
yrbss_path = Path("../datasets/yrbss.csv")
yrbss_df = pd.read_csv(yrbss_path)

yrbss_df.sample(5)

Unnamed: 0,age,gender,grade,hispanic,race,height,weight,helmet_12m,text_while_driving_30d,physically_active_7d,hours_tv_per_school_day,strength_training_7d,school_night_hours_sleep
3578,18.0,male,11,not,White,1.8,102.06,did not ride,0,4.0,5+,1.0,7
5672,17.0,female,10,not,Black or African American,1.57,56.7,did not ride,,2.0,5+,2.0,6
6415,16.0,female,11,not,Black or African American,1.5,72.58,did not ride,0,0.0,<1,0.0,7
5455,15.0,female,9,not,White,1.57,44.45,never,0,1.0,1,2.0,7
6364,15.0,male,9,not,,1.55,49.9,never,did not drive,0.0,<1,0.0,8


### Exercise 1 - What are the counts within each category for the amount of days these students have texted while driving within the past 30 days?

In [57]:
yrbss_df.groupby("text_while_driving_30d").size()

text_while_driving_30d
0                4792
1-2               925
10-19             373
20-29             298
3-5               493
30                827
6-9               311
did not drive    4646
dtype: int64

### Exercise 2 - What is the proportion of people who have texted while driving every day in the past 30 days and never wear helmets?

In [58]:
yrbss_df.loc[lambda df: (df["helmet_12m"] == "never") & (df["text_while_driving_30d"] != "did not drive")] \
        .shape[0] / yrbss_df.shape[0]

0.3578738128543032

Remember that you can use `.loc` to limit the dataset to just non-helmet wearers. Here, we will name the dataset `no_helmet`.

In [59]:
no_helmet = yrbss_df.loc[yrbss_df["helmet_12m"] == "never", :]

Also, it may be easier to calculate the proportion if you create a new variable that specifies whether the individual has texted every day while driving over the past 30 days or not. We will call this variable `text_ind`.

In [64]:
no_helmet["text_ind"] = no_helmet.apply(lambda x: "yes" if x["text_while_driving_30d"] == 30 else "no", axis=1)

no_helmet

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  no_helmet["text_ind"] = no_helmet.apply(lambda x: "yes" if x["text_while_driving_30d"] == 30 else "no", axis=1)


Unnamed: 0,age,gender,grade,hispanic,race,height,weight,helmet_12m,text_while_driving_30d,physically_active_7d,hours_tv_per_school_day,strength_training_7d,school_night_hours_sleep,text_ind
0,14.0,female,9,not,Black or African American,,,never,0,4.0,5+,0.0,8,no
1,14.0,female,9,not,Black or African American,,,never,,2.0,5+,0.0,6,no
2,15.0,female,9,hispanic,Native Hawaiian or Other Pacific Islander,1.73,84.37,never,30,7.0,5+,0.0,<5,no
3,15.0,female,9,not,Black or African American,1.60,55.79,never,0,0.0,2,0.0,6,no
7,14.0,male,9,not,Black or African American,1.88,71.22,never,,4.0,5+,0.0,6,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13569,17.0,male,12,not,Native Hawaiian or Other Pacific Islander,1.73,69.40,never,30,5.0,do not watch,,,no
13573,17.0,male,12,not,Native Hawaiian or Other Pacific Islander,1.68,57.61,never,30,6.0,4,,,no
13575,17.0,female,12,,,1.57,63.50,never,did not drive,3.0,3,,,no
13577,17.0,female,12,not,Native Hawaiian or Other Pacific Islander,1.60,72.58,never,0,4.0,5+,,,no


## Inference on proportions

When summarizing the YRBSS, the Centers for Disease Control and Prevention seeks insight into the population _parameters_. To do this, you can answer the question, “What proportion of people in your sample reported that they have texted while driving each day for the past 30 days?” with a statistic; while the question “What proportion of people on earth have texted while driving each day for the past 30 days?” is answered with an estimate of the parameter.

The inferential tools for estimating population proportion are analogous to those used for means in the last chapter: the confidence interval and the hypothesis test.

In [None]:
stats.bootstrap(no_helmet["text_ind"])