In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("hw02.ipynb")

<img src="data6.png" style="width: 15%; float: right; padding: 1%; margin-right: 2%;"/>

# Homework 2 – Arrays and Table Fundamentals

## Data 6

This homework is due on **Thursday, September 18 at 8:00PM**. 

You must submit the assignment to [Gradescope](https://www.gradescope.com/). See the [syllabus](https://data6.org/fa25/syllabus) for our homework late submission policy.

**Reminder:** This homework assignment contains hidden tests, which means that even if the test passed, it does not mean your answer is 100% correct. The autograder in the homework is intended as a sanity check.

While we encourage you to collaborate with your peers, directly copying solutions is not allowed. Please refer to our [syllabus](https://data6.org/fa25/syllabus) on Academic Honesty.
If you find yourself stuck on a problem, we recommend that you make a post on [Ed](https://edstem.org/us/courses/) or attend [office hours](https://data6.org/fa25/schedule/).

**Recommended readings for Homework 2:**
- [Introduction to Tables](https://inferentialthinking.com/chapters/03/4/Introduction_to_Tables.html)
- [Visualization](https://inferentialthinking.com/chapters/07/Visualization.html)

In [None]:
# Just run this cell to import necessary Python packages
try:
    from datascience import *
except:
    %pip install -q datascience
    from datascience import *
import numpy as np

<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />

# Part 1: Exploring the Table

<br/><br/>

## Background on the Data

The dataset that we'll use in this lab comes from the Behavioral Risk Factor Surveillance System (BRFSS), a health survey fielded by the Centers for Disease Control and Prevention (CDC). From the [BRFSS website](https://www.cdc.gov/brfss/index.html):
>The Behavioral Risk Factor Surveillance System (BRFSS) is the nation’s premier system of health-related telephone surveys that collect state data about U.S. residents regarding their health-related risk behaviors, chronic health conditions, and use of preventive services.

>By collecting behavioral health risk data at the state and local level, BRFSS has become a powerful tool for targeting and building health promotion activities. 

While we've wrangled and cleaned the data set you'll use in your investigation, you're welcome to investigate the original source; you can do so via the [Survey Data section](https://www.cdc.gov/brfss/data_documentation/index.htm) of the BRFSS site.

---

## Question 1 Overview

The dataset that you will investigate is a **subset of the 2022 BRFSS Survey**. We've taken all the data points corresponding to fully-completed surveys and some of the more interesting columns (in our opinion). Since the entire data set is so large (over 350,000 respondents), we've sampled a subset of respondents and built a CSV from the original data. Your first task is to make a reasonable guess about what subset we chose.

## Question 1.1

The file `brfss2022.csv` contains our dataset, a subset of the 2022 BRFSS dataset. Each row represents one individual's responses to the BRFSS survey. Load it as a table named `brfss` using the `Table.read_table()` function.

_Hint_: `Table.read_table(...)` takes one argument (data file name in string format) and returns a table.

In [None]:
brfss = ...
brfss

In [None]:
grader.check("q1_1")

## Question 1.2

Find the number of rows in the `brfss` table and assign it to `num_rows_brfss`.

In [None]:
num_rows_brfss = ...
num_rows_brfss

In [None]:
grader.check("q1_2")

## Question 1.3

The `State` column indicates the U.S. state or territory where an individual responded the BRFSS survey.
Assign `num_ca_rows` to the number of individuals who responded the BRFSS survey in **California**. You may need to use more lines than provided.

In [None]:
num_ca_rows = ...
num_ca_rows

In [None]:
grader.check("q1_3")

## Question 1.4

For reference, the full BRFSS dataset (which we did not provide you) has over 350,000 respondents. 
    
Based on your answers to the previous parts, fill in the blank to the following statement:

> There are __________ respondents in our subset of the BRFSS dataset.

Fill in the blank by assigning 1 or 2 to the name `takeaway_q1_4` below.

1. non-California
2. only California

In [None]:
takeaway_q1_4 = ...

In [None]:
grader.check("q1_4")

<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />


# Part 2: Investigating our Variables

We'll first start by looking at the BRFSS Survey Data on the **individual level**. That is, each row in our `brfss` table corresponds to a **unique individual** that responded to the survey. These individuals are not identified by their government names to protect their privacy.

Each column in the `brfss` table is a variable created from a question asked in the official BRFSS Survey:

In [None]:
# just run this cell
print(brfss.labels)

Based on these column names, it looks like the data includes questions about **telecommunications**, **housing**, **demographic information**, **mental and physical health**, **alcohol and drug consumption**, and **physical exercise**.

We will explore a subset of these variables in this homework. To read about the variables and survey questions used in the BRFSS Survey Data, please visit the official [2022 BRFSS Survey Data and Documentation](https://www.cdc.gov/brfss/annual_data/annual_2022.html).

## Review: Array functions

Before we continue, recall that there are many functions that perform operations on arrays.

Some are built-in functions, like `len` and `sum`:

In [None]:
arr = make_array(1, 2, 3, 0, 4)
print("len is", len(arr), "and sum is", sum(arr))

Others are imported from the NumPy package, like `np.mean` and `np.count_nonzero` and `np.sum`. Note that `np.sum` and `sum` both return the sum of NumPy elements.

In [None]:
print("# non-zero elements:", np.count_nonzero(arr))

Table columns can also be arrays. For example, to get the `Personal Doctor` column as an array, we call the `column` method on the `brfss` table:

In [None]:
# Just run this cell
personal_doctors = brfss.column("Personal Doctor")
personal_doctors

<br/><br/>

<hr style="border: 1px solid #fdb515;" />

## Personal Doctor

First, you will investigate the following question from the BRFSS Survey:
> *Do you have one person you think of as your personal doctor or health care provider?*

The above array `personal_doctors`, contains the responses to the above question:

* `1` for "Yes"
* `0` for "No" or "Missing Response."

In a later part of the class, we will call "Personal Doctor" a **binary variable** because of these two values. We'll come back to this.

---

## Question 2.1

Using `personal_doctors`, assign `has_personal_doctor` to `1` if the individual with index **37** (i.e., the 38th respondent) has a personal doctor, `0` if they don't/they did not respond.

**Note**: Do not ["hardcode"](https://en.wikipedia.org/wiki/Hard_coding) the answer! Instead, you should **index** into the array using the method `item`.

In [None]:
has_personal_doctor = ...
has_personal_doctor

In [None]:
grader.check("q2_1")

---

## Question 2.2

Using the array `personal_doctors`, assign `percent_personal_doc` to the **fraction** of people with a personal doctor in the dataset above. For the sake of this problem, an individual has a personal doctor only if they responded "yes" to the corresponding survey question.

* You should **only** use Python built-in functions like `sum` and `len` for this problem! In other words, do **not** use any NumPy functions (e.g., do not use `np.count_nonzero`, etc.).
* **Hint**: How can you make use of the *binary* nature of the "Personal Doctor" variable? The only values in the array are 1 (has personal doctor) 0 (has no known personal doctor).

In [None]:
percent_personal_doc = ...
percent_personal_doc

In [None]:
grader.check("q2_2")

<hr style="border: 1px solid #fdb515;" />

## Age

Now, consider the `Age` column:

In [None]:
ages = brfss.column('Age')
ages


## Question 3.1

Compute the average of the respondent ages in the `ages` array and call it `average_age`.

In [None]:
average_age = ...
average_age

In [None]:
grader.check("q3_1")


## Question 3.2

Compute the maximum age of respondent ages in the `ages` array and call it `max_age`.

In [None]:
max_age = ...
max_age

In [None]:
grader.check("q3_2")


## Question 3.3

Consider your answer above. Do you truly believe that the maximum age of all respondents is 80? 

Let's revisit the definition of this variable:

> `Age`: The reported age of the respondent, collapsed above 80.

So the `Age` variable is not necessarily the respondent's actual age! "Collapsed above 80" means that ages below 80 are reported with their actual value, and any age 80 or over was recorded as 80 in this variable.

---

With this information, complete the following statement:

> The value `avg_age` is __________ compared to the actual average age of respondents.

* higher 
* about the same
* lower

Fill in the blank by assigning one of the above options as a **string** to `takeaway_q3_3`.

In [None]:
takeaway_q3_3 = ...

print("The value avg_age is",
      takeaway_q3_3,
      "compared to the actual average age of respondents.")

In [None]:
grader.check("q3_3")


## Question 3.4

This dataset likely does _not_ report the true ages of respondents who are 80 or older because of **respondent privacy**. The BRFSS survey collects detailed health and demographic information from every respondent. With a small enough group, a reasonably informed person may be able to identify exactly who is who. The elderly population is one such tiny fraction of overall respondents.

Compute the percentage of respondents who are 80 or above. Call this percentage `percentage_over_80`.

_Hint_: Your expression should use a Table method on `brfss` to identify the number of rows that report an `Age` of 80. With our variable definition, 80 refers to age 80 or over.

In [None]:
percent_over_80 = ...
percent_over_80

In [None]:
grader.check("q3_4")

<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />

# Part 3: Missing Values

<hr style="border: 1px solid #fdb515;" />

## Sleep

Let's investigate the responses to a question from the BRFSS Survey:
> *On average, how many hours of sleep do you get in a 24-hour period?*

Run the following cell to load in the `sleep_col` array. This array contains the number of hours each respondent reported sleeping in a 24-hour period.

In [None]:
# Just run this cell
sleep_col = brfss.column("Sleep Time")
sleep_col

It turns out that some values in `sleep_col` are `0`! Instead of printing out the entire array, run the below cell to print out a random set of 20 values from the array. You can run it multiple times to print out different sleep times.

In [None]:
# Run this cell to print out 20 random sleep times
# Feel free to run it multiple times.
np.random.choice(sleep_col, 20)

Through the "random sampling" of sleep times above, you observe that some values in the `sleep_col` array are `0`. Since sleeping zero hours on average is near-impossible, this must be a **code** for something else. In this case, if the individual responded **don't know/unsure** or **skipped** this question, their hours of sleep value is `0`. (Note: In this context, "code" means a variable category, and not Python code).

---

## Question 4

Assign `sleeper_count` to an expression that uses a NumPy function to count the number of individuals who provided valid responses to the sleep survey question. Your expression should involve both a NumPy function and the array `sleep_col`.

**Hint**: Please see a list of NumPy functions in the [Data 6 Python reference](https://data6.org/notes/reference/). One of them will work!

In [None]:
sleeper_count = ...
sleeper_count

In [None]:
grader.check("q_4")

<hr style="border: 1px solid #fdb515;" />

## Height and Weight

Now consider the next two variables:

* `Height`: Height in meters [2 implied decimal places].
* `Weight`: Weight in kilograms[2 implied decimal places].

## Question 5.1
Below, use the `brfss` table to construct a new table of these two columns. Call the new table `heights_weights`.

In [None]:
heights_weights = ...
heights_weights.show(5)

In [None]:
grader.check("q_5_1")

## Question 5.2

These numbers seem rather high for human heights and weights. This is because the variable definitions includes the phrase **"2 implied decimal places"**

Construct a new table `heights_weights_converted` that adds two new columns to `heights_weights`. The columns should be labeled as follows:
* `Height (m)`, the respondent's height in meters (e.g., a `Height` value of 160 means a `Height (m)` value of 1.60 m)
* `Weight (kg)`, the respondent's value in kilograms (e.g., a `Weight` value of 6713 means a `Weight (kg)` value of 67.13 kg).

_Hint_: Your answer may likely involve the table method `with_columns`.

In [None]:
heights_weights_converted = ...

In [None]:
grader.check("q_5_2")

---

Unfortunately, there are also invalid values of heights and weights, where the respondent did not report their height or weight. These missing responses are reported as `-1`:

In [None]:
# just run this cell
heights_weights.show(10)

## Question 5.3

Which respondents are missing both missing height and weight? Create a new table called `missing_heights_weights` that contains only the rows of the **original** `brfss` table with missing height **and** weight values (i.e., values of -1).

_Hint_: To find an exact match across two columns, use method chaining with `where`.

In [None]:
missing_heights_weights = ...
missing_heights_weights

In [None]:
grader.check("q_5_3")

<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />

# Part 4: Average Sleep


Finally, let's explore how to aggregate the BRFSS data to different units of analysis. In other words, we'd like to move from analyzing individual respondents to groups of individuals, often grouped by demographic. Before we do so, we need to introduce the statistical notion of **weighted averages**.

<hr style="border: 1px solid #fdb515;" />

## [tutorial] Weighted Averages

Suppose we have five numbers: `0`, `2`, `5`, `10`, `12`.

The **arithmetic average** (i.e., mean) of a set of numbers is the sum of those numbers divided by the count of those numbers:

$$\text{average} = \frac{1}{5} \left( 0 + 2 + 5 + 10 + 12 \right)$$

Another perspective of the average is that each element contributes an *equal amount* to the final amount. In our example, each element contributes an equal 20%:

$$\text{average} = 0.2 (0) + 0.2 (5) + 0.2 (10) + 0.2 (12)$$

By contrast, a **weighted average** is when each element contributes a different amount (weight) to the final average, often based on its frequency of occurrence in a dataset.

For example, if our elements respectively occurred 20%, 30%, 10%, 5% and 35% in our data, then

$$\text{weighted average} = 0.2 (0) + 0.3 (2) + 0.1 (5) + 0.05 (10) + 0.35 (12)$$

where the weights are **frequencies** (and not percentages!) of occurrence.

---

## Question 6.1

Use array arithmetic and NumPy operations to compute the weighted average shown above. Your expression should use the arrays `elements` and `weights`, declared below. Assign your expression to the name `weighted_average`.


In [None]:
elements = make_array(0, 2, 5, 10, 12)
weights = make_array(0.20, 0.30, 0.10, 0.05, 0.35)

weighted_average = ...
weighted_average

In [None]:
grader.check("q_6_1")

<hr style="border: 1px solid #fdb515;" />

## Sleep Time Frequency Table

The below is a **frequency table** which shows how California residents respond to the `Sleep Time` question from the BRFSS survey:

> *On average, how many hours of sleep do you get in a 24-hour period?*
    
While our previous `brfss` table had rows that represented **individual respondents**, each row in our frequency table represents a **group of individuals** with the same hours of sleep.

Run the following cell to load the `sleep` table. It contains the following columns:
1. `Sleep Time`: The number of hours slept in a 24-hour period, on average.
1. `# Respondents`: The number of survey respondents corresponding to the sleep time value
1. `% Population (Estimated)`: The **estimated population weights** corresponding to the each sleep time value (discussed below)

In [None]:
# Just run this cell
sleep = Table.read_table("sleep.csv")
sleep.show(25)

In [None]:
col = sleep.column("# Respondents")
np.sum(col)

## Question 6.2: Estimated population percentages

The `# Respondents` column is computed by counting the number of respondents who reported a valid daily sleep time (in hours) between 1 and 24 (or 0 if no survey response). Furthermore, the BRFSS therefore computes data weights for each individual that estimates how common an individual's data are among the US population. These weights are likely computed through demographic data like age, gender, location, race/ethnicity, income, etc. 

The `% Population (Estimated)` variable is the **estimated** percentage of the California population associated 
with the corresponding sleep time, computed by summing up the data weights individuals with the same sleep time. These numbers have been precomputed for you.

Note that these `% Population (Estimated)` values can be different from the actual BRFSS survey respondent percentages for each sleep time! After all, there are just under 7000 California respondents to the 2022 BRFSS Survey, compared to the 2022 California population of over 39 million individuals.

---

Let's compare the estimated California population rates to the California respondent rates. Make a new table called `sleep_percents`. It should have the same rows and columns as `sleep`, with the addition of a new column called `% Respondents` that computes the percentage of BRFSS survey respondents that responded with the provided sleep time (or 0 if no survey response).
                                                                                                    
_Hint_: Use the `# Respondents` column. If it's useful, we've made `num_respondents` for you.

In [None]:
num_respondents = sleep.column("# Respondents")
sleep_percents = ...
sleep_percents.show(25)

In [None]:
grader.check("q_6_2")

---

## Question 6.3

How many hours of sleep is the most common across the **California population**?
Assign `most_common_sleep_time` to the most commonly reported number of hours.

_Hint_: You don't have to necessarily write code to calculate this number; just analyze the `sleep` or `sleep_percents` tables. If you prefer to code and check your work, we have left an extra cell for you as scratch work. The `tbl.sort` method would be useful here.

In [None]:
...

In [None]:
most_common_sleep_time = ...

In [None]:
grader.check("q_6_3")


<hr style="border: 1px solid #fdb515;" />

## Question 6.4 - Weighted Sleep Average

We'd like to analyze sleep at the population level by computing the **weighted average sleep time** of the US population of 2020. Instead of taking the simple average of sleep hour, we will weight sleep time by the estimated % population for that reported sleep time.

Let's next explore how we can use these estimated population percentages to estimate the average hours of sleep for Californians.

---

## Question 6.4(a)

Assign `sleep_times` and `percent_pops` to arrays of the values in the `Sleep Time` and `% Population (Estimated)` columns, respectively, of the `sleep` table. 

_Hint_: Should we use `tbl.column` or `tbl.select`?

In [None]:
sleep_times = ...
percent_pops = ...

In [None]:
grader.check("q_6_4_a")

## Question 6.4(b)

Assign `avg_sleep` to a weighted average of the `sleep_times` array, where the weights are drawn from the `percent_pops` array.
    
**Hint**: The average sleep time should be between 0 and 24 (the number of hours in a day). Check your math with rates!

In [None]:
avg_sleep = ...
avg_sleep

In [None]:
grader.check("q_6_4_b")

## Question 6.4(c)

As we saw in an earlier part of this homework, our `sleep` table contains individuals that did not report a sleep time (denoted by 0 hours of average nightly sleep). The `avg_sleep` value above is therefore an understimate of sleep time, because it incorporates this invalid 0 value into its average.

Using the `percent_pops` array, assign `percent_not_reported` to the estimated % population that would not have reported a sleep time. Do not "hardcode" this number; instead, write an expression that gets the correct element of `percent_pops`. Your expression should evaluate to a percent between 0 and 100.

_Hint_: Index into the array using `item`.

In [None]:
percent_not_reported = ...
percent_not_reported

In [None]:
grader.check("q_6_4_c")

Our homework stops here, but here is some (ungraded) food for thought: How would you recompute the weights to only take the average of nonzero sleep times? We will discuss this in a future class, hopefully...!

In [None]:
# Use this cell to answer the bonus question!

# Done!

Congratulations! You've finished your second Data 6 homework assignment!

| Category | Points |
| --- | --- |
| Autograder (Coding questions) | 24 |
| **Total** | 24 |

## Pets of Data 6
Shefran needs some help on this homework. Can you help after submitting to Gradescope?

<img src="shefran.png" width="50%" alt="Dog on leather couch"/>

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

_Note_: There is no written work to submit for this homework, so you'll only need to submit the `.zip` file to the HW02 portal on Gradescope. 

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False)