In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("hw01.ipynb")

# Homework 1: Causality and Expressions

Please complete this notebook by filling in the cells provided. Before you begin, run the previous cell to load the provided tests.

**Recommended Readings:**

- [What is Data Science?](http://www.inferentialthinking.com/chapters/01/what-is-data-science.html)
- [Causality and Experiments](http://www.inferentialthinking.com/chapters/02/causality-and-experiments.html) 
- [Programming in Python](http://www.inferentialthinking.com/chapters/03/programming-in-python.html)

For all problems that you must write explanations and sentences for, you **must** provide your answer in the designated space. Moreover, throughout this homework and all future ones, please be sure to not re-assign variables throughout the notebook! For example, if you use `max_temperature` in your answer to one question, do not reassign it later on. Otherwise, you will fail tests that you thought you were passing previously!


**Note: This homework has hidden tests on it. That means even though tests may say 100% passed, it doesn't mean your final grade will be 100%. We will be running more hidden tests for correctness once everyone turns in the homework.**

Directly sharing answers is not okay, but discussing problems with the course staff or with other students is encouraged. 

You should start early so that you have time to get help if you're stuck.

## 1. Scary Arithmetic

<!-- BEGIN QUESTION -->

An ad for ADT Security Systems says,

> "When you go on vacation, burglars go to work [...] According to FBI statistics, over 25% of home burglaries occur between Memorial Day to Labor Day."

Do the data in the ad support the claim that burglars are more likely to go to work during the time between Memorial Day to Labor Day? Please explain your answer. **(6 Points)**

**Note:** You can assume that "over 25%" means only slightly over. Had it been much over, say closer to 30%, then the marketers would have said so.

**Note:** Memorial Day is observed on the last Monday of May and Labor Day is observed on the first Monday of September.


_Type your answer here, replacing this text._

<!-- END QUESTION -->

## 2. Characters in Little Women


In lecture, we counted the number of times that the literary characters were named in each chapter of the classic book, [*Little Women*](https://inferentialthinking.com/chapters/01/3/1/Literary_Characters.html?highlight=little%20women). In computer science, the word "character" also refers to a letter, digit, space, or punctuation mark; any single element of a text. The following code generates a scatter plot in which each dot corresponds to a chapter of *Little Women*. The horizontal position of a dot measures the number of periods in the chapter. The vertical position measures the total number of characters.

In [None]:
# Just run this cell.

# This cell contains code that hasn't yet been covered in the course,
# but you should be able to interpret the scatter plot it generates.

from datascience import *
from urllib.request import urlopen
import numpy as np
%matplotlib inline

little_women_url = 'https://www.inferentialthinking.com/data/little_women.txt'
chapters = urlopen(little_women_url).read().decode().split('CHAPTER ')[1:]
text = Table().with_column('Chapters', chapters)
Table().with_columns(
    'Periods',    np.char.count(chapters, '.'),
    'Characters', text.apply(len, 0)
    ).scatter(0)

**Question 1.** Around how many periods are there in the chapter with the most characters? Assign either 1, 2, 3, 4, or 5 to the name `characters_q1` below. **(4 Points)**

1. 40,000
2. 32,000
3. 440
4. 390
5. 250


**Note:** If you run into a `NameError: name 'grader' is not defined` error in the autograder cell below (and in any assignment), please re-run the first cell at the very top of this notebook!


In [None]:
characters_q1 = ...

In [None]:
grader.check("q2_1")

The test above checks that your answers are in the correct format. **This test does not check that you answered correctly**, only that you assigned a number successfully in each multiple-choice answer cell.

**Question 2.** Which of the following chapters has the most characters per period? Assign either 1, 2, 3, or 4 to the name `characters_q2` below. **(4 Points)**

1. The chapter with about 460 periods
2. The chapter with about 250 periods
3. The chapter with about 60 periods
4. The chapter with about 90 periods


In [None]:
characters_q2 = ...

In [None]:
grader.check("q2_2")

Again, the test above checks that your answers are in the correct format, but not that you have answered correctly.

To discover more interesting facts from this plot, check out [Section 1.3.2](https://inferentialthinking.com/chapters/01/3/2/Another_Kind_Of_Character.html) in the textbook.

## 3. Names and Assignment Statements

**Question 1.** When you run the following cell, Python produces a cryptic error message.

In [None]:
4 = 2 + 2

Choose the best explanation of what's wrong with the code, and then assign 1, 2, 3, or 4 to `names_q1` below to indicate your answer. **(4 Points)**

1. Python is smart and already knows `4 = 2 + 2`.

2. It should be `2 + 2 = 4`.

3. In Python, the convention dictates that the "=" sign should be accompanied by a variable name on its left side, and "4" doesn't fulfill the 
   criteria of being a variable name.

4. I don't get an error message. This is a trick question.


In [None]:
names_q1 = ...

In [None]:
grader.check("q3_1")

**Question 2.** When you run the following cell, Python will produce another cryptic error message.

In [None]:
three = 4
sixteen = three times three

Choose the best explanation of what's wrong with the code and assign 1, 2, 3, or 4 to `names_q2` below to indicate your answer. **(4 Points)**

1. The `times` operation only applies to numbers, not the word "two".

2. The name "three" cannot be assigned to the number 4.

3. The name `times` isn't a built-in operator; instead, multiplication uses `*`.

4. Three times three is 9, not 16.


In [None]:
names_q2 = ...

In [None]:
grader.check("q3_2")

**Question 3.** Run the following cell.

In [None]:
x = 4
y = 12 - x
x = 5

What is `y` after running this cell, and why? Choose the best explanation and assign 1, 2, 3, or 4 to `names_q3` below to indicate your answer. **(4 Points)**

1. `y` is equal to 8, because the second `x = 5` has no effect since `x` was already defined.

2. `y` is equal to 7, because assigning `x` to 5 will update `y` to 7 since `y` was defined in terms of `x`.

3. `y` is equal to 7, because `x` is 5 and 12 - 5 is 7.

4. `y` is equal to 8, because `x` was 4 when `y` was assigned, and 12 - 4 is 8. 


In [None]:
names_q3 = ...

In [None]:
grader.check("q3_3")

## 4. Differences Between Majors

Berkeley's Office of Planning and Analysis provides data on numerous aspects of the campus. Adapted from the OPA website, the table below displays the number of degree recipients in three majors in the 2020-2021 and 2021-2022 academic years.

| Major                              | 2020-2021    | 2021-2022   |
|------------------------------------|--------------|-------------|
| Public Policy                      |      222     |    231      |
| L&S Data Science                   |      327     |    422      |
| Journalism                         |      92      |    88       |



**Question 1.** Suppose you want to find the **smallest** absolute difference between the number of degree recipients in the two years, among the three majors.

In the cell below, compute this value and call it `smallest_change`. Use a single expression (a single line of code) to compute the answer. Let Python perform all the arithmetic (like subtracting 222 from 231) rather than simplifying the expression yourself. The built-in `abs` function takes a numerical input and returns the absolute value. The built-in `min` function can take in 3 arguments and returns the maximum of the three numbers. **(5 Points)**


In [None]:
smallest_change = ...
smallest_change

In [None]:
grader.check("q4_1")

**Question 2.** Which of the three majors had the **largest** absolute difference? Assign `largest_change_major` to 1, 2, or 3 where each number corresponds to the following major:

1. Public Policy  
2. L&S Data Sceince  
3. Journalism

Choose the number that corresponds to the major with the largest absolute difference.

You should be able to answer by rough mental arithmetic, without having to calculate the exact value for each major. **(4 Points)** 


In [None]:
largest_change_major = ...
largest_change_major

In [None]:
grader.check("q4_2")

**Question 3.**  For each major, define the “relative change” to be the following: $\large{\frac{\text{absolute difference}}{\text{value in 2008-2009}} * 100}$ 

Fill in the code below such that `gws_relative_change`, `linguistics_relative_change` and `rhetoric_relative_change` are assigned to the relative changes for their respective majors. **(5 Points)**


In [None]:
pp_relative_change = (abs(...) / 222) * 100
data_relative_change = ...
journal_relative_change = ...
pp_relative_change, data_relative_change, journal_relative_change

In [None]:
grader.check("q4_3")

**Question 4.** Assign `largest_relative_change_major` to 1, 2, or 3 where each number corresponds to to the following: 

1. Public Policy
2. L&S Data Science
3. Journalism

Choose the number that corresponds to the major with the largest relative change. **(4 Points)**


In [None]:
largest_relative_change_major = ...
largest_relative_change_major

In [None]:
grader.check("q4_4")

## 5. Nearsightedness Study

[Myopia](https://en.wikipedia.org/wiki/Myopia), or nearsightedness, results from a number of genetic and environmental factors. In 1999, Quinn et al studied the relation between myopia and ambient lighting at night (for example, from nightlights or room lights) during childhood.

<!-- BEGIN QUESTION -->

**Question 1.** The study found that of the children who slept with a room light on before the age of 2, 55% were myopic. Of the children who slept with a night light on before the age of 2, 34% were myopic. Of the children who slept in the dark before the age of 2, 10% were myopic. The study concluded the following: "The prevalence of myopia [...] during childhood was strongly associated with ambient light exposure during sleep at night in the first two years after birth."


Do the data support this statement? Why or why not? You may interpret "strongly" in any reasonable qualitative way. **(5 Points)**

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 2.** The data were gathered by the following procedure, reported in the study. "Between January and June 1998, parents of children aged 2-16 years [...] that were seen as outpatients in a university pediatric ophthalmology clinic completed a questionnaire on the child's light exposure both at present and before the age of 2 years." Was this study observational, or was it a controlled experiment? Explain. **(5 Points)**

Do the data support this statement? Why or why not? You may interpret "strongly" in any reasonable qualitative way. **(5 Points)**


_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 3.** On May 13, 1999, CNN reported the results of this study under the headline, "Night light may lead to nearsightedness." 
The final paragraph of this report said that "several eye specialists" had pointed out that the study should have accounted for heredity.

Myopia is passed down from parents to children. Myopic parents are more likely to have myopic children, and may also be more likely to leave lights on habitually (since the parents have poor vision). In what way does the knowledge of this possible genetic link affect how we interpret the data from the study? Explain. **(5 Points)**

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 4.** Does the conclusion of the study "Night light may lead to nearsightedness" claim that night light causes nearsightedness? **(5 Points)**



_Type your answer here, replacing this text._

<!-- END QUESTION -->

## 6. Studying the Survivors

The Reverend Henry Whitehead was skeptical of John Snow’s conclusion about the Broad Street pump. After the Broad Street cholera epidemic ended, Whitehead set about trying to prove Snow wrong.  (The history of the event is detailed [here](http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1034367/pdf/medhist00183-0026.pdf).)

He realized that Snow had focused his analysis almost entirely on those who had died. Whitehead, therefore, investigated the drinking habits of people in the Broad Street area who had not died in the outbreak.

What is the main reason it was important to study this group? Assign either 1, 2, or 3 to the name `survivor_answer` below. **(4 Points)**

1. Through considering the survivors, Whitehead could have identified a cure for cholera.

2. If Whitehead had found that many people had drunk water from the Broad Street pump and not caught cholera, that would have been evidence against Snow's hypothesis.

3. Survivors could provide additional information about what else could have caused the cholera, potentially unearthing another cause.


In [None]:
survivor_answer = ...
survivor_answer

In [None]:
grader.check("q6_1")

**Note:** Whitehead ended up finding further proof that the Broad Street pump played a central role in spreading the disease to the people who lived near it. Eventually, he became one of Snow’s greatest defenders.

You're done with Homework 1!  

**Important submission steps:** 
1. Run the tests and verify that they all pass.
2. Choose **Save Notebook** from the **File** menu, then **run the final cell**. 
3. Click the link to download the zip file.
4. Then submit the zip file to the corresponding assignment according to your instructor's directions. 

**It is your responsibility to make sure your work is saved before running the last cell.**


---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export()