# Homework 5: Estimation and Assessing Models

**Reading**: Textbook chapters [10](https://dukecs.github.io/textbook/chapters/10/sampling-and-empirical-distributions.html), and [11](https://dukecs.github.io/textbook/chapters/11/testing-hypotheses.html).

Please complete this notebook by filling in the cells provided. Before you begin, execute the following cell to load the provided tests. Each time you start your server, you will need to execute this cell again to load the tests.

Homework 5 is due Wednesday, 11/7 at 11:59pm. Start early so that you can come to office hours if you're stuck. Check the website for the office hours schedule. 

You can work in pairs on this homework. Directly sharing answers is not okay, but discussing problems with the course staff or with other students is encouraged.

For all problems that you must write our explanations and sentences for, you **must** provide your answer in the designated space. Moreover, throughout this homework and all future ones, please be sure to not re-assign variables throughout the notebook! For example, if you use `max_temperature` in your answer to one question, do not reassign it later on.

In [None]:
# Don't change this cell; just run it. 

import numpy as np
from datascience import *

# These lines do some fancy plotting magic.
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import warnings
warnings.simplefilter('ignore', FutureWarning)

from client.api.notebook import Notebook
ok = Notebook('hw05.ok')
_ = ok.auth(inline=True)

## 1. Earthquakes


The next cell loads a table containing information about **every earthquake above a magnitude of 4.5** in 2017, compiled by the US Geological Survey. (source: https://earthquake.usgs.gov/earthquakes/search/)

In [2]:
earthquakes = Table().read_table('earthquakes_2017.csv').select(['time', 'mag', 'place'])
earthquakes

There are a lot of earthquakes that occured over the year 2017 that are of interest, and generally, we won't have access to this large population. Instead, if we sample correctly, we can take a small subsample of earthquakes in this year to get an idea about the distribution of magnitudes throughout the year!

**Question 1.** In the following lines of code, we take two different samples from the earthquake table, and calculate the mean of the magnitudes of these earthquakes. Are these samples representative of the population of earthquakes in the original table (that is, the should we expect the mean to be close to the population mean)? 

*Hint:* Consider the ordering of the `earthquakes` table. 

In [3]:
sample1 = earthquakes.sort('mag', descending = True).take(np.arange(100))
sample1_magnitude_mean = np.mean(sample1.column('mag'))
sample2 = earthquakes.take(np.arange(100))
sample2_magnitude_mean = np.mean(sample2.column('mag'))
[sample1_magnitude_mean, sample2_magnitude_mean]

*Write your answer here, replacing this text.*

**Question 2.** Write code producing a sample that should represent the population of size 500 then take the mean of the magnitudes of the earthquakes in this sample. Assign these to `representative_sample` and `representative_mean` respectively. 

*Hint:* What sort of samples can properly represented the population?

In [4]:
representative_sample = ...
representative_mean = ...
representative_mean

In [5]:
_ = ok.grade('q1_2')

**Question 3.** Suppose we want to figure out what the biggest magnitude earthquake was in 2017, but we are tasked with doing this only with a sample of 500 from the earthquakes table. 

To determine whether trying to find the biggest magnitude from a sample is a plausible idea, write code that simulates the maximum of a random sample of size 500 from the `earthquakes` table 5000 times. Assign your array of maximums to `maximums`. 

In [7]:
maximums = ...
for i in np.arange(5000): 
    maximums = ...

In [8]:
#Histogram of your maximums
Table().with_column('Largest magnitude in sample', maximums).hist('Largest magnitude in sample') 

In [9]:
_ = ok.grade('q1_3')

**Question 4.** Is a random sample of size 500 likely to help you determine the largest magnitude earthquake in the population? Find the magnitude of the (actual) strongest earthquake in 2017 to help you determine your answer. After this, explain whether you believe you can accurately use a sample size of 500 to determine the maximum. What is a specific con of using the maximum as your estimator? Use the histogram above to help answer. 

In [10]:
strongest_earthquake_magnitude = ...
strongest_earthquake_magnitude

In [11]:
_ = ok.grade('q1_4')

*Write your answer here, replacing this text.*

**Question 5.** We would like to try and accurately predict the magnitude of the largest earthquake using a sample of 500 by using a different statistic, rather than the maximum. 

Assign `valid_statistic` to either 1, 2, or 3 corresponding to the *best* option below that can be used to predict the maximum using a sample in general (not just in this specific example). 

1. The mean of a sample  
2. The mean of a sample * 2
3. The largest value - the smallest value (the range)

In [12]:
valid_statistic = ...

In [13]:
_ = ok.grade('q1_5')

**Question 6:** Just as we did before with the max, we would like to see if this new statistic is a good idea.

Simulate 5000 times the action of sampling 500 instances from the `earthquake` table, keeping track of the observed values of our test statistic from above. Then, make a histogram out of these test statistics. Be sure to keep track of your statistics in the `other_statistic` variable. 

In [14]:
other_statistic = ...
for i in np.arange(5000): 
    other_statistic = ...

In [15]:
#Histogram of your statistics
Table().with_column('New Statistic', other_statistic).hist('New Statistic') 

In [16]:
_ = ok.grade('q1_6')

**Question 7.** Does our new statistic look like a reasonable predictor for the maximum? Explain why or why not. 

*Hint:* Remember what exactly this table is representing. Go back up and read the description of the table. 

*Write your answer here, replacing this text.*

## 2. Assessing Gary's Models
#### Games with Gary

Our friend Gary comes over and asks us to play a game with him. The game works like this: 

> We will flip a fair coin 10 times, and if the number of heads is greater than or equal to 5, we win!
> 
> Otherwise, Gary wins.

We play the game once and we lose, observing 2 heads. We are angry and accuse Gary of cheating! Gary is adamant, however, that the coin is fair.

Gary's model claims that there is an equal chance of getting heads or tails, but we do not believe him. We believe that the coin is clearly rigged, with heads being less likely than tails. 

#### Question 1
Assign `coin_model_probabilities` to a two-item array containing the chance of heads as the first element and the chance of tails as the second element under Gary's model. Make sure your values are between 0 and 1. 

In [2]:
coin_model_probabilities = ...
coin_model_probabilities

In [3]:
_ = ok.grade('q2_1')

#### Question 2

Define the function `coin_simulation_and_statistic`, which, given a sample size and an array of model proportions (like the one you created in Q1), returns the number of heads in one simulation of flipping the coin under the model specified in `model_proportions`. 

*Hint:* Think about how you can use the function `sample_proportions`. 

In [0]:
def coin_simulation_and_statistic(sample_size, model_proportions):
    ...

coin_simulation_and_statistic(10, coin_model_probabilities)

In [9]:
_ = ok.grade('q2_2')

**Question 3** 

Use your function from above to simulate the flipping of 10 coins 5000 times under the proportions that you specified in problem 1. Keep track of all of your statistics in `coin_statistics`. 

In [None]:
coin_statistics = ...
repetitions = ...

for ... in ...: 
    ...

coin_statistics

In [None]:
_ = ok.grade('q2_3')

Let's take a look at the distribution of statistics, using a histogram. 

In [None]:
#Draw a distribution of statistics 
Table().with_column('Coin Statistics', coin_statistics).hist()

#### Question 4
Given your observed value, do you believe that Gary's model is reasonable, or is our alternative more likely? Explain your answer using the distribution drawn in the previous problem. 

*Write your answer here, replacing this text.*

## 3. Submission


Once you're finished, select "Save and Checkpoint" in the File menu and then execute the `submit` cell below. The result will contain a link that you can use to check that your assignment has been submitted successfully. If you submit more than once before the deadline, we will only grade your final submission. If you mistakenly submit the wrong one, you can head to [okpy.org](https://okpy.org/) and flag the correct version. To do so, go to the website, click on this assignment, and find the version you would like to be graded. There should be an option to flag that submission for grading!

In [None]:
_ = ok.submit()