# Data 80A/180A Data Science for Everyone

# Lab 8: Sampling and Simulations

#### Today's lab

Welcome to Lab 8! In today's lab, you'll learn about sampling and continue practicing simulation.

Reading:

* [9.3 Simulations](https://www.inferentialthinking.com/chapters/09/3/Simulation.html)
* [10. Sampling](https://inferentialthinking.com/chapters/10/Sampling_and_Empirical_Distributions.html)

First, set up the tests and imports by running the cell below.

In [None]:
# Run this cell, but please don't change it.

# These lines import the Numpy and Datascience modules.
import numpy as np
from datascience import *

# These lines do some fancy plotting magic
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

import otter
grader = otter.Notebook()

## 1. Sampling Basketball Data

Run the cell below to load player and salary data. We join the two tables to form the `full_data` table from which we will sample.  

In [None]:
player_data = Table().read_table("player_data.csv")
salary_data = Table().read_table("salary_data.csv")
full_data = salary_data.join("PlayerName", player_data, "Name")

# The show method immediately displays the contents of a table. 
player_data.show(3)
salary_data.show(3)
full_data.show(3)

Rather than getting data on every player (as in the tables loaded above), imagine that we had gotten data on only a smaller subset of the players.  

If we want to make estimates about a certain numerical property of the population (known as a statistic, e.g. the mean or median), we may have to come up with these estimates based only on a smaller sample. Whether these estimates are useful or not often depends on how the sample was gathered. We have prepared some example sample datasets to see how they compare to the full NBA dataset. Later we'll ask you to create your own samples to see how they behave.

To save typing and increase the clarity of your code, we will package the analysis code into a few functions. This will be useful in the rest of the lab as we will repeatedly need to create histograms and collect summary statistics from that data.

We've defined the `histograms` function below, which takes a table with columns `Age` and `Salary` and draws a histogram for each one. It uses bin widths of 1 year for `Age` and $1,000,000 for `Salary`.

In [None]:
def histograms(t):
    ages = t.column('Age')
    salaries = t.column('Salary')/1000000
    t1 = t.drop('Salary').with_column('Salary', salaries)
    age_bins = np.arange(min(ages), max(ages) + 2, 1) 
    salary_bins = np.arange(min(salaries), max(salaries) + 1, 1)
    t1.hist('Age', bins=age_bins, unit='year')
    plt.title('Age distribution')
    t1.hist('Salary', bins=salary_bins, unit='million dollars')
    plt.title('Salary distribution') 
    
histograms(full_data)
print('Two histograms should be displayed below')

**Question 1.1**. Create a function called `compute_statistics` that takes a table containing ages and salaries and:
- Draws a histogram of ages
- Draws a histogram of salaries
- Returns a two-element array containing the average age and average salary (in that order)

You can call the `histograms` function to draw the histograms! 

In [None]:

def compute_statistics(age_and_salary_data):
    histograms(...)   # call the function histograms to plot Ages and Salaries
    age = ...
    salary = ...
    return ...        # return an array with 2 values: the mean age and mean salary

full_stats = compute_statistics(full_data)
full_stats

In [None]:
grader.check("q11")

### Simple random sampling

In a **simple random sample (SRS) without replacement**, we ensure that when we randomly pick players, each player is selected at most once. Imagine writing down each player's name on a card, putting the cards in an box, and shuffling the box.  Then, pull out cards one by one and set them aside, stopping when the specified sample size is reached.

### Producing simple random samples
Sometimes, it’s useful to take random samples even when we have the data for the whole population. It helps us understand sampling accuracy.

### `sample`

The table method `sample` produces a random sample from the table. By default, it draws at random **with replacement** from the rows of a table. It takes in the sample size as its argument and returns a **table** with only the rows that were selected. 

Run the cell below to see an example call to `sample()` with a sample size of 5, with replacement.

In [None]:
# Just run this cell

salary_data.sample(5)

The optional argument `with_replacement=False` can be passed through `sample()` to specify that the sample should be drawn **without replacement**.

Run the cell below to see an example call to `sample()` with a sample size of 5, without replacement.

In [None]:
# Just run this cell

salary_data.sample(5, with_replacement=False)

**Question 1.2.** Produce a simple random sample of size 44 from `full_data`. Run your analysis on it again.  Run the cell a few times to see how the histograms and statistics change across different samples.

*Note:* Unless specified otherwise, simple random sample is drawn **without replacement**.

- How much does the average age change across samples? 
- What about average salary?

In [None]:
my_small_data = ...
my_small_stats = compute_statistics(...) 

my_small_stats

*Write your answer here, replacing this text.*

In [None]:
grader.check("q12")

**Question 1.3.** As in the previous question, analyze several simple random samples of size 100 from `full_data` by using the `compute_statistics` function.

- Do the histogram shapes seem to change more or less across samples of 100 than across samples of size 44?  
- Are the sample averages and histograms closer to their true values/shape for age or for salary?  What did you expect to see?

In [None]:

my_large_data = ...
my_large_stats = compute_statistics(...)


# Solution
my_large_data = full_data.sample(100, with_replacement=False) 
my_large_stats = compute_statistics(my_large_data)

my_large_stats

*Write your answer here, replacing this text.*

In [None]:
grader.check("q13")

## 2. More Practice with Simulations

We can use `np.random.choice` to simulate multiple trials.

**Question 2.1.** Stephanie decides to spend her Friday night rolling a standard six-sided die. She wants to know what her total score would be if she rolled the die 100 times. Write code that simulates her total score after 100 rolls. 

*Hint:* First decide the possible values you can take in the experiment with one roll (point values in this case). Then use `np.random.choice` to simulate Stephanie’s rolls. Finally, sum up the rolls to get Stephanie's total score.

In [None]:
possible_point_values = ...
simulated_tosses = ...
total_score = ...
total_score

In [None]:
grader.check("q21")

**Question 2.2.** Write a function called `one_hundred_rolls` to simulate Stephanie's game of rolling a die 100 times and totaling up the scores.

In [None]:
def one_hundred_rolls():
    ...

In [None]:
grader.check("q22")

Run the following cell a few times to see the total score of 100 rolls.

In [None]:
one_hundred_rolls()

**Question 2.3.** Simulate Stephanie playing her one_hundred_rolls game 10000 times.  Make a histogram of the total score distributions (code provided).

In [None]:
total_scores = make_array()
...

# histogram of the total_scores
Table().with_column('Total score of 100 rolls', total_scores).hist('Total score of 100 rolls')

In [None]:
grader.check("q23")

Achilles the turtle sits on the number line. Achilles loves long random walks that last a total of 100 times steps. At each time step, Achilles moves based on the following scheme: He flips a coin and moves one step to the right if the coin comes up heads or one step to the left if the coin comes up tails. (from Midterm Review Worksheet)

**Question 2.4.** Assuming that Achilles’ coin is fair, write a function called `one_walk` that simulates one random walk of 100 time steps and returns how far from the origin  (this should be a positive number) Achilles ends up at the end of his walk. You may assume that Achilles always starts from the origin.

In [None]:
def one_walk():
    ...


In [None]:
grader.check("q24")

Run the following cell a few times to see the distance of one random walk.

In [None]:
one_walk()

**Question 2.5.** Assuming that Achilles’ coin is fair, we would like to simulate what would happen if Achilles took 10000 different random walks. Complete the simulation below and keep track of how far Achilles ends up from the origin in each of his walks in an array called `distances`. Make a histogram of the distribution of `distances`.

A sample histogram looks like:
<img src="Q2.5_graph.PNG" width="400px">

In [None]:
# this code may take some time to run
distances = make_array()
...
    
# histogram of the distances
...

In [None]:
grader.check("q25")

**Question 2.6.** Achilles goes for a walk and claims that at the end of his walk, he ended up 35 steps away from the origin which happens to be the location of an ice cream store.  Does it seem to you that Achilles took a **random** walk?  What question(s) would you ask Achilles?

*Write your answer here, replacing this text.*

Congratulations, you're done with Lab 8! 
Be sure to:

   * run all the tests,
   * save your notebook and download a pdf version of it,
   * submit your work to Canvas,
   * and ask a lab instructors to check you off.