The contents of this course including lectures, labs, homework assignments, and exams have all been adapted from the [Data 8 course at University California Berkley](https://data.berkeley.edu/education/courses/data-8). Through their generosity and passion for undergraduate education, the Data 8 community at Berkley has opened their content and expertise for other universities to adapt in the name of undergraduate education.

In [None]:
!pip install datascience
from datascience import *
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

# Chapter 9 Review

In [None]:
# Create a function that simulats 10000 flips of a coin. 
# Record the number of heads and tails that are flipped in the simulation.
# Save the results in a data table.

# Probability Review

## Equally likely Outcomes

Assuming all outcomes are equally likely, the chance of an event, A is:

P(A) = # of outcomes that make A happen / Total number of outcomes

## Multiplication Rule

Change that two events, A and B, both happen:

P(A)(B) = P(A happens) * P(B happens given that A has happend) 

* The more conditions you have to satisfy the less likely the event will happen

## Addition Rule

If an event can happen one of two ways:

P(A) = P(first way) + P(second way)

## At least one Success

Chance that A happens at least once.

P(not A) = 1 - P(A)

At least one head in three coin tosses?

P(not TTT) = 1 - P(TTT) = 1 - (1/2)^3 = 1 - (1/8) = 87.5%

### HW 6/7 error
Question 2.6. Compute the chance that the monkey types the letter "t" at least once in the 5 strikes. Call it t_chance. Use algebra and type in an arithmetic equation that Python can evalute. 

In [None]:
t_count = 0
simulate_one_type = [2, 1, 1, 2, 2]
simulate_one_type
if 1 in simulate_one_type:
    t_count += 1
t_count

In [None]:
def simulate_key_strokes(num_simulations):
    keyboard = np.arange(62)
    t_count = 0
    for i in np.arange(num_simulations):
        simulate_one_type = np.random.choice(keyboard, 5)
        if 1 in simulate_one_type:
            t_count += 1
    return t_count/num_simulations

In [None]:
simulate_key_strokes(1*10**6)

### What are the chances we select the Ace of Hearts and the Ace of Dimonds in a deck of 52 cards without replacement?

P  = 2 * (1/52) * (1/51) = 7.77%

### What are the chances neither the Ace of Hearts and the Ace of Dimonds in a deck of 52 cards without replacement?

P = (50/52) * (49/51) = 92.4%

# Chapter 10: Sampling and Empirical Distributions

An important part of data science is infering meaning from random samples.  This chapter takes a closer look at sampling and random samples.

In [None]:
top1 = Table.read_table('top_movies.csv')
top2 = top1.with_column('Row Index', np.arange(top1.num_rows))
top = top2.move_to_start('Row Index')

top.set_format(make_array(3, 4), NumberFormatter)

We can create a ***deterministic sample*** by selecting specific elements from the table (i.e. not randomly).

In [None]:
top.take([1, 10, 100])

In [None]:
top.where('Title', are.containing('Harry Potter'))

A ***population*** is the set of all elements

A ***probibility sample*** is a set where the chance of subsets is calculatable.

A ***systematic sample*** is a set where the subsets are evenly spaced

## Sampling with or without Replacement

Random samples with replacement the value can be sampled again.  This is the default setting with <code>np.random.choice</code> when it samples from an array.

Random samples without replacement the value cannot be sampled again.  This is like dealing a deck of cards.

### Sample of Convenience

* Example: sample consists of whoever walks by
* Just because you think you’re sampling “randomly”, doesn’t mean you have a random sample.
* If you can’t figure out ahead of time 
    * what’s the population
    * what’s the chance of selection, for each group in the population
    <p>then you don’t have a random sample

## Empirical Distributions

The word *Emperical* means *observed*.  We will consider the emerical, observed, distributions of some data

In [None]:
die = Table().with_column('Face', np.arange(1, 7, 1))
die

In [None]:
die_bins = np.arange(0.5, 6.6, 1)
die.hist(bins = die_bins)

In [None]:
die.sample(10)

In [None]:
def empirical_hist_die(n):
    die.sample(n).hist(bins = die_bins)

In [None]:
empirical_hist_die(6)

In [None]:
empirical_hist_die(60)

In [None]:
empirical_hist_die(600)

In [None]:
empirical_hist_die(6000)

In [None]:
empirical_hist_die(60000)

In [None]:
empirical_hist_die(600000)

### The law of averages

If an event is repeated under identical conditions then the proportion of times the event occurs approaches the theoretical probability of the event

### Sampling from a Population

The table <code>united</code> contains data for United Airlines domestic flights from San Francisco in the summer of 2015.  The data are made publicly available by the [Bureau of Transportation Statistic](https://www.transtats.bts.gov/nosessionvar.asp) in the United States Department of Transportation.

In [None]:
united = Table.read_table('united.csv')
united

Delay is the delay in mins.  Negative values mean those flights left early.

In [None]:
united.column('Delay').min()

In [None]:
united.column('Delay').max()

In [None]:
delay_bins = np.append(np.arange(-20, 301, 10), 600)
united.hist('Delay', bins = delay_bins, unit = 'minute')

In [None]:
united.where('Delay', are.above(200)).num_rows/united.num_rows

Only about 0.8% of the flights had delays over 200 mins.  We will not plot those for visual convinience.

In [None]:
delay_bins = np.arange(-20, 201, 10)
united.hist('Delay', bins = delay_bins, unit = 'minute')

What percent of flighs had a delaly between 0 and 10 mins?

In [None]:
united.where('Delay', are.between(0, 10)).num_rows/united.num_rows

In [None]:
def empirical_hist_delay(n):
    united.sample(n).hist('Delay', bins = delay_bins, unit = 'minute')

As we saw with the dice, as the sample size increases, the empirical histogram of the sample more closely resembles the histogram of the population. Compare these histograms to the population histogram above.

In [None]:
empirical_hist_delay(10)

In [None]:
empirical_hist_delay(100)

In [None]:
empirical_hist_delay(1000)

A large random sample is likely to resembel the population from which it was draw.

* The Law of Averages implies that with high probability, the empirical distribution of a large random sample will resemble the distribution of the population from which the sample was drawn.

Frequently, we are interested in numerical quantities associated with a population.

* In a population of voters, what percent will vote for Candidate A?

* In a population of Facebook users, what is the largest number of Facebook friends that the users have?

* In a population of United flights, what is the median departure delay?

Numerical quantities associated with a population are called parameters. For the population of flights in united, we know the value of the parameter “median delay”:

In [None]:
np.median(united.column('Delay'))

In [None]:
united.where('Delay', are.below_or_equal_to(2)).num_rows / united.num_rows


Half of the united flights had delays less than 2 mins. Not too bad.

If we calculate the median on a sample of the data will we always calcualte the same media delay time?

In [None]:
np.median(united.sample(10).column('Delay'))

In [None]:
np.median(united.sample(100).column('Delay'))

In [None]:
np.median(united.sample(1000).column('Delay'))

### Lets simulate the median of a random sample using Python!

Step 1: Decide which statistic to simulate. We have already decided that: we are going to simulate the median of a random sample of size 1000 drawn from the population of flight delays.

Step 2: Write the code to generate one value of the statistic. Draw a random sample of size 1000 and compute the median of the sample. We did this in the code cell above. Here it is again, encapsulated in a function.

Step 3: Decide how many simulated values to generate. Let’s do 5,000 repetitions.

Step 4: Write the code to generate an array of simulated values. As in all simulations, we start by creating an empty array in which we will collect our results. We will then set up a for loop for generating all the simulated values. The body of the loop will consist of generating one simulated value of the sample median, and appending it to our collection array.

In [None]:
def random_sample_median():
    return np.median(united.sample(1000).column('Delay'))

In [None]:
medians = make_array()

for i in np.arange(5000):
    medians = np.append(medians, random_sample_median())

Lets visualize the results

In [None]:
simulated_medians = Table().with_column('Sample Median', medians)
simulated_medians

In [None]:
simulated_medians.hist(bins=np.arange(0.5, 5, 1))