# Simulation

In this lesson, we will work through several examples of using random numbers to simulate real-world scenarios.

In [209]:
%matplotlib inline
import numpy as np
import pandas as pd

# curriculum example visualizations
# import viz
np.random.seed(1349)

### How will we utilize Python to obtain probabilities?

We will utilize Monte Carlo simulations.

A Monte Carlo simulation is a means to recreate potential events and empirically take the results of simiulated trials to obtain a reasonably precise estimate of a desired probability.

What does this mean for us here?

In [None]:
# Let's take a hypothetical base probability. 
# What is the probability of rolling a one (1) on a single, standard, fair six-sided die?


In [2]:
# Potential outcomes of a die roll:
possible_outcomes = [1,2,3,4,5,6]

In [None]:
# options that equal 1: just 1, literally one

In [3]:
ideal_roll = 1

In [4]:
theoretical_prob = ideal_roll / len(possible_outcomes)
theoretical_prob

0.16666666666666666

In [None]:
# Now how would we do this with a simulation?


In [None]:
# We will do it utilizing a large number of trials, that we calculate.

In [None]:
# Allow us to examine the same problem: Probability of rolling a 1 on a fair six-sided die.

In [16]:
# First, we will set a value for the number of trials that we want to conduct.
# We have the power of computation at our finger tips, so let's shoot for something like one million.

num_trials = 10 ** 7


In [17]:
# We have one die roll for each trial, which is our event, that we call a single simulation
n_dice =1

In [18]:
# We will do a single simulation one million times, with each simulation being a die roll.

In [19]:
rolls = np.random.choice(possible_outcomes, num_trials*n_dice).reshape(num_trials, n_dice)

In [20]:
# let's make our simulations!

In [21]:
type(rolls)

numpy.ndarray

In [22]:
(rolls == 1).mean()

0.1666628

## Generating Random Numbers with Numpy

The `numpy.random` module provides a number of functions for generating random numbers.

- `np.random.choice`: selects random options from a list
- `np.random.uniform`: generates numbers between a given lower and upper bound
- `np.random.random`: generates numbers between 0 and 1
- `np.random.randn`: generates numbers from the standard normal distribution
- `np.random.normal`: generates numbers from a normal distribution with a specified mean and standard deviation

## Example Problems

### Carnival Dice Rolls

> You are at a carnival and come across a person in a booth offering you a game
> of "chance" (as people in booths at carnivals tend to do).

> You pay 5 dollars and roll 3 dice. If the sum of the dice rolls is greater
> than 12, you get 15 dollars. If it's less than or equal to 12, you get
> nothing.

> Assuming the dice are fair, should you play this game? How would this change
> if the winning condition was a sum greater than *or equal to* 12?

To simulate this problem, we'll write the python code to simulate the scenario described above, then repeat it a large amount of times.

One way we can keep track of all the simulations is to use a 2-dimensional matrix. We can create a matrix where each row represents one "trial". Each row will have 3 columns, representing the 3 dice rolls.

In [25]:
n_trials = nrows = 10_000
n_dice = ncols = 3

rolls = np.random.choice(possible_outcomes, n_trials * n_dice).reshape(nrows, ncols)
rolls

array([[1, 1, 4],
       [4, 6, 5],
       [3, 6, 1],
       ...,
       [6, 1, 4],
       [2, 6, 2],
       [2, 1, 2]])

Here we used the `choice` function to randomly select an element out of the list of the number 1-6, effectively simulating a dice roll. The second argument supplied to `choice` is the total number of dice to roll. Once we have generated all the dice rolls, we use the `.reshape` method to create our matrix with 3 columns and 10,000 rows.

Now that we have all of the simulated dice rolls, we want to get the sum of the dice rolls for each trial. To do this, we can use the `.sum` function and specify that we want the sum of every row (as opposed to the sum of all the numbers, or the sum by column) with the `axis` key word argument.

In [26]:
sums_by_trial = rolls.sum(axis=1)
sums_by_trial

array([ 6, 15, 10, ..., 11, 10,  5])

Let's pause here for a minute and visualize the data we have:

In [None]:
viz.simulation_example1(sums_by_trial)

The area shaded in lightblue represents our chance of winning, that is, the number of times that the sum of 3 dice rolls is greater than 12.

We can now convert each value in our array to a boolean value indicating whether or not we won:

In [29]:
wins = sums_by_trial > 12

To calculate an overall win rate, we can treat each win as a `1` and each loss as `0`, then take the average of the array:

In [31]:
win_rate = wins.mean()

In [32]:
win_rate

0.2639

Now that we know our win rate, we can calculate the expected profit:

In [34]:
expected_winnings = win_rate * 15
cost = 5
expected_profit = expected_winnings - cost
expected_profit

-1.0414999999999996

So we would expect, based on our simulations, on average, to lose a little over a dollar everytime we play this game.

To answer the last part of the question, we can recalculate our win rate based on the sums being greater than or equal to 12:

In [35]:
wins = sums_by_trial >= 12
win_rate = wins.mean()
expected_winnings = win_rate * 15
cost = 5
expected_profit = expected_winnings - cost
expected_profit

0.6129999999999995

If our win condition changes to the sum being greater than or equal to 12, then, based on our simulations, on average, we expect to win about 58 cents.

## Mini Exercise:

What is the probability of rolling "snake eyes" on a roll of two (fair) dice?

In [49]:
n_trials = n_rows = 10_000
n_dice = n_cols = 2

rolls = np.random.choice(possible_outcomes, n_trials * n_dice).reshape(n_rows, n_cols)
rolls

array([[5, 6],
       [4, 6],
       [2, 4],
       ...,
       [4, 3],
       [4, 3],
       [4, 6]])

In [51]:
sums_by_trial = rolls.sum(axis=1)
sums_by_trial

array([11, 10,  6, ...,  7,  7, 10])

In [52]:
wins = sums_by_trial == 2
wins

array([False, False, False, ..., False, False, False])

In [53]:
win_rate = wins.mean()
win_rate

0.0279

### No Rest or Relaxation

> There's a 30% chance my son takes a nap on any given weekend day. What is the chance that he takes a nap at least one day this weekend? What is the probability that he doesn't nap at all?

Let's first do a little bit of setup:

In [46]:
p_nap = 0.3
ndays = n_cols = 2
n_simulated_weekends = n_rows = 10_000

To simulate the results from many weekends, we'll create a 2 x 10,000 matrix, with 2 being the number of days in a weekend and 10,000 being the number of simulations we want to run.

To determine whether or not a nap is taken on a given day, we'll generate a random number between 0 and 1, and say that it is a nap if it is less than our probability of taking a nap.

In [54]:
naps = np.random.random((n_rows, n_cols))

In [55]:
naps[:10]

array([[0.23382281, 0.3297641 ],
       [0.11605216, 0.78718142],
       [0.47561055, 0.80916677],
       [0.61620839, 0.63293275],
       [0.72558482, 0.71991363],
       [0.73446911, 0.29487962],
       [0.32333043, 0.38173145],
       [0.66978573, 0.12105864],
       [0.98754934, 0.4552965 ],
       [0.57135483, 0.37211968]])

In [56]:
naps = naps < p_nap

In [57]:
naps

array([[ True, False],
       [ True, False],
       [False, False],
       ...,
       [ True,  True],
       [False, False],
       [False, False]])

Now that we have each day as either true or false, we can take the sum of each row to find the total number of naps for the weekend. When we sum an array of boolean values, numpy will treat `True` as 1 and `False` as 0.

In [58]:
naps.sum(axis=1)

array([1, 1, 0, ..., 2, 0, 0])

Now we have the results of our simulation, an array where each number in the array represents how many naps were taken in a two day weekend.

In [None]:
viz.simulation_example2(naps)

We can use this to answer our original questions, what is the probability that at least one nap is taken?

In [59]:
(naps.sum(axis=1) > 0).mean()

0.5171

What is the probability no naps are taken?

In [60]:
(naps.sum(axis=1) == 0).mean()

0.4829

In [61]:
(naps.sum(axis=1) > 1).mean()

0.0867

## Mini Exercise:

There are ten options in a blind-box style collectable, but you are only likely to get the one you want the most at a probability of one out of every twenty boxes because its a little rarer.

What is the probability of getting your desired collectable if you buy three blindbox toys?

In [63]:
p_fav = 0.05
nplays = n_cols = 3
n_trials = n_rows = 10_000

In [65]:
tests = np.random.random((n_rows, n_cols))
tests

array([[0.22596562, 0.31973196, 0.86943127],
       [0.17174395, 0.54952612, 0.28639282],
       [0.42943705, 0.01793972, 0.36891534],
       ...,
       [0.06169977, 0.61223719, 0.41337371],
       [0.41772704, 0.1214792 , 0.44829514],
       [0.67195496, 0.27106252, 0.95803998]])

In [67]:
success = tests < p_fav
success

array([[False, False, False],
       [False, False, False],
       [False,  True, False],
       ...,
       [False, False, False],
       [False, False, False],
       [False, False, False]])

In [68]:
(success.sum(axis=1) > 0).mean()

0.143

### One With Dataframes

Let's take a look at one more problem:

> What is the probability of getting at least one 3 in 3 dice rolls?

To simulate this, we'll use a similar strategy to how we modeled the dice rolls in the previous example, but this time, we'll store the results in a pandas dataframe so that we can apply a lambda function that will check to see if one of the rolls was a 3.

In [69]:
n_trials = nrows = 10 ** 6
n_dice_rolled = ncols = 3

rolls = np.random.choice(possible_outcomes, n_trials * n_dice_rolled).reshape(nrows, ncols)

In [70]:
pd.DataFrame(rolls).apply(lambda row: 3 in row.values, axis=1)

0         False
1         False
2          True
3          True
4         False
          ...  
999995     True
999996    False
999997     True
999998     True
999999     True
Length: 1000000, dtype: bool

In [71]:
pd.DataFrame(rolls).apply(lambda row: 3 in row.values, axis=1).mean()

0.421488

Let's break down what's going on here:

1. First we assign values for the number of rows and columns we are going to use
1. Next we create the `rolls` variable that holds a 3 x 10,000 matrix where each element is a randomly chosen number from 1 to 6
1. Lastly we create a dataframe from the rolls
    1. `pd.DataFrame(rolls)` converts our 2d numpy matrix to a pandas DataFrame
    1. `.apply(...` applies a function to each **row** in our dataframe, because we specified `axis=1`, the function will be called with each row as it's argument. The body of the function checks to see if the value `3` is in the values of the row, and will return either `True` or `False`
    1. `.mean()` takes our resulting series of boolean values, and treats `True` as 1 and `False` as 0, to give us the average rate of `True`s, in this case, the simulated probability of getting a 3 in 3 dice rolls.

## Mini Exercise:

Recreate the blindbox problem utilizing the above strategy!


In [72]:
p_fav = 0.05
nplays = n_cols = 3
n_trials = n_rows = 10_000

In [73]:
tests = np.random.random((n_rows, n_cols))
tests

array([[0.45391365, 0.83744253, 0.84064129],
       [0.78509321, 0.15537387, 0.22148265],
       [0.6611318 , 0.07268316, 0.75565033],
       ...,
       [0.31063658, 0.0877315 , 0.36093516],
       [0.42984108, 0.50983725, 0.70044963],
       [0.94927515, 0.16843756, 0.04954599]])

In [108]:
pd.DataFrame(tests).apply(lambda x: x<p_fav, axis=1).mean()

0    0.0522
1    0.0506
2    0.0509
dtype: float64

In [88]:
# diferent method...
n_rows = 10_000
n_cols = 3
outcomes = [1,2,3,4,5,6,7,8,9,10]
prob_win = 0.05
prob_others = (1-prob_win)/9

In [92]:
data = np.random.choice(outcomes, n_rows * n_cols, p=[prob_win, prob_others, prob_others, prob_others, prob_others, prob_others, prob_others, prob_others, prob_others, prob_others]).reshape(n_rows, n_cols)

In [93]:
data[:5]

array([[ 1, 10, 10],
       [ 3, 10,  8],
       [ 3,  2, 10],
       [ 7,  6,  3],
       [10,  3, 10]])

In [94]:
pd.DataFrame(data).apply(lambda row: 1 in row.values, axis=1).mean()

0.1418

## Exercises

Within your `codeup-data-science directory`, create a directory named `statistics-exercises`. This will be where you do your work for this module. Create a repository on GitHub with the same name, and link your local repository to GitHub.

Do your work for this exercise in either a python file named `simulation.py` or a jupyter notebook named `simulation.ipynb`.

1. How likely is it that you roll doubles when rolling two dice?

In [109]:
# first I am going to solve the problem mathematically, so I have a baseline and can get my bearings...
(1/6)*(1/6)*6

0.16666666666666666

In [100]:
# now programmatically
n_trials = r_rows = 10_000
n_dice = n_cols = 2
possible_outcomes = [1,2,3,4,5,6]

data = np.random.choice(possible_outcomes, n_rows * n_cols).reshape(n_rows, n_cols)
data[:5]

array([[1, 4],
       [2, 3],
       [2, 4],
       [5, 3],
       [5, 4]])

In [110]:
pd.DataFrame(data).apply(lambda x: x[0]==x[1] in x.values, axis=1).mean()

0.1645

In [115]:
# another way to do this would be to do the snake eyes method, and then multiply by 6
# since each set of doubles has an identical chance of occurring
n_trials = n_rows = 10_000
n_dice = n_cols = 2

rolls = np.random.choice(possible_outcomes, n_trials * n_dice).reshape(n_rows, n_cols)

sums_by_trial = rolls.sum(axis=1)

wins = sums_by_trial == 2

win_rate = wins.mean()

doubles = win_rate*6
doubles


0.165

2. If you flip 8 coins, what is the probability of getting exactly 3 heads? 

In [None]:
# first let's solve this mathematically, to make sure our work in the future is correct
#  you can get exactly 3 heads



In [116]:
n_trials = n_rows = 10_000
n_coins = n_cols = 8

heads = 1
tails =0

outcomes = [heads, tails]

flips = np.random.choice(outcomes, n_coins * n_trials).reshape(n_trials, n_coins)
flips



array([[0, 0, 1, ..., 0, 1, 1],
       [0, 1, 1, ..., 1, 0, 1],
       [1, 1, 1, ..., 1, 0, 0],
       ...,
       [0, 1, 1, ..., 0, 1, 0],
       [1, 0, 0, ..., 0, 1, 1],
       [1, 1, 1, ..., 0, 1, 0]])

In [120]:
sum_by_flips = flips.sum(axis=1)
sum_by_flips

array([4, 5, 5, ..., 3, 3, 6])

In [122]:
exactly_three = (sum_by_flips==3).mean()
exactly_three

0.2226

What is the probability of getting more than 3 heads?

In [123]:
n_trials = n_rows = 10_000
n_coins = n_cols = 8

heads = 1
tails =0

outcomes = [heads, tails]

flips = np.random.choice(outcomes, n_coins * n_trials).reshape(n_trials, n_coins)
flips

array([[0, 1, 1, ..., 1, 1, 1],
       [1, 0, 1, ..., 1, 0, 1],
       [1, 0, 0, ..., 1, 1, 0],
       ...,
       [0, 0, 1, ..., 0, 0, 1],
       [1, 0, 1, ..., 0, 1, 0],
       [0, 0, 1, ..., 1, 1, 0]])

In [124]:
sum_by_flips = flips.sum(axis=1)
sum_by_flips

array([6, 5, 5, ..., 3, 5, 4])

In [125]:
more_than_three = (sum_by_flips>3).mean()
more_than_three

0.6436

3. There are approximitely 3 web development cohorts for every 1 data science cohort at Codeup. Assuming that Codeup randomly selects an alumni to put on a billboard, what are the odds that the two billboards I drive past both have data science students on them?

In [130]:
# let's assume each cohort has the same number of students

n_trials = n_rows = 10_000
n_draw = n_cols = 2

outcomes = [3,1]

prob_wb = 0.75
prob_ds = 1-prob_wb

data = np.random.choice(outcomes, n_rows * n_cols, p=[prob_wb, prob_ds]).reshape(n_rows, n_cols)
data

array([[3, 3],
       [1, 3],
       [3, 3],
       ...,
       [3, 3],
       [1, 1],
       [3, 3]])

In [132]:
sum_it_up = data.sum(axis=1)
sum_it_up

array([6, 4, 6, ..., 6, 2, 6])

In [134]:
double_ds = (sum_it_up == 2).mean()
double_ds

0.0655

4. Codeup students buy, on average, 3 poptart packages (+- 1.5) a day from the snack vending machine. If on monday the machine is restocked with 17 poptart packages, how likely is it that I will be able to buy some poptarts on Friday afternoon?

In [136]:
n_trials = n_rows = 10_000
n_days = n_cols = 5

testing = np.random.normal(3, 1.5, size =(10_000,5))
testing

array([[2.43866448, 4.87013369, 4.8715226 , 2.04130249, 6.85496171],
       [4.05565403, 1.8379895 , 3.19315957, 1.0575082 , 4.4505543 ],
       [3.99840932, 5.8840454 , 2.56779069, 3.27359235, 3.09818416],
       ...,
       [2.16824873, 1.41760355, 0.91106326, 1.83914501, 3.81046331],
       [3.83944191, 2.30066346, 2.86039039, 3.8918159 , 0.86232609],
       [4.26067385, 2.78429632, 3.77184781, 0.82072434, 2.85670489]])

In [137]:
testing_sums = testing.sum(axis=1)
testing_sums

array([21.07658497, 14.5948656 , 18.82202194, ..., 10.14652386,
       13.75463775, 14.49424721])

In [138]:
yes_poptarts = (testing_sums < 17).mean()
yes_poptarts

0.7329

5. Compare Heights

    - Men have an average height of 178 cm and standard deviation of 8cm.
    - Women have a mean of 170, sd = 6cm.
    - If a man and woman are chosen at random, P(woman taller than man)?

In [150]:
n_rows = 10_000
n_cols = 2

# create normal distribution of Men's height
men_dist = np.random.normal(178, 8, size=(10_000,1))
# create normal distribution of Women's height
women_dist = np.random.normal(170, 6, size=(10_000,1))

both_dist = np.column_stack((men_dist, women_dist))
both_dist


array([[164.61184159, 165.72115202],
       [186.44955827, 178.54518593],
       [191.20717195, 174.17166328],
       ...,
       [187.85186067, 158.91601912],
       [164.5663132 , 171.69550501],
       [177.9268822 , 175.61132491]])

In [153]:
men_women = np.subtract(men_dist, women_dist)
men_women

array([[-1.10931043],
       [ 7.90437234],
       [17.03550868],
       ...,
       [28.93584154],
       [-7.12919181],
       [ 2.31555729]])

In [155]:
women_men = (men_women < 0).mean()
women_men

0.2257

6. When installing anaconda on a student's computer, there's a 1 in 250 chance
   that the download is corrupted and the installation fails. What are the odds
   that after having 50 students download anaconda, no one has an installation
   issue?  100 students?

    What is the probability that we observe an installation issue within the first
    150 students that download anaconda?

    How likely is it that 450 students all download anaconda without an issue?

In [174]:
# fail = 0
# succeed = 1
n_trials = n_rows = 10_000
n_students = n_cols = 50
fail_prob = (1/250)
no_fail = 1-fail_prob
outcomes = [0,1]

fifty_test = np.random.choice(outcomes, n_cols * n_rows, p = [fail_prob, no_fail]).reshape(n_rows, n_cols)
fifty_test



array([[1, 1, 1, ..., 1, 1, 1],
       [1, 1, 1, ..., 1, 1, 1],
       [1, 1, 1, ..., 1, 1, 1],
       ...,
       [1, 1, 0, ..., 1, 1, 1],
       [1, 1, 1, ..., 1, 1, 1],
       [1, 1, 1, ..., 1, 1, 1]])

In [175]:
fifty_totals = fifty_test.sum(axis=1)
fifty_totals

array([50, 50, 49, ..., 49, 50, 50])

In [176]:
fifty_fail = (fifty_totals<50).mean()
fifty_fail

0.1834

In [177]:
fifty_success = 1-fifty_fail
fifty_success

0.8166

In [178]:
# 100 students
# fail = 0
# succeed = 1
n_trials = n_rows = 10_000
n_students = n_cols = 100
fail_prob = (1/250)
no_fail = 1-fail_prob
outcomes = [0,1]

hundred_test = np.random.choice(outcomes, n_cols * n_rows, p = [fail_prob, no_fail]).reshape(n_rows, n_cols)
hundred_test

array([[1, 1, 1, ..., 1, 1, 1],
       [1, 1, 1, ..., 1, 1, 1],
       [1, 1, 1, ..., 1, 1, 1],
       ...,
       [1, 1, 1, ..., 1, 1, 1],
       [1, 1, 1, ..., 1, 1, 1],
       [1, 1, 1, ..., 1, 1, 1]])

In [179]:
hundred_totals = hundred_test.sum(axis=1)
hundred_totals

array([100, 100, 100, ..., 100,  99, 100])

In [180]:
hundred_fail = (hundred_totals<100).mean()
hundred_fail

0.3339

In [181]:
hundred_success = 1-hundred_fail
hundred_success

0.6661

In [182]:
# 150 students
# fail = 0
# succeed = 1
n_trials = n_rows = 10_000
n_students = n_cols = 150
fail_prob = (1/250)
no_fail = 1-fail_prob
outcomes = [0,1]

hundred_fifty_test = np.random.choice(outcomes, n_cols * n_rows, p = [fail_prob, no_fail]).reshape(n_rows, n_cols)
hundred_fifty_test


array([[1, 1, 1, ..., 1, 1, 1],
       [1, 1, 1, ..., 1, 1, 1],
       [1, 1, 1, ..., 1, 1, 1],
       ...,
       [1, 1, 1, ..., 1, 1, 1],
       [1, 1, 1, ..., 1, 1, 1],
       [1, 1, 1, ..., 1, 1, 1]])

In [183]:
hundred_fifty_totals = hundred_fifty_test.sum(axis=1)
hundred_fifty_totals

array([150, 150, 150, ..., 150, 150, 150])

In [184]:
hundred_fifty_fail = (hundred_fifty_totals<150).mean()
hundred_fifty_fail

0.4522

In [185]:
hundred_fifty_success = 1-hundred_fifty_fail
hundred_fifty_success

0.5478000000000001

In [186]:
# 450 students
# fail = 0
# succeed = 1
n_trials = n_rows = 10_000
n_students = n_cols = 450
fail_prob = (1/250)
no_fail = 1-fail_prob
outcomes = [0,1]

four_fifty_test = np.random.choice(outcomes, n_cols * n_rows, p = [fail_prob, no_fail]).reshape(n_rows, n_cols)
four_fifty_test

array([[1, 1, 1, ..., 1, 1, 1],
       [1, 1, 1, ..., 1, 1, 1],
       [1, 1, 1, ..., 1, 1, 1],
       ...,
       [1, 1, 1, ..., 1, 1, 1],
       [1, 1, 1, ..., 1, 1, 1],
       [1, 1, 1, ..., 1, 1, 1]])

In [187]:
four_fifty_totals = four_fifty_test.sum(axis=1)
four_fifty_totals

array([448, 449, 449, ..., 448, 449, 450])

In [188]:
four_fifty_fail = (four_fifty_totals<450).mean()
four_fifty_fail

0.8327

In [189]:
four_fifty_success = 1-four_fifty_fail
four_fifty_success

0.1673

7. There's a 70% chance on any given day that there will be at least one food
   truck at Travis Park. However, you haven't seen a food truck there in 3 days.
   How unlikely is this?

    How likely is it that a food truck will show up sometime this week?

In [190]:
no_truck = 0
truck = 1

n_trials = n_rows = 10_000
n_days = n_cols = 3
outcomes = [0,1]
truck_prob = 0.7
no_truck_prob = 0.3

three_day_test = np.random.choice(outcomes, n_cols * n_rows, p=[no_truck_prob, truck_prob]).reshape(n_rows, n_cols)
three_day_test


array([[1, 0, 0],
       [0, 1, 1],
       [1, 1, 1],
       ...,
       [1, 1, 1],
       [0, 0, 1],
       [0, 1, 1]])

In [191]:
three_day_sum = three_day_test.sum(axis=1)
three_day_sum

array([1, 2, 3, ..., 3, 1, 2])

In [192]:
bad_luck = (three_day_sum==0).mean()
bad_luck

0.0261

In [193]:
# how likely is it that a truck will show up this week, 
# given that 3 days in, there has not been a truck, 
# what are the chances of a truck showing up at least once in the next 4 days?

no_truck = 0
truck = 1

n_trials = n_rows = 10_000
n_days = n_cols = 4
outcomes = [0,1]
truck_prob = 0.7
no_truck_prob = 0.3

four_day_test = np.random.choice(outcomes, n_cols * n_rows, p=[no_truck_prob, truck_prob]).reshape(n_rows, n_cols)
four_day_test


array([[1, 1, 1, 1],
       [1, 1, 0, 0],
       [1, 1, 0, 1],
       ...,
       [1, 1, 0, 0],
       [0, 1, 0, 1],
       [1, 1, 1, 1]])

In [194]:
four_day_sum = four_day_test.sum(axis=1)
four_day_sum

array([4, 2, 3, ..., 2, 2, 4])

In [195]:
good_luck = (four_day_sum > 0).mean()
good_luck

0.9917

8. If 23 people are in the same room, what are the odds that two of them share a birthday? What if it's 20 people? 40?

In [220]:
# first, we have to assume that each date has an equal chance of being someone's birthdate

b_days = np.arange(1,366)

n_trials = n_rows = 10_000
n_people = n_cols = 23

first_test = np.random.choice(b_days, n_cols * n_rows).reshape(n_rows, n_cols)
first_test

array([[275, 274, 165, ..., 247, 171,  85],
       [ 82, 352,  92, ...,  20, 139, 290],
       [315, 240, 355, ..., 262, 320, 225],
       ...,
       [ 46, 282,  46, ..., 298, 278, 114],
       [ 39, 292, 288, ..., 250, 263, 191],
       [ 64, 347,  31, ..., 315, 167,  16]])

In [222]:
second_test = pd.DataFrame(first_test)
second_test

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,13,14,15,16,17,18,19,20,21,22
0,275,274,165,167,228,256,310,183,40,122,...,280,340,224,93,78,238,250,247,171,85
1,82,352,92,270,167,224,77,353,5,279,...,356,236,131,69,364,27,271,20,139,290
2,315,240,355,287,107,262,308,88,54,229,...,236,317,50,25,173,329,351,262,320,225
3,196,201,335,116,195,150,30,194,275,281,...,215,154,127,65,203,153,293,279,243,270
4,158,179,275,59,246,123,254,326,154,207,...,126,224,18,308,64,338,358,170,327,187
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,193,111,80,198,186,128,134,122,98,31,...,249,316,320,244,123,24,222,268,160,125
9996,309,35,172,198,229,54,346,302,96,156,...,293,283,91,204,342,173,73,308,88,1
9997,46,282,46,268,332,283,303,23,319,334,...,151,48,267,160,253,317,296,298,278,114
9998,39,292,288,250,225,337,44,358,214,277,...,341,162,351,26,75,57,187,250,263,191


In [224]:
third_test = second_test.nunique(axis=1)
third_test

0       23
1       23
2       22
3       23
4       23
        ..
9995    22
9996    23
9997    22
9998    22
9999    22
Length: 10000, dtype: int64

In [225]:
fourth_test = (third_test < 23).mean()
fourth_test

0.5043

In [226]:
# what if it's only 20 people

b_days = np.arange(1,366)

n_trials = n_rows = 10_000
n_people = n_cols = 20

twenty_test = np.random.choice(b_days, n_cols * n_rows).reshape(n_rows, n_cols)
twenty_test

array([[250, 127, 357, ..., 226, 114, 334],
       [184, 184,  94, ...,  34, 255, 159],
       [323, 238, 243, ..., 357,  25,  98],
       ...,
       [ 60, 205, 206, ...,  79, 140,  19],
       [ 63, 115, 360, ..., 173, 307, 118],
       [252, 112, 192, ..., 156, 242, 195]])

In [227]:
second_twenty = pd.DataFrame(twenty_test)
second_twenty

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,250,127,357,13,343,365,282,79,106,10,27,184,27,172,283,171,190,226,114,334
1,184,184,94,74,282,208,192,259,352,118,261,108,90,233,186,308,211,34,255,159
2,323,238,243,262,218,130,142,91,239,52,282,144,217,123,295,103,265,357,25,98
3,251,82,203,73,63,95,1,144,198,301,127,144,251,116,257,233,159,108,168,334
4,165,2,119,33,51,253,126,199,152,67,120,145,185,290,54,127,231,174,229,87
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,115,267,361,305,274,360,341,134,230,35,45,306,295,150,319,176,165,332,284,327
9996,306,351,344,274,76,166,309,47,88,129,309,71,113,231,243,102,344,358,67,72
9997,60,205,206,354,266,280,49,229,258,207,145,280,332,226,164,194,29,79,140,19
9998,63,115,360,138,316,2,60,106,216,78,351,234,115,125,202,52,44,173,307,118


In [228]:
third_twenty = second_twenty.nunique(axis=1)
third_twenty

0       19
1       19
2       20
3       18
4       20
        ..
9995    20
9996    18
9997    19
9998    19
9999    20
Length: 10000, dtype: int64

In [229]:
fourth_twenty = (third_twenty < 20).mean()
fourth_twenty

0.406

In [230]:
# what about 40 people?
b_days = np.arange(1,366)

n_trials = n_rows = 10_000
n_people = n_cols = 40

forty_test = np.random.choice(b_days, n_cols * n_rows).reshape(n_rows, n_cols)
forty_test

array([[258, 200,   8, ..., 172, 285, 329],
       [308, 300, 219, ..., 301, 223,  32],
       [128,  11, 204, ...,   7, 176, 241],
       ...,
       [162, 206,  88, ..., 159, 237, 130],
       [254,  93,  69, ..., 240,  32, 216],
       [124, 189, 177, ..., 168, 218,  13]])

In [231]:
second_forty = pd.DataFrame(forty_test)
second_forty

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,30,31,32,33,34,35,36,37,38,39
0,258,200,8,75,76,99,52,281,65,326,...,228,292,255,260,141,304,250,172,285,329
1,308,300,219,82,339,1,81,283,191,81,...,347,329,191,269,306,319,68,301,223,32
2,128,11,204,359,133,175,271,264,102,107,...,233,295,140,359,169,11,329,7,176,241
3,329,269,314,233,162,144,81,12,225,213,...,21,45,362,79,19,183,88,274,102,304
4,241,355,92,58,138,141,348,100,346,146,...,204,135,73,108,99,18,72,1,148,31
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,234,256,215,130,236,219,216,314,36,191,...,69,332,167,134,87,75,8,364,334,3
9996,200,249,81,282,81,69,287,58,203,185,...,71,310,315,80,125,47,84,235,266,195
9997,162,206,88,209,103,297,18,309,36,44,...,307,159,196,88,124,326,319,159,237,130
9998,254,93,69,80,134,290,359,118,183,96,...,203,207,162,277,343,342,319,240,32,216


In [232]:
third_forty = second_forty.nunique(axis=1)
third_forty

0       36
1       37
2       37
3       40
4       40
        ..
9995    36
9996    37
9997    37
9998    40
9999    35
Length: 10000, dtype: int64

In [233]:
fourth_forty = (third_forty < 40).mean()
fourth_forty

0.8869

#### Bonus Exercises
- [Mage Duel](https://gist.github.com/ryanorsinger/2996446f02c1bf30fcb3f8fdb88bd51d)
- [Chuck a Luck](https://gist.github.com/ryanorsinger/eac1d7b7e978f90b8390bdc056312123)