In [2]:
# Let's do our imports:
# matplotlib inline for notebook visualization display
%matplotlib inline
# numpy for matrix manipulation
import numpy as np
# pandas for dataframe manipulation
import pandas as pd
# curriculum example visualizations
import viz 
# and setting our random seed
np.random.seed(1349)

### Question #1:
## How likely is it that you roll doubles when rolling two dice?

Mentally visualize the situation:

Two dice, each have six sides.

The probability is evenly distributed amongst six sides for each die

In this way, you have an equal probability of getting any of the six sides of each die on a given roll.
Therefore, rolling on a specific number is a 1/6 probability.

There are 36 possible outcomes of dice rolls (6x6)

And there are 6 possible ways to get doubles (11,22,33,44,55,66)

6/36 = 1/6

making for a probability of roughly 0.167

<img src="http://www.stayorswitch.com/blog/wp-content/uploads/2014/06/Screen-Shot-2016-10-27-at-11.39.17-PM.png">

In [52]:
# Let's do it with a simulation in Python:

# Represent our data's possible outcomes:
outcomes = [1, 2, 3, 4, 5, 6]

# Create the data!
n_simulations = 1_000_000
n_trials = 2

In [53]:
# Let's get our rolls. We'll make a simulation of 1 million trails or simulated rolls for two dice

# outcomes: possible sides the die could land on
# n_simulations: the number of rolls of our dice to simulate
# n_trials: the number of dice we are rolling in this instance.

In [55]:
rolls = np.random.choice(outcomes, size=(n_simulations, n_trials))

In [56]:
# let's take a peek here
rolls

array([[3, 2],
       [5, 3],
       [4, 4],
       ...,
       [3, 4],
       [6, 1],
       [2, 1]])

Using a sum isn't the best option here, since we are looking for two matching elements, or the number of unique elements.

In [5]:
# Let's see how numpy's unique() operates on a single instance of the array

In [57]:
np.unique(rolls[0])

array([2, 3])

In [58]:
np.unique(rolls[2])

array([4])

In [6]:
# Great, so we can say that when len(np.unique()) == 1, we have a situation of doubles)

# Let's make a list of every instance where this is the case in our array of rolls:

In [59]:
# Let's use a list comprehension: 
# a list of the length of the uniques for each instance for the full number of simulations by index,
# but only if the number of uniques is 1
lst_doubles = [len(np.unique(rolls[n])) for n in range(0, n_simulations-1) if len(np.unique(rolls[n])) ==1]

In [60]:
# The length of this is going to be the number of times we rolled doubles, and we can divide that by the total number of simulations:

In [62]:
calculated_prob = len(lst_doubles) / n_simulations
print(f'The probability that we will flip at least 3 heads over {n_trials} coins is {calculated_prob}')

The probability that we will flip at least 3 heads over 2 coins is 0.166136


In [64]:
# And that approximates to 0.167, which is what we were expecting!

### Question #2:
### If you flip 8 coins, what is the probability of getting exactly 3 heads? What is the probability of getting more than 3 heads?

Mentally visualize the situation:
Eight coins, each have two sides.
The probability on a "fair" coin is distributed evenly among the two sides on a given flip
The probability of getting H or T is equal, 1/2
Order does not matter here; it does not matter *when* the heads come up in the mix
Situation = {3H, 5T}

<img src="https://i.ytimg.com/vi/qyd1bQlPW-8/hqdefault.jpg">

In [11]:
# number of ways that we could get three heads out of eight flips, divided by number of possible flip outcomes of eight flips (2 * 2 * 2 * 2 * 2 * 2 * 2 * 2)

In [12]:
# 8C3/(2^8) = (8!/3!(8-3)!)/2^8 = 7/32 = ~0.219

In [13]:
# Does that look a little confusing? Let's do it with a simulation in Python!

In [4]:
# Let's make a million simulated flips of 8 trials, or independent coins.
n_simulations = nrows = 1_000_000
n_trials = ncols = 8
heads = 1 
tails = 0
flips = np.random.choice([heads, tails], size=(n_simulations, n_trials))

In [5]:
flips

array([[1, 1, 0, ..., 1, 0, 0],
       [0, 0, 0, ..., 1, 0, 0],
       [1, 1, 0, ..., 0, 0, 1],
       ...,
       [1, 0, 0, ..., 0, 1, 1],
       [1, 0, 1, ..., 1, 0, 1],
       [1, 1, 0, ..., 1, 1, 0]])

In [6]:
# Since we assigned heads as a value of 1, the sum of any given row of 8 trials will be 3 if there were three heads!
numheads = flips.sum(axis=1)

# And if we take the average number of times where that sum equaled 3:
calculated_prob = (numheads == 3).mean()
print(f'The probability that we will flip exactly 3 heads over {n_trials} coins is {calculated_prob}')

The probability that we will flip exactly 3 heads over 8 coins is 0.21882


In [7]:
# Ta Da! We did the thing! Congratulations!! We subverted math with programming! Computer!

#### And for the second part? If the sum is equal or over to 3, we know that we rolled at least 3 heads, so:

In [9]:
calculated_prob = (numheads > 2).mean()
print(f'The probability that we will flip at least 3 heads over {n_trials} coins is {calculated_prob}')

The probability that we will flip at least 3 heads over 8 coins is 0.855982


### Question #3:
## There are approximitely 3 web development cohorts for every 1 data science cohort at Codeup. Assuming that Codeup randomly selects an alumni to put on a billboard, what are the odds that the two billboards I drive past both have data science students on them?
#### Mentally visualize the situation:
3 Web Dev cohorts for every 1 Data Science cohort, which is a ratio of 3:1,

or think of it this way:

each sign is a biased coin flip, where we know we have a 1 out of 4 chance of getting a data science student

In [11]:
# Let's make a million simulated drives where we pass by two independent billboards

n_simulations = nrows = 1_000_000
n_trials = ncols = 2
data_sci = 1
web_dev = 0
commutes = np.random.choice([data_sci, web_dev], size=(n_simulations, n_trials), p=[0.25, 0.75])

In [12]:
# We want to see a situation where both instances were a 1, which is a sum of 2, so we go about it the same as our previous simulations:
calculated_prob = (commutes.sum(axis=1) == 2).mean()
print(f'The probability that there will still be data science students on {n_trials} billboards is {calculated_prob}')

The probability that there will still be data science students on 2 billboards is 0.062634


In [13]:
(1/4) * (1/4)

0.0625

## Question 4:
###  Codeup students buy, on average, 3 poptart packages (+- 1.5) a day from the snack vending machine. If on monday the machine is restocked with 17 poptart packages, how likely is it that I will be able to buy some poptarts on Friday afternoon?

In [15]:
# average number of poptarts consumed:

pop_avg = 3

# deviation of potars: 1.5

pop_std = 1.5
n_trials = n_cols = 5
n_simulations = n_rows = 1_000_000
simulated_consumed_potars = np.random.normal(pop_avg, pop_std, size=(n_simulations, n_trials))

In [16]:
simulated_consumed_potars

array([[-0.21407   ,  6.37625842,  2.0654548 ,  1.57119656,  4.29129215],
       [ 3.47890053,  1.86250156,  3.71099695,  3.18657193,  3.59210198],
       [-0.78278923,  1.46440324,  2.75146115,  1.00791192,  3.03889717],
       ...,
       [ 2.9320415 ,  0.6615184 ,  5.47651384,  2.10245505,  5.77567982],
       [ 2.67986957,  2.24653527,  2.09497628,  1.36642653,  1.47537435],
       [ 4.41742006,  2.2669592 ,  2.21905763,  2.09746598,  3.3080759 ]])

In [19]:
calculated_prob = (simulated_consumed_potars.sum(axis=1) < 17).mean()
print(f'The probability that there will still be poptarts in the vending machine after {n_trials} days is {calculated_prob}')

The probability that there will still be poptarts in the vending machine after 5 days is 0.725162


## Question 5:

### Compare Heights: Men have an average height of 178 cm and standard deviation of 8cm. 

Women have a mean of 170, sd = 6cm. 

If a man and woman are chosen at random, P(woman taller than man)?

In [20]:
men_avg = 178
men_std = 8
wmn_avg = 170
wmn_std = 6

In [21]:
# Since we have an average and a standard deviation, let's use np.random.normal

In [22]:
s_men = np.random.normal(men_avg, men_std, 1_000_000)

In [23]:
s_men

array([186.0451691 , 176.86380875, 177.9336578 , ..., 180.2852766 ,
       185.88673255, 188.84939297])

In [24]:
s_men.shape

(1000000,)

In [25]:
s_wmn = np.random.normal(wmn_avg, wmn_std, 1_000_000)

In [27]:
calculated_prob = (s_wmn > s_men).mean()
print(f'The probability that we will have a woman taller than a man presuming a normal distribution is {calculated_prob}')

The probability that we will have a woman taller than a man presuming a normal distribution is 0.211362


In [29]:
# So I'm pretty tall for a woman, where does this put me?
(s_wmn > 180.34).mean()

0.041965

In [31]:
# Yeah but what if its the 9ft tall vampire lady from the new resident evil game?
(s_wmn > 274).mean()

0.0

## Question 6:

### When installing anaconda on a student's computer, there's a 1 in 250 chance that the download is corrupted and the installation fails. What are the odds that after having 50 students download anaconda, no one has an installation issue? 100 students?

### What is the probability that we observe an installation issue within the first 150 students that download anaconda?

### How likely is it that 450 students all download anaconda without an issue?

In [32]:
n_simulations = nrows = 1_000_000

# n_trials in this case is going to be the number of students installing Anaconda.

n_trials = ncols = 50
conda_failure = 1
great_success = 0

# The one in 250 is going to come up with our probability bias for the two outcomes.  
# 1/250 = 0.004 probability that we will have an anaconda failure.

attempted_class_installs = np.random.choice([conda_failure, great_success], size=(n_simulations, n_trials), p=[0.004, 0.996])

In [33]:
calculated_prob = (attempted_class_installs.sum(axis=1) > 0).mean()
print(f'The probability that we will have one or more failure over {n_trials} is {calculated_prob}')

The probability that we will have one or more failure over 50 is 0.181629


In [34]:
n_simulations = nrows = 1_000_000

# n_trials in this case is going to be the number of students installing Anaconda.

n_trials = ncols = 100
conda_failure = 1
great_success = 0

# The one in 250 is going to come up with our probability bias for the two outcomes.  
# 1/250 = 0.004 probability that we will have an anaconda failure.

attempted_class_installs = np.random.choice([conda_failure, great_success], size=(n_simulations, n_trials), p=[0.004, 0.996])

In [35]:
calculated_prob = (attempted_class_installs.sum(axis=1) > 0).mean()
print(f'The probability that we will have one or more failure over {n_trials} is {calculated_prob}')

The probability that we will have one or more failure over 100 is 0.330362


In [36]:
n_simulations = nrows = 1_000_000

# n_trials in this case is going to be the number of students installing Anaconda.

n_trials = ncols = 150
conda_failure = 1
great_success = 0

# The one in 250 is going to come up with our probability bias for the two outcomes.  
# 1/250 = 0.004 probability that we will have an anaconda failure.

attempted_class_installs = np.random.choice([conda_failure, great_success], size=(n_simulations, n_trials), p=[0.004, 0.996])

In [37]:
calculated_prob = (attempted_class_installs.sum(axis=1) > 0).mean()
print(f'The probability that we will have one or more failure over {n_trials} is {calculated_prob}')

The probability that we will have one or more failure over 150 is 0.451355


In [38]:
n_simulations = nrows = 1_000_000

# n_trials in this case is going to be the number of students installing Anaconda.

n_trials = ncols = 450
conda_failure = 1
great_success = 0

# The one in 250 is going to come up with our probability bias for the two outcomes.  
# 1/250 = 0.004 probability that we will have an anaconda failure.

attempted_class_installs = np.random.choice([conda_failure, great_success], size=(n_simulations, n_trials), p=[0.004, 0.996])

In [39]:
calculated_prob = (attempted_class_installs.sum(axis=1) > 0).mean()
print(f'The probability that we will have one or more failure over {n_trials} is {calculated_prob}')

The probability that we will have one or more failure over 450 is 0.835447


## Question 7:
### There's a 70% chance on any given day that there will be at least one food truck at Travis Park. However, you haven't seen a food truck there in 3 days. How unlikely is this?

### How likely is it that a food truck will show up sometime this week?

In [34]:
# You havent been to Travis Park in like a year because we're in the middle of a pancetta and you're attending codeup from inside your home, so its 0% unlikely, congratulations.

In [35]:
# Let's pretend its regular times for the sake of doing some statistics, though.
# We are still looking at these like independent events, so:
# There either will be or will not be a food truck, with a probability of 0.7 in favor of there being a food truck.
# 3 days of the week have passed, with two more left, assuming a regular business week.

In [40]:
n_simulations = nrows = 1_000_000

# n_trials in this case is going to be the number of weekdays. Let's start with the three days you haven't seen a truck.

n_trials = ncols = 3
food_truck = 1
no_truck = 0
lunch_days = np.random.choice([food_truck, no_truck], size=(n_simulations, n_trials), p=[0.7, 0.3])

In [41]:
calculated_prob = (lunch_days.sum(axis=1) == 0).mean()
print(f'The probability that we will not have seen a food truck over the course of {n_trials} days is {calculated_prob}')

The probability that we will not have seen a food truck over the course of 3 days is 0.026845


In [38]:
# The presence of a food truck is not dependent on whether or not one showed up on the previous day, its independent.  Let's see wht its like for the last two days

In [42]:
n_simulations = nrows = 1_000_000
n_trials = ncols = 2
food_truck = 1
no_truck = 0

In [43]:
lunch_days = np.random.choice([food_truck, no_truck], size=(n_simulations, n_trials), p=[0.7, 0.3])

In [44]:
calculated_prob = (lunch_days.sum(axis=1) > 0).mean()
print(f'The probability that we have seen a food truck over the course of {n_trials} days is {calculated_prob}')

The probability that we have seen a food truck over the course of 2 days is 0.909645


In [41]:
# calculated_prob = 
# print(f'The probability that we will have one or more food truck show up over {n_trials} days is {calculated_prob}')

## Question 8:
### If 23 people are in the same room, what are the odds that two of them share a birthday? What if it's 20 people? 40?

In [42]:
# 365 days in a year (typically)
# 23 people in the room
# we want an instance where both are the same number!

# Hey, this is exactly the same as our first problem with a few extra steps!

In [62]:
# Represent our data's possible outcomes, the number of days in a year
# People born on leap days don't actually exist, so we are going to exclude them here:

outcomes = range(0, 365)
# Create the data!
n_simulations = 1_000_000
n_trials = 23

In [63]:
# Let's get our simulations. We'll make a simulation of 1 million classrooms of 23 students.
#
# outcomes: possible unique days of the year that a person could have.
# n_simulations: the number of simulated classroom trials
# n_trials: the number of student birthdays
#

In [64]:
classrooms = np.random.choice(outcomes, size=(n_simulations, n_trials))

In [65]:
# let's take a peek at the length of an instance here

In [66]:
len(classrooms[0])

40

##### Great, so we can say that when len(np.unique()) == 22 or less, we have a situation of doubles)

#### Let's make a list of every instance where this is the case in our array of simulated classes:

In [59]:
# Let's use a list comprehension: 
# a list of the length of the uniques for each instance for the full number of simulations by index, 
# but only if the number of uniques is less than the number of students in the class

In [69]:
list_of_twinsies = [len(np.unique(classrooms[n])) for n in range(0, n_simulations-1) if len(np.unique(classrooms[n])) < 23]

#### The length of this is going to be the number of times we had a class with shared birthdays, and we can divide that by the total number of simulations:

In [70]:
prop_twinsies = len(list_of_twinsies) / n_simulations
print(f'The probability that we will have one or more shared birthdays over {n_trials} students is {prop_twinsies}')

The probability that we will have one or more shared birthdays over 40 students is 0.891583


### 20?

In [42]:
# 365 days in a year (typically)
# 23 people in the room
# we want an instance where both are the same number!

# Hey, this is exactly the same as our first problem with a few extra steps!

In [62]:
# Represent our data's possible outcomes, the number of days in a year
# People born on leap days don't actually exist, so we are going to exclude them here:

outcomes = range(0, 365)
# Create the data!
n_simulations = 1_000_000
n_trials = 20

In [63]:
# Let's get our simulations. We'll make a simulation of 1 million classrooms of 23 students.
#
# outcomes: possible unique days of the year that a person could have.
# n_simulations: the number of simulated classroom trials
# n_trials: the number of student birthdays
#

In [64]:
classrooms = np.random.choice(outcomes, size=(n_simulations, n_trials))

##### Great, so we can say that when len(np.unique()) == 19 or less, we have a situation of doubles)

#### Let's make a list of every instance where this is the case in our array of simulated classes:

In [59]:
# Let's use a list comprehension: 
# a list of the length of the uniques for each instance for the full number of simulations by index, 
# but only if the number of uniques is less than the number of students in the class

In [69]:
list_of_twinsies = [len(np.unique(classrooms[n])) for n in range(0, n_simulations-1) if len(np.unique(classrooms[n])) < 20]

#### The length of this is going to be the number of times we had a class with shared birthdays, and we can divide that by the total number of simulations:

In [70]:
prop_twinsies = len(list_of_twinsies) / n_simulations
print(f'The probability that we will have one or more shared birthdays over {n_trials} students is {prop_twinsies}')

The probability that we will have one or more shared birthdays over 40 students is 0.891583


### 40?

In [42]:
# 365 days in a year (typically)
# 23 people in the room
# we want an instance where both are the same number!

# Hey, this is exactly the same as our first problem with a few extra steps!

In [62]:
# Represent our data's possible outcomes, the number of days in a year
# People born on leap days don't actually exist, so we are going to exclude them here:

outcomes = range(0, 365)
# Create the data!
n_simulations = 1_000_000
n_trials = 40

In [63]:
# Let's get our simulations. We'll make a simulation of 1 million classrooms of 23 students.
#
# outcomes: possible unique days of the year that a person could have.
# n_simulations: the number of simulated classroom trials
# n_trials: the number of student birthdays
#

In [64]:
classrooms = np.random.choice(outcomes, size=(n_simulations, n_trials))

In [65]:
# let's take a peek at the length of an instance here

In [66]:
len(classrooms[0])

40

##### Great, so we can say that when len(np.unique()) == 39 or less, we have a situation of doubles)

#### Let's make a list of every instance where this is the case in our array of simulated classes:

In [59]:
# Let's use a list comprehension: 
# a list of the length of the uniques for each instance for the full number of simulations by index, 
# but only if the number of uniques is less than the number of students in the class

In [69]:
list_of_twinsies = [len(np.unique(classrooms[n])) for n in range(0, n_simulations-1) if len(np.unique(classrooms[n])) < 40]

#### The length of this is going to be the number of times we had a class with shared birthdays, and we can divide that by the total number of simulations:

In [70]:
prop_twinsies = len(list_of_twinsies) / n_simulations
print(f'The probability that we will have one or more shared birthdays over {n_trials} students is {prop_twinsies}')

The probability that we will have one or more shared birthdays over 40 students is 0.891583
