In [10]:
%run viz.py

In [3]:
%matplotlib inline 
import numpy as np 
import pandas as pd 


np.random.seed(29)

# NOTES 

Generating Random Numbers with Numpy 

*np.random.choice: selects random options from a list

*np.random.uniform: generates numbers between a given lower and upper bound

*np.random.random: generates numbers between 0 and 1

*np.random.randn: generates numbers from the standard normal distribution

*np.random.normal: generates numbers from a normal distribution with a specified mean and standard deviation

### Example 1

You are at a carnival and come across a person in a booth offering you a game of "chance" (as people in booths at carnivals tend to do).

You pay 5 dollars and roll 3 dice. If the sum of the dice rolls is greater than 12, you get 15 dollars. If it's less than or equal to 12, you get nothing.

Assuming the dice are fair, should you play this game? How would this change if the winning condition was a sum greater than or equal to 12?

In [4]:
#The rows indicate the number of simulated trials we want to do, i.e. rolling 3 die 
#There are three columns one for each dice 
#The random choice gives options for 1 through 6 because a dice has 6 sides
#The reshape method creates a matrix
n_trials = nrows = 10_000
n_dice = ncols = 3
rolls = np.random.choice([1, 2, 3, 4, 5, 6], n_trials * n_dice).reshape(nrows, ncols)
rolls

array([[6, 4, 5],
       [6, 3, 1],
       [1, 2, 2],
       ...,
       [6, 2, 1],
       [3, 4, 3],
       [4, 2, 4]])

In [5]:
#Now to calculate the sum of each row 
#The axis is set to 1 to indicate that we want the sum of every row not sum of every column
sums_by_trial = rolls.sum(axis=1)
sums_by_trial

array([15, 10,  5, ...,  9, 10, 10])

In [12]:
#Now create the sums over 12 into a boolean array 
wins = sums_by_trial > 12
wins

array([ True, False, False, ..., False, False, False])

In [13]:
#Calculate an overall win rate, treat each win as a 1 and each loss as a 0, then take the average of the array 
win_rate = wins.astype(int).mean()
win_rate

0.2633

In [15]:
#Calculate expected profit 
expected_winnings = win_rate * 15 
cost = 5 
expected_profit = expected_winnings - cost 
expected_profit

-1.0505000000000004

#We can expect to lose over a dollar everytime we play this game 

### Example 2

There's a 30% chance my son takes a nap on any given weekend day. What is the chance that he takes a nap at least one day this weekend? What is the probability that he doesn't nap at all?

In [17]:
#The .3 is for 30% chance of nap, 2 is for the number of days in the weekend, 10**5 is for 10,000 simulations 
h_nap = .3
ndays = ncols = 2
n_simulated_weekends = nrows = 10**5

In [18]:
#We will use random numbers between 0 and 1 where the values below the percent are true for taking a nap and above false
data = np.random.random((nrows, ncols))
data

array([[0.46762045, 0.70078355],
       [0.18897809, 0.54312897],
       [0.253291  , 0.43836437],
       ...,
       [0.15008559, 0.37577491],
       [0.34690321, 0.58934311],
       [0.97135998, 0.57219933]])

In [20]:
naps = data < h_nap
naps

array([[False, False],
       [ True, False],
       [ True, False],
       ...,
       [ True, False],
       [False, False],
       [False, False]])

In [21]:
naps.sum(axis=1)

array([0, 1, 1, ..., 1, 0, 0])

What is the probability that at least one nap is taken?

In [22]:
#I don't understand how this code works 
(naps.sum(axis=1) >= 1).mean()

0.50998

In [23]:
(naps.sum(axis=1) == 0).mean()

0.49002

You have about a 50/50 chance that at least one nap will be taken one day of the weekend, which dramatically increases if you take that child to the park and run them for the first few hours of the day

### Example 3

What is the probability of getting at least one 3 in 3 dice rolls?

In [24]:
n_simulations = nrows = 10**5
n_dice_rolled = ncols = 3

rolls = np.random.choice([1, 2, 3, 4, 5, 6], nrows * ncols).reshape(nrows, ncols)

(pd.DataFrame(rolls)
 .apply(lambda row: 3 in row.values, axis=1)
 .mean())



0.42324

Step 1: Assign values for the number of rows and columns we are going to use   

Step 2: Create the rolls variable that holds a 3 x 10,000 matrix where each element is a randomly chosen number between 1 and 6

Step 3: create a dataframe from the rolls: pd.DataFrame(rolls), .apply(), .mean()

### Practice Problems 

### Q1. How likely is it that you roll doubles when rolling two dice?

In [25]:
#Total possible outcomes when two dice are rolled 36 
#The number of outcomes rolling doubles is 6 
n_trials = nrows = 10_000
n_dice = ncols = 2
rolls = np.random.choice([1, 2, 3, 4, 5, 6], n_trials * n_dice).reshape(nrows, ncols)
rolls

array([[2, 3],
       [6, 5],
       [2, 4],
       ...,
       [6, 3],
       [2, 1],
       [5, 1]])

In [122]:
rolls = pd.DataFrame(rolls)
rolls

Unnamed: 0,0,1
0,2,3
1,6,5
2,2,4
3,5,6
4,3,1
...,...,...
9995,1,4
9996,3,2
9997,6,3
9998,2,1


# NEED TO KNOW HOW TO PULL NON UNIQUE VALUES FROM DATAFRAME 

### Q2. If you flip 8 coins, what is the probability of getting exactly 3 heads? What is the probability of getting more than 3 heads?

In [26]:
#I chose to set this up with the choices being 1 and 0 to simulate a numeric version of heads or tails where the 1 would symbolize rolling heads and the 0 would symbolize rolling tails 
n_trials = nrows = 10_000 
n_coin = ncols = 8 
flips = np.random.choice([1, 0], n_trials * n_coin).reshape(nrows, ncols)
flips

array([[1, 0, 1, ..., 0, 0, 1],
       [0, 0, 1, ..., 1, 1, 1],
       [0, 0, 1, ..., 0, 0, 0],
       ...,
       [0, 1, 1, ..., 0, 1, 1],
       [0, 0, 1, ..., 1, 0, 1],
       [1, 0, 0, ..., 0, 1, 0]])

In [27]:
sums_by_trial = flips.sum(axis=1)
sums_by_trial

array([3, 6, 2, ..., 6, 3, 4])

In [78]:
three = sums_by_trial == 3
three

array([False, False, False, ..., False, False, False])

In [80]:
#need to know number of true values 

<function ndarray.any>

### Q3. There are approximitely 3 web development cohorts for every 1 data science cohort at Codeup. Assuming that Codeup randomly selects an alumni to put on a billboard, what are the odds that the two billboards I drive past both have data science students on them?

In [39]:
ds_cohort = 0.25 
nbillboards = ncols = 2 
n_simulated_billboards = nrows = 10**5 

In [40]:
data = np.random.random((nrows, ncols))
data

array([[0.46429388, 0.40885974],
       [0.08326115, 0.02222945],
       [0.41161641, 0.89518895],
       ...,
       [0.3459291 , 0.99321578],
       [0.90523284, 0.62221469],
       [0.03487333, 0.54012269]])

In [41]:
ds_alumni = data < ds_cohort
ds_alumni 

array([[False, False],
       [ True,  True],
       [False, False],
       ...,
       [False, False],
       [False, False],
       [ True, False]])

In [42]:
ds_alumni.sum(axis=1)

array([0, 2, 0, ..., 0, 0, 1])

In [45]:
(ds_alumni.sum(axis=1) >= 2).mean()

0.06402

### There is approximately a 6% chance that both billboards will have a data science alumni

### Q4. Codeup students buy, on average, 3 poptart packages with a standard deviation of 1.5 a day from the snack vending machine. If on monday the machine is restocked with 17 poptart packages, how likely is it that I will be able to buy some poptarts on Friday afternoon? (Remember, if you have mean and standard deviation, use the np.random.normal)

In [49]:
#snack = np.random.normal(average, standard deviation, size=(10,000-for the number of simulations, 4 for the number of days)).round()

snack = np.random.normal(3, 1.5, size=(10_000 , 4)).round()
snack

array([[ 3.,  6.,  3.,  2.],
       [ 4.,  3.,  4.,  2.],
       [ 3.,  4.,  3.,  2.],
       ...,
       [ 2.,  2.,  1.,  2.],
       [ 3.,  2.,  2.,  2.],
       [ 4.,  2., -0.,  2.]])

In [51]:
snacks_by_week = snack.sum(axis=1)
snacks_by_week

array([14., 13., 12., ...,  7.,  9.,  8.])

In [52]:
available_tarts = snacks_by_week < 17
available_tarts

array([ True,  True,  True, ...,  True,  True,  True])

In [53]:
hungry_me = available_tarts.astype(int).mean()
hungry_me

0.932

I have a 93% chance of poptarts still being available in the vending machine on Friday

### Q5. Compare Heights

Men have an average height of 178 cm and standard deviation of 8cm.


Women have a mean of 170, sd = 6cm.


Since you have means and standard deviations, you can use np.random.normal to generate observations.


If a man and woman are chosen at random, what is the likelihood the woman is taller than the man?


In [54]:
male = np.random.normal(178, 8, 10_000)
female = np.random.normal(170, 6, 10_000)

In [55]:
(female > male).mean()

0.2064

There is a 20% chance a female will be taller than a male

### Q6. When installing anaconda on a student's computer, there's a 1 in 250 chance that the download is corrupted and the installation fails. What are the odds that after having 50 students download anaconda, no one has an installation issue? 

In [57]:
#np.random.choice([possible outcomes for any element], probability=[whereTrue, whereFalse], size(number of simulations(rows), number of units(columns)))
n_students = 50
simulations = np.random.choice([True, False], p=[1/250, 249/250], size=(10_000, n_students))
simulations

array([[False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       ...,
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False]])

In [59]:
#10,000 rows for the sample base, 50 to represent the number of students 
simulations.shape

(10000, 50)

In [69]:
#A corrupted file occurs 1833 times per 10000 downloads 
simulations.any(axis=1).sum()

1833

The odds that 1 in 50 students downloads a corrupt file is 18%

The odds that no one downloads a corrupt file is 82%

### 100 students?

In [71]:
n_students = 100 
simulations = np.random.choice([True, False], p=[1/250, 249/250], size=(10_000, n_students))
simulations

array([[False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       ...,
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False]])

In [72]:
simulations.any(axis=1).sum()

3272

In [73]:
3272/10000

0.3272

In a class of 100, there is a 32% chance a student will download a corrupt file 

In a class of 100, there is a 68% chance no one will download a corrupt file 

### I. What is the probability that we observe an installation issue within the first 150 students that download anaconda?

In [74]:
n_students = 150 
simulations = np.random.choice([True, False], p=[1/250, 249/250], size=(10_000, n_students))
simulations
simulations.any(axis=1).sum()

4558

In [75]:
4558/10000

0.4558

A 45% probability

### II. How likely is it that 450 students all download anaconda without an issue?

In [76]:
n_students = 450 
simulations = np.random.choice([True, False], p=[1/250, 249/250], size=(10_000, n_students))
simulations
simulations.any(axis=1).sum()

8369

In [77]:
8369/10000

0.8369

An 17% probability 

Q7. There's a 70% chance on any given day that there will be at least one food truck at Travis Park. However, you haven't seen a food truck there in 3 days. How unlikely is this?

In [81]:
food_truck = .7
ndays = ncols = 3
iterations = nrows = 10000
data = np.random.random((nrows, ncols))
data



array([[0.99506392, 0.73420189, 0.51775892],
       [0.0800494 , 0.18482718, 0.13014966],
       [0.74251692, 0.65799799, 0.38680959],
       ...,
       [0.36530261, 0.45973899, 0.6073687 ],
       [0.76001429, 0.26114043, 0.54278693],
       [0.9776759 , 0.8250263 , 0.81262429]])

In [86]:
truck = data < 0.70
truck

array([[False, False,  True],
       [ True,  True,  True],
       [False,  True,  True],
       ...,
       [ True,  True,  True],
       [False,  True,  True],
       [False, False, False]])

In [87]:
truck.any(axis=1).sum()

9728

In [88]:
9728/10000

0.9728

The probability of the truck not being there after three days is 2.68%

### I. How likely is it that a food truck will show up sometime this week?

In [89]:
food_truck = .7
ndays = ncols = 7
iterations = nrows = 10000
data = np.random.random((nrows, ncols))
data

array([[0.19842854, 0.9064451 , 0.93749594, ..., 0.90026707, 0.48827506,
        0.77701832],
       [0.33798502, 0.27027299, 0.50058364, ..., 0.14922925, 0.79022106,
        0.9981348 ],
       [0.10604764, 0.30832491, 0.99522721, ..., 0.41199631, 0.55972056,
        0.97152305],
       ...,
       [0.80404303, 0.34448305, 0.65407864, ..., 0.76960067, 0.33095556,
        0.39260188],
       [0.90228115, 0.98484751, 0.25866203, ..., 0.10646487, 0.52737205,
        0.18502824],
       [0.24167103, 0.50975916, 0.41583218, ..., 0.53984193, 0.35450694,
        0.83410817]])

In [90]:
truck = data < 0.70
truck
truck.any(axis=1).sum()

9997

It is over 99% likely the truck will be there this week 

### Q8. If 23 people are in the same room, what are the odds that two of them share a birthday? What if it's 20 people? 40?

In [92]:
#Create a sample of 10,000 that uses numbers 1-365 for 23 columns 
same_birthday = np.random.randint(1, 366, size=(10000, 23))
same_birthday

array([[241,  64, 349, ..., 114, 273, 139],
       [ 29,  99, 203, ...,  25, 335, 215],
       [248, 260, 343, ...,  95, 315, 170],
       ...,
       [167, 103, 232, ..., 173, 148, 197],
       [ 41, 137,  47, ..., 120,  63, 114],
       [ 44, 306, 188, ...,  85, 327, 127]])

In [94]:
#convert the NP ARRAY to a database 
same_birthday = pd.DataFrame(same_birthday)
same_birthday

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,13,14,15,16,17,18,19,20,21,22
0,241,64,349,31,263,324,246,119,84,205,...,193,335,275,323,94,308,290,114,273,139
1,29,99,203,160,96,50,266,58,306,324,...,230,67,157,183,336,192,336,25,335,215
2,248,260,343,229,340,82,355,243,271,97,...,163,33,284,232,78,95,214,95,315,170
3,219,1,160,16,136,202,14,139,215,345,...,176,225,44,348,37,300,182,222,338,41
4,175,125,29,122,123,22,31,174,218,184,...,180,210,56,349,77,305,122,252,285,199
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,266,274,358,274,240,290,89,339,194,335,...,113,260,130,115,103,323,167,150,225,109
9996,45,221,31,356,15,209,275,268,235,343,...,164,130,24,357,48,64,343,289,222,279
9997,167,103,232,338,52,67,114,291,260,173,...,355,107,327,256,89,310,137,173,148,197
9998,41,137,47,161,47,107,350,100,240,287,...,298,57,324,249,77,104,359,120,63,114


In [96]:
#Use the ununique operator to create a boolean list of True or False values, where True is yes two elements match across a row and False is no matching elements across row
(same_birthday.apply(lambda birthdays: birthdays.nunique(), axis = 1) < 23)

0       False
1        True
2        True
3       False
4        True
        ...  
9995     True
9996     True
9997     True
9998     True
9999    False
Length: 10000, dtype: bool

In [97]:
(same_birthday.apply(lambda birthdays: birthdays.nunique(), axis = 1) < 23).sum()

5093

In [98]:
5093/10000

0.5093

The probability of two people having the same birthdays in a room of 23 people is 50% 

In [105]:
same_birthday = np.random.randint(1, 366, size=(10000, 20))

In [106]:
same_birthday = pd.DataFrame(same_birthday)
same_birthday

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,233,364,291,196,207,254,296,299,170,360,68,182,30,362,41,348,233,43,169,310
1,353,265,61,253,235,210,116,144,280,127,12,42,309,189,191,350,280,129,241,147
2,286,193,41,142,259,339,211,67,160,215,263,221,163,54,43,148,347,231,73,329
3,32,194,107,307,65,16,32,172,145,198,28,51,245,189,12,30,275,28,205,223
4,120,138,257,289,69,333,278,57,48,235,339,51,247,176,297,328,361,120,272,291
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,57,228,296,209,23,116,281,42,118,209,233,325,10,75,140,243,292,134,113,128
9996,330,147,328,274,234,291,334,337,171,50,112,269,183,302,98,120,59,58,151,266
9997,312,360,88,27,108,363,212,274,331,319,164,360,326,117,289,183,43,115,158,123
9998,143,37,265,139,296,243,214,103,320,89,299,125,332,179,29,358,102,101,152,283


In [108]:
(same_birthday.apply(lambda birthday: birthday.nunique(), axis=1)<20).sum()

4156

In [109]:
4156/10000

0.4156

In a room of 20 people, there is a 41% probability two people will share the same birthday. 

In [113]:
same_birthday = np.random.randint(1, 366, size=(10000, 40))
same_birthday

array([[  6, 205, 127, ...,  67, 337, 213],
       [232,  16, 273, ..., 215, 274, 266],
       [ 42, 342,  42, ..., 200, 203,  12],
       ...,
       [177, 303,  95, ..., 259, 328, 240],
       [  7,  49, 227, ...,  70, 258,  61],
       [212, 336, 285, ...,  56, 351, 188]])

In [115]:
same_birthday = pd.DataFrame(same_birthday)
same_birthday

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,30,31,32,33,34,35,36,37,38,39
0,6,205,127,258,31,235,123,263,360,130,...,322,256,92,332,229,227,250,67,337,213
1,232,16,273,290,70,160,248,359,195,117,...,222,152,223,362,107,196,74,215,274,266
2,42,342,42,130,31,210,264,253,221,85,...,16,165,260,77,86,59,204,200,203,12
3,131,161,80,172,48,295,82,346,23,291,...,97,136,172,152,5,188,263,295,117,348
4,54,201,358,106,79,108,349,46,125,320,...,29,154,244,9,39,352,112,154,117,134
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,157,332,172,26,41,296,63,296,172,22,...,297,50,146,240,323,211,321,63,191,160
9996,281,205,285,151,322,297,143,59,267,37,...,121,352,105,122,312,337,275,293,298,266
9997,177,303,95,207,237,203,323,220,111,3,...,186,146,142,76,2,244,204,259,328,240
9998,7,49,227,244,195,204,33,258,277,73,...,63,90,195,280,221,303,330,70,258,61


In [118]:
(same_birthday.apply(lambda birthday: birthday.nunique(), axis=1) < 40) 

0        True
1       False
2        True
3        True
4        True
        ...  
9995     True
9996     True
9997     True
9998     True
9999     True
Length: 10000, dtype: bool

In [119]:
(same_birthday.apply(lambda birthday: birthday.nunique(), axis=1) < 40).sum()

8897

In [120]:
8897/10000

0.8897

In a room of 40 people, there is almost an 89% probability two people will share the same birthday