In [1]:
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats
import viz # curriculum viz example code

np.random.seed(123)

Exercises
Do your work for this exercise in either a python script named probability_distributions.py or a jupyter notebook named probability_distributions.ipynb.

For the following problems, use python to simulate the problem and calculate an experimental probability, then compare that to the theoretical probability.

1. A bank found that the average number of cars waiting during the noon hour at a drive-up window follows a Poisson distribution with a mean of 2 cars. Make a chart of this distribution and answer these questions concerning the probability of cars waiting at the drive-up window.



In [2]:
# hourly rate of cars in drive-through
rate = 2

1a. What is the probability that no cars drive up in the noon hour?


In [3]:
stats.poisson(rate).pmf(0)

0.1353352832366127

1b. What is the probability that 3 or more cars come through the drive through?

In [4]:
stats.poisson(rate).sf(2)

0.32332358381693654

1c. How likely is it that the drive through gets at least 1 car?

In [5]:
stats.poisson(rate).sf(0)

0.8646647167633873

2. Grades of State University graduates are normally distributed with a mean of 3.0 and a standard deviation of .3. Calculate the following:

In [6]:
gpa_mean = 3.0
gpa_std = 0.3

gpa_data=stats.norm(gpa_mean, gpa_std)

2a. What grade point average is required to be in the top 5% of the graduating class?

In [7]:
# top 5%
top_5_percent = gpa_data.ppf(.95)

print('{:,.2f}'.format(top_5_percent))


3.49


2b. What GPA constitutes the bottom 15% of the class?

In [8]:
bottom_15_percent = gpa_data.ppf(.15)

print('{:,.2f}'.format(bottom_15_percent))

2.69


2c. An eccentric alumnus left scholarship money for students in the third decile from the bottom of their class. Determine the range of the third decile. Would a student with a 2.8 grade point average qualify for this scholarship?

In [9]:
# let's look at deciles...
# 1st decile bottom 10%
# 2nd decile = 10-20%
# 3rd decile is 20-30%
# let's find the cutoffs at each end of their specturm

bottom_20 = gpa_data.ppf(.2)
bottom_30 = gpa_data.ppf(.3)

print(f'The gpa range of the third from bottom decile is', '{:,.2f}'.format(bottom_20), 'to', '{:,.2f}'.format(bottom_30))

The gpa range of the third from bottom decile is 2.75 to 2.84


In [10]:
# can we do this more efficiently?


2d. If I have a GPA of 3.5, what percentile am I in?

In [11]:
gpa_3_5 = (gpa_data.sf(3.4999)*100)

print(f'A GPA of 3.5 puts a student in the top', '{:,.2f}%'.format(gpa_3_5)) 

A GPA of 3.5 puts a student in the top 4.78%


3. A marketing website has an average click-through rate of 2%. One day they observe 4326 visitors and 97 click-throughs. How likely is it that this many people or more click through?

In [12]:
acr = 0.02 * 4326
visitors = 4326
ct = 97
test_rate = ct/visitors

test_rate


0.022422561257512713

In [13]:
marketing = stats.poisson(acr)

In [14]:
likelyhood = marketing.sf(ct-1)
likelyhood

0.14211867659283192

In [15]:
# try this with a binomial setup
n_trials = 4326
p = 0.02

ct_data = stats.binom(n_trials, p)

likely = ct_data.sf(ct-1)
likely


0.1397582363130086

4. You are working on some statistics homework consisting of 100 questions where all of the answers are a probability rounded to the hundreths place. Looking to save time, you put down random probabilities as the answer to each question.

In [16]:
# weighted coinflip problem
# 10% chance of correctly guessing 1st digit
# 10% chance of correctly guessing 2nd digit

# 1/100 chance for each question
# binomial dist

n_trials = 100
p = 0.01

guessing = stats.binom(n_trials, p)

4a. What is the probability that at least one of your first 60 answers is correct?

In [17]:
guess_60 = guessing.cdf(60)
guess_60

0.9999999999999999

5. The codeup staff tends to get upset when the student break area is not cleaned up. Suppose that there's a 3% chance that any one student cleans the break area when they visit it, and, on any given day, about 90% of the 3 active cohorts of 22 students visit the break area. How likely is it that the break area gets cleaned up each day? How likely is it that it goes two days without getting cleaned up? All week?

In [18]:
# 3% chance students don't need their mothers
# visitors per day = 90% of 3*22 = 59.4

In [19]:
n_trials = 59.4
p = 0.03

true_data = stats.binom(n_trials, p)

In [20]:
# likelyhood of getting cleaned everyday
daily_clean = true_data.sf(0)
daily_clean

0.8342199288437355

In [21]:
no_clean = 1-daily_clean

In [22]:
two_days_dirty = no_clean * no_clean
two_days_dirty

0.027483031992576113

In [23]:
# whole week w/o cleaning

no_clean ** 5

0.0001252165138809122

6. You want to get lunch at La Panaderia, but notice that the line is usually very long at lunchtime. After several weeks of careful observation, you notice that the average number of people in line when your lunch break starts is normally distributed with a mean of 15 and standard deviation of 3. If it takes 2 minutes for each person to order, and 10 minutes from ordering to getting your food, what is the likelihood that you have at least 15 minutes left to eat your food before you have to go back to class? Assume you have one hour for lunch, and ignore travel time to and from La Panaderia.

In [24]:
mean_line = 15
std_line = 3
to_order = 2
to_make = 10
lunch_break = 60

# line_length = stats.norm(mean_line, std_line) 

line_time = stats.norm(mean_line*2, std_line*2)

In [25]:
# how much time do I really have?
my_time = 60 - to_make - to_order
my_time

48

In [26]:
max_line_time = my_time - 15
max_line_time

33

In [27]:
victory_chance = line_time.cdf(33)
victory_chance

0.6914624612740131

In [None]:
# let's clean this up
mean_time = 30
std_time = 6
to_order = 2
to_make = 10
lunch_break = 60
eating_time = 15
max_line_time = lunch_break - eating_time - to_order - to_make

lucky = stats.norm(mean_time, std_time).cdf(max_line_time)
lucky

In [29]:
# setting up connection to SQL database

In [30]:
import pandas as pd
import numpy as np
from pydataset import data

In [37]:
from env import host, user, password
url = f'mysql+pymysql://{user}:{password}@{host}/employees'

In [38]:
def get_db_url(db, user=user, host=host, password=password):
    return f'mysql+pymysql://{user}:{password}@{host}/{db}'

In [39]:
url = get_db_url('employees')

In [40]:
sql_query = 'SELECT * FROM salaries'

In [41]:
def get_employees_data(db):
    return pd.read_sql(sql_query, get_db_url(db))

In [42]:
get_employees_data('employees')

Unnamed: 0,emp_no,salary,from_date,to_date
0,10001,60117,1986-06-26,1987-06-26
1,10001,62102,1987-06-26,1988-06-25
2,10001,66074,1988-06-25,1989-06-25
3,10001,66596,1989-06-25,1990-06-25
4,10001,66961,1990-06-25,1991-06-25
...,...,...,...,...
2844042,499999,63707,1997-11-30,1998-11-30
2844043,499999,67043,1998-11-30,1999-11-30
2844044,499999,70745,1999-11-30,2000-11-29
2844045,499999,74327,2000-11-29,2001-11-29


In [48]:
salaries_df = pd.read_sql("SELECT * FROM salaries WHERE to_date LIKE '9999%'", url)

7. Connect to the employees database and find the average salary of current employees, along with the standard deviation. For the following questions, calculate the answer based on modeling the employees salaries with a normal distribution defined by the calculated mean and standard deviation then compare this answer to the actual values present in the salaries dataset.

In [49]:
# avg salary

mean_salary = salaries_df['salary'].mean()
mean_salary

72012.23585730705

In [50]:
# stddev

std_salary = salaries_df['salary'].std()
std_salary

17309.99538025198

7a. What percent of employees earn less than 60,000?

In [54]:
less_than_60 = stats.norm(mean_salary, std_salary).cdf(59_999.99)
less_than_60

0.2438572436502896

7b. What percent of employees earn more than 95,000?

In [55]:
more_than_95 = stats.norm(mean_salary, std_salary).sf(95_000)
more_than_95

0.09208819199804053

7c. What percent of employees earn between 65,000 and 80,000?

In [56]:
# what percent earn more than 65,000?

more_than_65 = stats.norm(mean_salary, std_salary).sf(65_000)
more_than_65

0.6572970780493486

In [57]:
# what percent have more than 80,000?

more_than_80 = stats.norm(mean_salary, std_salary).sf(80_000)
more_than_80

0.32223650950468197

In [58]:
# what percent are in between?

between_65_80 = more_than_65 - more_than_80
between_65_80

0.3350605685446666

7d. What do the top 5% of employees make?

In [59]:
top_5p = stats.norm(mean_salary, std_salary).isf(0.05)
top_5p

100484.64454102777