In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy import stats

#### Probability Distributions: Mathematical functions that we can use to model real-world processes.
- Random Variable: The unknown 
- A probability distribution is a list of all of the possible outcomes of a random variable along with their corresponding probability values.
    - Defines the probability of every outcome in the sample space (in our experiment)  
    - Sample space: it is set of all outcomes. For dice, we have 6 outcomes.
    - Sum of all probabilties in your sample space = 1  (Seems trivial, but incredibly powerful)




![distributions.svg](attachment:distributions.svg)

Discrete distribution:
- Number of customer complaints
- Number of calls received in a call-center per hour
- Number of food trucks at Travis Park in a day

Continuous distribution:
- Height
- Temperature
- Employee salaries


### Types of Distributions:

- Uniform distribution
- Normal distribution
- Binomial distribution
- Poisson distrbution


More Probability distributions: https://en.wikipedia.org/wiki/List_of_probability_distributions  
https://www.kdnuggets.com/2020/02/probability-distributions-data-science.html



### Lesson Objectives:
    
- Understand and recognize these distributions
- Understand parameters we need to generate these distribution
- Given a distribution, calculate probabilities for certain value of the random variable

### Uniform Distribution

Rolling a fair 6-sided die

- what does the Probability distribution looks like

![Uniform%20distribution.png](attachment:Uniform%20distribution.png)

#### Working in Scipy stats module

In [None]:
# create a scipy object for underlying distribution

die_distribution = stats.randint(1,7)
die_distribution

#### scipy distribution object: What can we calculate from distribution?


- value -> probability  
   -- pmf: probability at a particular value of random variable (only for discrete distributions!)  
    -- pdf: probability at a particular value of random variable (for continuous distributions)  
   -- cdf: cumulative probablity for less than or equal to value of random variable     
   -- sf:  probability for a random variable **greater than** certain value
- probability -> value  
   -- ppf: less than or equal to a point   
   -- isf: greater than a point  
- rvs for random values 

In [None]:
# rvs gives random values from a given distribution
die_distribution.rvs(10)

In [None]:
# What is probability of rolling 3?



In [None]:
# What is probability of rolling 3 or less?



In [None]:
# Given a probability calculate value of random variable (inverse of cdf)


In [None]:
# What is the likelihood we roll a value higher than 4?



In [None]:
# There is a 1/3 chance a dice roll will be higher than what value



#### Examples of uniform distribution in real life:
- rolling a dice
- flipping a coin
- lucky draw


### Normal Distribution

- Bell shaped
- Most observations are closer to the mean
- Common in nature. Examples
    - Height
    - time a flight takes from point A to B
    - manufacturing
- 2 parameters
    - mean ($\mu$)
    - std dev ($\sigma$)


#### Suppose that a store's daily sales are normally distributed with a mean of 12,000 dollars and standard deviation of 2000 dollars.
 - What is the probability that sales are 10,000 dollars on a certain day.     
 - What is the probability that sales are 10,000 dollars <ins>or less</ins> on a certain day.
 - What is the probability that sales are greater than 15,000 dollars on a certain day.
 - How much would the daily sales have to be to be in the top 10% of all days?

![normal_dist.svg](attachment:normal_dist.svg)


![figure1.svg](attachment:figure1.svg)

####  What is the probability that sales are 10,000 dollars on a certain day?
#### What is the probability that sales are 10,000 dollars or less on a certain day.
#### What is the probability that sales are greater than 15,000 dollars on a certain day?
#### How much would the daily sales have to be to be in the top 10% of all days?

In [None]:
# stats.norm(mean, std_dev)

In [None]:
# Random variable?
# X = daily sales

# parameters defining the normal dist
mean = 12_000
std_dev = 2000

# Use Scripy Stats module to generte a dist:
sales = stats.norm(mean, std_dev)

# use apporpriate method to predict probability
sales

In [None]:
# What is the probability that sales are 10,000 dollars on a certain day?
# use pdf to predict probability given certain value of random variable


In [None]:
# What is the probability that sales are 10,000 dollars or less on a certain day.
# Use cdf to find probability for any given X or less.


In [None]:
# What is the probability that sales are greater than 15,000 dollars on a certain day?
# use sf to find probability for any given X or more.



The survival function (sf) tells us what the probability of our random variable falling above a certain value is. This is the same as 1 minus the cdf of the same value.

In [None]:
# Given a probability what is the value of X?
# How much would the daily sales have to be to be in the top 10% of all days?


#### Summary: Scipy provides many different ways of interacting with various statistical distributions through it's stats module.

- pmf / pdf
- cdf / ppf
- sf / isf


#### Mini Exercise:

The average battery life for a fully charges iphone-12 is 14 hours with standard deviation of 1.5 hour


1. What kind of probability distribution represents the random variable "battery life in hours"?


2. What are the appropriate defining parameters for this distribution?


3. Create a Scipy object/instance for this distribution


4. Use the object create above and choose appropriate method (e.g. pmf, cdf, ppf etc.) to calculate the following:  


     a. What is the probability the cell phone battery more than 16 hours.  
     b. What is probability that cell phone battery lasts for exactly 12 hours.  
     c. What is the probability that cell phone battery lasts for 12 hours or less.  
     d. How many hours do the battery lasts for top 25% longest lasting phones.  

In [None]:
# distribution?

# Normal distribution

# what are the defining parameters for the distributions?
mean = 14
std = 1.5

# create a scipy object/instance of this distribution
battery = stats.norm(mean, std)

In [None]:
# 1.  What is the probability the cell phone battery lasts more than 16 hours.  


In [None]:
# What is the probability that cell phone battery lasts for 12 hours or less.  



In [None]:
# How many hours do the battery lasts for top 25% longest lasting phones.  


## Binomial Distribution 

 Binomial distribution is a <ins>discrete</ins> probability distribution.
 
 
Defined by 
 - Number of Trials (sequence of n trials)
 - Probability of 'success' in each trial



### Assumptions:
- Two potential outcome per trial
- Probability of success is same across all trials
- Each trial is independent

### Random variable:
X = <span style="color:red">**Number of successes in n trials**</span>


#### Example: Suppose we flip a fair coin 5 times in a row. What is probability of getting exactly 1 head.

Random variable X = Number of heads (success) from flipping a coin 5 times


 What is a trial?

Define what is 'success'.

Total possible outcomes for 5 coin flips = 2.2.2.2.2 = 32

In [None]:
# Probability (X ==0) i.e 0 heads in 5 trails?




In [None]:
# Probability (X = 1) i.e 1 head in 5 trials 


5 * 0.5 * 0.5* 0.5*0.5 * 0.5
# 5/32

#### Using Scipy's stats module

In [None]:
# Binomial Parameters:
n_trials = 5
p =  0.5  # probability of success

flips = stats.binom(n_trials, p)

In [None]:
# prob of getting 0 heads 


In [None]:
# prob of getting 2 heads 



![binom_coin_toss0_5.svg](attachment:binom_coin_toss0_5.svg)

__________________________________________
#### Rigged coin distribution
- Probability of success (getting 'heads' in a coin toss) = 0.7  
- number of trials = 5

In [None]:
# what is probability of getting 5 heads?


![binom_coin_toss0_7.svg](attachment:binom_coin_toss0_7.svg)

___________________________________

#### Rigged coin distribution
Probability of success (getting heads in a coin toss) = 0.2  
Number of trials = 5

In [None]:
# what is probability of getting 5 heads?

![binom_coin_toss0_2.svg](attachment:binom_coin_toss0_2.svg)

#### Example Binomial distribution:  
You are taking a multiple choice test consisting of 30 questions that you forgot to study for. Each question has 4 possible answers and you will choose one at random. What is the probability you get 11 or more questions right?

In [None]:
# What kind of distribution is this? 
# Binomial dist

#success = answering question correctly

n_trials = 30
p = 0.25

# Random variable X: # Number of questions answered correctly


![binom_example1-2.svg](attachment:binom_example1-2.svg)

#### Mini Exercise

The probability that a visitor will make a purchase when browsing in your web-store is 1.5%. You expect 350 web-visitors today 


1. What kind of probability distribution you have for "# number of visitors who end up making a purchase"?


2. What are the appropriate defining parameters for this distribution?


3. Create a Scipy object/instance for this distribution



4. Use the object create above and choose appropriate method (e.g. pmf, cdf, ppf etc.) to calculate the following:  


     a. What is the probability that exactly 10 vistors will make the a purchase?
     b. What is probability 13 or more visitors will make a purchase?  
     c. What is probability that 10 or less visitors will make a purchase?
     d. Visualize the resulting distribution (hint: try a bar chart)

In [None]:
sales = stats.binom(350, 0.015)

In [None]:
# What is the probability that exactly 10 vistors will make the a purchase?


In [None]:
# What is probability 13 or more visitors will make a purchase?


In [None]:
# What is probability that 10 or less visitors will make a purchase?


In [None]:
# plot the data


### Poisson Distribution

- discrete probability distribution 
- expresses the probability of a given **number of events** occurring in a fixed interval of time or space
- No upper bound on number of events (un-like Binomial distribution)
- Only one parameter ($\lambda$,) which is the rate at which the event happens.

### Assumptions
- events are occuring independently
- probability that event occurs in a given length of time does not change through time

#### Real life examples:

Number of deaths by horse kicks in Prussian army

Telecommunications: # of calls arriving in to customer service.  

Astronomy: photons arriving at a telescope.

Biology: the number of mutations on a strand of DNA per unit length.  

Management: customers arriving at a counter or call centre.

Finance and insurance: number of losses or claims occurring in a given period of time.  

Radioactivity: number of decays in a given time interval in a radioactive sample.




#### Example
Suppose that astronomers estimate that large meteorites (above a certain size) hit the earth on average once every 100 years (λ = 1 event per 100 years), and that the number of meteorite hits follows a Poisson distribution.

What is the probability of k = 0 meteorite hits in the next 100 years?

In [None]:
# What kind of distribution is this?

λ = 1 # number of strikes per 100 years

stats.poisson(λ).pmf(0)

In [None]:
#plot the data
x = np.arange(0,10)
y = stats.poisson(λ).pmf(x)

plt.bar(x,y)
plt.xlabel('No of meteorite strikes per 100 years')
plt.ylabel('P(X)')
plt.title('Poisson distribution $λ = 1$');

#### Mini Exercise:

Average number of customers going through CVS drive-through is 8 per hour. 

1. What kind of distribution we are working with?


2. What are the appropriate defining parameters for this distribution?


3. Create a Scipy object/instance for this distribution



4. Use the object create above and choose appropriate method (e.g. pmf, cdf, ppf etc.) to calculate the probability that 11 cars customers will go through in next hr.


In [None]:
# What kind of distribution?

In [None]:
# What are the appropriate defining parameters for this distribution?
λ = 8 # events per hour

In [None]:
# Create a Scipy object/instance for this distribution
stats.poisson(λ)

In [None]:
# calculate the probability that 11 cars customers will go through in next hr.
stats.poisson(λ).pmf(11)

## Summary:

Types of Distribution:
1. Normal distribution
    - mean($\mu$)
    - std dev ($\sigma$)
    - stats.norm($\mu$,  $\sigma$ )


2. Binomial distribution
    - number of trials (n)
    - probability of success (p)
    - stats.binom(n, p )
    - X = number of successes in n trials


3. Poisson distribution
    - rate ($\lambda$)
    - stats.poisson($\lambda$)
    - X = number of events per unit time
    
 
For a defined probability distribution above, we can answer different questions using following functions:

- pmf/pdf
- cdf/ppf
- sf/isf


### Bonus Material
### Relationship between Binomial, Normal and Poisson Distribution

![Untitled%20presentation.svg](attachment:Untitled%20presentation.svg)


Reference: https://www.youtube.com/watch?v=u9onO78hDlw