# Probability Distributions

What are they?
- Mathematical functions that we can use to model real-world processes

Why do we care?
- We can calculate probabilities using these distributions instead of creating simulations

### Today:

- review some vocabulary 
- understand and recognize the four main distributions
    - uniform
    - normal
    - binomial
    - poisson
- given a distribution, calculate probabilities for certain value of the random variables using the stats module

#### Vocabulary 

- Probability distribution
    - a list of all of the possible outcomes of a random variable along with their corresponding probability values
    - defines the probability of every outcome in the sample space (in our experiment)  


- Random Variable
    - The unknown 


- Sample space
    - it is set of all outcomes
        - ex. for dice, we have 6 outcomes
        - sum of all probabilties in your sample space = 1  (Seems trivial, but incredibly powerful)

#### A probability distribution can be on either discrete values or continuous values. 

![distributions.svg](attachment:distributions.svg)

Examples: 
- Discrete distribution
    - Number of customer complaints
    - Number of calls received in a call-center per hour
    - Number of food trucks at Travis Park in a day


- Continuous distribution
    - Height
    - Temperature
    - Employee salaries

### Types of Distributions:

- Uniform distribution
- Normal distribution
- Binomial distribution
- Poisson distrbution


More Probability distributions: 
- https://en.wikipedia.org/wiki/List_of_probability_distributions  
- https://www.kdnuggets.com/2020/02/probability-distributions-data-science.html


In [1]:
#standard imports
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

#stats module!!!!!


## Uniform Distribution

- the probability of every outcome is the same
- it is a discrete probability distribution

Defined by:
- outcomes

Examples:
- rolling a dice
- flipping a coin
- lucky draw


### rolling a fair 6-sided die

- this is the probability distribution
    - aka the probability of every outcome (1,2,3,4,5,6) occuring

![Uniform%20distribution.png](attachment:Uniform%20distribution.png)

In [None]:
#probability of each outcome


In [None]:
#sum of all outcomes


- sum of all probabilties in your sample space = 1

#### scipy stats module - define our distribution

- use `stats.randint()`
    - generate a uniform distribution
    - send inclusive lower bound and exclusive upper bound

In [None]:
#create distribution


#### scipy distribution object: What can we calculate from distribution?

- `rvs` to generate random values 


- given a **value(s)** -> return the **probability** of that value(s) happening 


- given a **probability** -> return the **values** at that probability  


In [None]:
#generate random variables


#### what is probability of rolling 3?

In [None]:
#get lots of random variables


In [None]:
#calculate probability like we did with simulations


#### shortcut! use our scipy.stats chart!

- the top arm is when we are looking at a single value
- the middle and bottom arm are when we are looking at multiple values

#### what is probability of rolling 3?
- use `pmf`: Probability Mass Function

In [None]:
#longish way


In [None]:
#using probability function


Why are we are using `pmf` here? 
 - we are looking at a single value, 3
 - for that single value, we want to know the probability of it occuring
 - lastly, we are looking at a discrete distribution

#### what is probability of rolling 3 or less? (roll 1, 2, or 3)
- use `cdf`: Cumulative Distribution Function

In [None]:
#longish way


In [None]:
#using probability function


Why are we are using `cdf` here? 
 - we are looking at a range of values
 - we want all values that are LESS THAN OR EQUAL TO
 - we are given the value (3) and want the probability

Note: the cdf is INCLUDING 3 in it's calculation

#### Given a probability of .5, what is the value of random variable being less than or equal to?
- use `ppf`: Percent Point Function (inverse of cdf)

- we are giving it a probability of 50% and this corresponds to rolling a value of 3 or less

#### What is the likelihood we roll a value higher than 4? (roll 5 or 6)

- use `sf`: Survival Function (1-cdf)

Why are we are using `sf` here? 
 - we are looking at a range of values
 - we want all values that are GREATER THAN
 - we are given the value (4) and want the probability

Note: the sf is NOT INCLUDING 4 in its calculation

In [None]:
#show cdf


#### There is a 1/3 chance a die roll will be higher than what value?

- use `isf`: Inverse Survival Function (inverse of sf)

### Normal Distribution

- bell shaped
- most observations are closer to the mean
- it is a continuous probability distribution

Defined by:
- mean ($\mu$)
- std dev ($\sigma$)


Examples (common in nature):
- Height
- time a flight takes from point A to B
- manufacturing

![image-3.png](attachment:image-3.png)

### Suppose that a store's daily sales are normally distributed with a mean of 12,000 dollars and standard deviation of 2000 dollars.
 - What is the probability that sales are 10,000 dollars on a certain day?    
 - What is the probability that sales are 10,000 dollars <ins>or less</ins> on a certain day?
 - What is the probability that sales are <ins>greater than</ins> 15,000 dollars on a certain day.
 - How much would the daily sales have to be to be in the top 10% of all days?

![normal_dist.svg](attachment:normal_dist.svg)

- this showing the probability of each value (the daily sales amount) occuring

![figure1.svg](attachment:figure1.svg)

### Suppose that a store's daily sales are normally distributed with a mean of 12,000 dollars and standard deviation of 2000 dollars.

#### scipy stats module - define our distribution

- `stats.norm` 
    - will generate a normal distribution
    - send in a mean and standard distribution

In [None]:
#set mean and sd


In [None]:
#make the distribution


####  generate random variables from the distribution


In [None]:
#plot the random variables


### Answer questions using the same distribution chart!

####  What is the probability that sales are 10,000 dollars on a certain day?

- use `pdf`:Probability Density Function

In [None]:
#exactly 10,000!


#### What is the probability that sales are 10,000 dollars or less on a certain day.

In [None]:
#this should INCLUDE 10,000


#### What is the probability that sales are greater than 15,000 dollars on a certain day?

In [None]:
#greater than 15_000


#### How much would the daily sales have to be to be in the top 10% of all days?

In [None]:
#given percent and we want the value


## Recap

Scipy provides many different ways of interacting with various statistical distributions through it's stats module.

1. first we determine and define our distribution
    - uniform (discrete values)
        - `stats.randint`
    - normal (continuous values)
        - `stats.norm`
    
    
2. use scipy.stats module to extract data from our distribution
    - `pmf` / `pdf`: probability from single value
    - `cdf` / `ppf`: probability that a value is LESS THAN OR EQUAL TO
    - `sf`  / `isf`: probability that a value is GREATER THAN 


#### Mini Exercise:

The average battery life for a fully charges iphone-12 is 14 hours with standard deviation of 1.5 hour


1. What kind of probability distribution represents the random variable "battery life in hours"?


2. What are the appropriate defining parameters for this distribution?


3. Create a Scipy object/instance for this distribution


4. Use the object create above and choose appropriate method (e.g. pmf, cdf, ppf etc.) to calculate the following:  


     a. What is the probability the cell phone battery more than 16 hours.  
     b. What is probability that cell phone battery lasts for exactly 12 hours.  
     c. What is the probability that cell phone battery lasts for 12 hours or less.  
     d. How many hours do the battery lasts for top 25% longest lasting phones.  

## Binomial Distribution 

- the probability that one of two independent variables will occur
- its a <ins>discrete</ins> probability distribution
 
 
Defined by:
 - number of Trials (sequence of n trials)
 - probability of 'success' in each trial



Assumptions:
- Two potential outcome per trial
- Probability of success is same across all trials
- Each trial is independent

Random variable:
- X = number of success in trials

### Suppose we flip a fair coin 5 times in a row

Random variable
- X = Number of heads (success) from flipping a coin 5 times


What is a trial? 
- one coin flip

What are the outcomes?
- heads/tails

What is 'success'?
- getting heads

Total possible outcomes for 5 coin flips
- 2 * 2 * 2 * 2 * 2

In [None]:
#calculate all possible outcomes


#### What is probability of getting exactly 0 heads in 5 trials? i.e P(X==0)
- TTTTT

#### What is probability of getting exactly 1 head in 5 trials? i.e P(X==1)
- HTTTT
- THTTT
- TTHTT
- TTTHT
- TTTTH

#### scipy stats module - define our distribution
- `stats.binom`
- send in number of trails, and rate of wins

In [None]:
#define number of trials and probabilty of winning


In [None]:
#create distribution


![binom_coin_toss0_5.svg](attachment:binom_coin_toss0_5.svg)

#### What is probability of getting exactly 0 heads in 5 trials?

#### What is probability of getting exactly 1 head in 5 trials?

Why are we using `pmf` for both of these?
- we are given a single value and need probability
- this is a discrete distribution

### Rigged coin distribution
- probability of success (getting 'heads' in a coin toss) = 0.7  
- number of trials = 5

#### define our distribution

#### what is probability of getting 1 head?


![binom_coin_toss0_7.svg](attachment:binom_coin_toss0_7.svg)

___________________________________

### Rigged coin distribution again
- probability of success (getting heads in a coin toss) = 0.2  
- number of trials = 5

#### define our distribution

#### what is probability of getting 1 head?

![binom_coin_toss0_2.svg](attachment:binom_coin_toss0_2.svg)

### You are taking a multiple choice test consisting of 30 questions that you forgot to study for. Each question has 4 possible answers and you will choose one at random. What is the probability you get 11 or more questions right?

define: 
- what are the outcomes? 
- what is success? 
- probability of success
- number of trials 

#### define our distribution

#### use our function to answer

![binom_example1-2.svg](attachment:binom_example1-2.svg)

#### Mini Exercise

The probability that a visitor will make a purchase when browsing in your web-store is 1.5%. You expect 350 web-visitors today 


1. What kind of probability distribution you have for "# number of visitors who end up making a purchase"?


2. What are the appropriate defining parameters for this distribution?


3. Create a Scipy object/instance for this distribution



4. Use the object create above and choose appropriate method (e.g. pmf, cdf, ppf etc.) to calculate the following:  


     a. What is the probability that exactly 10 vistors will make the a purchase?
     b. What is probability 13 or more visitors will make a purchase?  
     c. What is probability that 10 or less visitors will make a purchase?
     d. Visualize the resulting distribution (hint: try a bar chart)

## Poisson Distribution

- discrete probability distribution 
- expresses the probability of a given **number of events** occurring in a fixed interval of time or space
- No upper bound on number of events (un-like Binomial distribution)

Defined by:
- lambda ($\lambda$) which is the rate at which the event happens

Assumptions
- events are occuring independently
- probability that event occurs in a given length of time does not change through time

Examples:
- Number of deaths by horse kicks in Prussian army
- Telecommunications: # of calls arriving in to customer service  
- Astronomy: photons arriving at a telescope
- Biology: the number of mutations on a strand of DNA per unit length  
- Management: customers arriving at a counter or call centre
- Finance and insurance: number of losses or claims occurring in a given period of time  
- Radioactivity: number of decays in a given time interval in a radioactive sample


### Suppose that astronomers estimate that large meteorites (above a certain size) hit the earth on average once every 100 years and that the number of meteorite hits follows a Poisson distribution.

lambda = rate of an event = 1 event per 100 years

#### define our function

In [None]:
#lambda
#don't overwrite the lambda function!


In [None]:
#set the distribution


In [None]:
#plot the data
# x = np.arange(0,8)
# y = [meteors.pmf(i) for i in x]

# plt.bar(x,y)

# plt.xlabel('Meteor Events Per 100 Years')
# plt.ylabel('Probability of Meteor Event')
# plt.title('Poisson Distribution of Meteor Events in 100 year Window')
# plt.show()

#### What is the probability of 0 meteorite hits in the next 100 years?

#### Mini Exercise:

Average number of customers going through CVS drive-through is 8 per hour. 

1. What kind of distribution we are working with?


2. What are the appropriate defining parameters for this distribution?


3. Create a Scipy object/instance for this distribution



4. Use the object create above and choose appropriate method (e.g. pmf, cdf, ppf etc.) to calculate the probability that 11 cars customers will go through in next hr.


## Summary:

Types of Distribution:
1. Uniform distribution
    - lower and upper bound
    - stats.randint(lower,upper)


1. Normal distribution
    - mean ($\mu$)
    - std dev ($\sigma$)
    - stats.norm($\mu$,  $\sigma$ )


2. Binomial distribution
    - number of trials (n)
    - probability of success (p)
    - stats.binom(n, p )
    - X = number of success in trials


3. Poisson distribution
    - rate ($\lambda$)
    - stats.poisson($\lambda$)
    - X = number of events per unit time
    
 
For a defined probability distribution above, we can answer different questions using following functions:

- pmf/pdf
    - an exact value
- cdf/ppf
    - less than or equal to
- sf/isf
    - greater than 

In [None]:
# hint for exercises:
# you can feed more than one value into a cdf or a ppf
# use a list format[low_val, high_val]
# ex: some_distribution.cdf([4,7])