# Lab 4: Standardized random variables, gamma distribution, ECDF

As usual, the first code cell below imports the packages we'll be using for this lab.

In [2]:
%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib.image as img
import numpy as np
import scipy as sp
import scipy.stats as st
import pickle as pkl
print ("Modules Imported!")

Modules Imported!


Labs 1 and 3 go over most of the python that will be necessary for any of the labs, so there will be no more Python tutorial sections.  As you've probably noticed in previous labs, some questions may require more than just code. You can create new cells and designate their type as markdown in order to do this. I would suggest learning the basics of LaTex so that you can more easily represent your mathematical thought process. You can use LaTex syntax by placing it between two dollar signs in a markdown cell.

## Standardized Random Variables:

A standard random variable is one that has a mean of zero and a variance of one $(\mu=0, \sigma^2=1)$.  If a random variable $Y$ is not standard, then a standard one can be derived from it
by centering and linear rescaling.   The distribution (e.g. pmf) of the standardized version of $Y$ has the same shape as the distribution of $Y$.    We require two things of the standarized version: a mean of zero and a variance of one. If we let $X$ be the standardized form of $Y$ then $X = \frac{Y-\mu_Y}{\sigma_Y},$ where $\mu_Y$ is the mean of $Y$ and $\sigma_Y^2$ is the variance of $Y.$ Let's check this:
\begin{align*}
E[X] & = E\left[\frac{Y-\mu_Y}{\sigma_Y}\right] = \frac{1}{\sigma_y}E[Y-\mu_Y] = \frac{1}{\sigma_Y}(E[Y]-\mu_Y) = 0  \\
\mbox{Var}(X) & = \mbox{Var}\left(\frac{Y-\mu_Y}{\sigma_Y}\right) = \frac{1}{\sigma_Y^2}\mbox{Var}(Y-\mu_Y) = \frac{\mbox{Var}(Y)}{\sigma_Y^2} = 1
\end{align*}

So to standardize any random variable, we simply need to subtract the mean and then divide by the standard deviation. This is useful because if we know the CDF of a standardized version of a random variable we can find the CDF of the original version. For example, suppose $Y$ is a non-standard variable and $X$ is the standardized version of Y, and suppose we want to determine the CDF of Y but only have the CDF of X. We can determine the CDF by the following:

$F_Y(c) = P\{Y \le c\} = P\{Y-\mu_Y \le c-\mu_y\} = P\left\{\frac{Y-\mu_Y}{\sigma_Y} \le \frac{c-\mu_Y}{\sigma_Y}\right\}= P\left\{X \le \frac{c-\mu_Y}{\sigma_Y}\right\}$

Since Python does such a nice job of packaging these distributions, this isn't particularly necessary for our coding purposes. However, when you get to Gaussian distributions in your probability class, you will use this extensively.

$\textbf{Caveat}$: When you do problem 1, be aware of a machine-dependent feature of the .pmf method of a distribution object created by st.rv_discrete, which has caused bugs and confusion for many students before (especially in part 3). The .pmf behaves weirdly for non-integer values, illustrated this example: 

In [None]:
c = [1.5, 2.0]
p = [0.5, 0.5]
Z = st.rv_discrete(values=(c,p))
print (Z.pmf(2.0))  # Prints 0.5
print (Z.pmf(1.5))  # Prints 0.5 on some machines, Prints 0.0 on some other machines (e.g. your laptops)

This seems to be a design flaw of the scipy library. You are not required to understand it or fix it. Our suggestion is: please avoid using the .pmf method for non-integer values. -- Zeyu Zhou, Feb 2018

<br>**<SPAN style="BACKGROUND-COLOR: #C0C0C0">Problem 1:</SPAN>**  To illustrate the standardization procedure, 
<ol>
    <li> For a standard six sided die, imagine a scenario that rolling a number that is equal or bigger than 5 is considered a success. Otherwise it is considered as a failure. You roll the dice 10 times. Create a variable $Y$ that represents the number of successes you achieve in these 10 rounds. Print out the mean and variance of $Y$.     
    <li> Create another random variable $X$, which is a standardized version of $Y$. Print out the mean and variance of $X$
    <li> Plot the pmf of $Y$ and the pmf of $X.$   Up to centering and linear scaling, the pmfs should have the same shape. 
</ol>

In [None]:
# Your code here

**<SPAN style="BACKGROUND-COLOR: #C0C0C0">End of Problem 1</SPAN>**

## Continuous Distribution:

### Gamma Distribution

As we discussed in ECE313 classes, we introduced some standard continuous probability distribution: Uniform Distribution, Exponential Distribution, Normal distribution. Here, we are going to introduce more continuous probability distribution.

The Gamma distribution is a particular case of the normal distribution, which describes many life events including predicted rainfall, the reliability of mechanical tools and machines, or any applications that only have positive results. Unfortunately, these applications are often unbalanced, which explains the Gamma distribution’s skewed shape. At its core, the Gamma distribution is used to describe the time until an event occurs, given a certain rate at which events happen. It is characterized by two parameters: the shape parameter $k$ (also known as $\alpha$), and the rate parameter $\theta$ (also known as $\beta$). In each of these forms, both parameters are positive real numbers. 

The shape parameter $k$ describes how many events the distribution describes. For example, if we use the Gamma distribution to describe the probability of car accidents in a particular city by modeling four accidents, the shape parameter will be four. The scale parameters $\theta$ describe the time interval between the events we are modeling. If we use the car accidents example again, we want to measure the time between the four accidents, so we’ll use the average time between those accidents. The scale parameter controls the height of the distribution’s peak. The higher the value of the scale distribution, the more spread it will be, which means it will have a lower peak. However, if the scale parameter is small, the distribution will contract and the peak will be larger.

<br>**<SPAN style="BACKGROUND-COLOR: #C0C0C0">Problem 2:</SPAN>** Complete the following problem related to Gamma Distribution. <br> <ol>
<li> Given the following sets of parameters for the Gamma distribution, plot the probability density function (PDF) for each set on the same graph for comparison: <br>
    1.$k$=1.0, $\theta$=2.0 <br>
    2.$k$=2.0, $\theta$=2.0 <br> 
    3.$k$=3.0, $\theta$=2.0 <br>  
    4.$k$=5.0, $\theta$=1.0 <br>  
    5.$k$=9.0, $\theta$=0.5 <br>
    6.$k$=7.5, $\theta$=1.0 <br> 
    7.$k$=0.5, $\theta$=1.0 <br>
    Use a range for $x$ from 0 to 20. Label your axes appropriately and include a legend to differentiate between the plots for each set of parameters. <br>
<li> For the same sets of parameters provided in Question 2.1, plot the cumulative distribution function (CDF) for each set on the same graph for comparison. <br>
<li> What do you find in this comparison? Give a brief description after plotting. <br> 

In [None]:
# Your code here

__Answer:__ (Your answer here)

**<SPAN style="BACKGROUND-COLOR: #C0C0C0">End of Problem 2</SPAN>**

### Normal Distribution and ECDF

The Normal distribution, also known as the Gaussian distribution, is one of the most important and widely used probability distributions in statistics. It is characterized by its bell-shaped curve, symmetric around its mean, $\mu$, and with its spread determined by the standard deviation, $\sigma$. The Normal distribution is used to model a wide range of natural phenomena, from heights and weights in a population to measurement errors in experiments. It is defined by the probability density function (PDF): <br>
\begin{align*}
f(x | \mu, \sigma^2) = \frac{1}{\sigma\sqrt{2\pi}} \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right)
\end{align*}

The Empirical Cumulative Distribution Function (ECDF) is a tool used in statistics to estimate the cumulative distribution function (CDF) of a random variable. Unlike the CDF, which is theoretical and known for distributions like the Normal distribution, the ECDF is constructed from data and provides a step-wise approximation of the CDF. The Empirical Cumulative Distribution Function (ECDF), \(F_n(x)\), for a given sample of size $n$ is defined as: <br>
\begin{align*}
F_n(x) = \frac{1}{n} \sum_{i=1}^{n} I(x_i \leq x)
\end{align*}
where $n$ is the number of data points, $x_i$ are the observed values, and $I$ is an indicator function that is 1 if $x_i$ ≤ $x$ and 0 otherwise. The ECDF gives the proportion of observations less than or equal to $x$. <br>

<br>**<SPAN style="BACKGROUND-COLOR: #C0C0C0">Problem 3:</SPAN>** Consider a Normal distribution $N(\mu, \sigma^2)$ where $\mu$ is the mean and $\sigma^2$ is the variance. For this question, you will be working with a specific Normal distribution where $\mu$ = 0 and $\sigma$ = 6.  <br><ol>
<li> Using the normal distribution $N(\mu, \sigma^2)$, generate three separate samples of sizes 10, 100, and 1000 random numbers.
<li> For each sample, calculate and plot the ECDF. Ensure that each ECDF plot is clearly labeled and includes a legend indicating the sample size.
<li> Discuss how the ECDF changes with increasing sample size and what this implies about the convergence of the empirical distribution to the theoretical distribution.

In [7]:
# Your code here

__Answer:__ (Your answer here)

**<SPAN style="BACKGROUND-COLOR: #C0C0C0">End of Problem 3</SPAN>**

## Lab Questions:

Make sure to complete all lab questions 1-3 for this weeks lab.

<div class="alert alert-block alert-warning"> 
## Academic Integrity Statement ##

By submitting the lab with this statement, you declare you have written up the lab entirely by yourself, including both code and markdown cells. You also agree that you should not share your code with anyone else. Any violation of the academic integrity requirement may cause an academic integrity report to be filed that could go into your student record. See <a href="https://provost.illinois.edu/policies/policies/academic-integrity/students-quick-reference-guide-to-academic-integrity/">Students' Quick Reference Guide to Academic Integrity</a> for more information. 