<table align="left" width=100%>
    <tr>
        <td width="20%">
            <img src="GL-2.png">
            <img src="faculty.png">
        </td>
        <td>
            <div align="center">
                <font color="#21618C" size=8px>
                  <b> Post Read <br> (Week 1) </b>
                </font>
            </div>
        </td>
    </tr>
</table>

## Table of Content

1. **[Import Libraries](#lib)**
2. **[Probability Distributions](#dist)**
     - 2.1 - **[Discrete Probability Distributions](#dis)**
         - 2.1.1 - **[Discrete Uniform Distribution](#dis_uni)**
         - 2.1.2 - **[Bernoulli Distribution](#bernoulli)**
         - 2.1.3 - **[Binomial Distribution](#binomial)**
         - 2.1.4 - **[Poisson Distribution](#poisson)**
     - 2.2 - **[Continuous Probability Distributions](#cont)**
         - 2.2.1 - **[Continuous Uniform Distribution](#cont_uni)**
3. **[Sampling](#sample)**
    - 3.1 - **[Stratified Sample](#strata)**
    - 3.2 - **[Systematic Sample](#sys)**
    - 3.3 - **[Cluster Sample](#cluster)**   


<a id="lib"></a>
# 1. Import Libraries

**Let us import the required libraries.**

In [1]:
# import 'pandas' 
import pandas as pd 

# import 'numpy' 
import numpy as np

# import subpackage of matplotlib
import matplotlib.pyplot as plt
from matplotlib import gridspec
%matplotlib inline

# import 'seaborn'
import seaborn as sns

# to suppress warnings 
from warnings import filterwarnings
filterwarnings('ignore')

# import 'factorial' from math library
from math import factorial

# import 'stats' package from scipy library
from scipy import stats
from scipy.stats import randint
from scipy.stats import skewnorm

# import 'random' to generate a random sample
import random

In [2]:
# set the plot size using 'rcParams'
# once the plot size is set using 'rcParams', it sets the size of all the forthcoming plots in the file
# pass width and height in inches to 'figure.figsize' 
plt.rcParams['figure.figsize'] = [15,8]

<a id="App"></a>
# 2.Probability Distribution

<a id="dis_uni"></a>
### 2.1.1 Discrete Uniform Distribution

A discrete variable X taking values 1, 2, ..., n follows discrete uniform distribution if all the values taken by X are equally likely. The pmf of X is given by:

<p style='text-indent:25em'> <strong> $ P(X = x) = \frac{1}{n}$</strong>$\hspace{2cm}$  x = 1, 2, ..., n </p>

The distribution is symmetric, as the probability is equal for all values of X.  

The mean and variance of the distribution is given as:<br>

Mean = $\frac{n+1}{2}$

Variance = $\frac{n^{2}-1}{12}$ 

### Example:

#### 1. A factory has 6 machines numbered from 1 to 6. Let r.v. X be the machine number. What is the probability that a machine chosen at random is either 3 or 4?

In [3]:
# probability of selecting any machine (1 to 6) is 1/6
prob_machine = 1/6

# to find: P(X = 3 or X = 4) = P(X = 3) + P(X = 4)
# 'randint.pmf' returns the pmf of discrete uniform distribution
# pass the value of X to the parameter, 'k'
# as, X = {1,2,3,4,5,6} 
# pass the minimum value that a variable X takes to the parameter, 'low'
# pass the (maximum+1) value that a variable X takes to the parameter, 'high'
req_prob = randint.pmf(k = 3, low = 1, high = 7) + randint.pmf(k = 4, low = 1, high = 7)

print('The probability of selecting a machine 3 or 4 is', req_prob)

The probability of selecting a machine 3 or 4 is 0.3333333333333333


<a id="bernoulli"></a>
### 5.1.2 Bernoulli Distribution

A discrete variable X taking only two values (say 0 and 1) follows a bernoulli distribution with parameter `p` if the pmf of X is given by:

<p style='text-indent:25em'> <strong> $ P(X = x) = p^{x}q^{1-x}$</strong> $\hspace{2cm}$  x = 0, 1 </p>

`p` denotes the probability of success of an experiment and `q` denotes the probability of failure. (where, p + q = 1)

The experiment associated with variable X is also known as the `Bernoulli trial`.

The mean and variance of the distribution is given as:<br>

Mean = $p$

Variance = $q$ 

### Example:

#### 1. If 7 out of 10 times a soccer player scores a goal for a direct free kick. What would be the probability that he scores a goal for the next free kick? 

In [4]:
# probability of success (scoring a goal)
p = 0.7

# calculate the probability that a player scores a goal 
# to find: P(X = 1)
# pass the value of X to the parameter, 'k'
# pass the probability of success to the parameter, 'p'
req_prob = stats.bernoulli.pmf(k = 1, p = 0.7)

print('The probability that the player scores a goal for the next free kick is', req_prob)

The probability that the player scores a goal for the next free kick is 0.7


<a id="binomial"></a>
### 5.1.3 Binomial Distribution

A discrete variable X taking values 0, 1, 2,..., n follows a binomial distribution with parameters `n` and `p`, if the pmf of X is given by:

<p style='text-indent:25em'> <strong> $ P(X = x) = {n \choose x} p^{x}q^{n-x}$</strong> $\hspace{2cm}$  x = 0, 1, ..., n </p>

`p` denotes the probability of success of an experiment and `q` denotes the probability of failure. (where, p + q = 1)

If `n` independent bernoulli trials (each with `p` as probability of success) are executed then the number of successes follows a binomial distribution.

The mean and variance of the distribution is given as:<br>

Mean = $np$

Variance = $npq$ 

### Example:

#### 1. Heaven Furnitures (HF) sells furniture like sofas, beds and tables. It is observed that 25% of their customers complain about the furniture purchased by them for many reasons. On Tuesday, 20 customers purchased furniture products from HF. 

#### a. Calculate the probability that exactly 3 customers will complain about the purchased products.

In [5]:
# use 'binom.pmf()' to calculate the pmf for binomial distribution 
# pass the required value of customers to the parameter, 'k' 
# pass number of total customers to the parameter, 'n'
# here the success is the customer's complaint about the products with probability 0.25
prob = stats.binom.pmf(k = 3, n = 20, p = 0.25)

# use 'round()' to round-off the value to 2 digits
prob = round(prob, 2)
print('The probability that exactly 3 customers will complain about the purchased products is', prob)

The probability that exactly 3 customers will complain about the purchased products is 0.13


####  b. Calculate the probability that more than 3 customers will complain about the furniture purchased by them.

In [6]:
# use 'binom.sf()' to calculate the value of survival function (1 - cdf). i.e. P(X > x)
# calculate the probability that more than 3 customers will complain
# pass the required value of customers to the parameter, 'k' 
# pass number of total customers to the parameter, 'n'
# here the success is the customer's complaint about the products with probability 0.25
prob = stats.binom.sf(k = 3, n = 20, p = 0.25)

# use 'round()' to round-off the value to 2 digits
req_prob = round(prob, 2)
print('The probability that more than 3 customers will complain about the furniture is', req_prob)

The probability that more than 3 customers will complain about the furniture is 0.77


#### 2. In a shooting academy, data was collected on the precision shooting of a student. From 15 shots fired 11 were on target. Consider the same student, what is the probability that out of 50 shots fired, exactly 35 will hit the target?

In [7]:
# use 'binom.pmf()' to calculate the pmf for binomial distribution 
# pass the required value of shots hit on the target to the parameter, 'k' 
# pass number of total shots fired to the parameter, 'n'
# here the success is hitting the shots on the target with probability 11/15
prob = stats.binom.pmf(k = 35, n = 50, p = 11/15)

# use 'round()' to round-off the value to 2 digits
prob = round(prob, 2)
print('The probability that that out of 50 shots fired, exactly 35 will hit the target is', prob)

The probability that that out of 50 shots fired, exactly 35 will hit the target is 0.11


<a id="poisson"></a>
### 5.1.4 Poisson Distribution

A discrete variable X taking values 0, 1, 2,... follows a poisson distribution with parameter `m` (m > 0), if the pmf of X is given by:

<p style='text-indent:25em'> <strong> $ P(X = x) = \frac{e^{-m}m^{x}}{x!}$</strong> $\hspace{2cm}$  x = 0, 1, 2,... </p>

The mean and variance of the distribution is given as:<br>

Mean = $m$ = Variance

**Note:** Consider a variable X that follows a binomial distribution with parameters `n` and `p`. If n$\rightarrow$$\infty$ and p$\rightarrow$0 then X follows a poison distribution with parameter `m = np`.

### Example:

**1. The number of pizzas sold per day by a food zone "Fapinos" follows a poisson distribution at a rate of 67 pizzas per day. Calculate the probability that the number of pizza sales exceeds 70 in a day.**

In [8]:
# use 'poisson.sf()' to calculate the value of survival function (1 - cdf). i.e. P(X > x)
# calculate the probability that more than 70 pizzas will be sold 
# pass the required value of pizzas to the parameter, 'k' 
# pass the average number of pizzas to the parameter, 'mu'
prob = stats.poisson.sf(k = 70, mu = 67)

# use 'round()' to round-off the value to 2 digits
req_prob = round(prob, 2)
print('The probability that the number of pizza sales exceeds 70 in a day is', req_prob)

The probability that the number of pizza sales exceeds 70 in a day is 0.33


#### 2. The number of calls received at a telephone exchange in a day follows poisson distribution. The probability that the exchange receives 5 calls is three times that of the exchange receiving 10 calls. Obtain the average calls that the telephone exchange receives in a day.

In [9]:
# given: P(X = 5) = 3*P(X = 10)
# to find: m = average number of calls
# solving the above equation we get
m_raised_5 = factorial(10) / (3* factorial(5))

# value of 'm'
m = m_raised_5**(1/5)

# as the number of calls is an integer, convert the value of 'm' using int() 
print('Average calls that the telephone exchange receives in a day', int(m))

Average calls that the telephone exchange receives in a day 6


<a id="cont"></a>
## 5.2 Continuous Probability Distributions

It is the probability distribution related to the continuous random variable. The area under `probability density function (pdf)` gives the probability that a variable lies in a specific range. The value of pdf at a specific point is always 0.

Probability that X lies between [a,b] is given by:

<p style='text-indent:25em'> <strong> $P(a \leq X \leq b) = \int_{a}^{b} f(x) dx$</strong> </p> 

Area under the curve is 1. i.e. $\int_{-\infty}^{\infty} f(x) dx = 1$

For a continuous random variable X the `cumulative distribution function (cdf)` is denoted by F(x) and defined as:

<p style='text-indent:25em'> <strong> $ F(x) = \int_{-\infty}^{x} f(u) du$</strong> </p>

The cdf of a random variable returns the probability that the variable takes all the values less than or equal to the specific value.

<a id="cont_uni"></a>
### 5.2.1 Continuous Uniform Distribution

A continuous variable X taking values in a range [a,b] follows continuous uniform distribution if all the values in the range of variable X are equally likely. The pdf of X is given by:

<p style='text-indent:25em'> <strong> $f(x) = \frac{1}{b-a}$</strong> $\hspace{2cm}$ $a \leq x \leq b$ </p>

The mean and variance of the distribution is given as:<br>

Mean = $\frac{a + b}{2}$

Variance = $\frac{(b-a)^{2}}{12}$ 

### Example:

#### 1. A gas supplying company has a pipe of 200 kms from the its supplying centre to the city. What is the probability that the pipe leaks in the middle 100 kms. (Assume that the chance of pipe leakage is equal on the entire route)

In [10]:
# distance between gas supplying centre and city is 200 kms
# to find: P(50<= X <= 150)
# X follows uniform distribution over the interval [0,200]

# use 'uniform.cdf()' to calculate the cdf for continuous uniform distribution 
# pass the required value of X to the parameter, 'x'
# pass the start point of the interval to the parameter, 'loc' 
# pass the end point of the interval to the parameter, 'scale'
prob_150 = stats.uniform.cdf(x = 150, loc = 0, scale = 200)
prob_50 = stats.uniform.cdf(x = 50, loc = 0, scale = 200)

# calculate the required probability
req_prob = prob_150 - prob_50

# use 'round()' to round-off the value to 2 digits
req_prob = round(req_prob, 2)
print('The probability that the pipe leaks in the middle 100 kms is', req_prob)

The probability that the pipe leaks in the middle 100 kms is 0.5


<a id="Sam"></a>
# 3.Sampling

<a id="strata"></a>
## 3.1 Stratified Sample

We can use the stratified sampling method to draw a sample from the heterogeneous dataset. The dataset is divided into the homogeneous strata and then a sample is drawn from each stratum. The final sample contains elements from each stratum.

### Example:

#### 1. Consider the annual income (in dollars) of 15 employees in a company. The socio-economic status of the employee depends on the annual income.

<table>
<tr>
    <th>Status</th>
    <td>Low</td>
    <td>High</td>
    <td>Below Poverty</td>
    <td>Middle</td>
    <td>Low</td>
    <td>High</td>
    <td>Middle</td>
    <td>Low</td>
    <td>Middle</td>
    <td>High</td>
    <td>Middle</td>
    <td>Below Poverty</td>
    <td>Low</td>
    <td>Below Poverty</td>
    <td>High</td>
   </tr>
<tr>
    <th>Income</th>
    <td>4850</td>
    <td>9270</td>
    <td>2520</td>
    <td>6020</td>
    <td>5790</td>
    <td>10400</td>
    <td>7140</td>
    <td>3100</td>
    <td>6850</td>
    <td>9740</td>
    <td>6540</td>
    <td>1230</td>
    <td>4400</td>
    <td>2210</td>
    <td>9880</td>
    </tr>
</table>


Draw the stratified sample such that each category will occur twice.

In [11]:
# create a dataframe of given data
df_inc = pd.DataFrame(dict(Status = ['Low', 'High', 'Below Poverty', 'Middle', 'Low', 'High','Middle', 'Low', 'Middle', 
                                      'High', 'Middle', 'Below Poverty', 'Low', 'Below Poverty', 'High'],
                      Income = [4850, 9270, 2520, 6020, 5790, 10400, 7140, 3100, 6850, 9740, 6540, 1230, 4400, 2210, 9880]))

# use 'groupby()' to create the homogeneous strata for each socio-economic status
# 'group_keys = False' will not add the group keys in the index for each group
# choose sample of two observations using 'sample'
# set 'random_state' to obtain the same sample every time you run the code 
# use 'apply()' to apply the lambda function for each socio-economic status
df_inc.groupby('Status', group_keys = False).apply(lambda x: x.sample(2, random_state = 1))

Unnamed: 0,Status,Income
2,Below Poverty,2520
13,Below Poverty,2210
14,High,9880
9,High,9740
12,Low,4400
7,Low,3100
10,Middle,6540
8,Middle,6850


<a id="sys"></a>
## 3.2 Systematic Sample

This technique can be used to draw a sample in a systematic manner. For the population with size `N` if we want to take a sample of size `n`, then arrange the population (rows-wise) in `k` columns such that, N = nk. Then select a column randomly as the required sample.

### Example:

#### 1. Consider the data for the number of ice-creams sold per day. Draw a systematic sample of size 6 starting from the 3rd element.

data = [21, 93, 62, 76, 73, 20, 56, 95, 41, 36, 38, 13, 80, 88, 34, 18, 40, 11, 
        25, 29, 61, 23, 82, 10, 92, 69, 60, 87, 14, 91, 94, 49, 57, 83, 96, 55, 
        79, 52, 59, 39, 58, 17, 19, 98, 15, 54, 48, 46, 72, 45, 65, 28, 37, 30, 
        68, 75, 16, 33, 31, 99, 22, 51, 27, 67, 85, 47, 44, 77, 64, 97, 84, 42, 
        90, 70, 74, 89, 32, 26, 24, 12, 81, 53, 50, 35, 71, 63, 43, 86, 78, 66]

In [12]:
# given data
data = np.array([21, 93, 62, 76, 73, 20, 56, 95, 41, 36, 38, 13, 80, 88, 34, 18, 40, 11, 25, 29, 61, 23, 82, 10, 92, 69, 60,
                  87, 14, 91, 94, 49, 57, 83, 96, 55, 79, 52, 59, 39, 58, 17, 19, 98, 15, 54, 48, 46, 72, 45, 65, 28, 37, 30, 
                  68, 75, 16, 33, 31, 99, 22, 51, 27, 67, 85, 47, 44, 77, 64, 97, 84, 42, 90, 70, 74, 89, 32, 26, 24, 12, 81, 
                  53, 50, 35, 71, 63, 43, 86, 78, 66])

# total number of elements 
N = len(data)

# required number of samples
n = 6

# i.e. k = 15
# arrange the data in 6 rows and 15 columns using 'reshape()'
data = data.reshape(6,15)

# use for loop to select each sample point
for i in range(6):
    
    # select a sample point
    # pass the second index as '2' to get the sales in the 3rd column of the reshaped data
    sample_pt = data[i][2]
    
    # print a sample point
    print(sample_pt)

62
11
57
46
27
26


<a id="cluster"></a>
## 3.3 Cluster Sample

Cluster sampling can be used when the population is a collection of small groups/clusters. This technique selects a complete cluster randomly as a single sampling unit.

### Example:

#### 1. A pandemic had a severe impact on 17 states in a country. A health organization aims to study the situation in the impacted states. Consider each state as a cluster and use cluster sampling to select 5 states.

    states = ['Arizona', 'Iowa', 'Ohio', 'Nevada', 'Texas', 'Alabama', ' Mississippi', 'Utah', 'Indiana', 'Florida', 
              'New York', 'Nebraska', 'California', 'Colorado', 'Montana', 'Oregon', 'Washington']

In [13]:
# given states 
states = ['Arizona', 'Iowa', 'Ohio', 'Nevada', 'Texas', 'Alabama', ' Mississippi', 'Utah', 'Indiana', 'Florida', 'New York', 
          'Nebraska', 'California', 'Colorado', 'Montana', 'Oregon', 'Washington']

# select 5 states using 'sample()'
# pass the data to the parameter, 'population'
# pass the required sample size to the parameter, 'k' 
cluster_samp = random.sample(population = states, k = 5)

# selected states 
cluster_samp

['California', 'Arizona', 'Iowa', 'Oregon', 'Alabama']