# Import

In [1]:
from matplotlib import pyplot as plt
import matplotlib.cm as cm
import matplotlib.gridspec as gridspec
from matplotlib.ticker import MaxNLocator
import matplotlib.mlab as mlab
%matplotlib inline

# set fig size; bigger DPI results in bigger fig
plt.rcParams["figure.dpi"] = 80

import seaborn as sns
import pandas as pd
import numpy as np
import math
import scipy.stats as stats
from scipy.stats import norm
import statsmodels.api as sm
import statsmodels.stats.api as sms
import statsmodels.formula.api as smf
import plotly.express as px
import plotly.graph_objects as go

import stemgraphic as stem
from mgt2001 import *
import mgt2001

import random
import itertools
import math
from IPython.display import display_html
plt.style.use('ggplot') # refined style


import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=UserWarning)

mgt2001.__version__

'0.1.111'

# Chapter 5 : Data Collection and Sample Distribution

Population has parameter, sample has statistics  

### Sampling Methods
- non-probabilistic sampling (**unequal** possibility of entering the sample)
  - Convenience (ex. volunteers)
  - Judgment (select the samples most representive of the population)
- probabilistic sampling (**equal** and **known** possibility of entering the sample)
  - Simple Random Sampling  
  - Stratified Random Sampling (Seperate the population to several set, then draw random samples from sets)
  - Cluster Sampling (Similar as stratified, each cluster is a representative small-scale version of the population )
  - Systematic Sampling (sample one element for every N/n elements in the population)

### Random sampling using python
random.sample(): Random sample without replacement  
random.choices(): Random sample with replacement  
random.shuffle(): Random Permutation  
randome.seed()  
random.randint(a, b) a <= int <= b  
  
### Sampling Error  
The absolute difference between an unbiased point estimate and the corresponding population parameter  
Expected to happen  
It is the result of using a subset of the population (the sample), and not the entire population to develop estimates  
Increase sample size will reduce sampling error  
  
Mean absolute error (MAE) : 1/n $\Sigma$ |e<sub>k</sub>|  
Root Mean Squared Error (RMSE) : $\sqrt{(1/n) \Sigma ek^2}$   
RMSE is more sensitive to outliers

### Non-Sampling Error  
Non-sampling errors occur due to mistakes made along the process of data acquisition  
Increasing sample size **will not** reduce this type of error  
Types of non-sampling error :  
- Errors in data acquisition
- Non-response errors
- Selection bias
- Self-selection bias  
#### Self-selection bias  
Individuals select themselves into a group, causing a **biased sample with nonprobability sampling**  



# Chapter 9 : Sampling Distribution


A sampling distribution of a statistic is a probability distribution of the statistics created by the sampling process  
![](https://i.imgur.com/a7Ytvnv.png)
![](https://i.imgur.com/jYGD170.png)  
Xbar tends to be closer to $\mu$ as the sample size increases  

### Central Limit Theorm  
The larger the sample size, the more closely the sampling distribution of $\bar{x}$ will resemble a normal distribution  
with mean = $\mu$ and variance = $\sigma^2$/n  

#### Scipy.stats  
- rvs: random variable
- pdf: Probability density function
- cdf: Cumulative distribution function
- ppf: Percent point function (Inverse CDF)


### Example : 
The amount of soda pop in each bottle is normally distributed with a mean of 32.2 ounces and a standard deviation of .3 ounces  
Find the probability that 4 bottles will have a mean more than 32 ounces  

In [5]:
xbar = 32 #Sample mean
mu = 32.2 # Population mean
std = 0.3 # Population std
n = 4 # Sample size
zcv = (xbar - mu) / (std / math.sqrt(n)) # Normalize
p = 1 - stats.norm.cdf(zcv)
print (p)

0.9087887802741352


![](https://i.imgur.com/8l4mZgB.png)
![](https://i.imgur.com/aHb2ZYn.png)


### Standard Deviation of $\bar{x}$  
- Finite Population :  
![](https://i.imgur.com/nXjnA27.png)
- Infinite Population :   
When doing **random sample without replacement**  
A finite population can be treated as infinite if n/N <= 0.5  
![](https://i.imgur.com/mL4iKSV.png)  
 &radic;(N-n)/(N-1) is the finite correction factor  
 
 ### Binomial Experiment  
 ![](https://i.imgur.com/TSEPFVI.png)  

## Normal approximation to the binomial
Normal approximation to the binomial works best when  
- sample size is large    
- probability of success is close to 0.5.  
np > 5 && n(1-p) > 5  
$\mu$ = np 
$\sigma^2$ = np(1-p)  
![](https://i.imgur.com/LYOENLN.png)  


## Population and Sample Proportion
The estimate of $p$ is $\hat{p}$ = number of success / n  
$E[\hat{p}]$ = $p$  
$\sigma$<sub>$\hat{p}$</sub> = $\sqrt{p(1-p)/n}$
p = population proportion  
n = sample size  
$\sigma$<sub>$\hat{p}$</sub> is standard error of proportion  
![](https://i.imgur.com/zJA099N.png)  


### Standard deviation of sampling distribution $\hat{p}$  
![](https://i.imgur.com/5UtccW5.png)  

## Sampling Distribution of the Difference Between Two Means  
![](https://i.imgur.com/XeNWepV.png)  
![](https://i.imgur.com/sEfGJ8R.png)  


# Chapter 10 : Introdiction to Estimation  
Estimator Charateristics : 
- Unbiasedness  
Expected value of sample statistics is equal to the population parameter being estimated  
- Consistency  
The point estimator become ** closer** to the population parameter as the sample size becomes bigger  
- Efficiency  
If there are two or more unbiased estimators of a parameter, the one whose variance (standard deviation) is **smaller** is said to be relatively efficient


## Estimating $\mu$ When $\sigma$ is known  
$\bar{x}$ = the sample mean  
1 – $\alpha$ = the confidence level  
z<sub>$\alpha / 2$ </sub>/2 = the z value providing an area of $\alpha$/2 in the upper tail of the standard normal probability distribution  
s = the population standard deviation  
n = the sample size  

![](https://i.imgur.com/ZaVAAf8.png)  


### Code Example :

In [8]:
sample = np.array([180, 130, 150, 165, 90, 130, 120, 60, 200, 180, 80, 240, 210, 150, 125])
alpha = 0.05
Xbar = sample.mean()
zvalue = stats.norm.ppf(1-alpha/2)
std = 40
n = 15
lcl = Xbar - zvalue * (std / (n ** 0.5))
ucl = Xbar + zvalue * (std / (n ** 0.5))

print (f"Confidence interval : {lcl:.4f} to {ucl:.4f}")

Confidence interval : 127.0909 to 167.5758


## Determining the Sample Size  
estimate the mean to within w units : $\bar{x}$ +- w  
w = z<sub>$\alpha$/2</sub> * $\sigma$ / $\sqrt{n}$  
n =  ( z<sub>$\alpha$/2</sub> * $\sigma$ / w )$^2$  


## Estimation of 𝜇 Using the Sample Median  
![](https://i.imgur.com/YnvQNCW.png)  
![](https://i.imgur.com/akpwPBF.png)  



# Introdiction to Hypothesis Testing  
$H$<sub>0</sub> : Null hypothesis  
$H$<sub>1</sub> : Alternative hypothesis (The one we want to prove)  
### Type of Errors  
![](https://i.imgur.com/fwW08R8.png)  
P(Type I error) = $\alpha$  
P(Type II error) = $\beta$  
1 - $\alpha$ is called **Confidence Level**  
1 - $\beta$ is called **Power of Test** (Reject a false null hypothesis )

#### The Rejection Region Method  
- $H$<sub>1</sub> : $\mu$ > $\mu$<sub>0</sub>  
The Rejection Region is: $\bar{x}$ >= $\bar{x}$<sub>c</sub>  
- $H$<sub>1</sub> : $\mu$ < $\mu$<sub>0</sub>  
The Rejection Region is: $\bar{x}$ <= $\bar{x}$<sub>c</sub>  
- $H$<sub>1</sub> : $\mu$ $\neq$ $\mu$<sub>0</sub>  
The Rejection Region is: $\bar{x}$ <= $\bar{x}$<sub>L</sub> or   $\bar{x}$ >= $\bar{x}$<sub>U</sub>  
alpha needs to be divided by 2 when computing critical value  
![](https://i.imgur.com/a2jmmG2.png) (Can vary for different $H$<sub>1</sub> )  

#### Testing Statistics  
Using standardized value $z = (\bar{x} - \mu) / (\sigma / \sqrt{n})$


#### $p$-value Method (most commonly used) 
![](https://i.imgur.com/dfRzp1J.png)  
If the p-value < $\alpha$ then reject the null hypothesis  
**Code**

In [10]:
Xbar = 190
mu = 200
n = 9 # df.shape[0] if using dataframe
std = 50
z = (Xbar - mu) / (std / (n**0.5))
p_value = 1 - stats.norm.cdf(z)
print (p_value)

0.7257468822499265
