<a id="lib"></a>
# 1. Import Libraries

**Let us import the required libraries.**

In [1]:
from warnings import filterwarnings
filterwarnings('ignore')

In [2]:
# import 'pandas' 
import pandas as pd 

# import 'numpy' 
import numpy as np

# import subpackage of matplotlib
import matplotlib.pyplot as plt

# import 'seaborn'
import seaborn as sns

# to suppress warnings 


# import 'random' to generate random sample
import random

# import statistics to perform statistical computation  
import statistics

# import 'stats' package from scipy library
from scipy import stats

# import a library to perform Z-test
from statsmodels.stats import weightstats as stests

# to test the normality 
from scipy.stats import shapiro

# import the function to calculate the power of test
from statsmodels.stats import power

In [3]:
# set the plot size using 'rcParams'
# once the plot size is set using 'rcParams', it sets the size of all the forthcoming plots in the file
# pass width and height in inches to 'figure.figsize' 
plt.rcParams['figure.figsize'] = [15,8]

In [4]:
import scipy.stats as stats
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

<a id="pt"></a>
##  Point Estimation

This method considers a single value (sample statistic) as the population parameter. 

Let $X_{1}, X_{2}, X_{3},..., X_{n}$ be the random sample drawn from a population with mean $\mu$ and standard deviation $\sigma$. <br>
The point estimation method estimates the population mean, $\mu = \overline{X}$, where $\overline{X}$ is the sample mean and population standard deviation, $\sigma = s$, where $s$ is the standard deviation of the sample .

### Example:

#### 1. Consider the data of grade points for 35 students in a data science course. Select grades of 20 students randomly from the data and find the point estimate for the population mean.

     Grades: [59.1, 65.0, 75.8, 79.2, 95.0, 99.8, 89.1, 65.2, 41.9, 55.2, 94.8, 84.1, 83.2, 74.0, 75.5, 76.2, 79.1, 80.1, 
              92.1, 74.2, 59.2, 64.0, 75, 78.2, 95.6, 97.8, 89.5, 64.2, 41.8, 57.2, 85, 91.4, 81.8, 74.6, 90]

In [5]:
# given population
grades = [59.1, 65.0, 75.8, 79.2, 95.0, 99.8, 89.1, 65.2, 41.9, 55.2, 94.8, 84.1, 83.2, 74.0, 75.5, 76.2, 79.1, 80.1, 
          92.1, 74.2, 59.2, 64.0, 75, 78.2, 95.6, 97.8, 89.5, 64.2, 41.8, 57.2, 85, 91.4, 81.8, 74.6, 90]


In [6]:
# AVERAGE
random.seed(1)
sample_grade=random.sample(grades,k=20)
sample_mean=np.mean(sample_grade)
sample_mean


75.32500000000002

#### 2. A financial firm has created 50 portfolios. From them, a sample of 13 portfolios was selected, out of which 8 were found to be underperforming. Can you estimate the number of underperforming portfolios?

In [7]:
# PROPORTION 
N=50
n=13
sample_proportion = 8/13
sample_proportion

0.6153846153846154

In [8]:
underperforming_portfolios=sample_proportion*N
round(underperforming_portfolios)

31

<a id="err"></a>
### 2.1.1 Sampling Error

Sampling error is considered as the absolute difference between the sample statistic used to estimate the parameter and the corresponding population parameter. Since the entire population is not considered as the sample, the values of mean, median, quantiles, and so on calculated on sample differ from the actual population values. 

One can reduce the sampling error either by increasing the sample size or determining the optimal sample size using various methods.

### Example:

#### 1. Consider the data for the number of ice-creams sold per day. An ice-cream vendor collected this data for 90 days and then a sample is drawn (without replacement) containing ice-creams sold for 25 days. 

data = [21, 93, 62, 76, 73, 20, 56, 95, 41, 36, 38, 13, 80, 88, 34, 18, 40, 11, 
        25, 29, 61, 23, 82, 10, 92, 69, 60, 87, 14, 91, 94, 49, 57, 83, 96, 55, 
        79, 52, 59, 39, 58, 17, 19, 98, 15, 54, 48, 46, 72, 45, 65, 28, 37, 30, 
        68, 75, 16, 33, 31, 99, 22, 51, 27, 67, 85, 47, 44, 77, 64, 97, 84, 42, 
        90, 70, 74, 89, 32, 26, 24, 12, 81, 53, 50, 35, 71, 63, 43, 86, 78, 66]
        
sample = [10, 22, 47, 66, 11, 57, 77, 98, 31, 63, 74, 84, 50, 96, 88, 92, 70, 54, 65, 44, 16, 72, 20, 90, 43]

Comupte the sampling error for mean.

In [9]:
pop = [21, 93, 62, 76, 73, 20, 56, 95, 41, 36, 38, 13, 80, 88, 
       34, 18, 40, 11, 25, 29, 61, 23, 82, 10, 92, 69, 60, 87, 
       14, 91, 94, 49, 57, 83, 96, 55, 79, 52, 59, 39, 58, 17, 
       19, 98, 15, 54, 48, 46, 72, 45, 65, 28, 37, 30, 68, 75, 
       16, 33, 31, 99, 22, 51, 27, 67, 85, 47, 44, 77, 64, 97, 
       84, 42, 90, 70, 74, 89, 32, 26, 24, 12, 81, 53, 50, 35, 
       71, 63, 43, 86, 78, 66]

sample = [10, 22, 47, 66, 11, 57, 77, 98, 31, 63, 74, 84, 50, 96, 
          88, 92, 70, 54, 65, 44, 16, 72, 20, 90, 43]

In [24]:
# Sampling error = population Mean - Sample mean
# Standard Error = s/sqrt(n)
mean_samp=np.mean(sample)
mean_pop=np.mean(pop)
print(mean_samp)
print(mean_pop)
sampling_error=mean_pop-mean_samp
print(sampling_error)
sigma=np.std(pop)    # sigma=s
standard_error=sigma/np.sqrt(len(sample))
print(standard_error)

57.6
54.5
-3.1000000000000014
5.195831662656775


<a id="int"></a>
## 2.2 Interval Estimation for Mean

This method considers the range of values in which the population parameter is likely to lie. The confidence interval is an interval that describes the range of values in which the parameter lies with a specific probability. It is given by the formula,<br> <p style='text-indent:20em'> `conf_interval = sample statistic ± margin of error`</p>

The uncertainty of an estimate is described by the `confidence level` which is used to calculate the margin of error. 

<a id="large"></a>
### 2.2.1 Large Sample Size

The confidence interval for the population mean with $100(1-\alpha)$% confidence level is given as: $\overline{X} \pm Z_{\frac{\alpha}{2}}\frac{s}{\sqrt{n}}$

Where, <br>
$\overline{X}$: Sample mean<br>
$\alpha$: Level of significance<br>
$s$: Sample Standard deviation <br>
$n$: Sample size

In [44]:
alpha =  [0.1, 0.05, 0.02, 0.01] 
for i in range(len(alpha)):
    alpha_by_2 = alpha[i] / 2
alpha_by_2

0.005

In [45]:
# let us find the Z-values for different alpha values

# create an empty dataframe to store the alpha and corresponding Z-value
df_Z = pd.DataFrame()

# create a dictionary of different alpha values
alpha =  [0.1, 0.05, 0.02, 0.01] 

# use for loop to calculate the value for each alpha
for i in range(len(alpha)):
    alpha_by_2 = alpha[i] / 2
    
    # use 'stats.norm.isf()' to find the Z-value corresponding to the upper tail probability 'q'
    # pass the value of 'alpha_by_2' to the parameter 'q'
    # use 'round()' to round-off the value to 4 digits
    Z = np.abs(round(stats.norm.isf(q = alpha_by_2), 4))
    
    # create a dataframe using dictionary to store the alpha and corresponding Z-value
    # set the loop iterator 'i' as the index of the dataframe
    row =  pd.DataFrame({"alpha": alpha[i], "Z_alpha_by_2" : Z}, index = [i])
    
    # append the row to the dataframe 'df_Z'
    df_Z = df_Z.append(row)

# print the final dataframe
df_Z

Unnamed: 0,alpha,Z_alpha_by_2
0,0.1,1.6449
1,0.05,1.96
2,0.02,2.3263
3,0.01,2.5758


To calculate the confidence interval with 95% confidence, use the Z-value corresponding to `alpha = 0.05`. 

## SF and ISF

In [47]:
# 95% alpha = 1-CI = 1-0.95 = 0.05, 
# alpha/2 = 0.025 
stats.norm.isf(0.025)

1.9599639845400545

In [50]:
# 90%
stats.norm.isf(0.05)

1.6448536269514729

In [51]:
# 99%
stats.norm.isf(0.005)

2.575829303548901

In [10]:
0.05/2

0.025

### Example:

#### 1. A random sample of weight (in kg.) for 35 diabetic patients is drawn from the population with a standard deviation of 8 kg. Find the 90% confidence interval for the population mean.

    Weight: [59.1, 65.0, 75.8, 79.2, 95.0, 99.8, 89.1, 65.3, 41.9, 55.2, 94.8, 84.1, 83.2, 74.0, 75.5, 76.2, 79.1, 80.1, 
             92.1, 74.2, 59.2, 64.0, 75, 78.2, 95.6, 97.8, 89.5, 64.2, 41.8, 57.2, 85, 91.4, 81.8, 74.6, 90]

In [55]:
sample=[59.1, 65.0, 75.8, 79.2, 95.0, 99.8, 89.1, 65.3, 41.9, 55.2, 
         94.8, 84.1, 83.2, 74.0, 75.5, 76.2, 79.1, 80.1, 92.1, 74.2, 
         59.2, 64.0, 75, 78.2, 95.6, 97.8, 89.5, 64.2, 41.8, 57.2, 
         85, 91.4, 81.8, 74.6, 90]

In [60]:
sigma=8
xbar=np.mean(sample) 
n=len(sample)


In [59]:
# ci=90%
# alpha=1-ci
# alpha=10%
# alpha/2=5%
z=stats.norm.isf(0.05)
ll=xbar-z*(sigma/np.sqrt(n))
ul=xbar+z*(sigma/np.sqrt(n))
print(ll)
print(ul)

74.46146621975642
78.90996235167215


#### Practice

#### 2. There are 150 apples on a tree. You randomly choose 40 apples and found that the average weight of apples is 182 grams with a standard deviation of 30 grams. Find the 95% confidence interval for the population mean.

In [103]:
s=30
xbar=182
n=40

In [104]:
z=stats.norm.isf(0.025)
ll=xbar-z*(s/np.sqrt(n))
ul=xbar+z*(s/np.sqrt(n))
print(ll)
print(ul)

172.70307451543158
191.29692548456842


In [105]:
# stats.norm.interval(alpha,loc,scale)
stats.norm.interval(.95,xbar,s/np.sqrt(n))

(172.70307451543158, 191.29692548456842)

#### 3. A movie production house needs to estimate the average monthly wage of the technical crew members. The previous data shows that the standard deviation of the wages is 190 dollars. The production team thinks that the estimation of the average wage should not exceed 54 dollars. The team has decided to take a small subset of wages for the estimation. Find a suitable number of wages to be considered to get the estimate with 90% confidence.

In [3]:
sigma=190
margin_error=54
z=stats.norm.isf(5/100)

In [82]:
n=np.floor((z*(sigma/margin_error))**2)
n

33.0

In [4]:
z

1.6448536269514729

#### 4. 100 bags of coal were tested and had an average of 35% of ash with a standard deviation of 15%. Calculate the margin of error for a 90% confidence level.

In [68]:
n=100
xbar=35 # not need for that que
s=0.15
t=stats.norm.isf(0.05)
margin_error=t*(s/np.sqrt(n))
print(margin_error)

0.02467280440427209


### T Distribution and Z Distribution

In [88]:
# 95% confidence 
z=stats.norm.isf(0.025)
z

1.9599639845400545

In [90]:
z=stats.t.isf(0.025,df=10)
z

2.2281388519649385

In [91]:
z=stats.t.isf(0.025,df=100)
z

1.983971518449634

In [92]:
z=stats.t.isf(0.025,df=1000)
z

1.9623390808264078

In [98]:
z=stats.t.isf(0.025,df=1000000000000000000)
z

1.9599639845400543

##### Z - Distribution is T Distribution with higher degrees of freedom

<a id="small"></a>
### 2.2.2 Small Sample Size

Let us take a sample of `n` observations from the population such that, $n < 30$. Here the standard deviation of the population is unknown. The confidence interval for the population mean with $100(1-\alpha)$% confidence level is given as: $\overline{X} \pm t_{\frac{\alpha}{2}, n-1}\frac{s}{\sqrt{n}}$

Where, <br>
$\overline{X}$: Sample mean<br>
$\alpha$: Level of significance<br>
$s$: Sample standard deviation<br>
$n-1$: degrees of freedom

The ratio $\frac{s}{\sqrt{n}}$ is the estimate of the standard error of the mean. And $t_{\frac{\alpha}{2}, n-1}\frac{s}{\sqrt{n}}$ is the margin of error for the estimate.

The value of $t_{\frac{\alpha}{2}, n-1}$ for different $\alpha$ values can be obtained using the `stats.t.isf()` from the scipy library.  

In [85]:
# let us find the t-values for different alpha values with 10 degrees of freedom

# create an empty dataframe to store the alpha and corresponding t-value
df_t = pd.DataFrame()

# create a dictionary of different alpha values
alpha =  [0.1, 0.05, 0.02, 0.01] 

# use for loop to calculate the t-value for each alpha value
for i in range(len(alpha)):
    alpha_by_2 = alpha[i] / 2
    
    # use 'stats.t.isf()' to find the t-value corresponding to the upper tail probability 'q'
    # pass the value of 'alpha_by_2' to the parameter 'q'
    # pass the 10 degrees of freedom to the parametr 'df' 
    # use 'round()' to round-off the value to 2 digits
    t = np.abs(round(stats.t.isf(q = alpha_by_2, df = 10), 2))

    # create a dataframe using dictionary to store the alpha and corresponding t-value 
    # set the loop iterator 'i' as the index of the dataframe
    row =  pd.DataFrame({"alpha": alpha[i], "t_alpha_by_2": t}, index = [i])

    # append the row to the dataframe 'df_t'
    df_t = df_t.append(row)

# print the final dataframe
df_t

Unnamed: 0,alpha,t_alpha_by_2
0,0.1,1.81
1,0.05,2.23
2,0.02,2.76
3,0.01,3.17


### Example:

#### 1. There are 150 apples on a tree. You randomly choose 17 apples and found that the average weight of apples is 78 grams with a standard deviation of 23 grams. Find the 90% confidence interval for the population mean.

In [117]:
n=17
xbar=78
s=23
t=stats.t.isf(0.05,df=n-1)
ll=xbar-t*(s/np.sqrt(n))
ul=xbar+t*(s/np.sqrt(n))
print(ll)
print(ul)

68.26090326067306
87.73909673932694


In [119]:
stats.t.interval(alpha=.90,loc=xbar,scale=s/np.sqrt(n),df=n-1)

(68.26090326067306, 87.73909673932694)

<a id="prop"></a>
## 2.3 Interval Estimation for Proportion

Consider a population in which each observation is either a success or a failure. The population proportion is denoted by `P` which the ratio of the number of successes to the size of the population.

The confidence interval for the population proportion with $100(1-\alpha)$% confidence level is given as: $p \pm Z_{\frac{\alpha}{2}}\sqrt{\frac{p(1 - p)}{n}}$

Where, <br>
$p$: Sample proportion<br>
$\alpha$: Level of significance<br>
$n$: Sample size

The quantity $Z_{\frac{\alpha}{2}}\sqrt{\frac{p(1 - p)}{n}}$ is the margin of error.

### Example:

#### 1. A financial firm has created 50 portfolios. From them, a sample of 13 portfolios was selected, out of which 8 were found to be underperforming. Construct a 99% confidence interval to estimate the population proportion.

In [127]:
n=13
p=8/13
z=stats.norm.isf(0.5/100)
ll=(p-(z*(p*(1-p)/n)**0.5))*100
ul=(p+(z*(p*(1-p)/n)**0.5))*100
print(np.floor(ll),'%')
print(np.floor(ul),'%')

26.0 %
96.0 %


In [131]:
stats.norm.interval(alpha=.99,loc=p,scale=np.sqrt(p*(1-p)/n))

(0.26782280814713805, 0.9629464226220927)

We are 99% confident that the population propotion will range between 26% to 96%

<a id="defn"></a>
# 3. Test of Hypothesis

It is the process of evaluating the validity of the claim made using the sample data obtained from the population. A statistical test is a rule used to decide the acceptance or rejection of the claim.

**Examples of hypothesis:**

        1. One can get 'A' grade if the attendance in the class is more than 75%.
        2. A probiotic drink can improve the immunity of a person. 

<a id="types"></a>
## 3.1 Types of Hypothesis

`Null Hypothesis`: The null hypothesis is the claim suggesting 'no difference'. It is denoted as H<sub>0</sub>.

`Alternative Hypothesis`: It is the hypothesis that is tested against the null hypothesis. The acceptance or rejection of the hypothesis is based on the likelihood of H<sub>0</sub> being true. It is denoted by H<sub>a</sub> or H<sub>1</sub>.



<a id="test_type"></a>
# 4. Types of Test

The hypothesis test is used to validate the claim given by the null hypothesis. The types of tests are based on the nature of the alternative hypothesis. 

<a id="2tailed"></a>
## 4.1 Two Tailed Test

Two tailed test considers the value of the population parameter is less than or greater than (i.e. not equal) a specific value. <br>
If we test the population mean ($\mu$) with a specific value ($\mu_{0}$) the null hypothesis is: $H_{0}: \mu = \mu_{0}$. 

The alternative hypothesis for the two tailed test is given as: $H_{1}: \mu \neq \mu_{0}$

#### Example:

A company that produces tennis balls claimed that the diameter of a tennis ball is 2.625 inches on average. To test the company's claim, a statistical test can be performed considering the hypothesis:

                    

In [None]:
# framing hypothesis
h0: Average diameter of a tennis ball is 2.625 (mu=2.625)
ha: Average diameter of a tennis ball <> 2.625 (mu<>2.625) #we can write h1 also.    

<a id="1tailed"></a>
## 4.2 One Tailed Test

One tailed test considers the value of the population parameter is less than or greater than (but not both) a specific value. <br>
If we test the population mean ($\mu$) with a specific value ($\mu_{0}$) the null hypothesis is: $H_{0}: \mu \leq \mu_{0}$ and the alternative hypothesis is $H_{1}: \mu > \mu_{0}$, the one tailed test is also known as a `right-tailed test`.

If we test the population mean ($\mu$) with a specific value ($\mu_{0}$) the null hypothesis is: $H_{0}: \mu \geq \mu_{0}$ and the alternative hypothesis is $H_{1}: \mu < \mu_{0}$, the one tailed test is also known as a `left-tailed test`.


### Example:

**1.** The company's annual quality report of machines states that a lathe machine works efficiently at most for 8 months on average after the servicing. The production manager claims that after the special tuxan servicing, the machine works efficiently for more than 8 months. To test the claim of production manager consider the hypothesis:

                    Null Hypothesis: Machine efficiency ≤ 8 months
                    Alternative Hypothesis: Machine efficiency > 8 months

This is the example of a **right-tailed test**. 

**2.** A railway authority claims that all the trains on the Chicago-Seattle route run with a speed of at least 54 mph on average. A customer forum declares that there are various records from passengers claiming that the speed of the train is less than what railway has claimed. In this scenario, a statistical test can be performed to test the claim of customer forum considering the hypothesis:

                    Null Hypothesis: Speed ≥ 54 mph
                    Alternative Hypothesis: Speed < 54 mph

This is the example of a **left-tailed test**. 

<a id="eg"></a>
# 5. Hypothesis Tests with Z Statistic

Let us perform one sample Z test for the population mean. We compare the population mean with a specific value. The sample is assumed to be taken from a population following a normal distribution.

To check the normality of the data, a test for normality is used. The `Shapiro-Wilk Test` is one of the methods used to check the normality. The hypothesis of the test is given as:
<p style='text-indent:25em'> <strong> H<sub>0</sub>:  The data is normally distributed </strong> </p>
<p style='text-indent:25em'> <strong> H<sub>1</sub>:  The data is not normally distributed </strong> </p>

The `shapiro()` from scipy library performs a Shapiro-Wilk normality test. 

The null and alternative hypothesis of Z-test is given as:
<p style='text-indent:25em'> <strong> $H_{0}: \mu = \mu_{0}$ or $\mu \geq \mu_{0}$ or $\mu \leq \mu_{0}$</strong></p>
<p style='text-indent:25em'> <strong> $H_{1}: \mu \neq \mu_{0}$ or $\mu < \mu_{0}$ or $\mu > \mu_{0}$</strong></p>

Consider a normal population with standard deviation $\sigma$. Let us take a sample of size n, 
The test statistic for one sample Z-test is given as:
<p style='text-indent:25em'> <strong> $Z = \frac{\overline{X} -  \mu}{\frac{\sigma}{\sqrt(n)}}$</strong></p>

Where, <br>
$\overline{X}$: Sample mean<br>
$\mu$: Specified mean<br>
$\sigma$: Population standard deviation<br>
$n$: Sample size





### Example:

#### 1. A car manufacturing company claims that the mileage of their new car is 25 kmph with a standard deviation of 2.5 kmph. A random sample of 45 cars was drawn and recorded their mileage as per the standard procedure. From the sample, the mean mileage was seen to be 24 kmph. Is this evidence to claim that the mean mileage is different from 25kmph? (assume the normality of the data) Use α = 0.01.

In [13]:
# practice
mu=25
sigma= 2.5
n=45
xbar=24
sig_lvl=0.01
z=(xbar-mu)/(sigma/np.sqrt(n))
pval= (stats.norm.sf(abs(z)))*2
if pval< sig_lvl:
    print('h0 is rejected \n the new car mileage is different from 25 kmph')
else:
    print('failed to reject h0 /nthe new car mileage is 25 kmph')

h0 is rejected 
 the new car mileage is different from 25 kmph


In [14]:
#Framing Hypothesis
#h0: the mileage of their new car is 25 (mu=25)            # from that this is two tail
#ha: the mileage of their new car <> 25 (mu<>25)
mu=25                                                 # claim value
sigma=2.5                                                                                           #standard deviation =  sigma
n=45                                                                           # length of sample = n
xbar=24                               # From the sample, the mean mileage = xbar  
sig_lvl=0.01                   # α = 0.01
z=(xbar-mu)/(sigma/np.sqrt(n))
z
pval=(stats.norm.sf(abs(z)))*2                 # stats.norm.cdf(z)+stats.norm.sf(abs(z))  only for two tail
# pval=stats.norm.cdf(z)+stats.norm.sf(abs(z))                                                         # same answer another way    
print(pval)


0.007290358091535638


In [153]:
# pval=0.0072
# sig_lvl=0.01 # significance level if not given 0.005
# pval<sig_lvl #------------------------------ h0 is rejected
#              # pval>sig_lvl ----------------------------- we are fail to reject h0 .
# Null hypothesis is rejected
# i.e. average mileage of the car != 25

In [146]:
# Data should be normal
# Population standard deviation is known
# two tail test
# one sample z test (two tailed)


In [15]:
#practice
# h0: The average calories in a slice bread of the brand 'Alphas' are 82 (mu=82)
# ha: The average calories in a slice bread of the brand 'Alphas' not 82 (mu!=82)
mu=82
sigma=15
n=40
xbar=95
sig_lvl=0.05
z=(xbar-mu)/(sigma/np.sqrt(n))
pval=(stats.norm.sf(abs(z)))*2
if pval<sig_lvl:
    print('h0 is rejected \nThe average calories in a slice bread of the brand Alphas not 82 ')
else:
    print('fail to reject h0 \nThe average calories in a slice bread of the brand Alphas are 82 ')

h0 is rejected 
The average calories in a slice bread of the brand Alphas not 82 


#### 2. The average calories in a slice bread of the brand 'Alphas' are 82 with a standard deviation of 15. An experiment is conducted to test the claim of the dietitians that the calories in a slice of bread are not as per the manufacturer's specification. A sample of 40 slices of bread is taken and the mean calories recorded are 95. Test the claim of dietitians with ⍺ value (significance level) as 0.05. (assume the normality of the data).

In [48]:
#Framing Hypothesis
#h0: average calories of bread = 82 (mu=82)
#ha: average calories of bread != 82 (mu!=82)

In [49]:
# Data should be normal
# Population standard deviation is known
# two tail test
# one sample z test (two tailed)


In [164]:
mu=82
sigma=15
n=40
xbar=95
sig_lvl=0.05
z=(xbar-mu)/(sigma/np.sqrt(n))
print('z= ',z)
pval=(stats.norm.sf(abs(z)))*2                             # stats.norm.cdf(z)+stats.norm.sf(abs(z))   only for two tail
# pval=stats.norm.cdf(z)+stats.norm.sf(abs(z))           # same answer another way    
print('pval= ',pval)


z=  5.4812812776251905
pval=  4.222565249683579e-08


In [162]:
print(pval<sig_lvl)
print('h0 is rejected')
print('average calories of bread != 82')

True
h0 is rejected
average calories of bread != 82 


In [163]:
# pval=4.222565249683579e-08
# sig_lvl=0.01 # significance level if not given 0.005
# pval<sig_lvl #------------------------------ h0 is rejected
#              # pval>sig_lvl ----------------------------- we are fail to reject it.
# Null hypothesis is rejected
# i.e. average mileage of the car != 25

In [16]:
#practice 
#h0: The average IQ of adult population is 100
#ha: The average IQ of adult population not 100
mu=100
sigma=15
n= 75
xbar=105
sig_lvl=0.05
z=(xbar-mu)/(sigma/np.sqrt(n))
pval=(stats.norm.sf(abs(n)))*2
if pval<sig_lvl:
    print('h0 is rejected \nThe average IQ of adult population not 100')
else:
    print('fail to reject h0 \nThe average IQ of adult population is 100')

h0 is rejected 
The average IQ of adult population not 100


#### 3. The average IQ of adult population is 100 with SD of 15. A researcher believes that this value has changed. The researcher decides to test the IQ of 75 random results. The average IQ of 75 random adults came out to be 105. Is there enough evidence to believe that the IQ has changed for the population


In [54]:
#Framing Hypothesis
#h0: The average IQ of adult population is 100 (mu=100)
#ha: The average IQ of adult population <> 100 (mu<>100)

In [55]:
# Data should be normal
# Population standard deviation is known
# two tail test
# one sample z test (two tailed)

In [166]:
mu=100
sigma=15
n=75
xbar=105
sig_lvl=0.05
z=(xbar-mu)/(sigma/np.sqrt(n))
print('z= ',z)
pval=(stats.norm.sf(abs(z)))*2                             # stats.norm.cdf(z)+stats.norm.sf(abs(z))   only for two tail
# pval=stats.norm.cdf(z)+stats.norm.sf(abs(z))           # same answer another way    
print('pval= ',pval)
print(pval<sig_lvl)


z=  2.886751345948129
pval=  0.003892417122778628
True


In [169]:
print('h0 is rejected')
print('The average IQ of adult population <> 100')

h0 is rejected
The average IQ of adult population <> 100


In [17]:
# practice
#h0: that the vaccines contain greater than and eual to 3 mg of acid
#ha:that the vaccines contain less than 3 mg of acid
acid_amt = [2.57, 3.06, 3.28 , 3.24, 2.79, 3.40, 3.36, 3.07, 2.46, 3.03, 3.05, 2.94, 3.46, 3.19, 3.09, 2.81, 3.13, 2.88, 
            2.76, 2.75, 3.17, 2.89, 2.54, 3.18, 3.08, 2.60, 3.06, 3.13, 3.11, 3.08, 2.93, 2.90, 3.06, 2.97, 3.24, 2.86, 
            2.87, 3.18, 3, 2.95]
mu=3
sigma=1.2
n=40
sig_lvl=0.05
xbar= np.mean(acid_amt)
z=(xbar-mu)/(sigma/np.sqrt(n))
pval=stats.norm.sf(z)
if pval<sig_lvl:
    print('h0 is rejected \nthat the vaccines contain less than 3 mg of acid')
else:
    print('failed to reject h0 \nthat the vaccines contain greater than and equal to 3 mg of acid')

failed to reject h0 
that the vaccines contain greater than and equal to 3 mg of acid


#### 4. A typhoid vaccine in the market inscribes 3 mg of ascorbic acid in the vaccine with standard deviation of 1.2mg . A research team claims that the vaccines contain less than 3 mg of acid. We collected the data of 40 vaccines by using random sampling from a population and recorded the amount of ascorbic acid. Test the claim of the research team using the sample data ⍺ value (significance level) to 0.05.

    acid_amt = [2.57, 3.06, 3.28 , 3.24, 2.79, 3.40, 3.36, 3.07, 2.46, 3.03, 3.05, 2.94, 3.46, 3.19, 3.09, 2.81, 3.13, 2.88, 
                2.76, 2.75, 3.17, 2.89, 2.54, 3.18, 3.08, 2.60, 3.06, 3.13, 3.11, 3.08, 2.93, 2.90, 3.06, 2.97, 3.24, 2.86, 
                2.87, 3.18, 3, 2.95]

In [60]:
#Framing Hypothesis
h0: acid vaccines contain greater than equal to 3 (mu>=3)
ha: acid vaccines contain less than 3 (mu>3)   

In [172]:
acid_amt = [2.57, 3.06, 3.28 , 3.24, 2.79, 3.40, 3.36, 3.07, 2.46, 3.03, 3.05, 2.94, 3.46, 3.19, 3.09, 2.81, 3.13, 2.88, 
            2.76, 2.75, 3.17, 2.89, 2.54, 3.18, 3.08, 2.60, 3.06, 3.13, 3.11, 3.08, 2.93, 2.90, 3.06, 2.97, 3.24, 2.86, 
            2.87, 3.18, 3, 2.95]
xbar=np.mean(acid_amt)
print(xbar)
mu=3
sigma=1.2
n=40
z=(xbar-mu)/(sigma/np.sqrt(n))
print('z= ',z)
pval=(stats.norm.cdf(z))
print(pval<sig_lvl)      

3.003
z=  0.015811388300842496
False


In [173]:
print('fail to reject h0 ')
print('acid vaccines contain greater than equal to 3')

fail to reject h0 
acid vaccines contain greater than equal to 3


In [178]:
print('h0 is rejected')
print('The average IQ of adult population <> 100')

h0 is rejected
The average IQ of adult population <> 100


In [18]:
# practice
#h0: the pvc pipe thickeness is greater than and equal to 13 mm
#ha: the pvc pipe thickeness is less than 13 mm
n=900
xbar=12.5
mu=13
sigma=1
sig_lvl=0.05
z=(xbar-mu)/(sigma/np.sqrt(n))
pval= stats.norm.sf(z)
if pval<sig_lvl:
    print('h0 is rejecteed \nthe pvc pipe thickeness is less than 13 mm')
else:
    print('fail to reject h0 \nthe pvc pipe thickeness is greater than and equal to 13 mm')

fail to reject h0 
the pvc pipe thickeness is greater than and equal to 13 mm


1.0

#### 4. A sample of 900 PVC pipes is found to have an average thickness of 12.5 mm. The sample is coming from a normal population. Is there any evidence that the pvc pipe thickeness is less than 13 mm. The population standard deviation is 1 mm. Test the hypothesis at 5% level of significance.

In [179]:
#Framing Hypothesis
#h0: pvc pipe thickeness is greater than equal to13 (mu>=13)
#ha: pvc pipe thickeness is less than 13 (mu<13)


In [None]:
# Data should be normal
# Population standard deviation is known
# One tail test(left tail)

In [21]:
n=900
xbar=12.5
mu=13
sigma=1
sig_lvl=0.05
z=(xbar-mu)/(sigma/np.sqrt(n))
pval= stats.norm.sf(z)
if pval<sig_lvl:
    print('h0 is rejecteed \nthe pvc pipe thickeness is less than 13 mm')
else:
    print('fail to reject h0 \nthe pvc pipe thickeness is greater than and equal to 13 mm')

fail to reject h0 
the pvc pipe thickeness is greater than and equal to 13 mm


<a id="2z"></a>
## 2.2 Two Sample Z Test

Let us perform a two sample Z test for the population mean. We compare the means of the two independent populations. The samples are assumed to be taken from populations such that they follow a normal distribution. Also, the sample must have equal variance.

The `Shapiro-Wilk Test` is used to check the normality of the data. The assumption of equal variances of the populations is tested using the `Levene's Test`. 
The hypothesis of the Levene's test is given as:
<p style='text-indent:25em'> <strong> H<sub>0</sub>:  The variances are equal</strong> </p>
<p style='text-indent:25em'> <strong> H<sub>1</sub>:  The variances are not equal </strong> </p>

The `levene()` from scipy library performs a Levene's test. 

The null and alternative hypothesis of two sample Z-test is given as:

<p style='text-indent:25em'> <strong> $H_{0}: \mu_{1} - \mu_{2} = \mu_{0}$ or $\mu_{1} - \mu_{2} \geq \mu_{0}$ or $\mu_{1} -\mu_{2} \leq \mu_{0}$</strong></p>
<p style='text-indent:25em'> <strong> $H_{1}: \mu_{1} - \mu_{2} \neq \mu_{0} $ or $\mu_{1} - \mu_{2} < \mu_{0}$ or $\mu_{1} -\mu_{2} > \mu_{0}$</strong></p>



The test statistic for two sample Z-test is given as:
<p style='text-indent:25em'> <strong> $Z = \frac{(\overline{X_{1}} - \overline{X_{2}})  - \mu_{0}} {\sqrt{\frac{\sigma_{1}^{2}}{n_{1}} + \frac{\sigma_{2}^{2}}{n_{2}}}}$</strong></p>

Where, <br>
$\overline{X_{1}}$, $\overline{X_{2}}$ : Mean of both the samples<br>
$\mu_{0}$: Mean difference given in the null hypothesis<br>
$\sigma_{1}, \sigma_{2}$: Standard deviation of both the populations<br>
$n_{1}, n_{2}$: Size of samples from both the populations




In [36]:
#practice 
#hypothesis
#h0:the population means of concentrations of the elements are the same for men and women
#ha:the population means of concentrations of the elements are not same for men and women
n1=160
n2=180
xbar1=13
xbar2=15
sigma1=4.1
sigma2=3.5
sig_lvl=0.01
z=(xbar1-xbar2)/(np.sqrt(((sigma1**2)/n1)+((sigma2**2)/n2)))
pval=stats.norm.sf(abs(z))*2
if pval<sig_lvl:
    print('h0 is rejected \nthe population means of concentrations of the elements are not same for men and women')
else:
    print('fail to reject h0 \nthe population means of concentrations of the elements are the same for men and women')

h0 is rejected 
the population means of concentrations of the elements are not same for men and women


#### 1. A study was carried out to understand amount of haemoglobin in blood for males and females. A random sample of 160 males and 180 females have means of 13 g/dl and 15 g/dl. The two population have standard deviation of 4.1 g/dl for male donors and 3.5 g/dl for female donor . Can it be said the population means of concentrations of the elements are the same for men and women? Use  α = 0.01.Assume data is normally distributed

In [9]:
#h0:mu1=mu2 average concentratation of the element for women and men are same
#h1:mu1!=mu2 average concentratation of the element for women and men are not same
n1=160
n2=180
xbar1=13
xbar2=15
sigma1=4.1
sigma2=3.5
sig_lvl=0.01
# data is normal population SD known
zstat=(((xbar1-xbar2)-(0))/(np.sqrt((sigma1**2/n1)+(sigma2**2/n2))))3
print(zstat)


-4.806830552525058


In [12]:
pval=stats.norm.sf(abs(zstat))*2

In [14]:
print(pval<sig_lvl)   
print('h0 is rejected')
print('average concentratation of the element for women and men are not same')

True
h0 is rejected
average concentratation of the element for women and men are not same
