# Faculty Notebook - Day - 02

<a id="lib"></a>
# 1. Import Libraries

**Let us import the required libraries.**

In [7]:
# import 'pandas' 
import pandas as pd 

# import 'numpy' 
import numpy as np

# import subpackage of matplotlib
import matplotlib.pyplot as plt

# import 'seaborn'
import seaborn as sns

# to suppress warnings 
from warnings import filterwarnings
filterwarnings('ignore')

# import 'random' to generate random sample
import random

# import statistics to perform statistical computation  
import statistics

# import 'stats' package from scipy library
from scipy import stats

# import a library to perform Z-test
from statsmodels.stats import weightstats as stests

# to test the normality 
from scipy.stats import shapiro

# import the function to calculate the power of test
from statsmodels.stats import power

In [2]:
# set the plot size using 'rcParams'
# once the plot size is set using 'rcParams', it sets the size of all the forthcoming plots in the file
# pass width and height in inches to 'figure.figsize' 
plt.rcParams['figure.figsize'] = [15,8]

In [3]:
import scipy.stats as stats
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

<a id="pt"></a>
##  Point Estimation

This method considers a single value (sample statistic) as the population parameter. 

Let $X_{1}, X_{2}, X_{3},..., X_{n}$ be the random sample drawn from a population with mean $\mu$ and standard deviation $\sigma$. <br>
The point estimation method estimates the population mean, $\mu = \overline{X}$, where $\overline{X}$ is the sample mean and population standard deviation, $\sigma = s$, where $s$ is the standard deviation of the sample .

<a id="err"></a>
### 2.1.1 Sampling Error

Sampling error is considered as the absolute difference between the sample statistic used to estimate the parameter and the corresponding population parameter. Since the entire population is not considered as the sample, the values of mean, median, quantiles, and so on calculated on sample differ from the actual population values. 

One can reduce the sampling error either by increasing the sample size or determining the optimal sample size using various methods.

### Example:

#### 1. Consider the data for the number of ice-creams sold per day. An ice-cream vendor collected this data for 90 days and then a sample is drawn (without replacement) containing ice-creams sold for 25 days. 

data = [21, 93, 62, 76, 73, 20, 56, 95, 41, 36, 38, 13, 80, 88, 34, 18, 40, 11, 
        25, 29, 61, 23, 82, 10, 92, 69, 60, 87, 14, 91, 94, 49, 57, 83, 96, 55, 
        79, 52, 59, 39, 58, 17, 19, 98, 15, 54, 48, 46, 72, 45, 65, 28, 37, 30, 
        68, 75, 16, 33, 31, 99, 22, 51, 27, 67, 85, 47, 44, 77, 64, 97, 84, 42, 
        90, 70, 74, 89, 32, 26, 24, 12, 81, 53, 50, 35, 71, 63, 43, 86, 78, 66]
        
sample = [10, 22, 47, 66, 11, 57, 77, 98, 31, 63, 74, 84, 50, 96, 88, 92, 70, 54, 65, 44, 16, 72, 20, 90, 43]

Comupte the sampling error for mean.

In [13]:
pop = [21, 93, 62, 76, 73, 20, 56, 95, 41, 36, 38, 13, 80, 88, 
       34, 18, 40, 11, 25, 29, 61, 23, 82, 10, 92, 69, 60, 87, 
       14, 91, 94, 49, 57, 83, 96, 55, 79, 52, 59, 39, 58, 17, 
       19, 98, 15, 54, 48, 46, 72, 45, 65, 28, 37, 30, 68, 75, 
       16, 33, 31, 99, 22, 51, 27, 67, 85, 47, 44, 77, 64, 97, 
       84, 42, 90, 70, 74, 89, 32, 26, 24, 12, 81, 53, 50, 35, 
       71, 63, 43, 86, 78, 66]

In [9]:
sample = [10, 22, 47, 66, 11, 57, 77, 98, 31, 63, 74, 84, 50,
          96, 88, 92, 70, 54, 65, 44, 16, 72, 20, 90, 43]

In [12]:
print('Sampling error',np.mean(pop)- np.mean(sample))  # pop mean - sam mean
print('Standard error', np.std(pop)/np.sqrt(len(sample)))  # sigma / root(n)

Sampling error -3.1000000000000014
Standard error 5.195831662656775


<a id="int"></a>
## 2.2 Interval Estimation for Mean

This method considers the range of values in which the population parameter is likely to lie. The confidence interval is an interval that describes the range of values in which the parameter lies with a specific probability. It is given by the formula,<br> <p style='text-indent:20em'> `conf_interval = sample statistic ± margin of error`</p>

The uncertainty of an estimate is described by the `confidence level` which is used to calculate the margin of error. 

<a id="large"></a>
### 2.2.1 Large Sample Size

The confidence interval for the population mean with $100(1-\alpha)$% confidence level is given as: $\overline{X} \pm Z_{\frac{\alpha}{2}}\frac{s}{\sqrt{n}}$

Where, <br>
$\overline{X}$: Sample mean<br>
$\alpha$: Level of significance<br>
$s$: Sample Standard deviation <br>
$n$: Sample size

In [8]:
# let us find the Z-values for different alpha values

# create an empty dataframe to store the alpha and corresponding Z-value
df_Z = pd.DataFrame()

# create a dictionary of different alpha values
alpha =  [0.1, 0.05, 0.02, 0.01] 

# use for loop to calculate the value for each alpha
for i in range(len(alpha)):
    alpha_by_2 = alpha[i] / 2
    
    # use 'stats.norm.isf()' to find the Z-value corresponding to the upper tail probability 'q'
    # pass the value of 'alpha_by_2' to the parameter 'q'
    # use 'round()' to round-off the value to 4 digits
    Z = np.abs(round(stats.norm.isf(q = alpha_by_2), 4))
    
    # create a dataframe using dictionary to store the alpha and corresponding Z-value
    # set the loop iterator 'i' as the index of the dataframe
    row =  pd.DataFrame({"alpha": alpha[i], "Z_alpha_by_2" : Z}, index = [i])
    
    # append the row to the dataframe 'df_Z'
    df_Z = df_Z.append(row)

# print the final dataframe
df_Z

Unnamed: 0,alpha,Z_alpha_by_2
0,0.1,1.6449
1,0.05,1.96
2,0.02,2.3263
3,0.01,2.5758


In [14]:
stats.norm.isf(0.05)

1.6448536269514729

In [None]:
# 95% and 99%

To calculate the confidence interval with 95% confidence, use the Z-value corresponding to `alpha = 0.05`. 

### Example:

#### 1. A random sample of weight (in kg.) for 35 diabetic patients is drawn from the population with a standard deviation of 8 kg. Find the 90% confidence interval for the population mean.

    Weight: [59.1, 65.0, 75.8, 79.2, 95.0, 99.8, 89.1, 65.3, 41.9, 55.2, 94.8, 84.1, 83.2, 74.0, 75.5, 76.2, 79.1, 80.1, 
             92.1, 74.2, 59.2, 64.0, 75, 78.2, 95.6, 97.8, 89.5, 64.2, 41.8, 57.2, 85, 91.4, 81.8, 74.6, 90]

In [15]:
sample=[59.1, 65.0, 75.8, 79.2, 95.0, 99.8, 89.1, 65.3, 41.9, 55.2, 94.8, 84.1, 83.2, 74.0, 75.5, 76.2, 79.1, 80.1, 
         92.1, 74.2, 59.2, 64.0, 75, 78.2, 95.6, 97.8, 89.5, 64.2, 41.8, 57.2, 85, 91.4, 81.8, 74.6, 90]
n = len(sample)
x_bar = np.mean(sample)
s = np.std(sample,ddof=1)

In [16]:
z  = stats.norm.isf(0.05)

In [17]:
ll = x_bar - (z * (s/n**0.5))
ul = x_bar + (z * (s/n**0.5))
print(ll,ul)

72.51271961336339 80.85870895806518


In [19]:
# Syntax: stats.norm.interval(alpha = confidence level,loc=x_bar,scale = (s/n**0.5))

stats.norm.interval(alpha = 0.9,loc=x_bar,scale = (s/n**0.5))

(72.51271961336339, 80.85870895806518)

#### Practice

2. There are 150 apples on a tree. You randomly choose 40 apples and found that the average weight of apples is 182 grams with a standard deviation of 30 grams. Find the 95% confidence interval for the population mean.

#### 3. A movie production house needs to estimate the average monthly wage of the technical crew members. The previous data shows that the standard deviation of the wages is 190 dollars. The production team thinks that the estimation of the average wage should not exceed 54 dollars. The team has decided to take a small subset of wages for the estimation. Find a suitable number of wages to be considered to get the estimate with 90% confidence.

In [21]:
s=190
me =54
z = stats.norm.isf(0.05)
n = ((z*s)/me)**2
print(round(n))

33


In [23]:
stats.norm.isf(0.025)

1.9599639845400545

In [24]:
stats.t.isf(0.025,df=10)

2.2281388519649385

In [25]:
stats.t.isf(0.025,df=100)

1.983971518449634

In [26]:
stats.t.isf(0.025,df=1000)

1.9623390808264078

In [27]:
stats.t.isf(0.025,df=10000)

1.9602012398906263

In [28]:
stats.t.isf(0.025,df=10000000)

1.9599642217672055

In [29]:
stats.t.isf(0.025,df=1000000000000)

1.9599639845424266

<a id="small"></a>
### 2.2.2 Small Sample Size

Let us take a sample of `n` observations from the population such that, $n < 30$. Here the standard deviation of the population is unknown. The confidence interval for the population mean with $100(1-\alpha)$% confidence level is given as: $\overline{X} \pm t_{\frac{\alpha}{2}, n-1}\frac{s}{\sqrt{n}}$

Where, <br>
$\overline{X}$: Sample mean<br>
$\alpha$: Level of significance<br>
$s$: Sample standard deviation<br>
$n-1$: degrees of freedom

The ratio $\frac{s}{\sqrt{n}}$ is the estimate of the standard error of the mean. And $t_{\frac{\alpha}{2}, n-1}\frac{s}{\sqrt{n}}$ is the margin of error for the estimate.

The value of $t_{\frac{\alpha}{2}, n-1}$ for different $\alpha$ values can be obtained using the `stats.t.isf()` from the scipy library.  

In [16]:
# let us find the t-values for different alpha values with 10 degrees of freedom

# create an empty dataframe to store the alpha and corresponding t-value
df_t = pd.DataFrame()

# create a dictionary of different alpha values
alpha =  [0.1, 0.05, 0.02, 0.01] 

# use for loop to calculate the t-value for each alpha value
for i in range(len(alpha)):
    alpha_by_2 = alpha[i] / 2
    
    # use 'stats.t.isf()' to find the t-value corresponding to the upper tail probability 'q'
    # pass the value of 'alpha_by_2' to the parameter 'q'
    # pass the 10 degrees of freedom to the parametr 'df' 
    # use 'round()' to round-off the value to 2 digits
    t = np.abs(round(stats.t.isf(q = alpha_by_2, df = 10), 2))

    # create a dataframe using dictionary to store the alpha and corresponding t-value 
    # set the loop iterator 'i' as the index of the dataframe
    row =  pd.DataFrame({"alpha": alpha[i], "t_alpha_by_2": t}, index = [i])

    # append the row to the dataframe 'df_t'
    df_t = df_t.append(row)

# print the final dataframe
df_t

Unnamed: 0,alpha,t_alpha_by_2
0,0.1,1.81
1,0.05,2.23
2,0.02,2.76
3,0.01,3.17


### Example:

#### 1. There are 150 apples on a tree. You randomly choose 17 apples and found that the average weight of apples is 78 grams with a standard deviation of 23 grams. Find the 90% confidence interval for the population mean.

In [33]:
n=17
x_bar =78
s=23
t = stats.t.isf(0.05,df=n-1)

In [34]:
ll = x_bar - (t * (s/n**0.5))
ul = x_bar + (t * (s/n**0.5))
print(ll,ul)

68.26090326067306 87.73909673932694


In [35]:
stats.t.interval(alpha =0.9,loc=x_bar,scale= (s/n**0.5),df=n-1)

(68.26090326067306, 87.73909673932694)

<a id="prop"></a>
## 2.3 Interval Estimation for Proportion

Consider a population in which each observation is either a success or a failure. The population proportion is denoted by `P` which the ratio of the number of successes to the size of the population.

The confidence interval for the population proportion with $100(1-\alpha)$% confidence level is given as: $p \pm Z_{\frac{\alpha}{2}}\sqrt{\frac{p(1 - p)}{n}}$

Where, <br>
$p$: Sample proportion<br>
$\alpha$: Level of significance<br>
$n$: Sample size

The quantity $Z_{\frac{\alpha}{2}}\sqrt{\frac{p(1 - p)}{n}}$ is the margin of error.

### Example:

#### 1. A financial firm has created 50 portfolios. From them, a sample of 13 portfolios was selected, out of which 8 were found to be underperforming. Construct a 99% confidence interval to estimate the population proportion.

In [37]:
p_sam = 8/13
n =13
z = stats.norm.isf(0.005)
ll = p_sam - (z * np.sqrt((p_sam*(1-p_sam))/n))
ul = p_sam + (z * np.sqrt((p_sam*(1-p_sam))/n))
print(ll,ul)

0.26782280814713794 0.962946422622093


<a id="defn"></a>
# 3. Test of Hypothesis

It is the process of evaluating the validity of the claim made using the sample data obtained from the population. A statistical test is a rule used to decide the acceptance or rejection of the claim.

**Examples of hypothesis:**

        1. One can get 'A' grade if the attendance in the class is more than 75%.
        2. A probiotic drink can improve the immunity of a person. 

<a id="types"></a>
## 3.1 Types of Hypothesis

`Null Hypothesis`: The null hypothesis is the claim suggesting 'no difference'. It is denoted as H<sub>0</sub>.

`Alternative Hypothesis`: It is the hypothesis that is tested against the null hypothesis. The acceptance or rejection of the hypothesis is based on the likelihood of H<sub>0</sub> being true. It is denoted by H<sub>a</sub> or H<sub>1</sub>.



<a id="test_type"></a>
# 4. Types of Test

The hypothesis test is used to validate the claim given by the null hypothesis. The types of tests are based on the nature of the alternative hypothesis. 

<a id="2tailed"></a>
## 4.1 Two Tailed Test

Two tailed test considers the value of the population parameter is less than or greater than (i.e. not equal) a specific value. <br>
If we test the population mean ($\mu$) with a specific value ($\mu_{0}$) the null hypothesis is: $H_{0}: \mu = \mu_{0}$. 

The alternative hypothesis for the two tailed test is given as: $H_{1}: \mu \neq \mu_{0}$

#### Example:

A company that produces tennis balls claimed that the diameter of a tennis ball is 2.625 inches on average. To test the company's claim, a statistical test can be performed considering the hypothesis:

                    

<a id="1tailed"></a>
## 4.2 One Tailed Test

One tailed test considers the value of the population parameter is less than or greater than (but not both) a specific value. <br>
If we test the population mean ($\mu$) with a specific value ($\mu_{0}$) the null hypothesis is: $H_{0}: \mu \leq \mu_{0}$ and the alternative hypothesis is $H_{1}: \mu > \mu_{0}$, the one tailed test is also known as a `right-tailed test`.

If we test the population mean ($\mu$) with a specific value ($\mu_{0}$) the null hypothesis is: $H_{0}: \mu \geq \mu_{0}$ and the alternative hypothesis is $H_{1}: \mu < \mu_{0}$, the one tailed test is also known as a `left-tailed test`.


### Example:

**1.** The company's annual quality report of machines states that a lathe machine works efficiently at most for 8 months on average after the servicing. The production manager claims that after the special tuxan servicing, the machine works efficiently for more than 8 months. To test the claim of production manager consider the hypothesis:

                    Null Hypothesis: Machine efficiency ≤ 8 months
                    Alternative Hypothesis: Machine efficiency > 8 months

This is the example of a **right-tailed test**. 

**2.** A railway authority claims that all the trains on the Chicago-Seattle route run with a speed of at least 54 mph on average. A customer forum declares that there are various records from passengers claiming that the speed of the train is less than what railway has claimed. In this scenario, a statistical test can be performed to test the claim of customer forum considering the hypothesis:

                    Null Hypothesis: Speed ≥ 54 mph
                    Alternative Hypothesis: Speed < 54 mph

This is the example of a **left-tailed test**. 

<a id="eg"></a>
# 5. Hypothesis Tests with Z Statistic

Let us perform one sample Z test for the population mean. We compare the population mean with a specific value. The sample is assumed to be taken from a population following a normal distribution.

To check the normality of the data, a test for normality is used. The `Shapiro-Wilk Test` is one of the methods used to check the normality. The hypothesis of the test is given as:
<p style='text-indent:25em'> <strong> H<sub>0</sub>:  The data is normally distributed </strong> </p>
<p style='text-indent:25em'> <strong> H<sub>1</sub>:  The data is not normally distributed </strong> </p>

The `shapiro()` from scipy library performs a Shapiro-Wilk normality test. 

The null and alternative hypothesis of Z-test is given as:
<p style='text-indent:25em'> <strong> $H_{0}: \mu = \mu_{0}$ or $\mu \geq \mu_{0}$ or $\mu \leq \mu_{0}$</strong></p>
<p style='text-indent:25em'> <strong> $H_{1}: \mu \neq \mu_{0}$ or $\mu < \mu_{0}$ or $\mu > \mu_{0}$</strong></p>

Consider a normal population with standard deviation $\sigma$. Let us take a sample of size n, 
The test statistic for one sample Z-test is given as:
<p style='text-indent:25em'> <strong> $Z = \frac{\overline{X} -  \mu}{\frac{\sigma}{\sqrt(n)}}$</strong></p>

Where, <br>
$\overline{X}$: Sample mean<br>
$\mu$: Specified mean<br>
$\sigma$: Population standard deviation<br>
$n$: Sample size





### Example:

#### 1. A car manufacturing company claims that the mileage of their new car is 25 kmph with a standard deviation of 2.5 kmph. A random sample of 45 cars was drawn and recorded their mileage as per the standard procedure. From the sample, the mean mileage was seen to be 24 kmph. Is this evidence to claim that the mean mileage is different from 25kmph? (assume the normality of the data) Use α = 0.01.

In [None]:
# Framing Hypothesis 
# Ho : mu = 25
# Ha : mu !=25

In [None]:
# Data should be normal
# pop std is known
# two tail test
# one sample z test (two tailed)

In [39]:
mu = 25
x_bar=24
n=45
sigma=2.5

In [40]:
z_stat = (x_bar-mu)/(sigma/n**0.5)
print(z_stat)

-2.6832815729997477


In [42]:
pval = stats.norm.sf(abs(z_stat))*2
print(pval)

0.007290358091535638


In [None]:
# pval = 0.0072
# sig.lvl = 0.01
# pval < sig.lvl
# Null hypothesis is rejected and alternate hypothesis selected
# Average mileage is not equal to 25.

#### 2. The average calories in a slice bread of the brand 'Alphas' are 82 with a standard deviation of 15. An experiment is conducted to test the claim of the dietitians that the calories in a slice of bread are not as per the manufacturer's specification. A sample of 40 slices of bread is taken and the mean calories recorded are 95. Test the claim of dietitians with ⍺ value (significance level) as 0.05. (assume the normality of the data).

In [None]:
# Framing Hypothesis 
# Ho : mu = 82
# Ha : mu !=82

In [52]:
# Data should be normal
# pop std is known
# two tail test
# one sample z test (two tailed)

In [43]:
mu = 82
x_bar=95
n=40
sigma=15

In [44]:
z_stat = (x_bar-mu)/(sigma/n**0.5)
print(z_stat)

5.4812812776251905


In [45]:
pval = stats.norm.sf(abs(z_stat))*2
print(pval)

4.222565249683579e-08


In [46]:
# pval = 0
# sig.lvl = 0.05
# pval < sig.lvl
# Null hypothesis is rejected and alternate hypothesis selected
# Avg calories are not 82.

#### 3. A typhoid vaccine in the market inscribes 3 mg of ascorbic acid in the vaccine with standard deviation of 1.2mg . A research team claims that the vaccines contain less than 3 mg of acid. We collected the data of 40 vaccines by using random sampling from a population and recorded the amount of ascorbic acid. Test the claim of the research team using the sample data ⍺ value (significance level) to 0.05.

    acid_amt = [2.57, 3.06, 3.28 , 3.24, 2.79, 3.40, 3.36, 3.07, 2.46, 3.03, 3.05, 2.94, 3.46, 3.19, 3.09, 2.81, 3.13, 2.88, 
                2.76, 2.75, 3.17, 2.89, 2.54, 3.18, 3.08, 2.60, 3.06, 3.13, 3.11, 3.08, 2.93, 2.90, 3.06, 2.97, 3.24, 2.86, 
                2.87, 3.18, 3, 2.95]

In [47]:
acid_amt = [2.57, 3.06, 3.28 , 3.24, 2.79, 3.40, 3.36, 3.07, 2.46, 3.03, 3.05, 2.94, 3.46, 3.19, 3.09, 2.81, 3.13, 2.88, 
            2.76, 2.75, 3.17, 2.89, 2.54, 3.18, 3.08, 2.60, 3.06, 3.13, 3.11, 3.08, 2.93, 2.90, 3.06, 2.97, 3.24, 2.86, 
            2.87, 3.18, 3, 2.95]

In [49]:
x_bar = np.mean(acid_amt)
s = np.std(acid_amt,ddof=1)
x_bar

3.003

In [None]:
# Ho:  mu <= 3
# Ha : mu > 3

In [None]:
# Data is normal
# pop std is known
# one sample z test(right tailed)

In [50]:
mu = 3
x_bar= np.mean(acid_amt)
n=len(acid_amt)
sigma=1.2
z_stat = (x_bar-mu)/(sigma/n**0.5)
print(z_stat)

0.015811388300842496


In [51]:
pval = stats.norm.sf(z_stat)
print(pval)

0.493692431511398


In [None]:
# pval =0.49
# sig lvl = 0.05
# pval> sig lvl
# Null hypothesis is accepted
# average acid content less than 3 mg.

#### 4. A sample of 900 PVC pipes is found to have an average thickness of 12.5 mm. The sample is coming from a normal population. Is there any evidece that pvc pipe thickeness is less than 13 mm. The population standard deviation is 1 mm. Test the hypothesis at 5% level of significance.

In [53]:
# Ho : mu >=13
# Ha : mu <13

In [54]:
# Data is normal
#  pop std is known
# one sample z test(left tailed)

In [55]:
mu = 13
x_bar= 12.5
n=900
sigma=1
z_stat = (x_bar-mu)/(sigma/n**0.5)
print(z_stat)



-15.0


In [57]:
pval = stats.norm.cdf(z_stat)
print(pval)



3.6709661993126986e-51


In [None]:
# pval =0
# sig lvl = 0.05
# pval< sig lvl
# Null hypothesis is rejected
# average thickness less than 13mm.

<a id="2z"></a>
## 2.2 Two Sample Z Test

Let us perform a two sample Z test for the population mean. We compare the means of the two independent populations. The samples are assumed to be taken from populations such that they follow a normal distribution. Also, the sample must have equal variance.

The `Shapiro-Wilk Test` is used to check the normality of the data. The assumption of equal variances of the populations is tested using the `Levene's Test`. 
The hypothesis of the Levene's test is given as:
<p style='text-indent:25em'> <strong> H<sub>0</sub>:  The variances are equal</strong> </p>
<p style='text-indent:25em'> <strong> H<sub>1</sub>:  The variances are not equal </strong> </p>

The `levene()` from scipy library performs a Levene's test. 

The null and alternative hypothesis of two sample Z-test is given as:

<p style='text-indent:25em'> <strong> $H_{0}: \mu_{1} - \mu_{2} = \mu_{0}$ or $\mu_{1} - \mu_{2} \geq \mu_{0}$ or $\mu_{1} -\mu_{2} \leq \mu_{0}$</strong></p>
<p style='text-indent:25em'> <strong> $H_{1}: \mu_{1} - \mu_{2} \neq \mu_{0} $ or $\mu_{1} - \mu_{2} < \mu_{0}$ or $\mu_{1} -\mu_{2} > \mu_{0}$</strong></p>



The test statistic for two sample Z-test is given as:
<p style='text-indent:25em'> <strong> $Z = \frac{(\overline{X_{1}} - \overline{X_{2}})  - \mu_{0}} {\sqrt{\frac{\sigma_{1}^{2}}{n_{1}} + \frac{\sigma_{2}^{2}}{n_{2}}}}$</strong></p>

Where, <br>
$\overline{X_{1}}$, $\overline{X_{2}}$ : Mean of both the samples<br>
$\mu_{0}$: Mean difference given in the null hypothesis<br>
$\sigma_{1}, \sigma_{2}$: Standard deviation of both the populations<br>
$n_{1}, n_{2}$: Size of samples from both the populations




#### 1. A study was carried out to understand amount of haemoglobin in blood for males and females. A random sample of 160 males and 180 females have means of 13 g/dl and 15 g/dl. The two population have standard deviation of 4.1 g/dl for male donors and 3.5 g/dl for female donor . Can it be said the population means of concentrations of the elements are the same for men and women? Use  α = 0.01.Assume data is normally distributed

In [60]:
n1=160
n2 =180
x1_bar = 13
x2_bar = 15
sigma1 = 4.1
sigma2 = 3.5

In [None]:
# Ho : mu1 = mu2  => mu1-mu2=0
# Ha : mu != mu   => mu1 = mu2 !=0

In [58]:
# Data is normal
# pop std is known
# two samples
# two sample z test(two tailed)

In [62]:
num = (x1_bar-x2_bar)- 0
den = np.sqrt((sigma1**2/n1)+(sigma2**2/n2))
zstat = num/den
print(zstat)

-4.806830552525058


In [64]:
pval = stats.norm.sf(abs(zstat))*2
print(pval)

1.5334185117556497e-06


In [None]:
# pval < sig lvl
# Null hypothesis is rejected. Alternate is selected
# haemoglobin level male is not equal to haemoglobin level of female.