## Sampling Distribution

Refer attrition data

**A) Use Monthly Income column from the dataset attrition.csv**

In [1]:
#Include all packages that will be needed to make our estimates, plots, graphs, calculations, etc.

import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import random
sns.set(rc={'figure.figsize':(12,8)})                              #setting the size of graph figure to 12:8

att = pd.read_csv("/Users/rachita/Desktop/Python/attrition.csv")   #reading the attrition file into att

**Q1. Consider entire variable as a population and calculate population mean**

In [97]:
population_income = att.MonthlyIncome                        #selecting Monthly Income as Population
print("Population mean is :", np.mean(population_income))    #calculating mean of Population

Population mean is : 6502.931292517007


**Q2. Select sample s of size = 200 from the population and calculate 95% Confidence Interval estimate of the population mean** 

In [103]:
sample_income = att.MonthlyIncome.sample(n=200)              #selecting a random sample from MonthlyIncome of n=200

#calculating 95% Confidence Interval Estimates. According to Standard Normal Table z-score of 95% is 1.96 
#using formula, calculate the value of a 
a = (np.mean(sample_income) - ((1.96)*(np.std(sample_income)/np.sqrt(200))))
#using formula, calculate the value of b
b = (np.mean(sample_income) + ((1.96)*(np.std(sample_income)/np.sqrt(200))))

print("Range of Confidence Interval lies between : ",a," and ",b)
print("Sample mean is : ",np.mean(sample_income),"We can notice that this value falls within the above range of CI.")


Range of Confidence Interval lies between :  5821.918156930799  and  7068.841843069201
Sample mean is :  6445.38 We can notice that this value falls within the above range of CI.


**Q3. Repeat step 2 1000 times and check how many of those CI captures the true population mean**

In [105]:
true_count = 0
false_count = 0

#repeating the bootstrap 1000 times, everytime calculating fresh value of a & b 
#and checking if the sample mean falls within that range or not.

for i in range(1000):
    sample_income_d = np.random.choice(sample_income,200,replace = True)         #bootstrapping sample
    s_mean = int(np.mean(sample_income))
    
    a = int(np.mean(sample_income_d) - ((1.96)*(np.std(sample_income_d)/np.sqrt(200))))
    b = int(np.mean(sample_income_d) + ((1.96)*(np.std(sample_income_d)/np.sqrt(200))))

    if s_mean in range(a,b):
        true_count += 1     #returning the count of times true mean was captured within range
    else:
        false_count += 1    #returning the count of times true mean was not captured within range

print("The number of times true mean was captured in the interval :",true_count)
print("The number of times true mean was not captured in the interval :",false_count)


The number of times true mean was captured in the interval : 943
The number of times true mean was not captured in the interval : 57


**Inference :** 95% CI will capture the true population mean, i.e, 1 out of every 10 samples or 5 out of every 100 samples will not be able to capture the population mean.
We can conclude that we have a 95% chance of capturing the true mean of the population. 

**B) Wide vs. Narrow Confidence Interval**

**Q1. Collect a sample s of size 200 from the population and calculate**
**a) CI = 90%**
**b) CI = 99%**

In [113]:
#extracting a sample of size 200 from the population data
sample_income = att.MonthlyIncome.sample(n=200)

#calculating 90% Confidence Interval Estimates. According to Standard Normal Table z-score of 90% is 1.645
a = (np.mean(sample_income) - ((1.645)*(np.std(sample_income)/np.sqrt(200))))
b = (np.mean(sample_income) + ((1.645)*(np.std(sample_income)/np.sqrt(200))))
print("For 90% CI the range varies between :",a, "and ",b)
print("Difference between the range is :",b-a)

#calculating 99% Confidence Interval Estimates. According to Standard Normal Table z-score of 99% is 2.575
a = (np.mean(sample_income) - ((2.575)*(np.std(sample_income)/np.sqrt(200))))
b = (np.mean(sample_income) + ((2.575)*(np.std(sample_income)/np.sqrt(200))))
print("For 99% CI the range varies between :",a, "and ",b)
print("Difference between the range is :",b-a)
print("Sample mean is :",np.mean(sample_income))


For 90% CI the range varies between : 5983.565276525884 and  7091.504723474116
Difference between the range is : 1107.9394469482322
For 99% CI the range varies between : 5670.378745929575 and  7404.691254070424
Difference between the range is : 1734.3125081408489
Sample mean is : 6537.535


**Note :** Both the intervals of 90% CI & 99% CI are able to capture the mean, with 95% confidence

a) Which one is wider ? Why ?

99% CI is wider that 90% Ci as its value is spread is larger. Also, alpha_1 is 10% which is split to 5% on each side. Whereas, alpha_2 is just 1%, which is further split into 0.5% on each side of the distribution.

**Q2. Collect 2 samples from the population S1 of size 200 and s2 of size 400. using both the samples calculate a 90% CI estimate of population mean**

In [120]:
#According to Standard Normal Table z-score of 90% is 1.645

#extracting a sample of size 200 from the population data
sample_income_1 = att.MonthlyIncome.sample(n=200)
a = (np.mean(sample_income_1) - ((1.645)*(np.std(sample_income_1)/np.sqrt(200))))
b = (np.mean(sample_income_1) + ((1.645)*(np.std(sample_income_1)/np.sqrt(200))))
print("For 90% CI the range varies between :",a, "and ",b)
print("Difference between a & b :",b-a)


#extracting a sample of size 400 from the population data
sample_income_2 = att.MonthlyIncome.sample(n=400)
#sample_income_2=att.MonthlyIncome.sample(n=400)
a = (np.mean(sample_income_2) - ((1.645)*(np.std(sample_income_2)/np.sqrt(400))))
b = (np.mean(sample_income_2) + ((1.645)*(np.std(sample_income_2)/np.sqrt(400))))
print("For 90% CI the range varies between :",a, "and ",b)
print("Difference between a & b :",b-a)


For 90% CI the range varies between : 6007.317964217904 and  7086.622035782097
Difference between a & b : 1079.3040715641928
For 90% CI the range varies between : 5916.511605180337 and  6662.603394819663
Difference between a & b : 746.0917896393257


a) Theoretically, which one would you expect to be narrower & why ?

**Inference :** We would expect the second one to be narrower because as sample size increases our Confidence Interval becomes more precise and hence, thinner.

#### Conclusion : Thus we have studied the different CI estimates and their reasons for being narrow or wide.