### Estimation Population Mean | Population Proportion | Population Variance

In [45]:
from scipy import stats
from math import sqrt
import numpy as np

# Mean Estimation

### Problem
A survey was taken of U.S. companies that do business with firms in India. One of
the questions on the survey was: Approximately how many years has your company
been trading with firms in India? A random sample of 44 responses to this question
yielded a mean of 10.455 years. Suppose the population standard deviation for this
question is 7.7 years. Using this information, construct a 90% confidence interval for
the mean number of years that a company has been trading in India for the population
of U.S. companies trading with firms in India.

### Solution
Sample Size >= 30
<br />Population Standard Deviation is Known
<br />Use Z-Distribution

In [46]:
sample_mean = 10.455
population_std = 7.7
sample_size = 44
confidence = 0.90

# sampling mean distribution std or standard error of mean is population std /square root of sample size
sample_mean_std = (population_std/(sample_size ** 0.5))

sample_mean_dist = stats.norm(loc=sample_mean,scale=sample_mean_std)

In [47]:
alpha = confidence
sample_mean_dist.interval(confidence)

(8.545623189521363, 12.364376810478635)

In [48]:
# one side left out area. At 90% leave 5% area on both side
left_out_area = (1-confidence)/2

# upper cut off point is 95% area
upper_area = 1-left_out_area

# Lower cut off point is 5% area
lower_area = left_out_area

# sample mean value corresponding to 95% area 
upper_bound = sample_mean_dist.ppf(upper_area)
upper_bound = round(upper_bound,4)

# sample mean value corresponding to 5% area
lower_bound = sample_mean_dist.ppf(lower_area)
lower_bound = round(lower_bound,4)

In [49]:
print(f"At {confidence*100}% confidence, Actual population mean is between {lower_bound} and {upper_bound}")

At 90.0% confidence, Actual population mean is between 8.5456 and 12.3644


### Problem
A study is conducted in a company that employs 800 engineers. A random sample
of 50 engineers reveals that the average sample age is 34.3 years. Historically, the
population standard deviation of the age of the company’s engineers is approximately
8 years. Construct a 98% confidence interval to estimate the average age of all the
engineers in this company.

### Solution
Sample Size >= 30
<br />Population Standard Deviation is Known
<br />Use Z-Distribution
<br/> Finite Sample hence use Finite Correction Factor sqrt((N-n)/(N-1)) where N = Population Size, n = Sample Size

In [50]:
sample_mean = 34.3
population_std = 8
sample_size = 50
population_size = 800
confidence = 0.98

# Finite Correction Factor for Standard Deviation
fcf = sqrt(population_size-sample_size )/sqrt(population_size-1)

# sampling mean distribution std or standard error of mean is population std /square root of sample size
sample_mean_std = (population_std/(sample_size ** 0.5)) * fcf

sample_mean_dist = stats.norm(loc=sample_mean,scale=sample_mean_std) 

In [51]:
alpha = confidence
sample_mean_dist.interval(confidence)

(31.750019349305987, 36.84998065069401)

In [52]:
# one side left out area. At 98% leave 1% area on both side
left_out_area = (1-confidence)/2

# upper cut off point is 99% area
upper_area = 1-left_out_area

# Lower cut off point is 1% area
lower_area = left_out_area

# sample mean value corresponding to 99% area 
upper_bound = sample_mean_dist.ppf(upper_area)
upper_bound = round(upper_bound,4)

# sample mean value corresponding to 1% area
lower_bound = sample_mean_dist.ppf(lower_area)
lower_bound = round(lower_bound,4)

In [53]:
print(f"At {confidence*100}% confidence, Actual population mean is between {lower_bound} and {upper_bound}")

At 98.0% confidence, Actual population mean is between 31.75 and 36.85


#### Without Finite Correction Factor

In [54]:
# sampling mean distribution std or standard error of mean is population std /square root of sample size
sample_mean_std = (population_std/(sample_size ** 0.5))

sample_mean_dist = stats.norm(loc=sample_mean,scale=sample_mean_std)

In [55]:
alpha = confidence
sample_mean_dist.interval(confidence)

(31.668037828586897, 36.9319621714131)

In [56]:
# one side left out area. At 98% leave 1% area on both side
left_out_area = (1-confidence)/2

# upper cut off point is 99% area
upper_area = 1-left_out_area

# Lower cut off point is 1% area
lower_area = left_out_area

# sample mean value corresponding to 99% area 
upper_bound = sample_mean_dist.ppf(upper_area)
upper_bound = round(upper_bound,4)

# sample mean value corresponding to 1% area
lower_bound = sample_mean_dist.ppf(lower_area)
lower_bound = round(lower_bound,4)

In [57]:
print(f"At {confidence*100}% confidence, Actual population mean is between {lower_bound} and {upper_bound}")

At 98.0% confidence, Actual population mean is between 31.668 and 36.932


### Problem
The owner of a large equipment rental company wants to make a
rather quick estimate of the average number of days a piece of
ditchdigging equipment is rented out per person per time. The company
has records of all rentals, but the amount of time required to
conduct an audit of all accounts would be prohibitive. The owner decides to take a random sample of rental invoices. Fourteen different rentals of
ditchdiggers are selected randomly from the files, yielding the following data. She
uses these data to construct a 99% confidence interval to estimate the average number
of days that a ditchdigger is rented and assumes that the number of days per
rental is normally distributed in the population.
<br/>3 1 3 2 5 1 2 1 4 2 1 3 1 1

### Solution
Sample Size < 30
<br />Population Standard Deviation is Unknown
<br />Population is Normally Distributed
<br />Use t-Distribution

In [58]:
sample = [3,1,3,2,5,1,2,1,4,2,1,3,1,1]
sample_size = len(sample) 
confidence = 0.99

In [59]:
sample_mean = np.mean(sample)
sample_mean = round(sample_mean,2)

# For Sample Standard deviation the degree of freedom is 1 i.e. divide by n-1
sample_std = np.std(sample,ddof=1)
sample_std = round(sample_std,2)

print(f"The sample mean is {sample_mean} and the sample standard deviation is {sample_std}.")

The sample mean is 2.14 and the sample standard deviation is 1.29.


In [60]:
# sampling mean distribution std or standard error of mean sample std/square root of sample size
sample_mean_std = (sample_std/(sample_size ** 0.5))
df = sample_size - 1
sample_mean_dist = stats.t(loc=sample_mean,scale=sample_mean_std,df=df)

In [61]:
alpha = confidence
sample_mean_dist.interval(confidence)

(1.101466689862367, 3.178533310137633)

In [62]:
# one side left out area. At 99% leave 0.5% area on both side
left_out_area = (1-confidence)/2

# upper cut off point is 99.5% area
upper_area = 1-left_out_area

# Lower cut off point is 0.5% area
lower_area = left_out_area

# sample mean value corresponding to 99.5% area 
upper_bound = sample_mean_dist.ppf(upper_area)
upper_bound = round(upper_bound,4)

# sample mean value corresponding to 0.5% area
lower_bound = sample_mean_dist.ppf(lower_area)
lower_bound = round(lower_bound,4)

In [63]:
print(f"At {confidence*100}% confidence, Actual population mean is between {lower_bound} and {upper_bound}")

At 99.0% confidence, Actual population mean is between 1.1015 and 3.1785


# Proportion Estimation

### Problem
Coopers & Lybrand surveyed 210 chief executives of fast-growing small companies.
Only 51% of these executives had a management succession plan in place. A
spokesperson for Cooper & Lybrand said that many companies do not worry about
management succession unless it is an immediate problem. However, the unexpected
exit of a corporate leader can disrupt and unfocus a company for long enough
to cause it to lose its momentum.
Use the data given to compute a 92% confidence interval to estimate the proportion
of all fast-growing small companies that have a management succession plan.

In [64]:
p = 0.51
n = 210
q = 1-p
confidence = 0.92

In [65]:
if (n * p > 5) and (n*q>5):
    print('We have enough sample size to apply central limit theorem for population proportion estimate')
else:
    print('We have do have enough sample size to apply central limit theorem for population proportion estimate')

We have enough sample size to apply central limit theorem for population proportion estimate


In [66]:
sample_proportion_dist_mean = p
sample_proprtion_dist_std = sqrt((p * q)/n)

sample_proprtion_dist = stats.norm(loc=sample_proportion_dist_mean,scale=sample_proprtion_dist_std) 

In [67]:
alpha = confidence
sample_proprtion_dist.interval(confidence)

(0.44960767394038487, 0.5703923260596151)

In [68]:
# one side left out area. At 92% leave 4% area on both side
left_out_area = (1-confidence)/2

# upper cut off point is 96% area
upper_area = 1-left_out_area

# Lower cut off point is 4% area
lower_area = left_out_area

# sample mean value corresponding to 96% area 
upper_bound = sample_proprtion_dist.ppf(upper_area)
upper_bound = round(upper_bound,4)

# sample mean value corresponding to 4% area
lower_bound = sample_proprtion_dist.ppf(lower_area)
lower_bound = round(lower_bound,4)

In [69]:
print(f"At {confidence*100}% confidence, Actual population proprtion is between {lower_bound} and {upper_bound}")

At 92.0% confidence, Actual population proprtion is between 0.4496 and 0.5704


# Variance Estimation

### Problem
The U.S. Bureau of Labor Statistics publishes data on the hourly compensation
costs for production workers in manufacturing for various countries. The latest
figures published for Greece show that the average hourly wage for a production worker in manufacturing is 16.10 dollar Suppose the business council of Greece wants
to know how consistent this figure is. They randomly select 25 production workers
in manufacturing from across the country and determine that the standard deviation
of hourly wages for such workers is $1.12. Use this information to develop a
95% confidence interval to estimate the population variance for the hourly wages
of production workers in manufacturing in Greece. Assume that the hourly wages
for production workers across the country in manufacturing are normally
distributed.

### Solution

As population is Normally Distributed, Chi-Square Distribution can be used for constructing confidence interval for variance

In [70]:
sample_std = 1.12
sample_var = sample_std ** 2
n = 25
df = n-1
confidence = 0.95

In [101]:
chi_dist =stats.chi2(df=24)
alpha = confidence
chi_dist.interval(confidence)

(12.401150217444439, 39.36407702660391)

In [115]:
# one side left out area. At 95% leave 2.5% area on both side
left_out_area = (1-confidence)/2

# upper cut off point is 96% area
upper_area = 1-left_out_area

# Lower cut off point is 4% area
lower_area = left_out_area

# sample mean value corresponding to 96% area 
lower_bound = chi_dist.ppf(upper_area)
lower_bound = round(lower_bound,4)

# sample mean value corresponding to 4% area
upper_bound = chi_dist.ppf(lower_area)
upper_bound = round(upper_bound,4)

In [116]:
lower_bound_var = (n-1)*sample_var/lower_bound
upper_bound_var = (n-1)*sample_var/upper_bound

In [118]:
print(f"At {confidence*100}% confidence, Actual population variation is between {lower_bound_var} and {upper_bound_var}")

At 95.0% confidence, Actual population variation is between 0.7647983822823334 and 2.427636035222398
