## Central Limit Theorem, Setting up of confidence intervals, Setting up hypothesis 

In [45]:
import  scipy.stats                     as  stats
import  numpy                           as  np

### Example 1

A sample of 100 diabetic patients was chosen to estimate the length of stay at a local hospital. 
The sample was 4.5 days and the population standard deviation was known to be 1.2 days.

* a) Calculate the 95% confidence interval for the population mean.
* b) What is the probability that the population mean is greater than 4.73 days?

In [52]:
#[A] Method 1 : Manual Calculation
#--------------------------------

Xbar  = 4.5 
sigma = 1.2
n     = 100
se    = sigma / np.sqrt(n)
ci = 0.95

zcrit = np.round(stats.norm.isf((1-ci)/2),2)

Lower_Interval = Xbar - (zcrit * se) #Lower_Interval = Xbar - (1.96 * se)
Upper_Interval = Xbar + (zcrit * se) #Upper_Interval = Xbar + (1.96 * se)

print('95 %s confidence interval for population mean is %1.4f  to %1.4f' % ('%', Lower_Interval , Upper_Interval))

#[B] Method 2 : stats.norm.interval() method
#-------------------------------------------


se  = sigma / np.sqrt(n)
LCI, UCI = stats.norm.interval(ci, loc = Xbar, scale = se) # Give confidence interval 95%, mean and std as arguments to get CI
print('95 %s confidence interval for population mean is %1.4f  to %1.4f' % ('%', LCI , UCI))


95 % confidence interval for population mean is 4.2648  to 4.7352
95 % confidence interval for population mean is 4.2648  to 4.7352


In [35]:
Zbar = (4.73 - Xbar) / se
P = stats.norm.sf(Zbar) # P = 1- stats.norm.cdf(Zbar)
print('Probability that the population mean is greater than 4.73 days %1.4f' % P)

Probability that the population mean is greater than 4.73 days 0.0276


### Example 2

Hindustan Pencils Pvt. Ltd. is an Indian manufacturer of pencils, writing materials and other stationery items, established in 1958 in Mumbai. Nataraj brand of pencils manufactured by the company is expected to have a mean length of 172 mm and the standard deviation of the length is 0.02 mm.

To ensure quality, a sample is selected at periodic intervals to determine whether the length is still 172 mm and other dimensions of the pencil meet the quality standards set by the company.

You select a random sample of 100 pencils and the mean is 170 mm. 

Construct a 95% confidence interval for the pencil length.

### Solution

In [44]:
mu = 172
sigma = 0.02
n = 100
xbar = 170interval

se = sigma/np.sqrt(n)
LCI, UCI = stats.norm.interval(ci, loc = xbar, scale = se)

print('95% confidence interval for pencil length is : %.4f to %.4f' %('%', LCI, UCI))


0.002
95%onfidence interval for pencil length is : 169.9961 to 170.0039


## Example 3

Construct a 99% confidence interval for the following examples given above:
* a. Example 1
* b. Example 2

In [72]:
# a. Example 1 - 99% Confidence interval

ci = 0.99
Xbar  = 4.5 
sigma = 1.2
n     = 100
se    = sigma / np.sqrt(n)

zcrit = np.round(stats.norm.isf((1-ci)/2),2)

Lower_Interval = Xbar - (zcrit * se)
Upper_Interval = Xbar + (zcrit * se)

print('99 %s confidence interval for population mean is %1.4f  to %1.4f' % ('%', Lower_Interval , Upper_Interval))

# b. Example 2 - 99% Confidence interval

mu = 172
sigma = 0.02
n = 100
xbar = 170

se = sigma/np.sqrt(n)

zcrit = np.round(stats.norm.isf((1-ci)/2),2)

Lower_Interval = xbar - (zcrit * se)
Upper_Interval = xbar + (zcrit * se)
LCI, UCI = np.round(stats.norm.interval(ci, loc=xbar, scale=se),4)

print()
print(Lower_Interval, '  ',Upper_Interval )
print(LCI, '  ',UCI)


99 % confidence interval for population mean is 4.1904  to 4.8096

169.99484    170.00516
169.9948    170.0052


## Confidence interval for population mean when standard deviation is unknown

**We observe that the values of t and Z converge for higher degrees of freedom.**

### Example 4

The following table contains the length of stay in minutes of each customer at a Fast Food restaurant.

|      |      |      |      |      |
| ---  | ---  | ---  | ---  | ---  |
| 7.42 | 6.29 | 5.83 | 6.50 | 8.34 |
| 9.51 | 7.10 | 6.80 | 5.90 | 4.89 |
| 6.50 | 5.52 | 7.90 | 8.30 | 9.60 |

* a. *Construct 95% confidence interval estimate for the population mean length of stay at Fast Food restaurant, assuming a normal distribution.*
* b. *Interpret the interval constructed at a.*

In [94]:
L = [7.42, 6.29, 5.83, 6.50, 8.34, 9.51, 7.10, \
     6.80, 5.90, 4.89, 6.50, 5.52, 7.90, 8.30, 9.60]
lengthStay = np.array(L)
#sample statistics
xbar        = lengthStay.mean(axis = 0)
S         = np.std(lengthStay,ddof = 1)
# Here ddof modifies the divisor of the sum of the squares of the samples-minus-mean

n      = 15
SL_2   = 0.025
deg_fr = n - 1
se = S/np.sqrt(n)

tcrit = np.abs(round(stats.t.isf( (SL_2), deg_fr),4))
LCI         = xbar - (tcrit * se)
UCI         = xbar + (tcrit * se)
print('a. 95s%s confidence interval for population mean is %1.4f  to %1.4f' % ('%', LCI , UCI))
print()
print('b. You can be 95%s confident that the mean length of stay at a Fast Food restaurant lies between %.4f minutes to %.4f minutes.'%('%', LCI, UCI))

a. 95s% confidence interval for population mean is 6.3147  to 7.8720

b. You can be 95% confident that the mean length of stay at a Fast Food restaurant lies between 6.3147 minutes to 7.8720 minutes.


### Another method using scipy.stats

In [95]:
alpha       = 0.95
LCI, UCI    = stats.t.interval(alpha, deg_fr, xbar, se)
print('a. 95 %s confidence interval for population mean is %1.4f  to %1.4f' % ('%', LCI , UCI))
print()
print('b. You can be 95%s confident that the mean length of stay at a Fast Food restaurant lies between %.4f minutes to %.4f minutes.'%('%', LCI, UCI))

a. 95 % confidence interval for population mean is 6.3147  to 7.8720

b. You can be 95% confident that the mean length of stay at a Fast Food restaurant lies between 6.3147 minutes to 7.8720 minutes.


## Example 5

Time taken to resolve a customer complaints in days of 100 customers in a Service Organization is given below:

 |      |      |      |      |      |      |      |      |      |      |
 | ---  | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
 | 2.50 | 3.26 | 2.79 | 3.74 | 5.60 | 3.24 | 3.65 | 3.91 | 4.35 | 3.35 |
 | 5.67 | 5.38 | 3.54 | 5.10 | 3.66 | 3.01 | 3.96 | 4.98 | 4.56 | 5.00 |
 | 5.03 | 5.29 | 4.91 | 4.63 | 2.94 | 3.82 | 4.76 | 2.24 | 4.25 | 3.45 |
 | 3.14 | 4.64 | 4.56 | 4.61 | 2.68 | 3.61 | 5.46 | 2.83 | 4.84 | 4.31 |
 | 2.98 | 3.90 | 4.45 | 3.62 | 6.15 | 4.04 | 5.19 | 4.63 | 2.78 | 2.95 |
 | 3.65 | 4.49 | 3.52 | 4.07 | 4.16 | 5.56 | 2.69 | 6.69 | 1.26 | 3.14 |
 | 4.71 | 4.80 | 3.41 | 3.18 | 4.64 | 4.23 | 4.36 | 3.94 | 3.81 | 4.26 |
 | 2.92 | 2.87 | 2.08 | 3.09 | 3.60 | 2.93 | 3.85 | 4.66 | 4.70 | 3.61 |
 | 5.59 | 3.39 | 3.13 | 4.14 | 4.23 | 4.25 | 4.12 | 5.95 | 4.76 | 4.96 |
 | 2.27 | 3.77 | 5.25 | 3.05 | 3.20 | 5.22 | 3.84 | 2.24 | 4.75 | 3.07 |


* a. *Construct 95% confidence interval estimate for the population mean days to resolve customer complaints,
      assuming a normal distribution.*
* b. *Interpret the interval constructed at a.*

**Hint**

* 1) Use the following code to obtain the NumPy array, resolvedDays to solve this problem.

In [108]:
resolved_in_days = [2.50, 3.26, 2.79, 3.74, 5.60, 3.24, 3.65, 3.91, 4.35, 3.35,\
5.67, 5.38, 3.54, 5.10, 3.66, 3.01, 3.96, 4.98, 4.56, 5.00,\
5.03, 5.29, 4.91, 4.63, 2.94, 3.82, 4.76, 2.24, 4.25, 3.45,\
3.14, 4.64, 4.56, 4.61, 2.68, 3.61, 5.46, 2.83, 4.84, 4.31,\
2.98, 3.90, 4.45, 3.62, 6.15, 4.04, 5.19, 4.63, 2.78, 2.95,\
3.65, 4.49, 3.52, 4.07, 4.16, 5.56, 2.69, 6.69, 1.26, 3.14,\
4.71, 4.80, 3.41, 3.18, 4.64, 4.23, 4.36, 3.94, 3.81, 4.26,\
2.92, 2.87, 2.08, 3.09, 3.60, 2.93, 3.85, 4.66, 4.70, 3.61,\
5.59, 3.39, 3.13, 4.14, 4.23, 4.25, 4.12, 5.95, 4.76, 4.96,\
2.27, 3.77, 5.25, 3.05, 3.20, 5.22, 3.84, 2.24, 4.75, 3.07]

resolvedDays = np.array(resolved_in_days)

xbar = resolvedDays.mean()
S = resolvedDays.std(ddof=1)
n = len(resolvedDays)
se = S/np.sqrt(n)
deg_fr = n-1
alpha = 0.05

tcrit = np.abs(round(stats.t.isf( (alpha/2), deg_fr),4))
print(tcrit)
LCI = xbar - (tcrit * se)
UCI = xbar + (tcrit * se)

print('a. 95 %s confidence interval for population mean is %1.4f  to %1.4f' % ('%', LCI , UCI))
print()
alpha       = 0.95
LCI, UCI    = stats.t.interval(alpha, deg_fr, xbar, se)
print('b. 95 %s confidence interval for population mean is %1.4f  to %1.4f' % ('%', LCI , UCI))



1.9842
a. 95 % confidence interval for population mean is 3.8016  to 4.1984

b. 95 % confidence interval for population mean is 3.8016  to 4.1984


### Example 6

A beverages company produces mineral water and available in 250 ml, 500 ml, 1 litre and 2 litre bottles, 5 litre, 15 litre and 20 litre jars.
Let us consider 2 litre bottles. Company specification require a mean volume of 2 litre per bottle.
You must adjust the water filling process when the mean volume in the population of bottles differs from 2 litres. Adjusting the process requires shutting down the water filling production line completely, so you do not want to make any adjustments without any reason unnecessarily.

Assume a sample of 50 water bottles indicate a sample mean, $\overline{X}$ of 2.001 litres and the population standard deviation, $\sigma$ is 15 ml.

In [35]:
#null hypothesis - Ho = 'mean volume in the population of 2 ltr bottle is 2 ltr'
#alternate hypothesis - Ha = 'mean volume in the population of 2 ltr bottle is not 2 ltr'

mu = 2
sigma = 0.015
n = 50
xbar = 2.001
alpha = 0.05

se = sigma/np.sqrt(n)
zscore = (xbar-mu)/se
print('Z score of the sample mean is : ', zscore)

cu = stats.norm.isf(0.025)
cl = stats.norm.ppf(0.025)

print('\nNon-rejection region for Ho is between ', cu, ' and ', cl)
print('\nCritical region (rejection region for Ho) is ->')
print('>', cu , '\nand\n <',cl)

critical_region = (( xbar <= cu ) and ( xbar >= cl))
   
print('\nDoes the sample mean fall into the rejection region ? : ', 'Yes' if(critical_region==True) else 'No')
print('Should we reject null hypothesis - Ho? : ', 'Yes' if(critical_region==True) else 'No')

pvalue = 2*(1-stats.norm.cdf(xbar, loc=mu, scale=se))
print('\np-value is : ', pvalue)
print('Is p-value < alpha(Should we reject Ho )? : ','Yes' if(pvalue < alpha) else 'No' )

Z score of the sample mean is :  0.4714045207909798

Non-rejection region for Ho is between  1.9599639845400545  and  -1.9599639845400545

Critical region (rejection region for Ho) is ->
> 1.9599639845400545 
and
 < -1.9599639845400545

Does the sample mean fall into the rejection region ? :  No
Should we reject null hypothesis - Ho? :  No

p-value is :  0.6373518882339742
Is p-value < alpha(Should we reject Ho )? :  No


### Example 7

In a bank, the average time taken for getting a demand draft or bankers cheque is 15 minutes.
From the past experience, you can assume that the population is normally distributed with a population standard deviation of 1.6 minutes. 

You select a sample of 50 requests for demand drafts and the sample mean is 14 minutes.

#### Use the five step approach listed above to deteremine whether there is evidence at a 5% level of significance that the population mean service time to get the demand draft has changed from the population mean of 15 minutes. 

In [56]:
# H0 = population mean is 15 minutes
# H1 = population mean is NOT 15 minutes

mu = 15
sigma = 1.6
n = 50
xbar = 14
alpha = 0.05

se = sigma/np.sqrt(n)
zscore = (xbar-mu)/se
print('Zscore for the xbar is : ', zscore)

cu = stats.norm.isf(0.025)
cl = stats.norm.ppf(0.025)

print('\nAcceptance region for H1 lies between : ', cl, ' and ', cu)
rejection_region = not((cu >= zscore ) and (cl <= zscore ))
print('\nShould we reject H0(Based on z-value test)? : ', np.where(rejection_region, 'Yes','No'))

p_value = stats.norm.cdf(xbar, loc=mu, scale=se)* 2
print('\nP-value for xbar is : ', p_value)
reject = (p_value <= alpha)
print('\nShould we reject H0(Based on p-value test)? : ', np.where(reject,'Yes', 'No'))


Zscore for the xbar is :  -4.419417382415922

Acceptance region for H1 lies between :  -1.9599639845400545  and  1.9599639845400545

Should we reject H0(Based on z-value test)? :  Yes

P-value for xbar is :  9.896734625245587e-06

Should we reject H0(Based on p-value test)? :  Yes


### Example 8:

A manufacturer claims that the mean lifetime of LED lamp is more than 50000 hours. Assume actual mean LED lamp lifetime is 49950 hours and population standard deviation is 120 hours. 

At 5% level of significance, what is the probability of having type II error for a sample size of 30 LED lamps?

* Ho > 50000 hours
* H1 <= 50000 hours
* We need to find the P(Population mean $\geq$ 49950  | $H_A$ is true)

In [109]:
# Type II error -> accepting H0 when H0 is false

mu = 50000
sigma = 120
xbar = 49950
n = 30
apha = 0.05

se = sigma/np.sqrt(n)



0.011239436683062633

In [127]:
n         = 30    # sample size
sigma     = 120  # population standard deviation
se      = sigma / np.sqrt(n) # Standard Error

alpha     = 0.05     # significance level
mu0       = 50000 #  hypothetical lower bound
q         = int(round(stats.norm.isf(1-alpha, loc = mu0, scale = se),0))
print(q)

49964


* Assume actual mean LED lamp lifetime is 49950 hours 
* We need to find the P(Population mean $\geq$ 49950  | $H_A$ is true)

In [125]:
mu1  = 49950 # Actual mean

#p = round(1 - stats.norm.cdf(q, loc = mu1, scale = se),4)
p = round(stats.norm.sf(q, loc=mu1, scale = se),4)

print('At 5 %s level of significance, the probability of having type II error\n\
       for a sample size of 30 LED lamps is %2.4f' %('%',p))
print('At 5 %s level of significance, the POWER OF THE TEST\n \
      for a sample size of 30 LED lamps is %2.4f' %('%',1 - p))


At 5 % level of significance, the probability of having type II error
       for a sample size of 30 LED lamps is 0.2614
At 5 % level of significance, the POWER OF THE TEST
       for a sample size of 30 LED lamps is 0.7386
