
# AccelerateAI - Data Science Global Bootcamp
### Hypothesis Testing - Assignment Solution

In [85]:
import numpy as np
import scipy.stats as st
import pandas as pd
import math

### Q1 
A company named Outel Semiconductors has developed a new microprocessor. It wants to test how fast one of these new chips can conduct a certain benchmark calculation. Suppose that the time it takes to complete the calculation is normally distributed.  After 10 runs, the sample average time to completion is 32.7 nanoseconds, and the sample variance is 16 nanoseconds. Can Outel claim that true average time to completion is 30 nanoseconds at 95% confidence level?

Solution: This is a one sample test on mean. 
Null Hypothesis:       µ = 30 
Alternate Hypothesis:  µ <30  ( this is what Outel wants to prove) 

N = 10, hence we will conduct a one sample t-test (with dof = 9) 

Sample mean x ̅=32.7 
Sample s.d.        = 16

t = ((x ̅-30))/(16/√10)

In [86]:
p_val = st.t.cdf(x=32.7, loc=30, scale=16, df=9)
print(p_val)

0.5651375378113244


***

### Q4
The table below contains data from a survey of 500 randomly selected households. Researchers would like to use the available sample information to
test whether home ownership rates vary by household location. For example, is there a nonzero difference between the proportions of individuals who own their homes (as opposed to those who rent their homes) in households located in the SW and NW sectors of this community? 
Use the sample data to test for a difference in home ownership rates in these two sectors. Use a 5% significance level. Interpret and summarize your results. 

Solution: <br>
Here we are to compare proportion for 2 independent samples. 

Home ownership proportion in NW sector, pa = 89/129 = 0.6899 <br>
Home ownership proportion in NW sector, pb = 106/123 = 0.8618

Null Hypothesis, H0: pa = pb <br>
Alternate Hypothesis, Ha: pa ≠pb 

The pooled proportion:<br>
Pc = (89+106)/(129+123)=0.7736 <br>
Z = (pa-pb)/√(pc(1-pc)(1/na+1/nb))= -3.2597

In [87]:
st.norm.cdf(x=-3.2597)

0.0005576505769978424

***

### Q5
Twenty people have rated a new beer on a taste scale of 0 to 100. Their ratings are in the file Q5_Beer_Taste.xlsx. Marketing has determined that the beer will be a success if the average taste rating exceeds 76. Using a 5% significance level, is there sufficient evidence to conclude that the beer will be a success? Discuss your result in terms of a p-value. Assume ratings are at least approximately normally distributed.

In [88]:
# Lets read the file 
beer_df = pd.read_excel("Q5_Beer_Taste.xlsx", sheet_name=0)
beer_df.sample(5)

Unnamed: 0,Person,Rating
21,22,76
18,19,77
58,59,100
32,33,48
4,5,94


In [89]:
#Step 1: Check for data quality
beer_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60 entries, 0 to 59
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   Person  60 non-null     int64
 1   Rating  60 non-null     int64
dtypes: int64(2)
memory usage: 1.1 KB


In [90]:
# Step 2 : Find the mean and variance 
print("N:", beer_df.shape[0])

mean = beer_df.Rating.mean()
sd = beer_df.Rating.std()
print("Mean rating:", mean, " Std. dev rating:", sd)

N: 60
Mean rating: 79.75  Std. dev rating: 14.502629802433788


We will use a one-sample right tailed  t-test, to check rating is greater than 76.
- Null Hypothesis: $\mu$  $\le\$ 76

- Alternate hypothesis: $\mu$  $\gt$ 76

In [91]:
p_val = st.ttest_1samp(a=beer_df.Rating, popmean=76, alternative='greater')

print(f'The p-values is {p_val.pvalue}')

The p-values is 0.024894435454756912


***Since p-value is less than 0.05, we reject the Null hypothesis. Hence we have sufficient evidence to conclude that the new beer will be a success.***

***

### Q6
A market research consultant hired by a leading soft drink company wants to determine the proportion of consumers who favor its low-calorie drink over the
leading competitor’s low-calorie drink in a particular urban location. A random sample of 250 consumers from the market under investigation is provided in the
file Q6_Lowcalorie_Drink.xlsx.

 1. Find a 95% confidence interval for the proportion of all consumers in this market who prefer this company’s drink over the competitors. What does this confidence interval tell us?
 2. Does the confidence interval in part a support the claim made by one of the company’s marketing managers that more than half of the consumers in this urban location favor its drink over the competitor’s? Explain your answer.
 3. Comment on the sample size used in this study. Specifically, is the sample unnecessarily large? Is it too small? Explain your reasoning.

In [92]:
# Lets read the file 
drink_df = pd.read_excel("Q6_LowcalorieDrink.xlsx", sheet_name=0)
drink_df.sample(5)

Unnamed: 0,Consumer,Gender,Age,Preference
3,4,F,Over 60,Competing brand
137,138,F,Between 20 and 40,Competing brand
131,132,M,Over 60,Competing brand
228,229,F,Over 60,Competing brand
46,47,M,Between 20 and 40,Our brand


In [93]:
# Preference of each drink
drink_df["Preference"].value_counts()

Our brand          134
Competing brand    116
Name: Preference, dtype: int64

In [94]:
#Proportion standard deviation 
n = len(drink_df)
p = len(drink_df[drink_df["Preference"] == "Our brand"]) / n
sd= math.sqrt(p*(1-p)/n)
print(f'Sample proportion={p:0.3f}, Sample sd ={sd:.3f}\n')

#95% CI = mean -/+ t_val*sd

print(f'95% CI = [ {p - 1.96*sd:.3f} to  {p + 1.96*sd:.3f} ]')

Sample proportion=0.536, Sample sd =0.032

95% CI = [ 0.474 to  0.598 ]


***2) No, the given data does not supports the that more than half of the consumers favor its drink over the competitor’s, as the 95% CI includes value from 47.4% to 59.8%.***

***3)The sample size is considerable small. This is observed from the fact that the 95% CI is too large - almost 20 percentage point wide***

***

### Q7
A large buyer of household batteries wants to decide which of two equally priced brands to purchase. To do this, he takes a random sample of 100 batteries of each brand. The lifetimes, measured in hours, of the batteries are recorded in the file Q7_Battery_life.csv. Before testing for the difference between the mean lifetimes of these two batteries, he must first determine whether the underlying population variances are equal.<br>
 a. Perform a test for equal population variances.  Report a p-value and interpret its meaning.<br>
 b. Based on your conclusion in part a, which test statistic should be used in performing a test for the difference between population means?

In [95]:
# Lets read the file 
brand1 = pd.read_excel("Q7_Battery_life.xlsx", sheet_name="Brand1")
brand2 = pd.read_excel("Q7_Battery_life.xlsx", sheet_name="Brand2")

In [96]:
brand2.head()

Unnamed: 0,Battery,Lifetime
0,1,110.65
1,2,92.24
2,3,96.63
3,4,99.45
4,5,102.55


***To check for equal variance, we will perform the F-test.***

This test is phrased in terms of the ratio of population variances:
- The null hypothesis is that this ratio is 1 (equal variances)
- The alternative is that it is not 1 (unequal variances).

The test statistic F test for equal variances is simply:

- F = Var(X) / Var(Y) with degree of freedom df1 = len(X) - 1, df2 = len(Y) - 1

In [97]:
#define F-test function
def f_test(x, y):
    x = np.array(x)
    y = np.array(y)
    
    f_stat = np.var(x, ddof=1)/np.var(y, ddof=1)   #calculate F test statistic 
    dfx = x.size-1                                 #degrees of freedom numerator 
    dfy = y.size-1                                 #degrees of freedom denominator 
    p = 1 - st.f.cdf(f_stat, dfn=dfx, dfd=dfy)     #p-value of F test statistic 
    
    return f_stat, p

In [98]:
#perform F-test
_, p_val = f_test(brand1.Lifetime, brand2.Lifetime)

print(f'p-value corresponding to F-Test of equal variance: {p_val:.3f}')

p-value corresponding to F-Test of equal variance: 1.000



**a) The p-value is 1.0, hence we fail to reject the Null hypothesis. The population variances are equal.** <br>
**b) To test the difference in population means, we will use a 2 sample independent t-test** 

***