# Python Solution to Tutorial 02

In [3]:
import scipy.stats as stat 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


#url = 'https://raw.githubusercontent.com/APS1040/Tutorials/main/Tutorial02_data.csv'
#df = pd.read_csv(url)


df = pd.read_csv('Tutorial02_data.csv')

net_contents = df['Net Contents (Oz)'].dropna()
df.head()

Unnamed: 0,Net Contents (Oz),Day,No of Clients Satisfied,Total No of Clients
0,12.03,1,68,100
1,12.01,2,77,100
2,12.0,3,96,100
3,12.02,4,80,100
4,12.05,5,43,100


## 1.	Does the data support a population mean of 12?  (Yes/No) What is the p-value?

$H_0:\mu=12$, $H_1:\mu\neq 12$. We use a two-sided t-test because the population variance is unknown and n<40. We could use a z-test if n was larger than 40.

In [9]:
n = net_contents.size
mu0 = 12
x_bar = net_contents.mean()
s = net_contents.std()
t0 = (x_bar-mu0)/(s/np.sqrt(n))
d_f = n-1
p_value = 2 * (1 - stat.t.cdf(t0,df=d_f))
print("p-value = ", p_value)

p-value =  0.35057055304117135


p-value is large, so we fail to reject $H_0$ meaning that data supports that the population mean is 12. 

## 2.	If mu = 12.00, what is the probability that we fail to reject mu = 12.012, given that the values are the average volume of a sample of 100 items analyzed? 

As we assume n=100, we can use z statistic and the formula derived in class:

$\beta = P (Z \leq Z_{\alpha/2}-\frac{\delta \sqrt{n}}{\sigma})-P (Z \leq -Z_{\alpha/2}-\frac{\delta \sqrt{n}}{\sigma})$

distribution of Z: N(0,1)


In [10]:
mu0 = 12
mu1 = 12.012
delta = mu1-mu0
alpha = 0.05
n = 100
s = net_contents.std()
z_alpha2 = stat.norm.ppf(1-alpha/2)
beta = stat.norm.cdf(z_alpha2- (delta*np.sqrt(n)/s),0,1) \
-stat.norm.cdf(-z_alpha2- (delta*np.sqrt(n)/s),0,1)
print("beta = ",beta)

beta =  0.00061260813579107


## 3. What is the average number of clients satisfied in the first 13 days and the standard deviation? 

# 4.What is the average number of clients satisfied in the last 13 days and the standard deviation? 

First we divide the data in this column to two sets: 1. before 13 (including 13), 2. after 13 : 

In [11]:
set1 = df.loc[:12,'No of Clients Satisfied'] 
set2 = df.loc[13:,'No of Clients Satisfied'] 

In [15]:
n1 = set1.size
n2 = set2.size
x_bar1 = set1.mean()
x_bar2 = set2.mean()
s1 = set1.std()
s2 = set2.std()

In [17]:
print("Average number of clients satisfied in the first 13 days:", x_bar1)
print("STD of clients satisfied in the first 13 days:", s1)
print("Average number of clients satisfied in the last 13 days:", x_bar2)
print("STD of clients satisfied in the last 13 days:", s2)

Average number of clients satisfied in the first 13 days: 68.15384615384616
STD of clients satisfied in the first 13 days: 21.384753735025622
Average number of clients satisfied in the last 13 days: 62.38461538461539
STD of clients satisfied in the last 13 days: 20.0064092294547


## 5.	Is there a change in the mean before and after the 13th day? (Yes/No) What is the p-value?

Now we test $H_0:\mu_1=\mu_2$, $H_1:\mu_1\neq \mu_2$. We use a two-sample t-test

In [9]:

t0 = (x_bar1-x_bar2)/np.sqrt((s1**2/n1)+(s2**2/n2))

d_f = (((s1**2/n1)+(s2**2/n2))**2)/((s1**2/n1)**2/(n1-1)\
                                    +(s2**2/n2)**2/(n2-1)) -2 
p_value =  2 * (1 - stat.t.cdf(t0,df=d_f))

print("p-value = ", p_value)
print ("t_score = ", t0)

p-value =  0.48500148709783897
t_score =  0.7103236761656716


p-value is large, so we fail to reject $H_0$ meaning that there is no change before and after. 