# Lab | Inferential statistics - T-test & P-value

### Instructions

1. *One-tailed t-test* - In a packing plant, a machine packs cartons with jars. It is supposed that a new machine will pack faster on average than the machine currently used. To test that hypothesis, the times each machine takes to pack ten cartons are recorded. The results are in seconds in the tables in the file `files_for_lab/machine.txt`.
   Assume that there is sufficient evidence to conduct the t-test, does the data provide sufficient evidence to show if one machine is better than the other?

2. *Matched Pairs Test* - In this challenge we will compare dependent samples of data describing our Pokemon (file `files_for_lab/pokemon.csv`). Our goal is to see whether there is a significant difference between each Pokemon's defense and attack scores. Our hypothesis is that the defense and attack scores are equal. Compare the two columns to see if there is a statistically significant difference between them and comment on your result.


# Inferential statistics - ANOVA

Note: The following lab is divided into 2 sections which represent activities 3 and 4.

## Part 1

In this activity, we will look at another example. Your task is to understand the problem and write down all the steps to set up ANOVA. After the next lesson, we will ask you to solve this problem using Python. Here are the steps that you would need to work on:
    - Null hypothesis
    - Alternate hypothesis
    - Level of significance
    - Test statistic
    - P-value
    - F table

### Context

Suppose you are working as an analyst in a microprocessor chip manufacturing plant. You have been given the task of analyzing a plasma etching process with respect to changing Power (in Watts) of the plasma beam. Data was collected and provided to you to conduct statistical analysis and check if changing the power of the plasma beam has any effect on the etching rate by the machine. You will conduct ANOVA and check if there is any difference in the mean etching rate for different levels of power. You can find the data `anova_lab_data.xlsx` file in the `files_for_lab` folder  

- State the null hypothesis
- State the alternate hypothesis
- What is the significance level
- What are the degrees of freedom of the model, error terms, and total DoF

Data were collected randomly and provided to you in the table as shown: [link to the image - Data](https://education-team-2020.s3-eu-west-1.amazonaws.com/data-analytics/7.05/7.05-lab_data.png)


## Part 2

- In this section, use Python to conduct ANOVA.
- What conclusions can you draw from the experiment and why?


In [1]:
# import libraries
import pandas as pd
import numpy as np
import scipy.stats as st
import matplotlib.pyplot as plt
import seaborn as sns


# 1. *One-tailed t-test* 
- In a packing plant, a machine packs cartons with jars. It is supposed that a new machine will pack faster on average than the machine currently used. To test that hypothesis, the times each machine takes to pack ten cartons are recorded. The results are in seconds in the tables in the file `files_for_lab/machine.txt`.
   Assume that there is sufficient evidence to conduct the t-test, does the data provide sufficient evidence to show if one machine is better than the other?

In [2]:
machine = pd.read_csv('files_for_lab/machine.csv', sep=';')
machine

Unnamed: 0,New machine,Old machine
0,42.1,42.7
1,41.0,43.6
2,41.3,43.8
3,41.8,43.3
4,42.4,42.5
5,42.8,43.5
6,43.2,43.1
7,42.3,41.7
8,41.8,44.0
9,42.7,44.1


In [3]:
machine.columns = ['new_machine', 'old_machine']
machine

Unnamed: 0,new_machine,old_machine
0,42.1,42.7
1,41.0,43.6
2,41.3,43.8
3,41.8,43.3
4,42.4,42.5
5,42.8,43.5
6,43.2,43.1
7,42.3,41.7
8,41.8,44.0
9,42.7,44.1


In [4]:
# the claim: the new machine pack faster on average than the current machine (time_mean_new < time_mean_old)
# H0: the new machine pack faster on average than the current machine (time_mean_new < time_mean_old )
# Ha: time_mean_new > time_mean_old
# one-sided test (right), confidence level 95%, a = 0.05
from scipy.stats import ttest_ind

t_stat, pvalue = ttest_ind(machine['new_machine'], machine['old_machine'], alternative = 'greater')
print('Statistic =  {:.3f}, pvalue = {:.3f}'.format(t_stat, pvalue))

# manual calculation
new_mean, old_mean = machine['new_machine'].mean(), machine['old_machine'].mean()
new_std, old_std = machine['new_machine'].std(ddof=1), machine['old_machine'].std(ddof=1)
sp = ( len(machine['new_machine']) - 1 ) * ( new_std**2 ) +  ( len(machine['old_machine']) - 1 ) * ( old_std**2 )
sp /= ( len(machine['new_machine']) + len(machine['old_machine']) - 2)
sp = np.sqrt(sp)
r = np.sqrt( (1/len(machine['new_machine'])) + (1/len(machine['old_machine'])) )
t = ( new_mean - old_mean )/ (sp * r)

print("The mean of new machine is {:.2f}".format(new_mean))
print("The mean of old machine is {:.2f}".format(old_mean))
print("The standard deviation of new machine is {:.2f}".format(new_std))
print("The standard deviation of old machine is {:.2f}".format(old_std))
print("The ratio of the sample variances is {:.2f} ".format(new_std/old_std))
print("The t statistic is: {:.3f}".format(t))
print('t critical: ', st.t.ppf(1-0.05, len(machine['new_machine']) + len(machine['old_machine']) - 2).round(3))
print('pvalue: ', 1-st.t.cdf(t, len(machine['new_machine']) + len(machine['old_machine']) - 2).round(3))

# conclusion:
# t = -3.397 < tc = 1.734 : accept H0
# pvalue = 0.998: accept H0: the new machine pack faster on average than the current machine (time_mean_new < time_mean_old )



Statistic =  -3.397, pvalue = 0.998
The mean of new machine is 42.14
The mean of old machine is 43.23
The standard deviation of new machine is 0.68
The standard deviation of old machine is 0.75
The ratio of the sample variances is 0.91 
The t statistic is: -3.397
t critical:  1.734
pvalue:  0.998


# 2. Matched Pairs Test 
- In this challenge we will compare dependent samples of data describing our Pokemon (file `files_for_lab/pokemon.csv`). Our goal is to see whether there is a significant difference between each Pokemon's defense and attack scores. Our hypothesis is that the defense and attack scores are equal. Compare the two columns to see if there is a statistically significant difference between them and comment on your result.

In [6]:
pokemon = pd.read_csv('files_for_lab/pokemon.csv')
pokemon

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
4,4,Charmander,Fire,,309,39,52,43,60,50,65,1,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...
795,719,Diancie,Rock,Fairy,600,50,100,150,100,150,50,6,True
796,719,DiancieMega Diancie,Rock,Fairy,700,50,160,110,160,110,110,6,True
797,720,HoopaHoopa Confined,Psychic,Ghost,600,80,110,60,150,130,70,6,True
798,720,HoopaHoopa Unbound,Psychic,Dark,680,80,160,60,170,130,80,6,True


In [9]:
# H0: the defense and attack scores are equal
# Ha: the defense and attack are NOT equal (two-sided)
# confidence level 95%, significant a =0.05

# calculate attack vs defense differences
pokemon['attack_defense_diff'] = pokemon['Attack'] - pokemon['Defense']
pokemon

# calculate mean and std of differences
diff_mean, diff_std = pokemon['attack_defense_diff'].mean(), pokemon['attack_defense_diff'].std(ddof=1)
diff_mean, diff_std

# compute t statistic
t = diff_mean / ( diff_std / np.sqrt(pokemon.shape[0]) )
print("The mean of differences between attack and defense is: {:.3f}".format(diff_mean))
print("The standard deviation of differences between attack and defense is: {:.3f}".format(diff_std))
print("Our t statistics is: {:.3f}".format(t))

# compute t critical
tc = st.t.ppf((0.05/2),df= pokemon.shape[0] - 1)
print('The t critical value is: {:.3f}'.format(tc))

# compute pvalue
pvalue = 1-st.t.cdf(t, df= pokemon.shape[0] - 1)
print('The p value is: {:.3f}'.format(pvalue))


# use function
st.ttest_rel(pokemon['Attack'],pokemon['Defense'])


# tc_lower = -1.963, tc_upper = 1.963 , t statistic = 4.326 >> tc_upper = 1.963: reject H0: the defense and attack scores are equal, and in favor of Ha: Ha: the defense and attack are NOT equal (two-sided)
# the same conclusion is drawn from pvalue = 0.000 < a = 0.05 : reject H0
# the t and pvalue is consistent with t and pvalue when using function st.ttest_rel
# 

The mean of differences between attack and defense is: 5.159
The standard deviation of differences between attack and defense is: 33.732
Our t statistics is: 4.326
The t critical value is: -1.963
The p value is: 0.000


TtestResult(statistic=4.325566393330478, pvalue=1.7140303479358558e-05, df=799)

# Inferential statistics - ANOVA

Note: The following lab is divided into 2 sections which represent activities 3 and 4.

## Part 1

In this activity, we will look at another example. Your task is to understand the problem and write down all the steps to set up ANOVA. After the next lesson, we will ask you to solve this problem using Python. Here are the steps that you would need to work on:
    - Null hypothesis
    - Alternate hypothesis
    - Level of significance
    - Test statistic
    - P-value
    - F table

### Context

Suppose you are working as an analyst in a microprocessor chip manufacturing plant. You have been given the task of analyzing a plasma etching process with respect to changing Power (in Watts) of the plasma beam. Data was collected and provided to you to conduct statistical analysis and check if changing the power of the plasma beam has any effect on the etching rate by the machine. You will conduct ANOVA and check if there is any difference in the mean etching rate for different levels of power. You can find the data `anova_lab_data.xlsx` file in the `files_for_lab` folder  

- State the null hypothesis
- State the alternate hypothesis
- What is the significance level
- What are the degrees of freedom of the model, error terms, and total DoF

In [26]:
plasma = pd.read_excel('files_for_lab/anova_lab_data.xlsx')
display(plasma)
plasma_power = plasma.groupby('Power ').agg({'Etching Rate': ['mean', 'count']}).reset_index()
plasma_power
#plasma.columns

Unnamed: 0,Power,Etching Rate
0,160 W,5.43
1,180 W,6.24
2,200 W,8.79
3,160 W,5.71
4,180 W,6.71
5,200 W,9.2
6,160 W,6.22
7,180 W,5.98
8,200 W,7.9
9,160 W,6.01


Unnamed: 0_level_0,Power,Etching Rate,Etching Rate
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,count
0,160 W,5.792,5
1,180 W,6.238,5
2,200 W,8.318,5


In [36]:
plasma_power['Etching Rate']['mean']

0    5.792
1    6.238
2    8.318
Name: mean, dtype: float64

# Answer:
- State the null hypothesis H0: changing the power of the plasma beam has NO effect on the etching rate by the machine.
- State the alternate hypothesis Ha: changing the power of the plasma beam has an effect on the etching rate by the machine.
- What is the significance level: 95%
- Test statistic: is F-test for oneway ANOVA
- d1 = 2, d2 = 15-3 = 12 
- error terms:
    * SST: variance of group mean against global mean
    * SSE: variance of individual values of each a group againt the group  mean
    * F = S2T/S2E
- total dof = 15-1 = 14 


## Part 2

- In this section, use Python to conduct ANOVA.
- What conclusions can you draw from the experiment and why?


In [56]:
# variance calculation of groups against global values
S2t = 0
for i, p in enumerate(plasma_power['Power '].unique()):
    ng = plasma_power[plasma_power['Power ']==p]['Etching Rate'].loc[i,'count']
    S2t  += ng * ( ( plasma_power[plasma_power['Power ']==p]['Etching Rate'].loc[i,'mean'] - plasma['Etching Rate'].mean() ) ** 2)
S2t /= ( plasma_power['Power '].nunique() - 1 ) # Number of  K = groups of power

print("The value of S2t is {:.2f}".format(S2t))


# variance calculation of S2E (variance of individual values of a group to mean of that group)using the second formula
S2E = 0
for i,p in enumerate(plasma_power['Power '].unique()):
    ng = plasma_power[plasma_power['Power ']==p]['Etching Rate'].loc[i,'count']
    S2E += ( ng - 1 ) * np.var(plasma[plasma['Power ']==p]['Etching Rate'],ddof=1) # sg2 (std) is var
S2E /= ( len(plasma) - plasma_power['Power '].nunique() )

print()
print("The value of S2E is {:.2f}".format(S2E2))

# compute F
F = S2t / S2E
print("The value of F is {:.2f}".format(F))

# degrees of freedom
d1 = plasma['Power '].nunique() - 1
d2 = len(plasma) - plasma['Power '].nunique()

print("Number of degrees of freedom d1: ",d1)
print("Number of degrees of freedom d2: ",d2)

# probability to get value below F
st.f.cdf(F,dfn=d1, dfd=d2)
#print('p value of F: {:.3f}'.format(pval))

# probability to get value above F
pval = 1 - st.f.cdf(F,dfn=d1, dfd=d2)
print('p value of F: {:.3f}'.format(pval))

Fc = st.f.ppf(1-0.05,dfn=d1, dfd=d2)

print("The critical value which corresponds to an area of 0.05 is: {:.2f}".format(Fc))


print(st.f_oneway(plasma[plasma['Power '] == '160 W']['Etching Rate'],
                  plasma[plasma['Power '] == '180 W']['Etching Rate'],
                  plasma[plasma['Power '] == '200 W']['Etching Rate']))

The value of S2t is 9.09

The value of S2E is 0.25
The value of F is 36.88
Number of degrees of freedom d1:  2
Number of degrees of freedom d2:  12
p value of F: 0.000
The critical value which corresponds to an area of 0.05 is: 3.89
F_onewayResult(statistic=36.87895470100505, pvalue=7.506584272358903e-06)


# conclusion
F = 36.88 > Fc(a=0.05) = 3.89
Pval = 0.000 < a =0.05
--> reject H0, and in favor of Ha: changing the power of the plasma beam has an effect on the etching rate by the machine, with a confidence level 95%, and significant a=0.05
