# Lab | Inferential statistics - T-test & P-value


In [2]:
#import libraries
import pandas as pd
import numpy as np
import scipy.stats as st 
from sklearn import feature_selection
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# 1. *One tailed t-test* 
- In a packing plant, a machine packs cartons with jars.   
- It is supposed that a new machine will pack faster on the average than the machine currently used.  
- To test that hypothesis, the times it takes each machine to pack ten cartons are recorded.  
- The results, in seconds, are shown in the tables in the file `files_for_lab/machine.txt`.
- Assume that there is sufficient evidence to conduct the t test, does the data provide sufficient evidence to show if one machine is better than the other?

In [9]:
# open machine.txt and create dataframe from the samples.
# Old_Machine = [42.1,41.0,41.3,41.8,42.4,42.8,43.2,42.3,41.8,42.7]
# New_Machine = [42.7,43.6,43.8,43.3,42.5,43.5,43.1,41.7,44.0,44.1]
# machine_test = pd.DataFrame(Old_Machine, New_Machine).reset_index()
# machine_test.columns = ['old_machine', 'new_machine']
# machine_test
machine_test = pd.read_csv(r'.\files_for_lab\machine2.txt', sep = '\s')
machine_test

  return func(*args, **kwargs)


Unnamed: 0,New_machine,Old_machine
0,42.1,42.7
1,41.0,43.6
2,41.3,43.8
3,41.8,43.3
4,42.4,42.5
5,42.8,43.5
6,43.2,43.1
7,42.3,41.7
8,41.8,44.0
9,42.7,44.1


In [35]:
# 1. Conduct a ttest to determine whether the hypothesis that the new machine will perform better than the old machine or not.
# H_0 = The average time for the old machine and the new machine is the same.
# H_1 = The average time for the old machine is greater than the average time for the new machine.
# assume the alpha is small as this may be a decision that could cost the company money in terms of productivity hours.
# a = 0.03
# as these are two samples that can be directly related with each other, we can use the st.ttest_rel() test
st.ttest_rel(machine_test['New_machine'], machine_test['Old_machine'])

# As we are only interested in whether the average time for the new machine is less than the old machine, this is a one-tailed test.
one_tailed_p = st.ttest_rel(machine_test['New_machine'], machine_test['Old_machine'])[1]/2
print('One-tail P value = ', one_tailed_p)

# Given the p-value against our measure of confidence, we can say that H_0 can be rejected.
old_machine_mean = np.mean(machine_test['Old_machine'])
new_machine_mean = np.mean(machine_test['New_machine'])

# Use the 1samp test
one_tailed_p_1samp = st.ttest_1samp(machine_test['New_machine'], np.round(old_machine_mean,2))[1]/2
print('One-tail P value 1samp = ', one_tailed_p_1samp)


One-tail P value =  0.006770167825816245
One-tail P value 1samp =  0.0003483188038379669


# 2. *Matched Pairs Test*   
- In this challenge we will compare dependent samples of data describing our Pokemon (file `files_for_lab/pokemon.csv`).  
- Our goal is to see whether there is a significant difference between each Pokemon's defense and attack scores.  
- Our hypothesis is that the defense and attack scores are equal.   
- Compare the two columns to see if there is a statistically significant difference between them and comment your result.

In [17]:
#2. Matched Pair Testing
# See whether there is a significant difference between each Pokemon's defense and attack scores. 
# H_0 = the defense and attack scores are equal. 
# We will assume that confidence is 5%.
# Compare the two columns to see if there is a statistically significant difference between them and comment your result.
pokemane = pd.read_csv('./files_for_lab/pokemon.csv')
pokemane_atk_def = pokemane[['Attack','Defense']]
pokemane_atk_def
# create a sample from the total population.
pokemane_atk_def_sample = pokemane_atk_def.sample(int(len(pokemane)/4))


Unnamed: 0,Attack,Defense
153,105,65
546,70,120
718,61,65
3,100,123
543,160,110
...,...,...
203,45,50
298,100,60
585,135,130
770,65,65


In [None]:
# Conduct a matched pair test for the features Attack and Defense
matched_test = st.ttest_rel(pokemane_atk_def_sample['Attack'], pokemane_atk_def_sample['Defense'])
matched_test
# The p-value is signifcantly lower than our alpha value, suggesting that we can reject H_0.
# This suggests that Attack and Defense are not equal.

# OPTIONAL PART | Inferential statistics - ANOVA

Note: The following lab is divided in 2 sections which represent activities 3 and 4.

## Part 1

In this activity, we will look at another example.  
Your task is to understand the problem and write down all the steps to set up ANOVA.  
After the next lesson, we will ask you to solve this problem using Python.   
Here are the steps that you would need to work on:
- Null hypothesis
- Alternate hypothesis
- Level of significance
- Test statistic
- P-value
- F table

### Context

Suppose you are working as an analyst in a microprocessor chip manufacturing plant.  
You have been given the task of analyzing a plasma etching process with respect to changing Power (in Watts) of the plasma beam.   
Data was collected and provided to you to conduct statistical analysis and check if changing the power of the plasma beam   
has any effect on the etching rate by the machine.   
You will conduct ANOVA and check if there is any difference in the mean etching rate for different levels of power.   
You can find the data `anova_lab_data.xlsx` file in the `files_for_lab` folder  

- State the null hypothesis
- State the alternate hypothesis
- What is the significance level
- What are the degrees of freedom of model, error terms, and total DoF

Data was collected randomly and provided to you in the table as shown:  
[link to the image - Data](https://education-team-2020.s3-eu-west-1.amazonaws.com/data-analytics/7.05/7.05-lab_data.png)

In [22]:
# Create a dataframe from the anova_lab_data.xlsx file
import openpyxl
import xlrd

anova_data = pd.read_excel("./files_for_lab/anova_lab_data.xlsx")
anova_data.head(10)

Unnamed: 0,Power,Etching Rate
0,160 W,5.43
1,180 W,6.24
2,200 W,8.79
3,160 W,5.71
4,180 W,6.71
5,200 W,9.2
6,160 W,6.22
7,180 W,5.98
8,200 W,7.9
9,160 W,6.01


## STATE HYPOTHESES:
**Null Hyopthesis:**  
H_0 = There is no change in the mean rate of etching./ Mean rate of etching is the same regardless of plasma beam power output.

**Alternate Hypothesis:**  
H_1 = The mean rate of etching varies by plasma power output.



In [58]:
# Test statistics
anova_data['Recorded Rate'] = anova_data.groupby('Power ').cumcount()
anova_data_pivot = anova_data.pivot(index = 'Recorded Rate', columns = 'Power ', values = 'Etching Rate')
anova_data_pivot

# Degrees of Freedom = number of samples - 1
DoF = len(anova_data_pivot) - 1


Power,160 W,180 W,200 W
Recorded Rate,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,5.43,6.24,8.79
1,5.71,6.71,9.2
2,6.22,5.98,7.9
3,6.01,5.66,8.15
4,5.59,6.6,7.55


## Part 2

- In this section, use the Python to conduct ANOVA.
- What conclusions can you draw from the experiment and why?

In [59]:
anova_test = st.f_oneway(anova_data_pivot['160 W'], anova_data_pivot['180 W'], anova_data_pivot['200 W'])
print('ANOVA p-value =', anova_test[1])


ANOVA p-value = 7.506584272358903e-06


## P-VALUE VERSUS SIGNIFICANCE LEVEL

Based on the p-value returned versus our alpha of 0.02 we can safely reject H_0.
