# Lab | Inferential statistics - T-test & P-value

## Instructions

1. One tailed t-test - In a packing plant, a machine packs cartons with jars. It is supposed that a new machine will pack faster on the average than the machine currently used. To test that hypothesis, the times it takes each machine to pack ten cartons are recorded. The results, in seconds, are shown in the tables in the file files_for_lab/machine.txt. Assume that there is sufficient evidence to conduct the t test, does the data provide sufficient evidence to show if one machine is better than the other?

2. Matched Pairs Test - In this challenge we will compare dependent samples of data describing our Pokemon (file files_for_lab/pokemon.csv). Our goal is to see whether there is a significant difference between each Pokemon's defense and attack scores. Our hypothesis is that the defense and attack scores are equal. Compare the two columns to see if there is a statistically significant difference between them and comment your result.

## Inferential statistics - ANOVA
Note: The following lab is divided in 2 sections which represent activities 3 and 4.

## Part 1
In this activity, we will look at another example. Your task is to understand the problem and write down all the steps to set up ANOVA. After the next lesson, we will ask you to solve this problem using Python. Here are the steps that you would need to work on:

- Null hypothesis
- Alternate hypothesis
- Level of significance
- Test statistic
- P-value
- F table

### Context
Suppose you are working as an analyst in a microprocessor chip manufacturing plant. You have been given the task of analyzing a plasma etching process with respect to changing Power (in Watts) of the plasma beam. Data was collected and provided to you to conduct statistical analysis and check if changing the power of the plasma beam has any effect on the etching rate by the machine. You will conduct ANOVA and check if there is any difference in the mean etching rate for different levels of power. You can find the data anova_lab_data.xlsx file in the files_for_lab folder

- State the null hypothesis
- State the alternate hypothesis
- What is the significance level
- What are the degrees of freedom of model, error terms, and total DoF
- Data was collected randomly and provided to you in the table as shown: link to the image - Data

## Part 2
- In this section, use the Python to conduct ANOVA.
- What conclusions can you draw from the experiment and why?

## Imports

In [1]:
import numpy as np
import pandas as pd

import scipy.stats as st

## One Tailed T-Test


h0: mean speed of new machines = old machines

h1: mean speed of new machines => old machines

In [2]:
machine = pd.read_csv("files_for_lab/machine.txt", encoding = "utf-16", sep = "\t", names = ["New", "Old"], header=0)
machine

Unnamed: 0,New,Old
0,42.1,42.7
1,41.0,43.6
2,41.3,43.8
3,41.8,43.3
4,42.4,42.5
5,42.8,43.5
6,43.2,43.1
7,42.3,41.7
8,41.8,44.0
9,42.7,44.1


In [3]:
print("New machine mean:", round(machine["New"].mean(),2))
print("Old machine mean:", round(machine["Old"].mean(),2))

New machine mean: 42.14
Old machine mean: 43.23


In [4]:
t_statistic, p_value = st.ttest_ind(machine['New'], machine['Old'], equal_var=False)
print("T-statatistic is: {:.2f} and p-value is: {:.5f}".format(t_statistic, p_value))

T-statatistic is: -3.40 and p-value is: 0.00324


In [12]:
if p_value < 0.05:
    print("As p_value is < 0.05 we assume that new machine is faster")
else:
    print("As p_value is >= 0.05 we assume that old machines are equal or faster")

As p_value is < 0.05 we assume that new machine is faster


## Matched Pairs Test

In [7]:
pokemon = pd.read_csv("files_for_lab/pokemon.csv")
print(pokemon[['Defense','Attack']].describe())
pokemon.head()

          Defense      Attack
count  800.000000  800.000000
mean    73.842500   79.001250
std     31.183501   32.457366
min      5.000000    5.000000
25%     50.000000   55.000000
50%     70.000000   75.000000
75%     90.000000  100.000000
max    230.000000  190.000000


Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
4,4,Charmander,Fire,,309,39,52,43,60,50,65,1,False


In [8]:
pokemon["Attack"].mean(), pokemon["Defense"].mean()

(79.00125, 73.8425)

In [14]:
statistical, pvalue = st.ttest_rel(pokemon["Attack"], pokemon["Defense"])
print("T-statatistic is: {:.2f} and p-value is: {:.5f}".format(statistical, pvalue))

if pvalue < 0.05:
    print("p_value < 0.05 | Features are different from each other, so we reject H0")
else:
    print("p_value > 0.05 | Features are different equal from one another, so we accept H0")

T-statatistic is: 4.33 and p-value is: 0.00002
p_value < 0.05 | Features are different from each other, so we reject H0


## ANOVA
### Part 1

In [20]:
etching = pd.read_excel('files_for_lab/anova_lab_data.xlsx', names = ["Power", "Rate"])
etching

Unnamed: 0,Power,Rate
0,160 W,5.43
1,180 W,6.24
2,200 W,8.79
3,160 W,5.71
4,180 W,6.71
5,200 W,9.2
6,160 W,6.22
7,180 W,5.98
8,200 W,7.9
9,160 W,6.01


**Null Hypothesis (H0)**: There is no difference in Etching Rate between the 5 groups

**Alternative Hypothesis (H1)**: There is a difference between the groups

**Significance Level**: lambda = 5%

Degrees of freedom: 

DoF1: Nº of Groups (k)-1 -> DoF = 4
DoF2: Nº of observations in all cells (n) - detrees of freedom lost -> Df2 = n-k = n - 5

### Part 2

In [26]:
etching["Power"].unique()

array(['160 W', '180 W', '200 W'], dtype=object)

In [30]:
stats, pvalue = st.f_oneway(etching[etching['Power'] == '160 W']['Rate'], etching[etching['Power'] == '180 W']['Rate'], etching[etching['Power'] == '200 W']['Rate'])
print("T-statatistic is: {:.2f} and p-value is: {:.5f}".format(statistical, pvalue))

T-statatistic is: 4.33 and p-value is: 0.00001
