In [None]:
import pandas as pd
import numpy as np
import scipy.stats as st
import matplotlib.pyplot as plt
import seaborn as sns

**One tailed t-test**

In a packing plant, a machine packs cartons with jars. It is supposed that a new machine will pack faster on the average than the machine currently used. 

To test that hypothesis, the times it takes each machine to pack ten cartons are recorded. 

The results, in seconds, are shown in the tables in the file files_for_lab/machine.txt. 

Assume that there is sufficient evidence to conduct the t test, does the data provide sufficient evidence to show if one machine is better than the other?

In [None]:
data = pd.read_csv('/content/drive/MyDrive/Ironhack1/Week_7/Day_2/Morning/machine.txt', encoding='UTF-16', sep='\t')
display(data.head())

data.dtypes

Unnamed: 0,New machine,Old machine
0,42.1,42.7
1,41.0,43.6
2,41.3,43.8
3,41.8,43.3
4,42.4,42.5


New machine    float64
Old machine    float64
dtype: object

In [None]:
# H0: No difference in time between the machines (the difference in group means is zero)
# H1: There is a difference in time between the machines (the difference in group means is different from zero)

display(st.ttest_ind(data['New machine'], data['Old machine'], equal_var=False))

# If the p value is below 0.05, we reject the H0, meaning there is a difference in group means between Old and New machines

# To make sure we are correct, we can visually check the mean scores for each machine
print('New machine:',round(data['New machine'].mean(),1))
print('Old machine:',round(data['Old machine'].mean(),1))

#the p_value is (0.0032) < 0.05, therefore H0 is rejected!

Ttest_indResult(statistic=-3.397230706117603, pvalue=0.0032422494663179747)

New machine: 42.1
Old machine: 43.2


**Matched Pairs Test**

In this challenge we will compare dependent samples of data describing our Pokemon (file files_for_lab/pokemon.csv). 

Our goal is to see whether there is a significant difference between each Pokemon's defense and attack scores. 

Our hypothesis is that the defense and attack scores are equal. 

Compare the two columns to see if there is a statistically significant difference between them and comment your result.

In [None]:
poke = pd.read_csv('/content/drive/MyDrive/Ironhack1/Week_7/Day_2/Morning/pokemon.csv')

display(poke.head())

poke.dtypes

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
4,4,Charmander,Fire,,309,39,52,43,60,50,65,1,False


#              int64
Name          object
Type 1        object
Type 2        object
Total          int64
HP             int64
Attack         int64
Defense        int64
Sp. Atk        int64
Sp. Def        int64
Speed          int64
Generation     int64
Legendary       bool
dtype: object

In [None]:
# H0: Attack and defense scores are not equal
# H1: Attack and defense scores are equal

display(st.ttest_rel(poke['Attack'], poke['Defense']))

# If the p value is below 0.05, we reject the H0, meaning there is a difference in group means between Attack and Defense

# To make sure we are correct, we can visually check the mean scores for each machine
print('Attack:',round(poke['Attack'].mean(),1)) #--> 79.0
print('Defense:',round(poke['Defense'].mean(),1)) #--> 73.8

#the p_value is (1.71) > 0.05, therefore H0 is accepted and H1 is rejected!
#there is a difference between the attack and defense scores!

Ttest_relResult(statistic=4.325566393330478, pvalue=1.7140303479358558e-05)

Attack: 79.0
Defense: 73.8


**Context**

Suppose you are working as an analyst in a microprocessor chip manufacturing plant. You have been given the task of analyzing a plasma etching process with respect to changing Power (in Watts) of the plasma beam. Data was collected and provided to you to conduct statistical analysis and **check if changing the power of the plasma beam has any effect on the etching rate by the machine**. You will **conduct ANOVA and check if there is any difference in the mean etching rate** **for different levels of power**. You can find the data anova_lab_data.xlsx file in the files_for_lab folder

- State the null hypothesis

- State the alternate hypothesis

- What is the significance level

- What are the degrees of freedom of model, error terms, and total DoF

In [None]:
data1 = pd.read_excel('/content/drive/MyDrive/Ironhack1/Week_7/Day_2/Morning/anova_lab_data.xlsx', engine='openpyxl')
display(data1.head())

data1.dtypes

Unnamed: 0,Power,Etching Rate
0,160 W,5.43
1,180 W,6.24
2,200 W,8.79
3,160 W,5.71
4,180 W,6.71


Power            object
Etching Rate    float64
dtype: object

In [None]:
# H0: The mean etching rates are the same
# H1: The mean etching rates are not the same

data1['power_count'] = data1.groupby('Power ').cumcount() ##is the new index 

data1_pivot = data1.pivot(index='power_count', columns='Power ', values='Etching Rate')

data1_pivot.columns = [''+str(x) for x in data1_pivot.columns.values]
data1_pivot.head(10)


Unnamed: 0_level_0,160 W,180 W,200 W
power_count,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,5.43,6.24,8.79
1,5.71,6.71,9.2
2,6.22,5.98,7.9
3,6.01,5.66,8.15
4,5.59,6.6,7.55


In [None]:
st.f_oneway(data1_pivot['160 W'],data1_pivot['180 W'],data1_pivot['200 W'])
#the p_value is significantly <  0.05, therefore we reject H0 and accept H1.
#the mean etching rates are not the same

F_onewayResult(statistic=36.87895470100505, pvalue=7.506584272358903e-06)

In [None]:
# To make sure we are correct, we can visually check the mean scores
data1_pivot.mean()

160 W    5.792
180 W    6.238
200 W    8.318
dtype: float64