In [1]:
import pandas as pd
import numpy as np
import scipy.stats as st
import matplotlib.pyplot as plt
import seaborn as sns
pd.set_option('display.max_columns', None)

## One tailed t-test - 
In a packing plant, a machine packs cartons with jars. 
It is supposed that a new machine will pack faster on the average than the machine currently used. 
To test that hypothesis, the times it takes each machine to pack ten cartons are recorded. 
The results, in seconds, are shown in the tables in the file files_for_lab/machine.txt. 
Assume that there is sufficient evidence to conduct the t test, does the data provide sufficient evidence to show if one machine is better than the other?

### H0- New machine will pack faster on average than the machine currently used ("Old machine")
### H1- New machine will not pack faster than the machine currently used 

In [2]:
df = pd.read_csv('/Users/aloaskari/Desktop/Ironhack/Bootcamp/Week_7/Day_2/Afternoon/Lab/lab-t-tests-p-values/files_for_lab/machine.txt',
                 encoding="utf-16", sep="\t")

df

Unnamed: 0,New machine,Old machine
0,42.1,42.7
1,41.0,43.6
2,41.3,43.8
3,41.8,43.3
4,42.4,42.5
5,42.8,43.5
6,43.2,43.1
7,42.3,41.7
8,41.8,44.0
9,42.7,44.1


In [3]:
df.columns

Index(['New machine', '    Old machine'], dtype='object')

In [4]:
df1 = df['New machine']

df2 = df['    Old machine']

In [5]:
display(df1.mean())

42.14

In [6]:
display(df2.mean())

43.230000000000004

#### At this point we can already surmise that the new machine is faster than the old one, nonetheless we will do the T-Test to make sure

In [7]:
st.ttest_1samp(df2, 42.14)

TtestResult(statistic=4.596524549827484, pvalue=0.0012969232972500778, df=9)

In [8]:
print('p value (single tailed): ',st.ttest_1samp(df2,42.14, alternative="less")[1]/[2])

p value (single tailed):  [0.49967577]


In [9]:
if (st.ttest_1samp(df2,42.14, alternative="less")[1]/[2]) < 0.05:
    print("Reject null hypothesis: New machine does not pack faster than the Old machine")
else:
    print("Accept null hypothesis: New machine does pack faster than the Old machine")

Accept null hypothesis: New machine does pack faster than the Old machine


### We accept the null hypothesis (H0)

## Matched Pairs Test - 
In this challenge we will compare dependent samples of data describing our Pokemon (file files_for_lab/pokemon.csv). Our goal is to see whether there is a significant difference between each Pokemon's defense and attack scores. Our hypothesis is that the defense and attack scores are equal. Compare the two columns to see if there is a statistically significant difference between them and comment your result.

### H0- Defense is equal to Attack
### H1- Defense is not equal to Attack

In [10]:
pokemon = pd.read_csv("pokemon.csv")

pokemon

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
4,4,Charmander,Fire,,309,39,52,43,60,50,65,1,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...
795,719,Diancie,Rock,Fairy,600,50,100,150,100,150,50,6,True
796,719,DiancieMega Diancie,Rock,Fairy,700,50,160,110,160,110,110,6,True
797,720,HoopaHoopa Confined,Psychic,Ghost,600,80,110,60,150,130,70,6,True
798,720,HoopaHoopa Unbound,Psychic,Dark,680,80,160,60,170,130,80,6,True


In [11]:
skill = pokemon[['Attack', 'Defense']]

skill

Unnamed: 0,Attack,Defense
0,49,49
1,62,63
2,82,83
3,100,123
4,52,43
...,...,...
795,100,150
796,160,110
797,110,60
798,160,60


In [12]:
skill['Difference'] = skill['Attack']-skill['Defense']

skill.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  skill['Difference'] = skill['Attack']-skill['Defense']


Unnamed: 0,Attack,Defense,Difference
0,49,49,0
1,62,63,-1
2,82,83,-1
3,100,123,-23
4,52,43,9


In [13]:
skill_diff_mean, skill_diff_std = skill['Difference'].mean(), skill['Difference'].std(ddof=1)
skill_diff_mean, skill_diff_std

(5.15875, 33.7323418553516)

In [14]:
t = skill_diff_mean / ( skill_diff_std / np.sqrt(skill.shape[0]) )
print("The mean of our samples differences is: {:.2f}".format(skill_diff_mean))
print("The standard deviation of our samples differences is: {:.2f}".format(skill_diff_std))
print("Our t statistics is: {:.2f}".format(t))

The mean of our samples differences is: 5.16
The standard deviation of our samples differences is: 33.73
Our t statistics is: 4.33


In [15]:
st.t.cdf(t,df = skill.shape[0] - 1)

0.9999914298482603

### We reject the null hypothesis (H0)

## Inferential statistics - ANOVA

### Part 1

Suppose you are working as an analyst in a microprocessor chip manufacturing plant. You have been given the task of analyzing a plasma etching process with respect to changing Power (in Watts) of the plasma beam. Data was collected and provided to you to conduct statistical analysis and check if changing the power of the plasma beam has any effect on the etching rate by the machine. You will conduct ANOVA and check if there is any difference in the mean etching rate for different levels of power. You can find the data anova_lab_data.xlsx file in the files_for_lab folder

State the null hypothesis.
State the alternate hypothesis.
What is the significance level.
What are the degrees of freedom of model, error terms, and total DoF

### H0- The higher the power of the plasma beam the higher the etching rate by the machine
### H1- The power of the plasma beam does not impact etching rate by the machine

In [16]:
data = pd.read_excel('anova_lab_data.xlsx')
data

Unnamed: 0,Power,Etching Rate
0,160 W,5.43
1,180 W,6.24
2,200 W,8.79
3,160 W,5.71
4,180 W,6.71
5,200 W,9.2
6,160 W,6.22
7,180 W,5.98
8,200 W,7.9
9,160 W,6.01


In [17]:
data['Power '].value_counts()

160 W    5
180 W    5
200 W    5
Name: Power , dtype: int64

Everytime I try and do this it changes the Dataframe to "None type"

data = data.rename(columns={'Etching Rate': 'Etching_Rate'}, inplace=True)

display()

In [18]:
data['Times_Used'] = data.groupby('Power ').cumcount()
data

Unnamed: 0,Power,Etching Rate,Times_Used
0,160 W,5.43,0
1,180 W,6.24,0
2,200 W,8.79,0
3,160 W,5.71,1
4,180 W,6.71,1
5,200 W,9.2,1
6,160 W,6.22,2
7,180 W,5.98,2
8,200 W,7.9,2
9,160 W,6.01,3


In [20]:
data_pivot = data.pivot(index='Times_Used', columns='Power ', values='Etching Rate')
data_pivot

Power,160 W,180 W,200 W
Times_Used,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,5.43,6.24,8.79
1,5.71,6.71,9.2
2,6.22,5.98,7.9
3,6.01,5.66,8.15
4,5.59,6.6,7.55


In [21]:
data_pivot.mean()

Power 
160 W    5.792
180 W    6.238
200 W    8.318
dtype: float64

In [22]:
group_df = data.groupby('Power ')['Etching Rate'].agg(Etching_Rate='mean',Samples='size').reset_index()
group_df

Unnamed: 0,Power,Etching_Rate,Samples
0,160 W,5.792,5
1,180 W,6.238,5
2,200 W,8.318,5
