# Lab | Inferential statistics - T-test & P-value
Instructions
One-tailed t-test - In a packing plant, a machine packs cartons with jars. It is supposed that a new machine will pack faster on average than the machine currently used. To test that hypothesis, the times each machine takes to pack ten cartons are recorded. The results are in seconds in the tables in the file files_for_lab/machine.txt. Assume that there is sufficient evidence to conduct the t-test, does the data provide sufficient evidence to show if one machine is better than the other?

Matched Pairs Test - In this challenge we will compare dependent samples of data describing our Pokemon (file files_for_lab/pokemon.csv). Our goal is to see whether there is a significant difference between each Pokemon's defense and attack scores. Our hypothesis is that the defense and attack scores are equal. Compare the two columns to see if there is a statistically significant difference between them and comment on your result.

Inferential statistics - ANOVA
Note: The following lab is divided into 2 sections which represent activities 3 and 4.

Part 1
In this activity, we will look at another example. Your task is to understand the problem and write down all the steps to set up ANOVA. After the next lesson, we will ask you to solve this problem using Python. Here are the steps that you would need to work on: - Null hypothesis - Alternate hypothesis - Level of significance - Test statistic - P-value - F table

Context
Suppose you are working as an analyst in a microprocessor chip manufacturing plant. You have been given the task of analyzing a plasma etching process with respect to changing Power (in Watts) of the plasma beam. Data was collected and provided to you to conduct statistical analysis and check if changing the power of the plasma beam has any effect on the etching rate by the machine. You will conduct ANOVA and check if there is any difference in the mean etching rate for different levels of power. You can find the data anova_lab_data.xlsx file in the files_for_lab folder

State the null hypothesis
State the alternate hypothesis
What is the significance level
What are the degrees of freedom of the model, error terms, and total DoF
Data were collected randomly and provided to you in the table as shown: link to the image - Data

Part 2
In this section, use Python to conduct ANOVA.
What conclusions can you draw from the experiment and why?

In [2]:
import pandas as pd
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
import pandas as pd

file_path = '/Users/franc/OneDrive/Desktop/Week_7/Day_2/Afternoon/lab-t-tests-p-values/files_for_lab/machine.txt'  # Replace with your actual file path

df = pd.read_csv(file_path, encoding='utf-16', sep='\t', header=0)

print(df)

   New machine      Old machine
0         42.1             42.7
1         41.0             43.6
2         41.3             43.8
3         41.8             43.3
4         42.4             42.5
5         42.8             43.5
6         43.2             43.1
7         42.3             41.7
8         41.8             44.0
9         42.7             44.1


In [4]:
df

Unnamed: 0,New machine,Old machine
0,42.1,42.7
1,41.0,43.6
2,41.3,43.8
3,41.8,43.3
4,42.4,42.5
5,42.8,43.5
6,43.2,43.1
7,42.3,41.7
8,41.8,44.0
9,42.7,44.1


In [5]:
df.columns= df.columns.str.lower()
df.columns= df.columns.str.replace(' ', '', regex= False)
df.columns

Index(['newmachine', 'oldmachine'], dtype='object')

In [6]:
from scipy import stats

t_stat, p_value = stats.ttest_ind(df['newmachine'], df['oldmachine'], alternative='greater')

new_mean = df['newmachine'].mean()
old_mean = df['oldmachine'].mean()
new_std = df['newmachine'].std(ddof=1)
old_std = df['oldmachine'].std(ddof=1)
n_new = len(df['newmachine'])
n_old = len(df['oldmachine'])

sp = np.sqrt(((n_new - 1) * new_std ** 2 + (n_old - 1) * old_std ** 2) / (n_new + n_old - 2))
sed = sp * np.sqrt(1/n_new + 1/n_old)
t = (new_mean - old_mean) / sed
df = n_new + n_old - 2
p_value_manual = 1 - stats.t.cdf(t, df)

print(f"New machine mean time: {new_mean:.2f}")
print(f"Old machine mean time: {old_mean:.2f}")
print(f"t-statistic: {t:.3f}")
print(f"Critical t-value for alpha=0.05: {stats.t.ppf(0.95, df):.3f}")
print(f"p-value (one-tailed): {p_value_manual:.3f}")

alpha = 0.05
if p_value_manual < alpha:
    print("Reject the null hypothesis: There is evidence that the new machine is faster.")
else:
    print("Fail to reject the null hypothesis: There is not enough evidence that the new machine is faster.")

New machine mean time: 42.14
Old machine mean time: 43.23
t-statistic: -3.397
Critical t-value for alpha=0.05: 1.734
p-value (one-tailed): 0.998
Fail to reject the null hypothesis: There is not enough evidence that the new machine is faster.


# Matched Pairs Test

In this challenge we will compare dependent samples of data describing our Pokemon (file files_for_lab/pokemon.csv). Our goal is to see whether there is a significant difference between each Pokemon's defense and attack scores. Our hypothesis is that the defense and attack scores are equal. Compare the two columns to see if there is a statistically significant difference between them and comment on your result.

In [7]:
df1= pd.read_csv('pokemon.csv')

In [8]:
df1 = df1.drop(columns=['#'])
df1

In [9]:
df1

Unnamed: 0,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
1,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
2,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
4,Charmander,Fire,,309,39,52,43,60,50,65,1,False
...,...,...,...,...,...,...,...,...,...,...,...,...
795,Diancie,Rock,Fairy,600,50,100,150,100,150,50,6,True
796,DiancieMega Diancie,Rock,Fairy,700,50,160,110,160,110,110,6,True
797,HoopaHoopa Confined,Psychic,Ghost,600,80,110,60,150,130,70,6,True
798,HoopaHoopa Unbound,Psychic,Dark,680,80,160,60,170,130,80,6,True


In [11]:
import scipy.stats as stats

df1['attack_defense_diff'] = df1['Attack'] - df1['Defense']

t_statistic, p_value = stats.ttest_rel(df1['Attack'], df1['Defense'])

alpha = 0.05

if p_value < alpha:
    print("Reject H0: The defense and attack scores are NOT equal (two-sided)")
else:
    print("Fail to reject H0: The defense and attack scores are equal")

print("T-Statistic: {:.3f}".format(t_statistic))
print("P-Value: {:.3f}".format(p_value))

Reject H0: The defense and attack scores are NOT equal (two-sided)
T-Statistic: 4.326
P-Value: 0.000


# Part 1

Part 1 In this activity, we will look at another example. Your task is to understand the problem and write down all the steps to set up ANOVA. After the next lesson, we will ask you to solve this problem using Python. Here are the steps that you would need to work on: - Null hypothesis - Alternate hypothesis - Level of significance - Test statistic - P-value - F table

Context Suppose you are working as an analyst in a microprocessor chip manufacturing plant. You have been given the task of analyzing a plasma etching process with respect to changing Power (in Watts) of the plasma beam. Data was collected and provided to you to conduct statistical analysis and check if changing the power of the plasma beam has any effect on the etching rate by the machine. You will conduct ANOVA and check if there is any difference in the mean etching rate for different levels of power. You can find the data anova_lab_data.xlsx file in the files_for_lab folder

State the null hypothesis State the alternate hypothesis What is the significance level What are the degrees of freedom of the model, error terms, and total DoF Data were collected randomly and provided to you in the table as shown: link to the image - Data

In [15]:
df2= pd.read_excel('anova_lab_data.xlsx')

In [16]:
df2

Unnamed: 0,Power,Etching Rate
0,160 W,5.43
1,180 W,6.24
2,200 W,8.79
3,160 W,5.71
4,180 W,6.71
5,200 W,9.2
6,160 W,6.22
7,180 W,5.98
8,200 W,7.9
9,160 W,6.01


In [21]:
new_df = df2.groupby('Power ')['Etching Rate'].agg(['mean', 'count']).reset_index()
print(new_df)

  Power    mean  count
0  160 W  5.792      5
1  180 W  6.238      5
2  200 W  8.318      5


-Null Hypothesis (H0): Changing the power of the plasma beam has no effect on the etching rate by the machine.

-Alternate Hypothesis (Ha): Changing the power of the plasma beam has an effect on the etching rate by the machine.

-Significance Level: 95%

-Test Statistic: F-test for one-way ANOVA

-Degrees of Freedom:

-Between Groups (d1): 2

-Within Groups (d2): 15 - 3 = 12

-Error Terms:

-SST (Sum of Squares Total): Variance of group means against the global mean

-SSE (Sum of Squares Error): Variance of individual values within each group against their respective group means

-Total Degrees of Freedom: 15 - 1 = 14

# Part 2 

In this section, use Python to conduct ANOVA. What conclusions can you draw from the experiment and why?

In [26]:
global_mean = df2['Etching Rate'].mean()
S2t = sum(new_df['count'] * (new_df['mean'] - global_mean) ** 2) / (new_df['Power '].nunique() - 1)

print("S2t: {:.2f}".format(S2t))

S2E = sum((ng - 1) * np.var(df2[df2['Power '] == p]['Etching Rate'], ddof=1) for p, ng in zip(new_df['Power '], new_df['count']))
S2E /= (len(df2) - new_df['Power '].nunique())

print("S2E: {:.2f}".format(S2E))

F = S2t / S2E
print("F-statistic: {:.2f}".format(F))

d1 = new_df['Power '].nunique() - 1
d2 = len(df2) - new_df['Power '].nunique()

print("Degrees of freedom (d1):", d1)
print("Degrees of freedom (d2):", d2)

pval = 1 - stats.f.cdf(F, dfn=d1, dfd=d2)
print("p-value of F: {:.3f}".format(pval))

Fc = stats.f.ppf(1 - 0.05, dfn=d1, dfd=d2)
print("Critical value (alpha = 0.05): {:.2f}".format(Fc))

result = stats.f_oneway(
    df2[df2['Power '] == '160 W']['Etching Rate'],
    df2[df2['Power '] == '180 W']['Etching Rate'],
    df2[df2['Power '] == '200 W']['Etching Rate']
)

print("One-way ANOVA result:", result)

S2t: 9.09
S2E: 0.25
F-statistic: 36.88
Degrees of freedom (d1): 2
Degrees of freedom (d2): 12
p-value of F: 0.000
Critical value (alpha = 0.05): 3.89
One-way ANOVA result: F_onewayResult(statistic=36.87895470100505, pvalue=7.506584272358903e-06)


# ANSWER

"The F-statistic (F = 36.88) exceeds the critical value (Fc = 3.89) at a significance level of 0.05 (a = 0.05). Additionally, the p-value (Pval = 0.000) is less than 0.05. Therefore, we reject the null hypothesis (H0). This suggests that changing the power of the plasma beam indeed has a statistically significant effect on the etching rate by the machine, with a 95% confidence level."