# Lab | Inferential statistics - ANOVA

Suppose you are working as an analyst in a microprocessor chip manufacturing plant. You have been given the task of analyzing a plasma etching process with respect to changing Power (in Watts) of the plasma beam. Data was collected and provided to you to conduct statistical analysis and check if changing the power of the plasma beam has any effect on the etching rate by the machine. You will conduct ANOVA and check if there is any difference in the mean etching rate for different levels of power. You can find the data anova_lab_data.xlsx file in the files_for_lab folder.

In [1]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

In [12]:
df = pd.read_excel("./files_for_lab/anova_lab_data.xlsx")
df.columns = [df.columns[col_name].strip().lower().replace(' ','_') for col_name in range(len(df.columns))]
df

Unnamed: 0,power,etching_rate
0,160 W,5.43
1,180 W,6.24
2,200 W,8.79
3,160 W,5.71
4,180 W,6.71
5,200 W,9.2
6,160 W,6.22
7,180 W,5.98
8,200 W,7.9
9,160 W,6.01


In [14]:
# Pivot to take a quick look at what we have

round(df.pivot(columns="power").describe(),2)

# the formula will be: "Etching Rate ~ C(Power)"

Unnamed: 0_level_0,etching_rate,etching_rate,etching_rate
power,160 W,180 W,200 W
count,5.0,5.0,5.0
mean,5.79,6.24,8.32
std,0.32,0.43,0.67
min,5.43,5.66,7.55
25%,5.59,5.98,7.9
50%,5.71,6.24,8.15
75%,6.01,6.6,8.79
max,6.22,6.71,9.2


In [20]:
# ANOVA

import statsmodels.api as sm
from statsmodels.formula.api import ols

model = ols("etching_rate ~ C(power)",df).fit()
sm.stats.anova_lm(model) # p-value < 0.05, therefore each voltage is producing significantly different results

Unnamed: 0,df,sum_sq,mean_sq,F,PR(>F)
C(power),2.0,18.176653,9.088327,36.878955,8e-06
Residual,12.0,2.95724,0.246437,,


In [28]:
# Voltage of 200 W is suspected to be the most effective, so let's check if its rate is significantly higher than the rest
from scipy.stats import ttest_ind

hi_volt = df[df['power'] == "200 W"]['etching_rate']

for voltage in df['power'].unique():
    lo_volt = df[df['power'] == voltage]['etching_rate']
    print(voltage, ttest_ind(hi_volt, lo_volt))

160 W Ttest_indResult(statistic=7.611403634613074, pvalue=6.237977344615716e-05)
180 W Ttest_indResult(statistic=5.827496614588661, pvalue=0.0003926796476049085)
200 W Ttest_indResult(statistic=0.0, pvalue=1.0)
