# THURSDAY 9TH OCT, 2025
## Anova
- Types of ANOVA

- There are 3 main types of ANOVA, depending on the number of independent variables and interactions involved:

1. One-Way ANOVA also ttest  two sample

✅ What it compares: One independent variable (factor) with 2 or more groups.

✅ Example: Comparing test scores between 3 teaching methods.

✅ Assumption: Groups are independent and data is normally distributed.

✅ Python function: scipy.stats.f_oneway()


2. Two-Way ANOVA

✅ What it compares: Two independent variables, possibly with interaction.

✅ Example: Test scores by teaching method and gender (2 factors).

✅ You can also test: Interaction effect — whether the effect of one factor depends on the other.

📍 Usually implemented using statsmodels with a formula:
from statsmodels.formula.api import ols
import statsmodels.api as sm

model = ols('Score ~ C(Method) + C(Gender) + C(Method):C(Gender)', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
 
 
3. Repeated Measures ANOVA

✅ What it compares: Same subjects measured under different conditions or times.

✅ Example: Blood pressure before, during, and after treatment on same patients.

✅ Use when: Data is not independent, i.e., repeated measures from same subjects.

✅ Python: statsmodels or pingouin library.

In [35]:
!pip install statsmodels



In [36]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

In [37]:
data = pd.read_csv(r"IT_salaries.csv")
data.head()

Unnamed: 0,S,X,E,M
0,13876,1,1,1
1,11608,1,3,0
2,18701,1,3,1
3,11283,1,2,0
4,11767,1,3,0


In [38]:
# FORMULA FOR OLS
# S is the column s,, M is the column M ,, and X is column X
# S: This is the dependent variable (also called the response or outcome variable).
# ~: Separates the dependent variable (S) from the independent variables (predictors).
# C(E): Treat E as a categorical variable.
# C(M): Treat M as a categorical variable.
# X: Treat X as a numerical (continuous) variable.
formula = 'S~ C(E) + C(M) + X'

In [39]:
#.fit(): This tells Python to actually fit (train) the model using the data
#It finds the best coefficients (weights) for the variables to predict S.
# ols is ordinary least squares
lm = ols(formula, data).fit()

In [40]:
# model summary
lm.summary()

0,1,2,3
Dep. Variable:,S,R-squared:,0.957
Model:,OLS,Adj. R-squared:,0.953
Method:,Least Squares,F-statistic:,226.8
Date:,"Wed, 15 Oct 2025",Prob (F-statistic):,2.2300000000000003e-27
Time:,12:09:56,Log-Likelihood:,-381.63
No. Observations:,46,AIC:,773.3
Df Residuals:,41,BIC:,782.4
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,8035.5976,386.689,20.781,0.000,7254.663,8816.532
C(E)[T.2],3144.0352,361.968,8.686,0.000,2413.025,3875.045
C(E)[T.3],2996.2103,411.753,7.277,0.000,2164.659,3827.762
C(M)[T.1],6883.5310,313.919,21.928,0.000,6249.559,7517.503
X,546.1840,30.519,17.896,0.000,484.549,607.819

0,1,2,3
Omnibus:,2.293,Durbin-Watson:,2.237
Prob(Omnibus):,0.318,Jarque-Bera (JB):,1.362
Skew:,-0.077,Prob(JB):,0.506
Kurtosis:,2.171,Cond. No.,33.5


In [41]:
# lm is the linear model created earlier
# typ=2 Tests each variable after adjusting for all other variables except interactions.
table = sm.stats.anova_lm(lm,typ=2)
table

Unnamed: 0,sum_sq,df,F,PR(>F)
C(E),91526240.0,2.0,43.351589,7.67245e-11
C(M),507572400.0,1.0,480.825394,2.901444e-24
X,338097900.0,1.0,320.281524,5.546313e-21
Residual,43280720.0,41.0,,


# Question 1
Scenario:
 A company ran three online ad campaigns — A, B, and C.
 You have conversion rates (%) from 10 cities for each campaign.
Task:
 Write Python code to check if the mean conversion rates differ significantly among the campaigns using one-way ANOVA.
 -- # Sample data
campaign_A = [12, 15, 14, 10, 13, 15, 11, 14, 13, 16]
campaign_B = [18, 17, 16, 15, 20, 19, 18, 16, 17, 19]
campaign_C = [10, 9, 11, 10, 12, 9, 11, 8, 10, 9]

In [42]:
# Sample data
# The F-statistic is a ratio of "between-group variance" to "within-group variance".
from scipy.stats import f_oneway
import numpy as np

campaign_A = [12, 15, 14, 10, 13, 15, 11, 14, 13, 16]
campaign_B = [18, 17, 16, 15, 20, 19, 18, 16, 17, 19]
campaign_C = [10, 9, 11, 10, 12, 9, 11, 8, 10, 9]

f_stats, p_value = f_oneway(campaign_A,campaign_B,campaign_C)
print(f_stats, p_value)
alpha = 0.05

if p_value < alpha:
    print ("reject the null hypothesis")
else:
    print ("fail to reject null hypothesis")

57.973333333333386 1.692627434604139e-10
reject the null hypothesis


# Question 2
Fertilizer Type vs Crop Yield
Scenario:
 Three fertilizers (A, B, and C) are used on crops.
 You record the yield (in kg) from 8 plots for each fertilizer.
Task:
 Perform one-way ANOVA in Python to test if fertilizer type affects crop yield.
fertilizer_A = [25, 27, 26, 30, 29, 28, 30, 27]
fertilizer_B = [32, 35, 34, 33, 36, 34, 35, 32]
fertilizer_C = [22, 20, 24, 23, 25, 21, 22, 23]
Question:
 Which fertilizer(s) seem to produce significantly different yields?


In [43]:
import numpy as np
from scipy.stats import f_oneway

fertilizer_A = [25, 27, 26, 30, 29, 28, 30, 27]
fertilizer_B = [32, 35, 34, 33, 36, 34, 35, 32]
fertilizer_C = [22, 20, 24, 23, 25, 21, 22, 23]

f_stats2, p_value2 = f_oneway(fertilizer_A,fertilizer_B,fertilizer_C)
print (f_stats2, p_value2)
alpha = 0.05
altern_hypotheis = "the mean differs significantly between fertilizers"
nul_hypothesis = "the mean does not differ significantly between fertilizers"

if p_value2<alpha:
    print (f"based on the current evidence {altern_hypotheis}")
else:
    print (f"based on the current evidence {nul_hypothesis}")

96.58758314855855 2.5717391158818948e-11
based on the current evidence the mean differs significantly between fertilizers


In [44]:
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Combine data and labels
yield_data = fertilizer_A + fertilizer_B + fertilizer_C
fertilizers = ['A']*len(fertilizer_A) + ['B']*len(fertilizer_B) + ['C']*len(fertilizer_C)

df = pd.DataFrame({'fertilizer': fertilizers, 'yield': yield_data})

# Tukey HSD test
tukey = pairwise_tukeyhsd(endog=df['yield'], groups=df['fertilizer'], alpha=0.05)
print(tukey)


Multiple Comparison of Means - Tukey HSD, FWER=0.05 
group1 group2 meandiff p-adj  lower    upper  reject
----------------------------------------------------
     A      B    6.125   0.0   4.0601  8.1899   True
     A      C    -5.25   0.0  -7.3149 -3.1851   True
     B      C  -11.375   0.0 -13.4399 -9.3101   True
----------------------------------------------------


# Question 3
Teaching Method vs Student Performance
Scenario:
 Students were taught using three different teaching methods (Lecture, Group Discussion, and Project-based).
 You have their exam scores.
Task:
 Use Python to run ANOVA and check if the teaching method affects scores.
Question:
 Does the teaching method have a statistically significant effect on student performance?


In [45]:
data = {
    'score': [78, 85, 80, 90, 88, 82, 84, 86, 92],
    'method': ['Lecture', 'Lecture', 'Lecture',
               'Discussion', 'Discussion', 'Discussion',
               'Project', 'Project', 'Project']
}


In [46]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
df = pd.DataFrame(data)
df.head(10)

Unnamed: 0,score,method
0,78,Lecture
1,85,Lecture
2,80,Lecture
3,90,Discussion
4,88,Discussion
5,82,Discussion
6,84,Project
7,86,Project
8,92,Project


In [47]:
formula1 = 'score ~ + C(method)'

In [48]:
lm2 = ols(formula1, df).fit()
lm2

<statsmodels.regression.linear_model.RegressionResultsWrapper at 0x2e72e33a990>

In [49]:
table1 = sm.stats.anova_lm(lm2,typ=2)
table1

Unnamed: 0,sum_sq,df,F,PR(>F)
C(method),72.666667,2.0,2.286713,0.182729
Residual,95.333333,6.0,,


In [50]:
alpha = 0.05
p_value = table1['PR(>F)'][0]

if p_value < alpha:
    print("based on the current evidence the teaching method has a statistically significant effect on student performance.")
else:
    print("based on the current evidence the teaching method does not have a statistically significant effect on student performance.")


based on the current evidence the teaching method does not have a statistically significant effect on student performance.


  p_value = table1['PR(>F)'][0]


# Question 4
Machine Type vs Production Quality
Scenario:
 Three different machines produce the same product.
 You record quality scores for items from each machine.
Task:
 Run ANOVA in Python to check if the mean quality differs among machines.
Question:
 Machine Type vs Production Quality
Scenario:
 Three different machines produce the same product.
 You record quality scores for items from each machine.
Task:
 Run ANOVA in Python to check if the mean quality differs among machines.
Question:
 Which machine produces items with significantly different average quality?


In [51]:
machine_A = [88, 85, 87, 90, 89, 88, 86, 87]
machine_B = [82, 84, 83, 81, 80, 83, 82, 84]
machine_C = [91, 90, 92, 93, 94, 92, 91, 93]

f_stats3, p_value3 = f_oneway(machine_A, machine_B, machine_C)
print (f_stats3, p_value3)
alpha = 0.05
null_hypothesis = "the mean doesn't differ significantly between machines"
alternative_hypothesis = "the mean differs significantly between machines"

if p_value3< alpha:
    print (f"based on the current evidence {alternative_hypothesis}")
else:
    print (f"based on the current evidence {null_hypothesis}")

88.80626780626768 5.678487937229093e-11
based on the current evidence the mean differs significantly between machines


In [52]:
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Combine data and labels
scores_data = machine_A + machine_B + machine_C
machines = ['A']*len(machine_A) + ['B']*len(machine_B) + ['C']*len(machine_C)

df = pd.DataFrame({'machine': machines, 'scores': scores_data})

# Tukey HSD test
tukey = pairwise_tukeyhsd(endog=df['scores'], groups=df['machine'], alpha=0.05)
print(tukey)


Multiple Comparison of Means - Tukey HSD, FWER=0.05
group1 group2 meandiff p-adj  lower   upper  reject
---------------------------------------------------
     A      B   -5.125   0.0 -6.9467 -3.3033   True
     A      C      4.5   0.0  2.6783  6.3217   True
     B      C    9.625   0.0  7.8033 11.4467   True
---------------------------------------------------


# Question 5
Diet Plan vs Weight Loss
Scenario:
 Four diet plans (Keto, Paleo, Vegan, and Mediterranean) were followed for 8 weeks.
 The weight loss (in kg) of participants was recorded.
Task:
 Use ANOVA in Python to test if diet type affects weight loss.
Question:
 Is there a significant difference in mean weight loss among the four diets?


In [53]:
data = {
    'diet': ['Keto']*6 + ['Paleo']*6 + ['Vegan']*6 + ['Mediterranean']*6,
    'weight_loss': [6, 5, 7, 8, 6, 7, 5, 6, 4, 5, 6, 5, 3, 4, 2, 4, 3, 2, 5, 6, 7, 6, 8, 7]
}

df3 = pd.DataFrame(data)
df3.head()

Unnamed: 0,diet,weight_loss
0,Keto,6
1,Keto,5
2,Keto,7
3,Keto,8
4,Keto,6


In [54]:
formula3 = 'weight_loss ~ + C(diet)'

In [55]:
lm3 = ols(formula3, df3).fit()
lm3

<statsmodels.regression.linear_model.RegressionResultsWrapper at 0x2e72e78f710>

In [56]:
table3 = sm.stats.anova_lm(lm3,typ=2)
table3

Unnamed: 0,sum_sq,df,F,PR(>F)
C(diet),49.125,3.0,18.364486,6e-06
Residual,17.833333,20.0,,


In [57]:
alpha = 0.05
p_value = table3['PR(>F)'][0]

if p_value < alpha:
    print("based on the current evidence there is a significant difference in mean weight loss among the four diets.")
else:
    print("based on the current evidence there is no significant difference in mean weight loss among the four diets.")


based on the current evidence there is a significant difference in mean weight loss among the four diets.


  p_value = table3['PR(>F)'][0]


# WEDNESDAY 15TH OCTOBER, 2025

In [58]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

In [59]:
house_prices = pd.read_csv(r"house_prices_dataset.csv")
house_prices.head()

Unnamed: 0,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,year_built,year_renovated,zipcode,lat,long,price
0,3,1,2803,1153,1,0,0,3,9,2003,0,98001,47.660671,-121.785347,835016.23
1,5,3,783,9762,1,0,2,4,4,2019,0,98003,47.681937,-122.151515,326073.09
2,3,3,3412,2842,2,0,3,4,9,1961,1962,98002,47.182798,-121.792089,1025404.09
3,2,3,2222,9020,2,1,2,4,4,2008,1962,98004,47.440995,-121.871224,758764.79
4,4,1,4713,1584,2,0,3,4,5,1987,0,98005,47.701306,-122.416623,1314784.23


In [None]:
# the columns bathrooms, bedrooms and floors are discrete numerical
# the columns condition, grade and waterfront are categorical
house_prices.columns

Index(['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors',
       'waterfront', 'view', 'condition', 'grade', 'year_built',
       'year_renovated', 'price'],
      dtype='object')

In [61]:
house_prices = house_prices.drop(columns=["zipcode","lat",'long'],axis=1)

In [70]:
house_prices['bedrooms'].value_counts()

bedrooms
4    224
2    206
1    203
5    188
3    179
Name: count, dtype: int64

In [None]:
house_prices['bathrooms'].value_counts()

bathrooms
3    355
2    328
1    317
Name: count, dtype: int64

In [66]:
house_prices['waterfront'].value_counts()

waterfront
0    893
1    107
Name: count, dtype: int64

In [67]:
house_prices['grade'].value_counts()

grade
12    111
1      95
3      91
10     90
8      86
4      85
5      80
9      80
6      76
11     75
7      72
2      59
Name: count, dtype: int64

In [64]:
house_prices['view'].value_counts()


view
1    265
0    259
3    250
2    226
Name: count, dtype: int64

In [63]:
house_prices.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   bedrooms        1000 non-null   int64  
 1   bathrooms       1000 non-null   int64  
 2   sqft_living     1000 non-null   int64  
 3   sqft_lot        1000 non-null   int64  
 4   floors          1000 non-null   int64  
 5   waterfront      1000 non-null   int64  
 6   view            1000 non-null   int64  
 7   condition       1000 non-null   int64  
 8   grade           1000 non-null   int64  
 9   year_built      1000 non-null   int64  
 10  year_renovated  1000 non-null   int64  
 11  price           1000 non-null   float64
dtypes: float64(1), int64(11)
memory usage: 93.9 KB


In [72]:
house_prices.columns

Index(['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors',
       'waterfront', 'view', 'condition', 'grade', 'year_built',
       'year_renovated', 'price'],
      dtype='object')

In [78]:
# the typ=2 is used in two way anova 
# the typ=3 is used in three way anova
# .fit() -- to create a model takes ols which has two parameters formula and data
# table take the model and typ of anova
# ols is used to compare dependent and independent variables
formula = 'price~ + bedrooms + floors + bathrooms + sqft_living + year_built + year_renovated + C(waterfront) + C(condition) + C(view) +C(grade)'
lm = ols(formula, house_prices).fit()
table4 = sm.stats.anova_lm(lm, typ=2)
table4

Unnamed: 0,sum_sq,df,F,PR(>F)
C(waterfront),917460400000.0,1.0,2415.951849,5.0685729999999996e-266
C(condition),203589000000.0,4.0,134.027892,3.098076e-91
C(view),18884900000.0,3.0,16.576555,1.645176e-10
C(grade),762397500000.0,11.0,182.511272,9.652991e-228
bedrooms,472111900000.0,1.0,1243.21395,3.5706039999999995e-176
floors,30295550.0,1.0,0.079777,0.7776593
bathrooms,61449770000.0,1.0,161.815886,2.09556e-34
sqft_living,101616800000000.0,1.0,267587.977203,0.0
year_built,8065949.0,1.0,0.02124,0.884157
year_renovated,86558700.0,1.0,0.227935,0.6331661
