ANOVA (Analysis of Variance) is a statistical technique used to compare the means of three or more groups to determine if they are significantly different from each other. It is commonly used in statistical sampling and machine learning for feature selection or understanding the relationship between categorical independent variables and a continuous dependent variable.

### example 1

A researcher wants to test whether the average exam scores of students differ based on the type of teaching method: Traditional, Online, or Hybrid.

In [6]:
import pandas as pd
from scipy.stats import f_oneway
data = {
    'Teaching_Method': ['Traditional', 'Traditional', 'Traditional',
                        'Online', 'Online', 'Online',
                        'Hybrid', 'Hybrid', 'Hybrid'],
    'Exam_Score': [75, 78, 72, 88, 84, 82, 80, 85, 79]
}

df = pd.DataFrame(data)
df

Unnamed: 0,Teaching_Method,Exam_Score
0,Traditional,75
1,Traditional,78
2,Traditional,72
3,Online,88
4,Online,84
5,Online,82
6,Hybrid,80
7,Hybrid,85
8,Hybrid,79


In [10]:
# Split data by groups
traditional_score=df[df["Teaching_Method"]=="Traditional"]["Exam_Score"]
online_score=df[df["Teaching_Method"]=="Online"]["Exam_Score"]
hybrid_score=df[df["Teaching_Method"]=="Hybrid"]["Exam_Score"]

# Perform ANOVA test
f_stat,p_value=f_oneway(traditional_score,online_score,hybrid_score)
print("F_statistcs :",f_stat)
print(" p_value : ",p_value)


# conclusion
if p_value<0.05:
    print("Reject the null hypothesis: At least one group mean is different.")
else:
    print("Fail to reject the null hypothesis: Group means are similar.")
    

F_statistcs : 7.5697674418604635
 p_value :  0.022864803227046895
Reject the null hypothesis: At least one group mean is different.


### example 2

In machine learning, ANOVA is often used for feature selection. For example, in a dataset with multiple categorical features, ANOVA can help determine which features have a significant impact on a continuous target variable.

Consider a dataset where we want to predict house prices based on categorical variables like "House Style," "Roof Type," and "Exterior Material."

In [52]:
from sklearn.feature_selection import f_regression
import pandas as pd
import numpy as np

# Example dataset
data = {
    'House_Style': ['1Story', '2Story', '1Story', '2Story', '1.5Fin', '1.5Fin', '1Story'],
    'Roof_Type': ['Gable', 'Hip', 'Gable', 'Hip', 'Flat', 'Gable', 'Hip'],
    'Price': [200000, 250000, 210000, 260000, 180000, 190000, 230000]
}
df=pd.DataFrame(data)
df

Unnamed: 0,House_Style,Roof_Type,Price
0,1Story,Gable,200000
1,2Story,Hip,250000
2,1Story,Gable,210000
3,2Story,Hip,260000
4,1.5Fin,Flat,180000
5,1.5Fin,Gable,190000
6,1Story,Hip,230000


In [54]:
# label encode
encoded_df=pd.get_dummies(df,drop_first=True)
encoded_df

Unnamed: 0,Price,House_Style_1Story,House_Style_2Story,Roof_Type_Gable,Roof_Type_Hip
0,200000,True,False,True,False
1,250000,False,True,False,True
2,210000,True,False,True,False
3,260000,False,True,False,True
4,180000,False,False,False,False
5,190000,False,False,True,False
6,230000,True,False,False,True


In [56]:
# extract features and target variables
X=encoded_df.drop("Price",axis=1)
y=encoded_df['Price']


In [66]:
# Perform ANOVA F-test
f_stat, p_values = f_regression(X, y)

# Output results
results = pd.DataFrame({'Feature': X.columns, 'F-Statistic': f_stat, 'P-Value': p_values})
print(results)


# using for loop iteration
print("                      ")
print("by iteration for loop ")
# Output results
for feature, p_val in zip(X.columns, p_values):
    print(f"Feature: {feature}, P-Value: {p_val}")



              Feature  F-Statistic   P-Value
0  House_Style_1Story     0.069686  0.802329
1  House_Style_2Story    13.113912  0.015199
2     Roof_Type_Gable     1.928571  0.223598
3       Roof_Type_Hip    23.669951  0.004612
                      
by iteration for loop 
Feature: House_Style_1Story, P-Value: 0.8023287915872906
Feature: House_Style_2Story, P-Value: 0.015198540225659942
Feature: Roof_Type_Gable, P-Value: 0.22359775328347115
Feature: Roof_Type_Hip, P-Value: 0.004612244656160908


In [68]:
# so that roof_tyoe_hip and house_style _2story is less than significant number(0.05) 

### example 3

In [89]:
from sklearn.feature_selection import mutual_info_classif
import pandas as pd

# Example dataset
data = {
    'Age': [25, 30, 35, 40, 45, 50, 55, 60],
    'Income': [40000, 50000, 60000, 70000, 80000, 90000, 100000, 110000],
    'Education_Level': ['Bachelor', 'Master', 'PhD', 'High School', 'Master', 'PhD', 'Bachelor', 'High School'],
    'Defaulted': [0, 0, 0, 1, 1, 1, 0, 0]
}
df = pd.DataFrame(data)

# Encode categorical feature
df['Education_Level'] = df['Education_Level'].map({
    'High School': 0, 'Bachelor': 1, 'Master': 2, 'PhD': 3
})

# Features and target
X = df[['Age', 'Income', 'Education_Level']]
y = df['Defaulted']

# Perform Mutual Information Feature Selection
mutual_info = mutual_info_classif(X, y)

# Display results
feature_scores = pd.DataFrame({'Feature': X.columns, 'Mutual Information': mutual_info})
print(feature_scores)


           Feature  Mutual Information
0              Age            0.000000
1           Income            0.030357
2  Education_Level            0.000000


### example 4

In [91]:
# Generate a larger synthetic dataset
from sklearn.datasets import make_classification

# Create dataset
X, y = make_classification(
    n_samples=500, n_features=10, n_informative=5, 
    n_redundant=2, n_classes=2, random_state=42
)

# Convert to DataFrame
columns = [f'Feature_{i}' for i in range(1, 11)]
df = pd.DataFrame(X, columns=columns)
df['Target'] = y

# Perform Mutual Information Feature Selection
mutual_info = mutual_info_classif(df.iloc[:, :-1], df['Target'])

# Display results
feature_scores = pd.DataFrame({
    'Feature': columns,
    'Mutual Information': mutual_info
}).sort_values(by='Mutual Information', ascending=False)

print(feature_scores)


      Feature  Mutual Information
0   Feature_1            0.194123
1   Feature_2            0.129449
8   Feature_9            0.115916
3   Feature_4            0.104875
2   Feature_3            0.097245
5   Feature_6            0.087287
9  Feature_10            0.017613
4   Feature_5            0.016971
6   Feature_7            0.000000
7   Feature_8            0.000000
