# Why we use these tests in ML
1. In machine learning, feature selection is a key step. Not all features contribute equally to prediction.

2. Statistical tests like Chi-Square and ANOVA help us evaluate which features are most relevant to the target variable.

3. Chi-Square Test → checks independence between a categorical feature and a categorical target.

4. ANOVA Test → checks whether means of a continuous feature differ across multiple classes.

# Chi-Square → categorical (feature ↔ target).

# ANOVA → continuous features ↔ categorical target.

# Chi-Square Test (χ² Test of Independence)
1. It checks whether two categorical variables are independent or related.
    In ML, we use it to test if a feature and the target variable are dependent
   (important) or independent (not useful).

2. x^2 = sum of ((O - E)**2)/E

3. O = Observed frequency (actual counts from data)
4. E = Expected frequency (if the feature and target were independent)

5. If χ² is large → observed values differ a lot from expected → dependent variables.
If χ² is small → variables are independent.

📌 Requirements

Both variables should be categorical.
(For Iris, features are continuous → we discretize them into bins).

A target variable must be categorical (Iris species are categorical: Setosa, Versicolor, Virginica).

Null Hypothesis 
𝐻
0
H
0
	​

: Feature and target are independent.
Alternative Hypothesis 
𝐻
1
H
1
	​

: Feature and target are dependent.     
6. p-value < 0.05 → Reject 
𝐻
0
H
0
	​

, the feature is significantly related to species.

p-value > 0.05 → Fail to reject 
𝐻
0
H
0
	​

, feature may not be useful.

# Code : 

In [13]:
import pandas as pd
from sklearn.datasets import fetch_openml
from sklearn.feature_selection import chi2
from sklearn.preprocessing import LabelEncoder

In [14]:
titanic_data = fetch_openml('titanic', version=1, as_frame=True)

In [15]:
df = titanic_data.frame

In [16]:
df

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0000,0,0,24160,211.3375,B5,S,2,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.5500,C22 C26,S,11,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0000,1,2,113781,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0000,1,2,113781,151.5500,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0000,1,2,113781,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1304,3,0,"Zabour, Miss. Hileni",female,14.5000,1,0,2665,14.4542,,C,,328.0,
1305,3,0,"Zabour, Miss. Thamine",female,,1,0,2665,14.4542,,C,,,
1306,3,0,"Zakarian, Mr. Mapriededer",male,26.5000,0,0,2656,7.2250,,C,,304.0,
1307,3,0,"Zakarian, Mr. Ortin",male,27.0000,0,0,2670,7.2250,,C,,,


In [21]:
categorical_df = df[['sex', 'pclass', 'embarked', 'survived']].dropna()

In [22]:
le = LabelEncoder()

In [24]:
for col in categorical_df.columns:
    categorical_df[col] = le.fit_transform(categorical_df[col])

In [25]:
y = categorical_df['survived']
x = categorical_df.drop(columns=['survived'])

In [26]:
x.shape

(1307, 3)

In [27]:
y.shape

(1307,)

In [29]:
chi_score , p_value = chi2(x, y)

In [32]:
chi2_result = pd.DataFrame({
    'feature' : x.columns,
    'chi2_score' : chi_score,
    'p-value' : p_value
})

In [33]:
chi2_result

Unnamed: 0,feature,chi2_score,p-value
0,sex,129.090413,6.479844e-30
1,pclass,67.96999,1.660029e-16
2,embarked,18.147668,2.044196e-05


In [49]:
# chi2_result[chi2_result['feature'] == 'sex']['p-value'] < 0.05
# chi2_result[chi2_result['feature'] == 'pclass']['p-value'] < 0.05
chi2_result[chi2_result['feature'] == 'embarked']['p-value'] < 0.05


2    True
Name: p-value, dtype: bool

In [71]:
selected_feature = []
for i in chi2_result.feature:
    selected_feature.append((i, chi2_result[chi2_result['feature'] == i]['p-value'].values[0]))

In [72]:
selected_feature

[('sex', 6.479844476671927e-30),
 ('pclass', 1.6600287122657094e-16),
 ('embarked', 2.0441962210904808e-05)]

In [74]:
for items in selected_feature:
    if items[1] < 0.05:
        print(f'Significant effective feature names are {items[0]}')

Significant effective feature names are sex
Significant effective feature names are pclass
Significant effective feature names are embarked
