# ‚úÖ Chi-Square Feature Selection (Categorical Data)

This method is **specifically designed for categorical features + categorical target**.

If your data looks like:

* encoded categorical features (0/1)
* target = 0/1

üëâ **Chi-square is PERFECT**.

---

## üß† Core intuition (very simple)

Chi-square checks:

> ‚ÄúIs this feature actually related to the target
> or are they independent?‚Äù

If feature and target are **independent** ‚Üí ‚ùå useless
If they are **dependent** ‚Üí ‚úÖ useful

---

## üß© Real-life example

Imagine spam dataset:

| has_link | spam |
| -------- | ---- |
| 1        | 1    |
| 1        | 1    |
| 1        | 1    |
| 0        | 0    |
| 0        | 0    |

Strong relationship ‚Üí very useful feature.

But:

| has_emoji | spam |
| --------- | ---- |
| 0         | 1    |
| 0         | 0    |
| 0         | 1    |
| 0         | 0    |

No relation ‚Üí useless.

---

## ‚ö†Ô∏è Very important rule (INTERVIEW FAVORITE)

> Chi-square works only with **non-negative values**

That‚Äôs why we apply it **after one-hot encoding**.

Binary values (0/1) ‚Üí perfect.

---

## üß† What chi-square actually measures

It compares:

* observed frequency
  vs
* expected frequency

If difference is large ‚Üí feature matters.

You don‚Äôt need math ‚Äî intuition is enough.

---

## üî• Practical example (hands-on)

### Step 1 ‚Äî Dataset

```python
import pandas as pd

data = {
    "city": ["Delhi", "Mumbai", "Delhi", "Delhi", "Mumbai", "Delhi"],
    "gender": ["Male", "Female", "Male", "Male", "Female", "Male"],
    "device": ["Android", "Android", "iPhone", "Android", "Android", "iPhone"],
    "purchased": [1, 0, 1, 1, 0, 0]
}

df = pd.DataFrame(data)
```

---

### Step 2 ‚Äî One-hot encode

```python
df_encoded = pd.get_dummies(df, drop_first=False)

X = df_encoded.drop("purchased", axis=1)
y = df_encoded["purchased"]
```

Now everything is numeric (0/1).

---

### Step 3 ‚Äî Apply Chi-Square

```python
from sklearn.feature_selection import chi2

chi_scores, p_values = chi2(X, y)
```

---

### Step 4 ‚Äî See results clearly

```python
chi_df = pd.DataFrame({
    "feature": X.columns,
    "chi_score": chi_scores,
    "p_value": p_values
}).sort_values("chi_score", ascending=False)

chi_df
```

---

## üß† How to interpret output

### Chi-score:

* higher = stronger relationship

### p-value:

* < 0.05 ‚Üí important
* > 0.05 ‚Üí likely useless

Example:

| feature       | chi_score | p_value |
| ------------- | --------- | ------- |
| device_iPhone | 6.21      | 0.01 ‚úÖ  |
| city_Mumbai   | 0.02      | 0.88 ‚ùå  |

---

## üî• Feature selection using SelectKBest

Instead of manually checking:

```python
from sklearn.feature_selection import SelectKBest, chi2

selector = SelectKBest(score_func=chi2, k=3)

X_selected = selector.fit_transform(X, y)

selected_features = X.columns[selector.get_support()]
selected_features
```

Boom üí•
Top K categorical features selected.

---

## üß† Why Chi-Square is powerful

‚úÖ perfect for categorical data
‚úÖ measures real dependency
‚úÖ fast
‚úÖ interpretable
‚úÖ commonly used in NLP
‚úÖ widely asked in interviews

---

## ‚ùå When NOT to use Chi-Square

* continuous target (regression)
* negative feature values
* raw text
* scaled data with negatives

---

## üéØ Interview-ready answer

> ‚ÄúChi-square feature selection evaluates statistical dependency between categorical features and categorical target by comparing observed and expected frequencies.‚Äù

üî• That line = strong impression.

---

## ‚ö° Summary (crystal clear)

* categorical features ‚Üí encode first
* target must be categorical
* values must be non-negative
* higher chi-score = more useful
* p-value confirms significance

---

## üß† You now know:

‚úÖ Variance Threshold ‚Üí removes useless columns
‚úÖ Chi-Square ‚Üí finds relevant categorical features

This is **real feature selection pipeline**.

---

In [1]:
import pandas as pd

data = {
    "city": ["Delhi", "Mumbai", "Delhi", "Delhi", "Mumbai", "Delhi"],
    "gender": ["Male", "Female", "Male", "Male", "Female", "Male"],
    "device": ["Android", "Android", "iPhone", "Android", "Android", "iPhone"],
    "purchased": [1, 0, 1, 1, 0, 0]
}

df = pd.DataFrame(data)

In [4]:
df_encoded = pd.get_dummies(df,drop_first=False)
X = df_encoded.drop('purchased',axis=1)
y = df['purchased']

In [5]:
from sklearn.feature_selection import chi2
chi_score , p_values = chi2(X,y)

In [6]:
chi_sqr = pd.DataFrame({
    "features":X.columns,
    "chi_score":chi_score,
    "p_value":p_values
})

In [None]:
chi_sqr
'''Chi-score:
higher = stronger relationship
p-value:
< 0.05 ‚Üí important
0.05 ‚Üí likely useless'''

Unnamed: 0,features,chi_score,p_value
0,city_Delhi,1.0,0.317311
1,city_Mumbai,2.0,0.157299
2,gender_Female,2.0,0.157299
3,gender_Male,1.0,0.317311
4,device_Android,0.0,1.0
5,device_iPhone,0.0,1.0


## üìä Your Chi-Square Output

| feature        | chi_score | p_value |
| -------------- | --------- | ------- |
| city_Delhi     | 1.0       | 0.317   |
| city_Mumbai    | 2.0       | 0.157   |
| gender_Female  | 2.0       | 0.157   |
| gender_Male    | 1.0       | 0.317   |
| device_Android | 0.0       | 1.000   |
| device_iPhone  | 0.0       | 1.000   |

---

# üß† RULES TO DECIDE (memorize these)

### ‚úÖ Rule 1 ‚Äî p-value is KING

> If **p-value < 0.05 ‚Üí keep**
> If **p-value ‚â• 0.05 ‚Üí remove**

Why?

Because p-value tells:

> ‚ÄúIs this relationship statistically significant or just random?‚Äù

---

## Apply rule üëá

### ‚ùå device_Android

* p = 1.0
* means **completely independent**
* target does not care at all

‚ùå REMOVE

---

### ‚ùå device_iPhone

* p = 1.0
* zero relationship

‚ùå REMOVE

---

### ‚ùå city_Delhi

* p = 0.317 (> 0.05)

‚ùå REMOVE

---

### ‚ùå gender_Male

* p = 0.317 (> 0.05)

‚ùå REMOVE

---

### ‚ö†Ô∏è city_Mumbai

* p = 0.157 (> 0.05)
* not statistically significant

‚ùå REMOVE (for strict ML)

---

### ‚ö†Ô∏è gender_Female

* p = 0.157 (> 0.05)

‚ùå REMOVE

---

## üî• FINAL RESULT

üëâ **Statistically speaking:**

### ‚ùå KEEP = NONE

### ‚ùå REMOVE = ALL

And that is **100% correct**.

---

## üòÆ Why everything got removed?

Because your dataset is:

* extremely small
* very few samples
* weak relationships
* randomness dominates

üìå Chi-square is **data-hungry**.

With only 5‚Äì6 rows, statistics cannot prove dependency.

---

## üß† THIS IS A HUGE LEARNING MOMENT

> Feature selection does NOT guarantee features will survive.

Sometimes the correct answer is:

> ‚ÄúNo feature is statistically significant.‚Äù

That‚Äôs not failure ‚Äî that‚Äôs honesty.

---

## üî• VERY IMPORTANT REALITY

In real datasets (10k+ rows):

* p-values drop drastically
* true relationships appear
* chi-square becomes powerful

In toy datasets ‚Üí almost always fails.

---

## üß† How professionals decide in practice

### Option 1 ‚Äî Strict statistical approach

Used in:

* medical ML
* finance
* research

Rule:

```
p_value < 0.05
```

Your result ‚Üí keep none ‚úÖ

---

### Option 2 ‚Äî Practical ML approach (very common)

We combine **three things**:

1. chi-score ranking
2. domain logic
3. model validation

Example:

Even if p = 0.15
but business logic says device matters
‚Üí we may keep it temporarily.

---

## üî• So what do we actually keep?

### Practical rule:

> Keep **top K features by chi-score**,
> even if p-value slightly high.

Example:

```python
chi_df.sort_values("chi_score", ascending=False).head(2)
```

That would give:

* city_Mumbai
* gender_Female

Then model decides later.

---

## üéØ Final professional rule (IMPORTANT)

| Stage                   | What we do        |
| ----------------------- | ----------------- |
| Early feature selection | Loose rules       |
| Final model             | Strict evaluation |
| Cross-validation        | Final judge       |

üëâ Feature selection is **not a one-shot decision**.

---

## üß† Interview-perfect explanation

If interviewer asks:

> ‚ÄúWhat if all p-values are high?‚Äù

You say:

> ‚ÄúIt usually indicates small dataset or weak statistical power. In such cases, I rank features using chi-square scores and validate them using model performance instead of blindly dropping everything.‚Äù

üî• That answer is VERY strong.

---

## ‚ö° Crystal-clear summary

* p-value < 0.05 ‚Üí statistically significant
* your dataset is too small
* chi-square cannot prove dependency
* removing all features is mathematically correct
* practically, we keep top-ranked features temporarily
* final decision comes from model performance

---

In [10]:
# Feature selection using SelectKBest
from sklearn.feature_selection import SelectKBest,chi2

selector = SelectKBest(score_func=chi2,k=3)
X_selected = selector.fit_transform(X,y)

selected_features = X.columns[selector.get_support()]
selected_features

Index(['city_Mumbai', 'gender_Female', 'gender_Male'], dtype='object')