# ‚úÖ Model-Based Feature Selection

This is how **actual ML systems decide features**.

Instead of asking math:

> ‚ÄúIs feature statistically related?‚Äù

We ask the model:

> ‚ÄúDid this feature help you reduce loss?‚Äù

üî• Much more practical.

---

## üß† Core idea

Train a model ‚Üí
check which features influenced predictions ‚Üí
keep important ones ‚Üí
drop rest.

---

## üìå Two very common model-based methods

### 1Ô∏è‚É£ Logistic Regression (coefficients)

* linear model
* great for interpretation
* very interview-friendly

### 2Ô∏è‚É£ Tree-based models (Random Forest)

* handles non-linearity
* very powerful
* industry standard

We‚Äôll do **both**.

---

# üîπ PART 1 ‚Äî Logistic Regression Feature Importance

Perfect because your target = binary.

---

## üß† Intuition

Logistic regression learns weights:

```
prediction = w1*x1 + w2*x2 + ...
```

* big weight ‚Üí important feature
* weight ‚âà 0 ‚Üí useless

Simple.

---

## üî¨ Practical example

### Step 1 ‚Äî Prepare data

```python
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

X = df_encoded.drop("purchased", axis=1)
y = df_encoded["purchased"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)
```

---

### Step 2 ‚Äî Train model

```python
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
```

---

### Step 3 ‚Äî Extract feature importance

```python
importance = pd.Series(
    model.coef_[0],
    index=X.columns
).sort_values(key=abs, ascending=False)

importance
```

---

## üß† How to interpret

| Coefficient    | Meaning                     |
| -------------- | --------------------------- |
| large positive | increases chance of class 1 |
| large negative | decreases chance            |
| near 0         | useless                     |

We care about **absolute value**.

---

## ‚úÖ Feature selection rule

```python
selected_features = importance[importance.abs() > 0.1].index
selected_features
```

These are features the model actually used.

---

## üî• THIS is model-based selection

Not theory.
Not assumptions.
Actual learning.

---

# üîπ PART 2 ‚Äî Random Forest Feature Importance

Even more powerful.

---

## üß† Intuition

Random Forest:

* builds many decision trees
* checks which features reduce impurity
* averages importance

Features used often ‚Üí important.

---

## üî¨ Code

```python
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
```

---

### Extract importance

```python
rf_importance = pd.Series(
    rf.feature_importances_,
    index=X.columns
).sort_values(ascending=False)

rf_importance
```

---

## üß† Interpretation

* higher value ‚Üí more important
* zero ‚Üí never used in splits

---

## ‚úÖ Select top features

```python
selected_rf_features = rf_importance[rf_importance > 0.05].index
selected_rf_features
```

---

## üî• THIS is what companies trust most

Because:

‚úÖ handles interactions
‚úÖ non-linear
‚úÖ robust
‚úÖ realistic

That‚Äôs why tree-based models dominate Kaggle and industry.

---

## üéØ Final real-world pipeline

```text
Raw data
 ‚Üì
Encode categorical
 ‚Üì
Variance threshold (remove useless)
 ‚Üì
Chi-square (optional filtering)
 ‚Üì
Model-based selection (FINAL)
 ‚Üì
Train final model
```

This is **production-level thinking**.

---

## üß† Interview gold answer

> ‚ÄúI usually use filter methods to remove obvious noise and then rely on model-based feature importance to make final feature selection decisions.‚Äù

üî• That sentence alone puts you above average.

---

## ‚ö° Super clear summary

| Method              | Purpose                  |
| ------------------- | ------------------------ |
| Variance threshold  | remove constant features |
| Chi-square          | statistical dependency   |
| Logistic regression | linear importance        |
| Random forest       | non-linear importance    |

Final judge = **model performance**.
---

## linear method 

In [3]:
import pandas as pd
data = {
    "city": ["Delhi", "Mumbai", "Delhi", "Delhi", "Mumbai", "Delhi"],
    "gender": ["Male", "Female", "Male", "Male", "Female", "Male"],
    "device": ["Android", "Android", "iPhone", "Android", "Android", "iPhone"],
    "purchased": [1, 0, 1, 1, 0, 0]
}
df = pd.DataFrame(data)
df_encoded = pd.get_dummies(df,drop_first=False)

In [4]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

X = df_encoded.drop('purchased',axis=1)
y = df_encoded['purchased']

X_train , X_test , y_train , y_test = train_test_split(X,y,test_size=0.2,random_state=42,stratify=y)
model = LogisticRegression(max_iter=1000)
model.fit(X_train,y_train)

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,1000


In [None]:
importance = pd.Series(
    model.coef_[0],
    index=X.columns
).sort_values(key=abs,ascending=False)

In [6]:
importance

city_Mumbai      -0.499885
gender_Female    -0.499885
city_Delhi        0.499687
gender_Male       0.499687
device_Android   -0.210577
device_iPhone     0.210379
dtype: float64

## random tree method

In [10]:
from sklearn.ensemble import RandomForestClassifier
rf_model = RandomForestClassifier(n_estimators=100,random_state=42)
rf_model.fit(X_train,y_train)

0,1,2
,n_estimators,100
,criterion,'gini'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


In [11]:
rf_importance = pd.Series(
    rf_model.feature_importances_,
    index=X.columns
).sort_values(key=abs,ascending=False)

In [12]:
rf_importance

city_Mumbai       0.229885
gender_Male       0.229885
city_Delhi        0.218391
gender_Female     0.183908
device_iPhone     0.080460
device_Android    0.057471
dtype: float64