<a href="https://colab.research.google.com/github/AyA-EhaB/FEDIS_Tasks/blob/main/Feature_engineering_Selecton_extraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# {1} Feature Extraction


# Target Encoding
### Target Encoding: Beyond the Basic Mean (Simplified Explanation)

**What is Target Encoding?**

* We have categories like “red”, “blue”, “green”.
* Instead of using text, we replace each category with a number.
* The simplest way: replace it with the **average target value** for that category.

**Example (basic mean):**
Imagine we have a small online shop dataset where the target is whether the customer bought the product (`1` for yes, `0` for no):

```
color   bought
red     1
red     1
blue    0
green   0
red     0
blue    1
```

* "red" rows → bought = \[1, 1, 0] → mean = 0.67
* "blue" rows → bought = \[0, 1] → mean = 0.5
* "green" rows → bought = \[0] → mean = 0.0

We could replace each category with these numbers.

**Why the simple way is risky:**

* "green" appears only once → its mean = 0.0 might be due to chance.
* For rare categories, the number is unstable and can mislead the model.

**The advanced way — Smoothing (Bayesian blending):**
We combine:

1. **Category mean** — info from that category.
2. **Overall (global) mean** — average target across *all* data points.

**What is the global mean here?**
In our example:

* All bought values: \[1, 1, 0, 0, 0, 1]
* Sum = 3, Count = 6 → global mean = 0.5

**Formula:**

```
encoded = (n_cat * mean_cat + k * mean_global) / (n_cat + k)
```

Where `k` is how strongly you pull the number towards the global mean.

**Real-world analogy:**

* Think of movie ratings. If a movie has only 2 reviews with 5 stars each, you wouldn’t trust it as much as a movie with 1,000 reviews averaging 4.8 stars. You’d mix the small-sample score with the average rating of all movies to avoid overreacting.


**When to use:**

* Many categories (high-cardinality)
* Rare categories that could mislead without smoothing
* Need to carefully add target info to features

**One-line takeaway:**

> The overall mean is simply the average target across the entire dataset, used as a safe fallback for categories with few samples.



In [None]:
import pandas as pd

def target_encode_smooth(train, target, col, k=5):
    global_mean = train[target].mean()
    agg = train.groupby(col)[target].agg(['mean', 'count'])
    smoothing = (agg['count'] * agg['mean'] + k * global_mean) / (agg['count'] + k)
    return train[col].map(smoothing)

# Example
df = pd.DataFrame({
    'color': ['red','blue','red','green','green','blue'],
    'label': [1,0,1,0,0,1]
})
df['color_te'] = target_encode_smooth(df, 'label', 'color', k=3)
print(df)

   color  label  color_te
0    red      1       0.7
1   blue      0       0.5
2    red      1       0.7
3  green      0       0.3
4  green      0       0.3
5   blue      1       0.5


# Polynomial Feature Generation
## Method 2 — Polynomial Feature Generation: Capturing Nonlinear Relationships

**Definition:**
Polynomial feature generation creates new features from existing ones by taking powers and interactions, enabling linear models to approximate nonlinear patterns.

**Example:**
From features:

```
X1, X2
```

We can generate:

```
X1², X2², X1 × X2
```

This lets a linear model represent curved trends and feature interactions.

**Why not just use nonlinear models?**

* **Interpretability:** Linear models with polynomial features remain transparent, unlike many nonlinear methods.
* **Small datasets:** Nonlinear models often require more data to perform well.
* **Efficiency:** Polynomial features keep training fast and memory use low.
* **Compliance:** In some industries (finance, healthcare), interpretable models are legally preferred.

**Real-world applications:**

* **House prices:** Price might peak at an optimal number of rooms; adding `rooms²` allows a linear model to capture this “sweet spot” curve.
* **Physics:** Distance = speed × time can be created as a polynomial interaction feature.
* **Marketing:** Sales rise with ad spend but level off (diminishing returns). Adding `ad_spend²` allows a linear model to fit a curve that rises quickly, then flattens.

**Risks:**

* High-degree polynomials can explode feature count, causing overfitting (“curse of dimensionality”).

**Advanced use:**

* Create only plausible features based on domain expertise.
* Use feature selection methods such as Recursive Feature Elimination (RFE) or LASSO regression to keep only the most useful polynomial terms.

**When to use:**

* You suspect nonlinear relationships.
* You need interpretable models.
* You have limited data but want richer features.

**Key takeaway:**

> Polynomial features give linear models the flexibility to model curves and interactions without abandoning simplicity.


In [None]:
import pandas as pd
from sklearn.preprocessing import PolynomialFeatures

# Sample data
df = pd.DataFrame({
    'X1': [1, 2, 3],
    'X2': [4, 5, 6]
})

poly = PolynomialFeatures(degree=2, interaction_only=False, include_bias=False)
poly_features = poly.fit_transform(df)
feature_names = poly.get_feature_names_out(['X1', 'X2'])

poly_df = pd.DataFrame(poly_features, columns=feature_names)
print(poly_df)

    X1   X2  X1^2  X1 X2  X2^2
0  1.0  4.0   1.0    4.0  16.0
1  2.0  5.0   4.0   10.0  25.0
2  3.0  6.0   9.0   18.0  36.0
