# Feature Creation & Interaction — Turn Raw Data into Gold

**Objective**: Learn **why** and **how** to create **new meaningful features** from existing ones — the **#1 way to boost model performance**.

---

## 1. What is Feature Engineering?

**Definition**: Using **domain knowledge** to create new variables that make machine learning algorithms work **better**.

> **"Feature engineering is the art of turning data into information."**

### Why Raw Features Are Not Enough

| Raw Data | Problem | Better Feature |
|---------|--------|----------------|
| `birth_year` | Model can't use | `age = 2025 - birth_year` |
| `check_in`, `check_out` | Two columns | `stay_duration` |
| `lat`, `lon` | Hard to use | `distance_to_center` |

---

## 2. Types of Feature Creation

| Type | Example | Use Case |
|------|--------|----------|
| **Aggregation** | `total_spend` | Sum of purchases |
| **Ratio** | `price_per_sqm` | Value per unit |
| **Binary Flag** | `is_weekend` | Yes/No logic |
| **Time-based** | `days_since_signup` | Recency |
| **Interaction** | `income × education` | Synergy |

---

## 3. Real-World Example 1: E-Commerce Order Value Prediction

We have 6 orders:

| Order | Day | Hour | Items | Unit Price | Total |
|-------|-----|------|-------|------------|-------|
| 1 | Mon | 14 | 2 | 600 | 1200 |
| 2 | Sat | 10 | 5 | 700 | 3500 |
| 3 | Sun | 18 | 1 | 800 | 800 |
| 4 | Tue | 9 | 3 | 600 | 1800 |
| 5 | Fri | 20 | 4 | 700 | 2800 |
| 6 | Sat | 11 | 6 | 700 | 4200 |

**Goal**: Predict **high-value orders** (> 3000)

**Raw model** sees only numbers → **no context**

In [1]:
import pandas as pd

df = pd.DataFrame({
    'order_id': [1, 2, 3, 4, 5, 6],
    'day': ['Mon', 'Sat', 'Sun', 'Tue', 'Fri', 'Sat'],
    'hour': [14, 10, 18, 9, 20, 11],
    'items': [2, 5, 1, 3, 4, 6],
    'unit_price': [600, 700, 800, 600, 700, 700],
    'total': [1200, 3500, 800, 1800, 2800, 4200]
})

print("Raw E-Commerce Data:")
df

Raw E-Commerce Data:


Unnamed: 0,order_id,day,hour,items,unit_price,total
0,1,Mon,14,2,600,1200
1,2,Sat,10,5,700,3500
2,3,Sun,18,1,800,800
3,4,Tue,9,3,600,1800
4,5,Fri,20,4,700,2800
5,6,Sat,11,6,700,4200


---

## 4. Feature Creation — Step by Step

### 1. Binary Flag: Is Weekend?

In [2]:
df['is_weekend'] = df['day'].isin(['Sat', 'Sun']).astype(int)

print("New Feature: is_weekend")
df[['day', 'is_weekend']]

New Feature: is_weekend


Unnamed: 0,day,is_weekend
0,Mon,0
1,Sat,1
2,Sun,1
3,Tue,0
4,Fri,0
5,Sat,1


### 2. Ratio: Price per Item

In [3]:
df['price_per_item'] = df['total'] / df['items']

print("New Feature: price_per_item")
df[['items', 'total', 'price_per_item']].round(2)

New Feature: price_per_item


Unnamed: 0,items,total,price_per_item
0,2,1200,600.0
1,5,3500,700.0
2,1,800,800.0
3,3,1800,600.0
4,4,2800,700.0
5,6,4200,700.0


### 3. Time Flag: Is Peak Hour? (6 PM – 9 PM)

In [4]:
df['is_peak_hour'] = df['hour'].between(18, 21).astype(int)

print("New Feature: is_peak_hour")
df[['hour', 'is_peak_hour']]

New Feature: is_peak_hour


Unnamed: 0,hour,is_peak_hour
0,14,0
1,10,0
2,18,1
3,9,0
4,20,1
5,11,0


### 4. Interaction: Weekend × Peak Hour

**Why?** Weekend evening orders may be **impulse buys** → higher value

In [5]:
df['weekend_peak'] = df['is_weekend'] * df['is_peak_hour']

print("Interaction Feature: weekend_peak")
df[['is_weekend', 'is_peak_hour', 'weekend_peak', 'total']]

Interaction Feature: weekend_peak


Unnamed: 0,is_weekend,is_peak_hour,weekend_peak,total
0,0,0,0,1200
1,1,0,0,3500
2,1,1,1,800
3,0,0,0,1800
4,0,1,0,2800
5,1,0,0,4200


---

## 5. Real-World Example 2: Customer Lifetime Value (CLV)

We have 6 customers:

| Customer | Age | Income | Tenure (Years) | Purchases |
|---------|-----|--------|----------------|-----------|
| 1 | 25 | 3 | 1 | 2 |
| 2 | 45 | 8 | 5 | 10 |
| 3 | 35 | 6 | 3 | 6 |

**Goal**: Predict **future value**

In [6]:
df_clv = pd.DataFrame({
    'age': [25, 45, 35, 50, 23, 40],
    'income_lakhs': [3, 8, 6, 4, 9, 5],
    'tenure_years': [1, 5, 3, 2, 1, 4],
    'purchases': [2, 10, 6, 4, 3, 8]
})

print("Raw Customer Data:")
df_clv

Raw Customer Data:


Unnamed: 0,age,income_lakhs,tenure_years,purchases
0,25,3,1,2
1,45,8,5,10
2,35,6,3,6
3,50,4,2,4
4,23,9,1,3
5,40,5,4,8


### New Features

In [7]:
# 1. Purchase Frequency
df_clv['purchase_freq'] = df_clv['purchases'] / df_clv['tenure_years']

# 2. Income per Age
df_clv['income_per_age'] = df_clv['income_lakhs'] / df_clv['age']

# 3. Interaction: High Income + Long Tenure
df_clv['income_tenure'] = df_clv['income_lakhs'] * df_clv['tenure_years']

print("\nAfter Feature Engineering:")
df_clv[['purchase_freq', 'income_per_age', 'income_tenure']].round(3)


After Feature Engineering:


Unnamed: 0,purchase_freq,income_per_age,income_tenure
0,2.0,0.12,3
1,2.0,0.178,40
2,2.0,0.171,18
3,2.0,0.08,8
4,3.0,0.391,9
5,2.0,0.125,20


---

## 6. Why Interaction Features Matter

**Linear Model** sees: `income + tenure`  
**With Interaction**: `income × tenure` → captures **synergy**

> **Example**: High income + long tenure = **very loyal rich customer**

---

## 7. Best Practices

1. **Use domain knowledge** — talk to business
2. **Avoid data leakage** — don't use future info
3. **Validate with model** — does accuracy improve?
4. **Keep it interpretable** — `is_weekend` > `feature_7`

---

## 8. Summary Table

| Feature Type | Example | Boosts Model? |
|--------------|--------|---------------|
| **Binary Flag** | `is_weekend` | Yes |
| **Ratio** | `price_per_item` | Yes |
| **Interaction** | `income × tenure` | Yes |
| **Time-based** | `purchase_freq` | Yes |

**Key Takeaway**:
> **Feature engineering > fancy models**  
> **Create features that make business sense**  
> **Always validate impact**

---
**End of Notebook**