<a href="https://colab.research.google.com/github/IyadSultan/AI_pediatric_oncology/blob/main/09_Feature_engineering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Feature Engineering for Tabular & Time-Series Data  
**Level:** Beginner → Intermediate  **Duration:** ≈ 2 hours  

Feature engineering transforms raw data into informative features that boost model performance.  
This notebook covers both **tabular** and **time-series** techniques:

* Handling missing values  
* Encoding categorical variables  
* Binning & discretization  
* Feature scaling & transformation  
* Feature extraction (datetime parts, polynomial terms)  
* Interaction features  
* Feature-selection methods  
* Time-series specifics (lags, rollings, seasonal features)  
* Automated FE with **tsfresh** & **Featuretools**

> **How to use this notebook**  
> 1. Run the cells in order.  
> 2. Tweak code or plug in your own data.  
> 3. Install extra libraries when prompted.


In [None]:
# --- Setup & Sample Data -----------------------------------
!pip install seaborn tsfresh featuretools --quiet

In [None]:
import numpy as np, pandas as pd, seaborn as sns
df = sns.load_dataset("titanic")
print("Titanic shape:", df.shape)
df.head()

Titanic shape: (891, 15)


Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


## 1  Handling Missing Values  

From the output above, we can observe the count of missing values per column. In the Titanic dataset, the age and embarked columns have a few missing values, and the deck column has a lot of missing values. (The deck feature indicates passenger deck levels on the ship, and many entries are missing since not all passengers have a recorded deck.)
**Common strategies**

| Strategy | When to use | Caveats |
|----------|-------------|---------|
| **Drop rows/cols** | few NaNs or column nearly empty | data loss |
| **Impute constant** | categorical “Unknown”, numeric 0 | may hide signal |
| **Statistical impute** | mean/median/mode | assumes missing at random |
| **Model-based impute** | KNN / Iterative | heavier, possible bias |
| **Missing flag** | when “missingness” is informative | add extra column |


**Strategy 1:** Removing missing data
If a column is mostly missing (for example, deck is missing for the majority of passengers), it might be prudent to drop that column entirely, as it may not be very useful. Similarly, if only a few rows have missing data but in critical columns, and if dropping them doesn't lose too much data, we might drop those rows. Let's drop the deck column and see how many rows remain if we drop any rows with any missing values:



In [None]:
# --- Missing-value inspection ------------------------------
df.isnull().sum()


Unnamed: 0,0
survived,0
pclass,0
sex,0
age,177
sibsp,0
parch,0
fare,0
embarked,2
class,0
who,0


In [None]:
# --- Drop 'deck' and view size impact ----------------------
df = df.drop(columns=["deck"])
print("Cols after drop:", df.columns.tolist())
print("Rows after dropping any-NaN rows:",
      len(df.dropna()), "of", len(df))


Cols after drop: ['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked', 'class', 'who', 'adult_male', 'embark_town', 'alive', 'alone']
Rows after dropping any-NaN rows: 712 of 891


In the output, notice how many rows would be left after dropping all rows with any missing value. We do this check to illustrate the impact: if a lot of rows are dropped, we might prefer imputation instead. In this case, dropping all missing data might remove a significant number of passengers, which could throw away useful information. Since we want to keep as much data as possible, let's opt for imputation for the remaining missing values (Age and Embarked):
**Strategy 2:** Imputing missing values
- For the numeric age feature, a common choice is to fill missing ages with the median age (median is used instead of mean if the distribution is skewed or has outliers).
- For the categorical embarked feature, we can fill missing entries with the mode (the most common port of embarkation).
We'll perform these imputations using pandas. (Alternatively, one can use scikit-learn's SimpleImputer – we'll show that as well.)

In [None]:
# Impute missing Age with median, and missing Embarked with mode
median_age = df['age'].median()
mode_embarked = df['embarked'].mode()[0]  # mode() returns a Series
print("Imputing missing ages with median:", median_age)
print("Imputing missing embarked with mode:", mode_embarked)

df['age'] = df['age'].fillna(median_age)
df['embarked'] = df['embarked'].fillna(mode_embarked)

# Verify no missing values remain in age and embarked
print("Remaining missing values:", df[['age','embarked']].isnull().sum().to_dict())


Imputing missing ages with median: 28.0
Imputing missing embarked with mode: S
Remaining missing values: {'age': 0, 'embarked': 0}


After imputation, the age and embarked columns should have no missing values. We filled age with the median (~28 years old, for example) and embarked with the most common port (likely "S" for Southampton in this dataset).

**Using scikit-learn's Imputer:**
For completeness, let's also demonstrate using scikit-learn's SimpleImputer to fill missing values. This is useful when building machine learning pipelines, so that imputation is combined with modeling steps and can be applied consistently to training and test data.

In [None]:
# --- Pipeline-friendly imputation demo ---------------------
from sklearn.impute import SimpleImputer

sample = sns.load_dataset("titanic").drop(columns=["deck"])
imp_med  = SimpleImputer(strategy="median")
imp_freq = SimpleImputer(strategy="most_frequent")

sample["age"]      = imp_med .fit_transform(sample[["age"]])
sample["embarked"] = imp_freq.fit_transform(sample[["embarked"]])
sample[["age","embarked"]].isna().sum()


## 2  Encoding Categorical Variables  

* **One-Hot** for nominal (sex, embarked)  
* **Ordinal** for ordered (First > Second > Third)  
* Avoid plain label-encoding on nominal features.

We’ll one-hot `sex` & `embarked`, then ordinal-encode `class`.


In [None]:
# --- One-Hot encode ----------------------------------------
dummies = pd.get_dummies(df[["sex","embarked"]], drop_first=False)
df      = pd.concat([df, dummies], axis=1).drop(columns=["sex","embarked"])
df.head(3)


In [None]:
# --- Ordinal encode 'class' --------------------------------
class_map = {"Third":1,"Second":2,"First":3}
df["class_encoded"] = df["class"].map(class_map)
df[["class","class_encoded"]].head()


## 3  Binning & Discretization  

Why bin? Robust to outliers, capture step-wise effects, simplify models.

* **Domain bins** – age groups  
* **Quantile bins** – fare quartiles  
* **KBinsDiscretizer** – automated (uniform / quantile)

We’ll create `AgeGroup` (Child/Teen/Adult/Senior) and fare quartiles.


In [None]:
# --- Age domain bins ---------------------------------------
bins   = [0,12,19,59,np.inf]
labels = ["Child","Teenager","Adult","Senior"]
df["AgeGroup"] = pd.cut(df["age"], bins=bins, labels=labels)
df[["age","AgeGroup"]].head()


In [None]:
# --- Fare quartile bins ------------------------------------
df["Fare_bin"] = pd.qcut(df["fare"], q=4, labels=["Q1","Q2","Q3","Q4"])
df["Fare_bin"].value_counts()


In [None]:
# --- KBinsDiscretizer demo ---------------------------------
from sklearn.preprocessing import KBinsDiscretizer
kb = KBinsDiscretizer(n_bins=3, encode="ordinal", strategy="uniform")
df["age_bin3"] = kb.fit_transform(df[["age"]])
df[["age","age_bin3"]].head()


## 4  Feature Scaling & Transformation  

Models like k-NN, SVM, neural nets need comparable scales.

* **StandardScaler** – mean 0, std 1  
* **MinMaxScaler** – 0 → 1  
* **Log / Box-Cox** – fix skew

We’ll scale `age`, `fare`, `sibsp`, `parch`, and add a log-fare.


In [None]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler
num_cols = ["age","fare","sibsp","parch"]
orig     = df[num_cols].copy()

std = StandardScaler().fit_transform(orig)
mm  = MinMaxScaler().fit_transform(orig)

print("Std-scaled sample:\n", pd.DataFrame(std, columns=num_cols).head())

df["fare_log10"] = np.log10(df["fare"] + 1e-5)
df[["fare","fare_log10"]].head()


## 5  Feature Extraction  

### 5.1 Datetime Parts  
Break timestamps into year, month, dow, hour, etc.

### 5.2 Polynomial Features  
`PolynomialFeatures(degree=2)` adds squares & interactions.


In [None]:
# --- Datetime parts demo -----------------------------------
dates = pd.DataFrame({"purchase_date": pd.to_datetime([
    "2021-01-01 14:23", "2021-07-15 09:00",
    "2022-03-05 20:45", "2022-03-06 12:00",
    "2022-12-25 00:00"])})

dates["year"]  = dates["purchase_date"].dt.year
dates["month"] = dates["purchase_date"].dt.month
dates["dow"]   = dates["purchase_date"].dt.day_name()
dates["hour"]  = dates["purchase_date"].dt.hour
dates


In [None]:
# --- PolynomialFeatures demo -------------------------------
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, include_bias=False)

af_poly = poly.fit_transform(df[["age","fare"]].fillna(0))
pd.DataFrame(af_poly, columns=poly.get_feature_names_out(["age","fare"])).head()


## 6  Interaction Features  

Domain-driven combos often matter:

* `family_size = sibsp + parch + 1`  
* `is_alone` flag  
* `fare_per_person` ratio  
* Categorical combos like `sex_pclass`


In [None]:
# --- Family size & friends ---------------------------------
df["family_size"] = df["sibsp"] + df["parch"] + 1
df["is_alone"]    = (df["family_size"] == 1).astype(int)
df["fare_per_person"] = df["fare"] / df["family_size"]
df[["sibsp","parch","family_size","is_alone","fare_per_person"]].head()


In [None]:
# --- sex_pclass combo (fresh dataset for demo) -------------
raw = sns.load_dataset("titanic")
raw["sex_pclass"] = raw["sex"] + "_" + raw["pclass"].astype(str)
raw["sex_pclass"].value_counts().head()


## 7  Feature Selection  

* **Filter** – SelectKBest(ANOVA)  
* **Embedded** – tree importances  
* **Wrapper** – RFE

We’ll build a synthetic dataset and try each.


In [None]:
from sklearn.datasets import make_classification
from sklearn.feature_selection import SelectKBest, f_classif, RFE
from sklearn.ensemble  import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
import numpy as np

X, y = make_classification(n_samples=1000, n_features=20,
                           n_informative=5, n_redundant=5,
                           random_state=42, shuffle=False)

# Filter
flt = SelectKBest(f_classif, k=5).fit(X, y)
print("Filter selected:", flt.get_support(indices=True))

# Embedded
rf = RandomForestClassifier(n_estimators=200, random_state=42).fit(X, y)
print("Top RF features:", np.argsort(rf.feature_importances_)[::-1][:5])

# Wrapper
rfe = RFE(LogisticRegression(max_iter=1000, solver="liblinear"), n_features_to_select=5)
rfe.fit(X, y)
print("RFE selected:", np.where(rfe.support_)[0])


## 8  Time-Series Feature Engineering  

Key ideas: **lags, rolling stats, diffs, seasonal parts**.


In [None]:
# --- Simulate monthly series -------------------------------
idx   = pd.date_range("2019-01", periods=36, freq="M")
t     = np.arange(36)
np.random.seed(0)
vals  = 10 + 0.5*t + 5*np.sin(2*np.pi*t/12) + np.random.normal(0,2,36)
ts    = pd.DataFrame({"Value": vals}, index=idx)

# Lag & rolling
ts["lag1"]   = ts["Value"].shift(1)
ts["lag12"]  = ts["Value"].shift(12)
ts["roll3"]  = ts["Value"].rolling(3).mean()
ts["diff1"]  = ts["Value"].diff()
ts.tail()


## 9  Automated Feature Engineering  

### 9.1 tsfresh – exhaustive stats for each time-series  
### 9.2 Featuretools – Deep Feature Synthesis for relational data


In [None]:
# --- tsfresh minimal example -------------------------------
from tsfresh.feature_extraction import extract_features
mini = ts.reset_index().rename(columns={"index":"time"})
mini["id"] = 1
features = extract_features(mini, column_id="id", column_sort="time",
                            default_fc_parameters={"mean":None,"median":None})
features


In [None]:
# --- Featuretools mock-customer demo -----------------------
import featuretools as ft
data = ft.demo.load_mock_customer()
es = ft.EntitySet(id="cust")
es = es.add_dataframe("customers",   data["customers"],   index="customer_id")
es = es.add_dataframe("sessions",    data["sessions"],    index="session_id",    time_index="session_start")
es = es.add_dataframe("transactions",data["transactions"],index="transaction_id",time_index="transaction_time")
es = es.add_relationship(parent_dataframe_name="customers",  parent_column_name="customer_id",
                         child_dataframe_name="sessions",    child_column_name="customer_id")
es = es.add_relationship(parent_dataframe_name="sessions",   parent_column_name="session_id",
                         child_dataframe_name="transactions",child_column_name="session_id")
fm, defs = ft.dfs(entityset=es, target_dataframe_name="customers", max_depth=2)
fm.head()
