<a href="https://colab.research.google.com/github/IyadSultan/AI_pediatric_oncology/blob/main/09_Feature_engineering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Feature Engineering for Tabular & Time-Series Data  
**Level:** Beginner → Intermediate  **Duration:** ≈ 2 hours  

Feature engineering transforms raw data into informative features that boost model performance.  
This notebook covers both **tabular** and **time-series** techniques:

* Handling missing values  
* Encoding categorical variables  
* Binning & discretization  
* Feature scaling & transformation  
* Feature extraction (datetime parts, polynomial terms)  
* Interaction features  
* Feature-selection methods  
* Time-series specifics (lags, rollings, seasonal features)  
* Automated FE with **tsfresh** & **Featuretools**

> **How to use this notebook**  
> 1. Run the cells in order.  
> 2. Tweak code or plug in your own data.  
> 3. Install extra libraries when prompted.


In [None]:
# --- Setup & Sample Data -----------------------------------
!pip install seaborn tsfresh featuretools --quiet

In [None]:
import numpy as np, pandas as pd, seaborn as sns
df = sns.load_dataset("titanic")
print("Titanic shape:", df.shape)
df.head()

Titanic shape: (891, 15)


Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


## 1  Handling Missing Values  

From the output above, we can observe the count of missing values per column. In the Titanic dataset, the age and embarked columns have a few missing values, and the deck column has a lot of missing values. (The deck feature indicates passenger deck levels on the ship, and many entries are missing since not all passengers have a recorded deck.)
**Common strategies**

| Strategy | When to use | Caveats |
|----------|-------------|---------|
| **Drop rows/cols** | few NaNs or column nearly empty | data loss |
| **Impute constant** | categorical “Unknown”, numeric 0 | may hide signal |
| **Statistical impute** | mean/median/mode | assumes missing at random |
| **Model-based impute** | KNN / Iterative | heavier, possible bias |
| **Missing flag** | when “missingness” is informative | add extra column |


**Strategy 1:** Removing missing data
If a column is mostly missing (for example, deck is missing for the majority of passengers), it might be prudent to drop that column entirely, as it may not be very useful. Similarly, if only a few rows have missing data but in critical columns, and if dropping them doesn't lose too much data, we might drop those rows. Let's drop the deck column and see how many rows remain if we drop any rows with any missing values:



In [None]:
# --- Missing-value inspection ------------------------------
df.isnull().sum()


Unnamed: 0,0
survived,0
pclass,0
sex,0
age,177
sibsp,0
parch,0
fare,0
embarked,2
class,0
who,0


In [None]:
# --- Drop 'deck' and view size impact ----------------------
df = df.drop(columns=["deck"])
print("Cols after drop:", df.columns.tolist())
print("Rows after dropping any-NaN rows:",
      len(df.dropna()), "of", len(df))


Cols after drop: ['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked', 'class', 'who', 'adult_male', 'embark_town', 'alive', 'alone']
Rows after dropping any-NaN rows: 712 of 891


In the output, notice how many rows would be left after dropping all rows with any missing value. We do this check to illustrate the impact: if a lot of rows are dropped, we might prefer imputation instead. In this case, dropping all missing data might remove a significant number of passengers, which could throw away useful information. Since we want to keep as much data as possible, let's opt for imputation for the remaining missing values (Age and Embarked):
**Strategy 2:** Imputing missing values
- For the numeric age feature, a common choice is to fill missing ages with the median age (median is used instead of mean if the distribution is skewed or has outliers).
- For the categorical embarked feature, we can fill missing entries with the mode (the most common port of embarkation).
We'll perform these imputations using pandas. (Alternatively, one can use scikit-learn's SimpleImputer – we'll show that as well.)

In [None]:
# Impute missing Age with median, and missing Embarked with mode
median_age = df['age'].median()
mode_embarked = df['embarked'].mode()[0]  # mode() returns a Series
print("Imputing missing ages with median:", median_age)
print("Imputing missing embarked with mode:", mode_embarked)

df['age'] = df['age'].fillna(median_age)
df['embarked'] = df['embarked'].fillna(mode_embarked)

# Verify no missing values remain in age and embarked
print("Remaining missing values:", df[['age','embarked']].isnull().sum().to_dict())


Imputing missing ages with median: 28.0
Imputing missing embarked with mode: S
Remaining missing values: {'age': 0, 'embarked': 0}


After imputation, the age and embarked columns should have no missing values. We filled age with the median (~28 years old, for example) and embarked with the most common port (likely "S" for Southampton in this dataset).

**Using scikit-learn's Imputer:**
For completeness, let's also demonstrate using scikit-learn's SimpleImputer to fill missing values. This is useful when building machine learning pipelines, so that imputation is combined with modeling steps and can be applied consistently to training and test data.

In [None]:
# --- Pipeline-friendly imputation demo ---------------------
from sklearn.impute import SimpleImputer

sample = sns.load_dataset("titanic").drop(columns=["deck"])
imp_med  = SimpleImputer(strategy="median")
imp_freq = SimpleImputer(strategy="most_frequent")

sample["age"]      = imp_med .fit_transform(sample[["age"]])
sample["embarked"] = imp_freq.fit_transform(sample[["embarked"]])
sample[["age","embarked"]].isna().sum()


## 2  Encoding Categorical Variables  

* **One-Hot** for nominal (sex, embarked)  
* **Ordinal** for ordered (First > Second > Third)  
* Avoid plain label-encoding on nominal features.



Many machine learning algorithms require numeric input and cannot directly handle categorical strings. Encoding categorical features means converting category labels into numerical values. There are different encoding techniques depending on the nature of the categorical data:
One-Hot Encoding (Dummy Variables): Create a new binary column for each category value, indicating presence (1) or absence (0) of that category for each observation. This is suitable for nominal categories (no natural order), e.g., embarked (C/Q/S) or sex (male/female). One-hot encoding avoids implying any ordinal relationship.

- Ordinal Encoding (Label Encoding with order): Map each category to an integer value (e.g., 1, 2, 3) based on some order. This is suitable for ordinal categories where the categories have an inherent rank. For example, pclass (passenger class) is 1, 2, 3 for first, second, third class – here 1st > 2nd > 3rd in terms of luxury, so we could encode 1st=3, 2nd=2, 3rd=1 or similar to preserve that order.
- Label Encoding (arbitrary integers): Assign an arbitrary numeric code to each category (e.g., red=0, green=1, blue=2). This is quick, but not recommended for nominal categories, because the model may interpret 2 > 1 > 0 as implying an order or magnitude. Use it only for ordinal data or when using certain models (like tree-based) that can treat the numeric codes as just categories internally.
- Frequency or Target Encoding (advanced): Replace categories with their frequency or with target variable statistics (like mean target value per category). These are more advanced techniques often used in certain competitions, but they require caution to avoid overfitting (and usually using cross-validation schemes).

Let's demonstrate encoding on the Titanic dataset for the sex (binary nominal) and embarked (nominal with 3 values) columns. We will use one-hot encoding for these. Pandas provides a convenient method pd.get_dummies for one-hot encoding.


In [1]:
# One-hot encode 'sex' and 'embarked' features
print("Unique values in 'sex':", df['sex'].unique())
print("Unique values in 'embarked':", df['embarked'].unique())

encoded_df = pd.get_dummies(df[['sex', 'embarked']], drop_first=False)
encoded_df.head()


NameError: name 'df' is not defined

By default, get_dummies creates a column for each category. We see new columns like sex_female, sex_male, embarked_C, embarked_Q, embarked_S with 0/1 values. We could drop one dummy column per feature (using drop_first=True) to avoid redundancy (for example, if we know a passenger is not male, they must be female, so one of the two is redundant). In practice, dropping the first dummy is often done to avoid multicollinearity issues in linear models, but for tree-based models it's not necessary. We'll keep all dummies here for clarity.

We can concatenate these new dummy columns back to our dataframe (or directly integrate this step in a pipeline).

In [None]:
# Ordinal encode the 'class' column (First, Second, Third)
class_mapping = {"Third": 1, "Second": 2, "First": 3}
df['class_encoded'] = df['class'].map(class_mapping)
print("Mapping 'class' -> numeric:", class_mapping)
df[['class', 'class_encoded']].head(5)


Notice the original sex and embarked columns are still present. We might drop them after encoding if we are going to use the dummy columns instead, to avoid duplication. Let's do that cleanup:

In [None]:
# Remove original categorical columns after encoding
df = df.drop(columns=['sex', 'embarked'])
df.head(5)


Now we have a new column class_encoded where Third=1, Second=2, First=3. This numeric representation implies an order (higher is better class). We still keep the original class string for reference; in modeling we would typically use the numeric version. Important: If using ordinal encoding, ensure that the order you impose makes sense for the problem. If not, it's safer to one-hot encode even ordinal features because a model can learn an order if it exists, but if you impose a wrong order you might mislead the model.

**Scikit-Learn approach:**
We could also use sklearn.preprocessing.OneHotEncoder or OrdinalEncoder for these tasks, which is beneficial when building pipelines. For brevity, we won't show those here, but they offer more control (like handling unknown categories in test data). At this point, our dataset's categorical features are encoded into numeric form, which means we can feed them into models. Next, let's look at transforming continuous features through binning and scaling.

## 3  Binning & Discretization  

Why bin? Robust to outliers, capture step-wise effects, simplify models.

* **Domain bins** – age groups  
* **Quantile bins** – fare quartiles  
* **KBinsDiscretizer** – automated (uniform / quantile)

Binning (discretization) is the process of converting a continuous feature into multiple bins or ranges, effectively turning it into a categorical or ordinal feature. This can be useful for several reasons:

- It can make the model more robust to outliers (since outlier values get put into a bin with a range).
- It can capture non-linear relationships by grouping values. For example, perhaps age has a non-linear effect on an outcome: very young and very old might behave similarly (as groups) compared to middle-aged. Binning can sometimes capture this pattern.
- It reduces the granularity of data, which can help some models (and also reduce overfitting in high-noise data).

However, binning also loses information (exact values are coarsened), so it should be used judiciously. It is more common in some contexts (like scoring systems) or when using certain algorithms that prefer categorical inputs (like some rule-based models).

**Techniques for binning:**
- Fixed-width binning: Divide the range of the feature into equal-width intervals. For example, ages 0-10, 10-20, 20-30, ... etc.
- Quantile binning (equal-frequency): Divide the data such that each bin has (approximately) equal number of observations. For example, quartiles (4 bins each containing 25% of data).
- Domain-specific binning: Define bins based on domain knowledge (e.g., age: infant, child, teen, adult, senior).
- Clustering-based binning: Using methods like k-means to find clusters and then assign bins (less common).

Let's apply binning to the age feature in the Titanic data as an example. We'll create an AgeGroup feature:
- Children: 0-12 years
- Teenagers: 13-19 years
- Adults: 20-59 years
- Seniors: 60+ years

This is a domain-driven choice of bins.


In [None]:
# Define age bins and labels
bins = [0, 12, 19, 59, np.inf]  # np.inf for any age above 59
labels = ['Child', 'Teenager', 'Adult', 'Senior']
df['AgeGroup'] = pd.cut(df['age'], bins=bins, labels=labels)
df[['age', 'AgeGroup']].head(10)


We used pd.cut to bin ages into the specified intervals. The new column AgeGroup is categorical with the labels we provided. We can inspect the distribution of these groups:

In [None]:
df['AgeGroup'].value_counts()


This shows how many passengers fall into each age group. Now, AgeGroup could be one-hot encoded or ordinal-encoded (there is an implied order: Child < Teenager < Adult < Senior) depending on how we want to use it. For some models like decision trees, you could even keep it as a categorical type directly.

**Equal-frequency binning example:**
Maybe we want exactly 4 bins each containing 25% of the observations (quartiles). We can use pd.qcut for quantile-based cuts:

In [None]:
# Quantile-based binning of Fare into 4 buckets (quartiles)
df['Fare_bin'] = pd.qcut(df['fare'], q=4, labels=['Q1','Q2','Q3','Q4'])
print(df[['fare', 'Fare_bin']].head(10))
print("\nFare_bin distribution:\n", df['Fare_bin'].value_counts())


The output shows each fare assigned a quartile label Q1-Q4. Each bin has (roughly) equal count of observations. This can be useful if the distribution of fare is highly skewed (which it is—most people paid low fare, a few paid very high fares). Binning can spread these out.

**KBinsDiscretizer (sklearn):**
Scikit-learn offers KBinsDiscretizer which can automate binning as part of a pipeline, with options for uniform or quantile strategy, and output as one-hot or ordinal. Here's a quick demonstration using KBinsDiscretizer to bin the age into 3 bins of equal width:

In [None]:
from sklearn.preprocessing import KBinsDiscretizer

X_age = df[['age']]  # need 2D array
kbins = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='uniform')
age_binned = kbins.fit_transform(X_age)
print("Age -> Binned (uniform width, 3 bins):\n", age_binned[:10].ravel())


The above will divide the age range into 3 equal-width bins and give ordinal labels 0,1,2 for those bins. The output array shows the bin index for first 10 ages. (Note: KBinsDiscretizer might warn about the subsample vs full data – that's fine for our small demonstration.)

In summary, binning is a way to simplify a continuous feature. It can be powerful when used appropriately (especially with certain models or when you suspect non-linear step changes in effect), but it also can throw away information, so consider it based on your analysis.

## 4  Feature Scaling & Transformation  

Models like k-NN, SVM, neural nets need comparable scales.

* **StandardScaler** – mean 0, std 1  
* **MinMaxScaler** – 0 → 1  
* **Log / Box-Cox** – fix skew

Feature scaling refers to methods used to normalize the range or distribution of features. Many machine learning algorithms perform better or converge faster during training when features are on similar scales:
Distance-based models (k-NN, K-Means, SVM with RBF kernel, etc.) are sensitive to the scale of features because they rely on distances.
Gradient descent based models (linear regression, logistic regression, neural networks) converge faster when features are scaled, to avoid some features dominating the gradient due to larger scale.
Regularized models (like Lasso or Ridge regression) impose penalties that assume features are centered or scaled similarly.
Tree-based models (decision trees, random forests, XGBoost) generally do not require feature scaling, as they split based on thresholds and are scale-invariant for splits. But scaling does not hurt them either.


We’ll scale `age`, `fare`, `sibsp`, `parch`, and add a log-fare.


In [None]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Take a subset of numeric features for demonstration
numeric_features = ['age', 'fare', 'sibsp', 'parch']  # sibsp = #siblings/spouses, parch = #parents/children
subset = df[numeric_features].copy()
print("Original scales:")
print(subset.describe().loc[['min','max','mean','std']].T)  # summary before scaling

# Standardization (Z-score)
scaler = StandardScaler()
subset_std = pd.DataFrame(scaler.fit_transform(subset), columns=numeric_features)

# Min-Max Scaling
minmax = MinMaxScaler()
subset_mm = pd.DataFrame(minmax.fit_transform(subset), columns=numeric_features)

print("\nAfter Standardization (mean ~0, std ~1):")
print(subset_std.describe().loc[['min','max','mean','std']].T)

print("\nAfter Min-Max Scaling (range 0 to 1):")
print(subset_mm.describe().loc[['min','max','mean','std']].T)


Looking at the output:
- After standardization, each feature has mean very close to 0 and standard deviation 1. The min and max are not bounded to [0,1], but typically within a few multiples of the std.
0 After min-max scaling, each feature's min is exactly 0 and max is 1, by construction. The mean and std will vary.

We can also directly compare a few rows before and after scaling to see how individual values change:

In [None]:
print("First 5 rows - original age and fare:")
print(subset[['age','fare']].head())

print("\nFirst 5 rows - after StandardScaler:")
print(subset_std[['age','fare']].head())

print("\nFirst 5 rows - after MinMaxScaler:")
print(subset_mm[['age','fare']].head())


Notice how an age of 22 years (for example) becomes around -0.6 after standardization (meaning 0.6 std below the mean age), and a fare of 7.25 becomes ~0.014 after min-max (very low in the 0-1 scale since 7.25 is near the minimum fare). Log transform example:

The fare distribution is highly skewed (a few very high fares). We can apply a log10 transform to compress the high end:

In [None]:
df['fare_log10'] = np.log10(df['fare'] + 1e-5)  # adding a tiny constant to avoid log(0)
print("Fare vs log10(Fare) for first 5 entries:")
print(df[['fare','fare_log10']].head())

print("\nFare distribution stats:\n", df['fare'].describe())
print("\nlog10(Fare) distribution stats:\n", df['fare_log10'].describe())


After log transform, the distribution of fare should be less skewed (the difference between min and max in log scale is much smaller). We added a small constant to avoid taking log of 0 for any zero fares. When to scale:

- Before using algorithms like k-NN, SVM, neural networks, or any model that uses gradient descent or distance, it's generally a good idea to scale features.

- Tree-based models (Decision Tree, Random Forest, Gradient Boosting) typically don't need scaling.

- Always apply the same scaling to training and test data (fit on train, apply to test) to avoid data leakage.


We have scaled some features for demonstration, but for the next sections we'll often focus on the raw or minimally processed values, as feature engineering steps can be demonstrated without scaling in each case. Just keep in mind scaling is an important step in a modeling pipeline.

## 5  Feature Extraction

Feature extraction is about deriving new features from existing data. We create new representations that might be more informative for the model. This can involve:
- Breaking down a feature into components (e.g., date -> year, month, day).
- Combining features (e.g., area = length * width).
- Mathematical transformations (e.g., polynomial terms)
- Aggregations or statistical summaries (especially in grouped data or time series).

We'll explore a few common scenarios:

### 5.1 Datetime Parts  
Break timestamps into year, month, dow, hour, etc.



In [None]:
# Create a small DataFrame with some dates
date_df = pd.DataFrame({
    'purchase_date': pd.to_datetime([
        "2021-01-01 14:23:00",
        "2021-07-15 09:00:00",
        "2022-03-05 20:45:00",
        "2022-03-06 12:00:00",
        "2022-12-25 00:00:00"
    ])
})
# Extract various components
date_df['year'] = date_df['purchase_date'].dt.year
date_df['month'] = date_df['purchase_date'].dt.month
date_df['day'] = date_df['purchase_date'].dt.day
date_df['day_of_week'] = date_df['purchase_date'].dt.day_name()
date_df['hour'] = date_df['purchase_date'].dt.hour
date_df['weekofyear'] = date_df['purchase_date'].dt.isocalendar().week  # ISO week number
date_df


In the above, we extracted several features:
- year: 2021 or 2022.
- month: 1-12.
- day_of_week: Monday, Tuesday, etc. (This is categorical, could be encoded as numbers 0-6 or one-hot).
- hour: 0-23.
- weekofyear: week number within the year (1-52).

We could further extract a boolean like is_weekend by checking if day_of_week is Saturday/Sunday, or is_holiday if we have a list of holidays.

**Cyclical encoding:**
One thing to note: some of these features are cyclical (after December comes January, after Sunday comes Monday, hours wrap around after 23 to 0). If using them in a linear model, it can be beneficial to encode such features using sine/cosine transforms to capture the cyclic nature (e.g., \text{month}_\sin = \sin(2\pi \cdot \text{month}/12), \text{month}_\cos = \cos(2\pi \cdot \text{month}/12)). This ensures December (12) and January (1) are considered close in the encoded space. For simplicity, we won't do that math here, but it's a useful trick for cyclical features.

The Titanic dataset does not have an obvious datetime feature (like a travel date). But if it did, we could apply similar extraction. However, when we get to the time series section, we'll see more about dealing with time indices.


### 5.2 Polynomial Features  
`PolynomialFeatures(degree=2)` adds squares & interactions.

Polynomial feature generation is a systematic way of creating interaction and power terms from numeric features.  

This allows models like linear regression to fit nonlinear relationships by considering these higher-order terms. However, polynomial expansion increases the number of features rapidly and can lead to overfitting if the degree is high or if features are many. Scikit-learn provides PolynomialFeatures to automate this. Let's demonstrate on a very small example to see what it does:

In [None]:
from sklearn.preprocessing import PolynomialFeatures

# Small example dataset (2 samples, 2 features)
X_example = np.array([[2, 3],
                      [3, 4]])
print("Original X:\n", X_example)

poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X_example)
print("\nPolynomial features (degree 2) of X:\n", X_poly)
print("Output feature names:", poly.get_feature_names_out())


```markdown
In the output, `X_poly` has 5 columns derived from the original 2:

* The first two columns are the original x₁, x₂.
* The third column is x₁².
* The fourth column is x₁ × x₂.
* The fifth column is x₂².

The feature names confirm this: e.g., `x0` = x₁, `x1` = x₂, `x0^2` = x₁², `x0 x1` = x₁x₂, `x1^2` = x₂².

We can apply this to a real dataset too. For instance, suppose in Titanic we want to add a polynomial feature for `age` and `fare` to allow a model to capture interactions between age and fare (though it's not obvious there is one, but for demonstration):
```

In [None]:
# --- PolynomialFeatures demo -------------------------------
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, include_bias=False)

af_poly = poly.fit_transform(df[["age","fare"]].fillna(0))
pd.DataFrame(af_poly, columns=poly.get_feature_names_out(["age","fare"])).head()


Now the DataFrame `age_fare_poly` contains: `age`, `fare`, `age^2`, `age fare` (interaction), `fare^2`. We could join these back to `df` if needed.

However, one must consider if these new features make sense and add value. For example, fare² might not be very meaningful, but an interaction age × fare might capture that perhaps the effect of fare on survival could depend on age, etc. (This is hypothetical in this context.)

Polynomial features are particularly useful when you have continuous features and you suspect non-linear relationships but want to use a linear model, or when you want to capture interactions between features explicitly.

**Note:** Higher degree polynomials (3, 4, ...) will generate a *lot* of features and can overfit. Use with caution, and consider applying feature selection or regularization to such expansions.

## 6  Interaction Features  

Interaction features are products or combinations of two or more original features. They capture relationships that are not evident when considering features individually. We already saw that polynomial features can create interaction terms automatically (e.g., x₁ × x₂). But often, domain knowledge might suggest specific interactions to create:

* Sums or differences: e.g., `family_size = sibsp + parch + 1` (number of family members traveling, +1 for self).
* Products: e.g., if modeling area from width and height, area = width × height.
* Ratios: e.g., `fare_per_person = fare / (family_size)` could be meaningful (cost per individual).
* Combinations of categorical features: e.g., combining two categorical into one (like `sex_class = sex + "_" + pclass` to capture that perhaps being female in 1st class differs from female in 3rd class).

Let's do a couple of these on the Titanic data:

**Family size interaction:** Titanic has `sibsp` (siblings/spouses aboard) and `parch` (parents/children aboard). A known useful feature is `family_size = sibsp + parch + 1` (the +1 is to count the person themselves). This tells how large the traveling group/family was.


**Summary:**
Domain-driven combos often matter:

* `family_size = sibsp + parch + 1`  
* `is_alone` flag  
* `fare_per_person` ratio  
* Categorical combos like `sex_pclass`


In [None]:
df['family_size'] = df['sibsp'] + df['parch'] + 1
print(df[['sibsp','parch','family_size']].head(10))
print("\nFamily size distribution:\n", df['family_size'].value_counts())


Now we have a new feature family_size. We see common values might be 1 (alone), 2, 3, etc. This feature could be further used to create, say, a boolean is_alone = (family_size == 1) which Titanic analyses often use (since solo travelers had different survival odds than those with family).

In [None]:
df['is_alone'] = (df['family_size'] == 1).astype(int)
print("\n'is_alone' (1 if family_size==1) distribution:\n", df['is_alone'].value_counts())


Fare per person:
We can create fare_per_person = fare / family_size. This might normalize the fare by how many people shared that fare (siblings/parents often paid together). It could differentiate those who paid a high fare but for many people versus someone who paid a high fare just for themselves in first class.

In [None]:
# Avoid division by zero (shouldn't happen since family_size min is 1)
df['fare_per_person'] = df['fare'] / df['family_size']
df[['fare', 'family_size', 'fare_per_person']].head(10)


We created a ratio feature fare_per_person. If someone has family_size 1, this is just their fare; if family_size > 1, this reduces the value.

**Categorical interaction example:**
Combine sex and pclass into a single category feature. (Remember, we dropped sex from df after encoding, but let's get it from original again for this concept.)

In [None]:
# Using original Titanic load to get sex and pclass quickly
titanic_raw = sns.load_dataset('titanic')
titanic_raw['sex_pclass'] = titanic_raw['sex'] + "_" + titanic_raw['pclass'].astype(str)
print(titanic_raw[['sex','pclass','sex_pclass']].head(6))
print("\nUnique sex_pclass combos:", titanic_raw['sex_pclass'].unique())


This new feature sex_pclass has values like "male_3", "female_1", etc., representing a passenger's sex and class together. If one were to one-hot encode this, it effectively is capturing an interaction between sex and class. If the effect of class on survival is different for males vs females, this kind of feature might help a model pick that up more easily. In general, you create interaction features when you have reason to believe two features combined have an important effect that is not simply linear/additive. Many algorithms can capture interactions on their own (e.g., decision trees naturally do, and neural networks can), but for linear models or just to introduce specific hypothesis-driven signals, manually creating interactions can be very useful. Keep in mind that adding many interaction features increases dimensionality and risk of overfitting, so again, prefer those that make logical sense or test them with feature importance/selection methods.

##  7. Feature Selection Methods

After creating many features, we often face the question: **which features are actually helpful?** Feature selection is the process of reducing the number of input features to those that are most useful to the model. This can help in:

* Simplifying the model (making it faster and more interpretable).
* Reducing overfitting by removing noisy or irrelevant features.
* Avoiding the curse of dimensionality (which can hurt performance when we have too many features and not enough data).

There are several approaches to feature selection:

* **Filter methods:** Select features based on statistical properties *without* involving any specific model. Examples:
   * Remove features with very low variance (near-constant features).
   * Select features most correlated with the target (for regression) or most associated via chi-square or mutual information (for classification).
   * Use statistical tests (ANOVA F-test, chi-squared test) between each feature and the target, and keep top N features.

* **Wrapper methods:** Use a predictive model to score feature subsets and select the best combination. This includes techniques like:
   * **Forward selection:** start with no feature, add features one by one that improve the model until no improvement.
   * **Backward elimination:** start with all features, remove least useful one by one until performance drops.
   * **Recursive Feature Elimination (RFE):** iteratively train the model and remove the weakest feature(s) until reaching desired number of features.

* **Embedded methods:** Feature selection occurs as part of the model training. For example:
   * **Lasso (L1 regularization)** tends to shrink irrelevant feature coefficients to zero, effectively selecting features.
   * **Tree-based feature importance:** Decision trees or ensemble (Random Forest, XGBoost) naturally compute an importance score for features; one can select top features based on that.
   * **Feature importances from any model:** Train a model and rank features by absolute coefficient or importance.

Let's demonstrate a couple of these methods



* **Filter** – SelectKBest(ANOVA)  
* **Embedded** – tree importances  
* **Wrapper** – RFE

We’ll build a synthetic dataset and try each.


In [None]:
from sklearn.datasets import make_classification
# Create a synthetic classification dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=5, n_redundant=5,
                           n_repeated=0, n_classes=2, random_state=42, shuffle=False)
print("Shape of X:", X.shape)


In the above, we created 20 features where:

* 5 are informative (actually affect the target).
* 5 are redundant (random combinations of the informative ones).
* 10 are noise.

We set `shuffle=False` so that the first 5 might be informative, next 5 redundant, etc., just for ease of understanding (in real, it might be mixed).

**Filter method: SelectKBest (ANOVA F-test)**

We'll use `SelectKBest` to pick the top 5 features by univariate F-test (which measures linear correlation with the target for continuous features or ANOVA for categorical target). This is a filter method done independently for each feature.

In [None]:
from sklearn.feature_selection import SelectKBest, f_classif

selector = SelectKBest(score_func=f_classif, k=5)
X_selected = selector.fit_transform(X, y)
selected_indices = selector.get_support(indices=True)
print("Indices of selected features (0-based):", selected_indices)
print("ANOVA F-test scores for first 10 features:\n", selector.scores_[:10])


The output will list which feature indices were selected. Ideally, we hope it picks most of the truly informative ones (which we suspect might be features 0-4 if our generation placed them first). The F-test scores for features give an idea of how strongly each feature is correlated with the class label.

**Embedded method:** Feature importance from RandomForest
We can train a RandomForest classifier on all features and then look at its feature_importances_. Random forest is generally robust and can indicate which features it found useful.

In [None]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)
importances = rf.feature_importances_
important_idx = np.argsort(importances)[::-1]  # indices sorted by importance, descending
print("Feature importances (top 5):")
for idx in important_idx[:5]:
    print(f"Feature {idx}: Importance {importances[idx]:.4f}")


This prints the top 5 features according to the random forest. We can compare if those indices overlap with our earlier selection. Wrapper method: Recursive Feature Elimination (RFE)

We'll do a quick RFE with a logistic regression to select, say, 5 features. This will repeatedly eliminate the least important feature as determined by the model coefficients.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE

logreg = LogisticRegression(max_iter=1000, solver='liblinear')
rfe = RFE(estimator=logreg, n_features_to_select=5)
rfe.fit(X, y)
print("Features selected by RFE (LogisticRegression):", np.where(rfe.support_)[0])


RFE selects a subset of features (the positions where `rfe.support_` is True). LogisticRegression with L1 penalty could also be used to do selection by shrinking coefficients to zero, but that would require adjusting the regularization strength.

Each method might select slightly different features, especially since some features were redundant combinations of others.

**Which method to use?**

* **Filter methods** are fast and don't overfit to a model, but they consider features individually, ignoring interactions.
* **Wrapper methods** can consider feature combinations and typically yield better subsets, but they are computationally expensive (training many models) and can overfit if not careful (so use cross-validation).
* **Embedded methods** are a good balance: using the model to guide selection with usually one training. Tree importances or Lasso are common choices.

In practice, a combination is often used. For example, you might first filter out features with zero or near-zero variance, then use a tree model to get importances, and perhaps perform RFE on a smaller set.

For our tutorial purposes, understanding that these tools exist and seeing a simple usage is key. In your workflow, after engineering features, you can use these methods to whittle down to a set that makes your model perform best (using validation to ensure you're not just overfitting to training with selection).


## 8  Time-Series Feature Engineering  

Time series data brings additional challenges and opportunities for feature engineering. A time series is data recorded over time (usually at regular intervals like daily, monthly, etc.). Besides the kinds of features we can extract from each timestamp (as we saw in datetime extraction), there are unique features derived from the sequence of values itself.

When preparing time series data for machine learning (especially for forecasting or supervised learning approaches), common techniques include:

* **Lag features:** Using previous time steps' values as features for the current time. For example, if you want to predict today's value, you might use yesterday's value (lag 1) or last week's value (lag 7 for daily data) as features.
* **Rolling statistics (window features):** Features like moving average, moving standard deviation, or moving sum over a window of past observations. These capture trends or volatility over time.
* **Differences:** The change from previous period (first difference) or percentage change. Differencing can help stationarize a series (remove trend).
* **Datetime components:** As discussed, features like day of week, month, hour, etc., which for time series can capture seasonality or periodic effects.
* **Fourier or seasonal decomposition features:** (Intermediate/Advanced) Representing seasonal patterns via sine/cosine features or using seasonal decomposition to extract trend/season components.

Let's create a simple time series and demonstrate creating some of these features. We'll simulate a monthly time series with a seasonal pattern.

Key ideas: **lags, rolling stats, diffs, seasonal parts**.


In [None]:
# Simulate a time series (monthly data for 3 years)
date_index = pd.date_range(start='2019-01-01', periods=36, freq='M')
time = np.arange(36)
# Create a pattern: trend + seasonal + noise
np.random.seed(0)
values = 10 + 0.5*time + 5*np.sin(2*np.pi*time/12) + np.random.normal(0, 2, size=36)
ts_df = pd.DataFrame({'Date': date_index, 'Value': values})
ts_df = ts_df.set_index('Date')
ts_df.head()



We have ts_df with a Date index (monthly from Jan 2019 to Dec 2021) and a Value (some synthetic measurement). This series has a linear upward trend (0.5 per month) plus a yearly seasonality (sinusoidal) plus some noise.

**Lag Features:**
Let's create a new column for a 1-month lag. This means the value of the previous month as a feature to predict current month. We'll use shift(1) for that. We can similarly create a 12-month lag (for yearly seasonality, though with only 3 years data maybe not needed here).

In [None]:
ts_df['Value_lag1'] = ts_df['Value'].shift(1)
ts_df['Value_lag12'] = ts_df['Value'].shift(12)
ts_df[['Value', 'Value_lag1', 'Value_lag12']].head(15)


Notice for January 2019, Value_lag1 is NaN (since there's no December 2018 in data), and for Jan 2020, Value_lag12 is the value from Jan 2019, etc. These NaNs at the start of the series for lags are expected; we can decide to drop them or fill them (often drop if predicting, since we can't have a feature when it didn't exist).

**Rolling Window Features:**

Let's compute a 3-month rolling mean and a 3-month rolling standard deviation of the value. We use .rolling(window=3):

In [None]:
ts_df['rolling_mean_3'] = ts_df['Value'].rolling(window=3).mean()
ts_df['rolling_std_3'] = ts_df['Value'].rolling(window=3).std()
ts_df[['Value', 'rolling_mean_3', 'rolling_std_3']].head(10)


For the first two months, the 3-month window isn't full, so the default behavior gives NaN (since it requires 3 points to compute). We could specify min_periods=1 in rolling to have it start earlier, but typically for modeling we'd start using those features only once windows are full, or we fill initial ones in some way if needed.

The rolling mean smooths the series, capturing local trend, and rolling std captures the local volatility.

**Differences:**
We can add a feature that is the difference between the current value and previous month (lag1 difference). This can highlight short-term changes.

In [None]:
ts_df['Diff_1'] = ts_df['Value'] - ts_df['Value_lag1']
ts_df[['Value', 'Diff_1']].head(5)


Or simply ts_df['Diff_1'] = ts_df['Value'].diff(1) which does the same. A positive difference means an increase from last month.

**Datetime components for time series:**

We already have Date index. We might extract month and year as features as well (though in a purely time series forecasting setting, one might incorporate them differently or use seasonal dummies). For demonstration:

In [None]:
ts_df['Month'] = ts_df.index.month
ts_df['Year'] = ts_df.index.year
ts_df[['Value','Month','Year']].head(12)


`Month` will be 1-12 corresponding to Jan-Dec. As mentioned earlier, we could one-hot encode month or use sin/cos encoding to account for the cyclic nature of month in a year.

By now, our `ts_df` contains the original series and several new features derived from it:

* Value_lag1 (previous month value)
* Value_lag12 (value a year ago, capturing annual seasonality)
* rolling_mean_3, rolling_std_3 (short-term trend and volatility)
* Diff_1 (monthly change)
* Month, Year (time index info)

Let's see the tail of the DataFrame to observe these:

In [None]:
ts_df.tail(5)


You will see the values for late 2021, and the features computed. Note how `Value_lag12` for e.g. Dec 2021 equals the value from Dec 2020, etc.

These features can be fed into a regression model to predict future values. For instance, to predict Value of Feb 2022, you'd use Jan 2022 Value (lag1) and Feb 2021 Value (lag12), etc. Proper setup would ensure training on known history and predicting forward.

**Caution:** When creating lag/rolling features, be mindful of the context:

* If doing forecasting, ensure you only use past data to predict future (avoid lookahead bias).
* If you split train/test by time, you would compute these features on the training set and ensure they are available for test (sometimes needing to carry the last known values forward).
* Rolling features near the edges have missing data; usually we drop the first few records or fill them in a minimal way if needed.

For a concrete example: If we were predicting monthly values, we might drop the first 12 months after creating a 12-month lag, since we can't use those for modeling (no lag12 available for them).

Time series feature engineering is a big topic, but these basic techniques (lags, rollings, time-based features) are the core. Depending on the domain, you might add specific ones (e.g., if forecasting sales, you might include features like "7-day moving average of sales", or "days since last promotion event", etc., which are domain-specific signals).

## 9  Automated Feature Engineering  

Building features manually as we did above is powerful because you can incorporate domain knowledge and intuition. However, it can be time-consuming and you might miss important patterns. **Automated feature engineering tools** aim to generate many potential features automatically, which you can then use or filter.

We will introduce two popular Python libraries:

* **tsfresh:** stands for *Time Series Feature Extraction based on Randomized Testing*. It automatically calculates a huge number of time series characteristics (mean, median, Fourier coefficients, autocorrelation, etc.) for time series data.
* **Featuretools:** a general library for automated feature engineering, particularly useful for relational (multi-table) data or transactional data, using a method called *Deep Feature Synthesis*. It can create features like aggregations and transformations across tables.

These tools can save a lot of effort, though they may produce more features than you need, so typically you'd pair them with feature selection.

## 9.1 tsfresh (Automatic Time Series Features)

If you have time series data (especially multiple time series or a panel of time series grouped by an ID), `tsfresh` can compute a wide array of features for each.

**Scenario:** Suppose we have multiple sensors each producing a time series, and we want summary features of each sensor's series to feed into a classifier. tsfresh would create features for each sensor series such as mean, standard deviation, max, min, number of peaks, etc.

Let's do a simple example. We will create a small dataset of two time series (as if from two entities). For simplicity, we'll use our earlier synthetic series and duplicate it with some variation for a second series.


In [None]:
from tsfresh.feature_extraction import extract_features

# Prepare a small dataset for tsfresh
# Two IDs, each with a time series of 'Value'
ts_df = ts_df.reset_index()  # make Date a column for ease
# Create a second series as a variation (e.g., a phase shift or different noise)
ts_df2 = ts_df.copy()
ts_df2['Value'] = ts_df2['Value'] * 0.8 + np.random.normal(0, 2, len(ts_df2))  # slightly different series
ts_df2['id'] = 2
ts_df['id'] = 1
combined_ts = pd.concat([ts_df[['id','Date','Value']], ts_df2[['id','Date','Value']]])
combined_ts = combined_ts.rename(columns={"Date":"time"})  # rename to 'time' as expected by tsfresh
combined_ts.head()


We now have `combined_ts` DataFrame with columns: id, time, Value. It contains two groups of time series (id=1 and id=2), each with the same timestamps for simplicity but different values.

Now we use `tsfresh.extract_features`. We'll specify the identifiers and time column. By default, tsfresh will compute a comprehensive list of features. For the sake of brevity, we'll limit to a smaller set of features using the `default_fc_parameters` argument. Let's extract a few basic features like mean, standard deviation, minimum, and maximum of the Value series for each id.

In [None]:
from tsfresh.feature_extraction import MinimalFCParameters

# Use a minimal set of feature calculators for demonstration (to avoid huge output)
fc_parameters = {
    'mean': None,
    'standard_deviation': None,
    'minimum': None,
    'maximum': None,
    # (tsfresh has many more like median, skewness, etc., but we'll limit here)
}
features_df = extract_features(combined_ts, column_id='id', column_sort='time',
                               default_fc_parameters=fc_parameters)
features_df


The resulting `features_df` has one row per `id` (so 2 rows, for id 1 and 2) and columns representing the features we extracted:

* `Value__mean`
* `Value__standard_deviation`
* `Value__minimum`
* `Value__maximum`

The values in those columns are the respective statistics of the Value time series for each id.

In our small example:

* id 1 might have mean around ~ (the middle of the sinusoidal trend).
* id 2 being a scaled noisy version will have a different mean, etc.

If we had not limited features, by default tsfresh's Comprehensive set can generate dozens or even hundreds of features (like quantile values, Fourier transform features, time reversal symmetry, and many specialized metrics). You can then use `tsfresh.select_features` (which statistically tests which features correlate with a target, if provided) or simply feed them into a model and let it figure out importance.

**When to use tsfresh:**

* If you have time series and want to quickly get a broad set of characteristics.
* Feature extraction for time-series classification problems (e.g., classify a signal as failure vs normal based on sensor readings).
* Note that tsfresh can be computationally heavy if your series or number of features is large, so you might use the `MinimalFCParameters` or otherwise restrict features to manage performance.

## 9.2 Featuretools (Deep Feature Synthesis)

Featuretools is a library for automated feature engineering, especially useful when you have **relational data** (multiple tables) or hierarchical data (like customers, each with multiple transactions, etc.). It can automatically generate features by aggregating and combining information from these tables.

For example, imagine a retail dataset:

* A customers table (one row per customer).
* An transactions table (multiple transactions per customer).

Featuretools can automatically create features like "total spend per customer", "number of transactions in last 30 days", "average transaction amount for customer", etc., by aggregating the transactions table, as well as handle temporal relationships (like at time of prediction, only use past transactions).

Let's use Featuretools' built-in demo dataset to illustrate. The mock dataset includes customers, their sessions, and transactions.

In [None]:
import featuretools as ft

# Load a demo dataset of customers
data = ft.demo.load_mock_customer()
customers_df = data["customers"]
sessions_df = data["sessions"]
transactions_df = data["transactions"]

print("Customers table columns:", customers_df.columns.tolist())
print("Sessions table columns:", sessions_df.columns.tolist())
print("Transactions table columns:", transactions_df.columns.tolist())
print("Number of customers:", len(customers_df))
print("Sample transactions:\n", transactions_df.head(5))


We have:
- customers_df: with customer info (customer_id, zip_code, join_date, birthday).
- sessions_df: with session info (session_id, customer_id, device, session_start).
- transactions_df: (transaction_id, session_id, transaction_time, product_id, amount).


We need to tell Featuretools how these tables are related (customer -> sessions -> transactions) and then let it generate features for a target table (say we want features at the customer level).

In [None]:
# Create an EntitySet and add the dataframes
es = ft.EntitySet(id="customer_data")
es = es.add_dataframe(dataframe_name="customers", dataframe=customers_df, index="customer_id")
es = es.add_dataframe(dataframe_name="sessions", dataframe=sessions_df, index="session_id", time_index="session_start")
es = es.add_dataframe(dataframe_name="transactions", dataframe=transactions_df, index="transaction_id", time_index="transaction_time")

# Define relationships
es = es.add_relationship(parent_dataframe_name="customers", parent_column_name="customer_id",
                         child_dataframe_name="sessions", child_column_name="customer_id")
es = es.add_relationship(parent_dataframe_name="sessions", parent_column_name="session_id",
                         child_dataframe_name="transactions", child_column_name="session_id")
es


We set up an EntitySet (a structure in Featuretools that holds all tables and their relationships). We specified which column is the index for each table, and which are time indices for temporal data. Then we defined relationships:
- customers -> sessions via customer_id
- sessions -> transactions via session_id

Now we ask Featuretools to do Deep Feature Synthesis (DFS): generate features for the customers table by aggregating or transforming data from related tables.

In [None]:
feature_matrix, feature_defs = ft.dfs(entityset=es, target_dataframe_name="customers", max_depth=2)
print("Generated feature matrix shape:", feature_matrix.shape)
feature_matrix.head(5)


The feature_matrix is a new dataframe where each row is a customer (since target_dataframe_name was "customers") and columns are newly generated features. The max_depth=2 parameter means it can stack at most 2 transformations/aggregations. If you inspect feature_matrix.head(), you'll see columns like:

* AGE(customer) or something derived from birthday perhaps (maybe it calculates age from birthday and a reference).
* Aggregations from sessions and transactions, like sessions.Count (number of sessions per customer), transactions.Sum(amount) grouped by customer (total spending), transactions.MEAN(amount) (average spend), maybe transactions.MAX(product_id) or other combinations.
* Possibly features like "number of unique product_id per customer", etc.

The shape printed tells how many features were created. Featuretools might create quite a few features depending on the data. For example, typical features could be:

* transactions.NUM_UNIQUE(product_id) by customer
* sessions.COUNT by customer (how many sessions)
* sessions.MONTH(join_date) (maybe extracting month of join date)
* transactions.SUM(amount) by sessions and then MAX by customer (like largest session spend)
* etc.

We won't list them all here, but you get the idea. Featuretools systematically enumerates combinations: it looks at each relationship and applies aggregations (sum, mean, count, etc.) and also applies transformations to columns (like extracting date parts or doing arithmetic).

**Using Featuretools output:**

The generated features can be used to train a model. You might want to drop some that make no sense or have high cardinality. Featuretools tries to handle a lot automatically, but not every generated feature will be useful. It's often followed by manual review or feature selection.

**When to use Featuretools:**

* You have complex data with multiple tables (like users → logs → transactions) and want to automatically generate features capturing relations.
* You want to try a broad set of derived features (especially aggregations over one-to-many relationships) quickly.
* It excels in scenarios like predicting customer churn or credit risk from transactional history, etc., where you have to aggregate over a user's history.

Keep in mind that Featuretools (and automated feature generation in general) can produce a huge number of features. It's powerful, but be prepared to use feature selection or regularization to handle the output.

## Conclusion

In this notebook, we've covered a comprehensive range of feature engineering techniques for both tabular and time series data:

* How to handle missing values by dropping or imputing, and the importance of doing so.
* Encoding categorical features with appropriate techniques (one-hot for nominal, ordinal encoding for ordered categories).
* Binning continuous features into categories to capture non-linear effects or reduce noise.
* Scaling and normalizing features to ensure comparability and to meet model assumptions.
* Extracting new features from existing ones, such as breaking down dates or forming polynomial and interaction features to enrich the feature set.
* Selecting the most important features using filter, wrapper, and embedded methods, to simplify models and avoid overfitting.
* Engineering time series features like lags and rolling statistics to incorporate temporal patterns into models.
* Utilizing automated feature engineering libraries (tsfresh and Featuretools) to let algorithms discover potentially useful features across time series and relational data.

Feature engineering is as much an art as it is science. It requires understanding both the data and the modeling algorithms. Always keep in mind:

* **Garbage in, garbage out:** No model can make up for completely irrelevant or bad features. Invest time in understanding the data.
* **Domain knowledge:** Use it to create meaningful features that a generic algorithm might not think of.
* **Experimentation:** Try different transformations and combinations, and evaluate via model performance (using validation sets or cross-validation) to see what helps.
* **Simplicity:** More features isn't always better. The right features are better. Use feature selection or judgement to keep the feature set concise and interpretable when possible.