# **Chapter 17: Advanced Feature Engineering Techniques**

## **Learning Objectives**

By the end of this chapter, you will be able to:

- Understand and apply automated feature engineering libraries such as `tsfresh` and `Featuretools`
- Create feature crosses and interaction terms to capture non‑linear relationships
- Generate polynomial features and apply binning/discretization for time‑series data
- Implement target encoding while avoiding leakage in temporal settings
- Use feature hashing for high‑dimensional categorical variables
- Leverage embedding‑based features (e.g., entity embeddings, pre‑trained representations)
- Explore transfer learning and meta‑learning for feature generation
- Apply best practices to ensure robust, generalizable features

---

## **17.1 Automated Feature Engineering**

Automated feature engineering (AutoFE) systematically generates a large number of candidate features from raw data using predefined transformations. For time‑series, libraries like `tsfresh` (Time Series Feature extraction based on scalable hypothesis tests) and `Featuretools` (using Deep Feature Synthesis) can produce hundreds of statistical and relational features. This approach complements manual engineering by exploring a vast space of possibilities that human intuition might miss.

### **17.1.1 tsfresh for Time‑Series Feature Extraction**

`tsfresh` automatically calculates a comprehensive set of time‑series characteristics (e.g., mean, variance, number of peaks, FFT coefficients) for each time series. In the context of the NEPSE dataset, we can treat each stock symbol as a separate time series and extract features over rolling windows.

```python
import pandas as pd
import numpy as np
from tsfresh import extract_features
from tsfresh.utilities.dataframe_functions import impute
from tsfresh.feature_selection.relevance import calculate_relevance_table

# Load NEPSE data (sample)
df = pd.read_csv('nepse_data.csv')

# Ensure proper datetime format and sort
df['Date'] = pd.to_datetime(df['Date'])
df = df.sort_values(['Symbol', 'Date'])

# Prepare data for tsfresh: each row is a (id, time, value) triplet
# We'll extract features for each Symbol (id) using the Close price
tsfresh_df = df[['Symbol', 'Date', 'Close']].copy()
tsfresh_df.columns = ['id', 'time', 'value']   # tsfresh requires these column names

# Extract features (default_fc_parameters="comprehensive" yields ~800 features)
extracted_features = extract_features(
    tsfresh_df,
    column_id='id',
    column_sort='time',
    default_fc_parameters='comprehensive',   # use 'efficient' for faster run
    impute_function=impute,                  # automatically handle NaNs
    n_jobs=4                                  # parallel processing
)

print(f"Extracted {extracted_features.shape[1]} features")
print(extracted_features.iloc[:, :5].head())   # show first few features
```

**Explanation:**

- **Data preparation:** `tsfresh` expects a DataFrame with columns `id`, `time`, and `value`. Here, `id` is the stock symbol, `time` is the trading date, and `value` is the closing price. This structure allows extraction for each time series separately.
- **Feature extraction:** The `extract_features` function applies a large collection of feature calculators (mean, variance, number of crossings, etc.) to each time series. The parameter `default_fc_parameters` can be set to `"efficient"` (faster, fewer features) or `"comprehensive"` (all possible features). For the NEPSE dataset, “comprehensive” would generate many features that may help capture subtle patterns, but also risks overfitting.
- **Imputation:** `impute` replaces `NaN` values (which occur when a feature cannot be computed, e.g., because the series is too short) with zeros or other strategies.
- **Result:** The output is a matrix where rows correspond to each stock symbol (id) and columns are the extracted features. These can be merged back to the original daily data (e.g., by assigning the same feature values to all rows of a given symbol). However, note that `tsfresh` extracts features over the entire time series, not per rolling window – for prediction we need features that change over time. Hence we typically apply `tsfresh` over sliding windows (e.g., last 30 days) to create dynamic features. This can be done by manually rolling over time or using `tsfresh`’s `rolling` functionality (see documentation).

### **17.1.2 Featuretools for Deep Feature Synthesis**

Featuretools applies **Deep Feature Synthesis (DFS)** to automatically generate features from relational datasets. It can create features by aggregating related data (e.g., mean of previous trades) and by applying transformation primitives (e.g., day of week). For time‑series, we can set up an entity set with a time index and let DFS build features that respect temporal order.

```python
import featuretools as ft

# Prepare data: we need a unique index per row
df['id'] = range(len(df))

# Create an EntitySet
es = ft.EntitySet(id="nepse_data")

# Add the main dataframe (entity) with a time index
es = es.add_dataframe(
    dataframe_name="stocks",
    dataframe=df,
    index="id",
    time_index="Date"                     # tells Featuretools to respect time ordering
)

# Run Deep Feature Synthesis
feature_matrix, feature_defs = ft.dfs(
    entityset=es,
    target_dataframe_name="stocks",
    agg_primitives=["mean", "sum", "std", "count", "trend"],
    trans_primitives=["day", "month", "percent_change", "cum_sum"],
    max_depth=2,                           # how many layers of features to create
    verbose=True
)

print(f"Generated {len(feature_defs)} features")
print(feature_matrix.head())
```

**Explanation:**

- **EntitySet:** This is Featuretools’ way of organizing data. We add our main table (`stocks`) and specify the `time_index` column so that DFS knows to only use past data when creating aggregations.
- **DFS parameters:**
  - `agg_primitives` are functions that aggregate related instances (e.g., mean of previous prices, standard deviation of volume over a window).
  - `trans_primitives` are functions applied to single instances (e.g., extracting day of week, computing percentage change).
  - `max_depth` controls the complexity of generated features. Depth 1 features are direct aggregations/transformations; depth 2 features are aggregations of transformations, etc.
- **Result:** `feature_matrix` contains the original rows plus the newly engineered features. Because we set a time index, all features are calculated using only information available up to that point in time, avoiding look‑ahead bias.

Featuretools is particularly powerful when you have multiple related tables (e.g., economic indicators, sector data) that you want to join with the stock data. For the NEPSE dataset, you could incorporate sector indices or macroeconomic data to enrich the feature set.

### **17.1.3 Custom Automation**

When you need full control or have domain‑specific transformations, building a custom automation pipeline is often the best approach. The idea is to define a set of transformation functions and systematically apply them over rolling windows.

```python
def generate_rolling_features(df, windows=[5, 10, 20]):
    """
    Generate a set of rolling statistical features for each numeric column.
    This is a simple example of custom automation.
    """
    df_feat = df.copy()
    numeric_cols = ['Open', 'High', 'Low', 'Close', 'Vol']
    
    for col in numeric_cols:
        if col not in df.columns:
            continue
        for w in windows:
            df_feat[f'{col}_mean_{w}'] = df[col].rolling(w).mean()
            df_feat[f'{col}_std_{w}'] = df[col].rolling(w).std()
            df_feat[f'{col}_skew_{w}'] = df[col].rolling(w).skew()
            df_feat[f'{col}_min_{w}'] = df[col].rolling(w).min()
            df_feat[f'{col}_max_{w}'] = df[col].rolling(w).max()
            # rate of change over the window
            df_feat[f'{col}_roc_{w}'] = df[col].pct_change(periods=w)
    
    return df_feat

# Apply to NEPSE data
df_with_features = generate_rolling_features(df, windows=[5, 10, 20])
print(df_with_features.iloc[:, -10:].head())
```

**Explanation:**

- This function loops over a predefined list of numeric columns and a set of window sizes. For each combination, it computes common statistics (mean, standard deviation, skewness, min, max, rate of change) using pandas’ built‑in rolling methods.
- The result is a DataFrame with many new columns, each named according to the pattern `{column}_{statistic}_{window}`.
- This approach is fully transparent and easily extendable: you can add new statistics, change windows, or include domain‑specific calculations (e.g., autocorrelation, entropy). It also runs efficiently because it leverages vectorized pandas operations.

**Automated vs. Manual:** Automated tools like `tsfresh` and Featuretools are excellent for exploration and baseline models, but they can generate thousands of features, many of which are redundant or noisy. Manual and semi‑automated pipelines, guided by domain knowledge, often yield more robust and interpretable features. A common workflow is to use automation to generate a wide set of candidates, then apply feature selection (see Chapter 16) to keep only the most relevant.

---

## **17.2 Feature Crosses and Interactions**

Feature crosses (also called interaction terms) are new features formed by combining two or more existing features. They capture non‑linear relationships that a linear model cannot represent on its own. In time‑series forecasting, interactions can reveal important patterns, such as “high volume on a down day” or “late‑month effect combined with momentum.”

### **17.2.1 Arithmetic Combinations**

The simplest interactions are arithmetic operations: multiplication, division, addition, or subtraction of two features. For the NEPSE dataset, we can create:

- **Price × Volume** – reflects the total value traded (turnover), which may indicate institutional interest.
- **Return × Volume** – captures the conviction behind a price move.
- **High‑Low range × Volatility** – amplifies the effect of wide‑ranging days when volatility is already high.

```python
# Example: create interaction features
df['Turnover_Est'] = df['Close'] * df['Vol']          # price * volume
df['Return_Volume'] = df['Daily_Return'] * df['Vol']  # return * volume
df['Range_Volatility'] = (df['High'] - df['Low']) * df['Volatility_20']
```

**Explanation:**

- `Turnover_Est` approximates the total traded value. In the NEPSE CSV, there is a `Turnover` column (actual traded value), so this feature may be redundant, but it demonstrates the concept.
- `Return_Volume` is larger when a significant price change occurs on high volume, which often signals strong momentum or a reversal. A small move on low volume may be noise; a large move on high volume is more meaningful.
- `Range_Volatility` multiplies the daily price range by the 20‑day volatility. This amplifies the range when volatility is already high, helping models distinguish between quiet days with small ranges and volatile days with wide ranges.

### **17.2.2 Polynomial Features**

Polynomial features are a special case of interactions that include powers of individual features and their products. For example, from two features `a` and `b`, polynomial features of degree 2 include `a²`, `b²`, and `a×b`. This allows a linear model to fit quadratic surfaces.

```python
from sklearn.preprocessing import PolynomialFeatures

# Select a few base features
base_features = df[['Daily_Return', 'Vol', 'RSI']].dropna()

# Generate polynomial features up to degree 2 (includes interactions)
poly = PolynomialFeatures(degree=2, include_bias=False)
poly_features = poly.fit_transform(base_features)

# Get feature names
feature_names = poly.get_feature_names_out(base_features.columns)
poly_df = pd.DataFrame(poly_features, columns=feature_names, index=base_features.index)

# Concatenate with original dataframe
df = pd.concat([df, poly_df], axis=1)
print(poly_df.head())
```

**Explanation:**

- `PolynomialFeatures` creates a new feature matrix consisting of all polynomial combinations of the input features with degree ≤ 2.
- `include_bias=False` avoids adding a constant column (intercept). The resulting columns are named, e.g., `Daily_Return^2`, `Daily_Return Vol`, `Vol^2`, etc.
- **Caution:** Polynomial features can explode in number – with 10 features and degree 3, you get over 200 new features. This can quickly lead to overfitting, especially in time‑series with limited data. Use regularization (e.g., Lasso) or feature selection after generating them.

### **17.2.3 Logical Combinations**

Sometimes interactions are not arithmetic but logical: e.g., “is it the end of the month **and** the stock is oversold?” This can be captured by multiplying binary flags or by using decision trees that naturally handle such splits. You can create explicit binary interaction features:

```python
# Create binary flags
df['Is_Month_End'] = (df['Date'].dt.is_month_end).astype(int)
df['Is_Oversold'] = (df['RSI'] < 30).astype(int)

# Interaction: end of month AND oversold
df['MonthEnd_Oversold'] = df['Is_Month_End'] * df['Is_Oversold']
```

**Explanation:**

- `Is_Month_End` is 1 on the last trading day of the month, 0 otherwise.
- `Is_Oversold` is 1 when RSI < 30, a common oversold signal.
- The product of these two binary variables is 1 only when both conditions are true, creating a feature that captures a specific regime that might be predictive (e.g., end‑of‑month window dressing in oversold stocks).

### **17.2.4 When to Use Interactions**

Interactions are most valuable when domain knowledge suggests that the effect of one feature depends on the value of another. For instance:
- The impact of a price change may depend on the current volatility level.
- The significance of a volume spike may depend on whether the market is trending or range‑bound.
- Seasonal effects (e.g., fiscal quarter) may amplify the predictive power of technical indicators.

Including interactions can improve model accuracy, but they increase dimensionality and the risk of overfitting. Therefore, use them judiciously and always validate with out‑of‑sample data.

---

## **17.3 Polynomial Features**

Polynomial features extend the idea of interactions by including powers of individual features. In time‑series, polynomial features can model non‑linear trends – for example, a quadratic trend in prices might capture acceleration or deceleration.

### **17.3.1 Creating Polynomial Features**

As shown in 17.2.2, `PolynomialFeatures` from scikit‑learn is the standard tool. However, for time‑series we often want polynomial features of time itself (e.g., day number) to capture long‑term trends.

```python
# Create a time index (days since start)
df['Day_Num'] = (df['Date'] - df['Date'].min()).dt.days

# Generate polynomial features of time (degree 3)
from sklearn.preprocessing import PolynomialFeatures
poly_time = PolynomialFeatures(degree=3, include_bias=False)
time_poly = poly_time.fit_transform(df[['Day_Num']])
df['Day_Num^2'] = time_poly[:, 0]
df['Day_Num^3'] = time_poly[:, 1]   # Note: order depends on get_feature_names_out
```

**Explanation:**

- `Day_Num` is a linear time counter. Its square and cube can capture parabolic and cubic trends in the price series. In a linear model, including these terms allows the model to fit a polynomial trend line.
- **Warning:** Polynomial trends can extrapolate wildly outside the training range. Use them only if you have a strong reason to believe the trend follows a polynomial shape, and be cautious when forecasting far into the future.

### **17.3.2 Polynomials of Other Features**

You can also create polynomial expansions of any numeric feature, such as `Volume²` or `RSI³`. This can help capture diminishing returns or threshold effects. For example, very high RSI values (above 80) might be more predictive than moderately high values (70–80) – a cubic term can emphasize that difference.

```python
poly_rsi = PolynomialFeatures(degree=3, include_bias=False)
rsi_poly = poly_rsi.fit_transform(df[['RSI']])
df['RSI^2'] = rsi_poly[:, 0]
df['RSI^3'] = rsi_poly[:, 1]
```

### **17.3.3 Limitations and Regularization**

Polynomial features are a form of basis expansion that can lead to highly correlated features and overfitting. Always combine them with regularization (e.g., Ridge or Lasso regression) or use them in tree‑based models that can handle non‑linearities without explicit polynomial terms. In practice, for time‑series, polynomial features of raw prices are less common than using them on derived features like returns or volatility.

---

## **17.4 Binning and Discretization**

Binning (or discretization) converts continuous variables into categorical ones by dividing the range into intervals. This can help capture non‑linear relationships, reduce the impact of outliers, and create interpretable features. In time‑series, binning is often applied to volume, returns, or technical indicators to create regime indicators.

### **17.4.1 Equal‑Width Binning**

Divide the feature into `k` bins of equal width.

```python
# Bin the daily return into 5 categories
df['Return_Bin'] = pd.cut(df['Daily_Return'], bins=5, labels=['Very Negative', 'Negative', 'Neutral', 'Positive', 'Very Positive'])
```

**Explanation:**  
`pd.cut` creates bins of equal width based on the range of the data. The `labels` parameter assigns meaningful names. The resulting categorical feature can be one‑hot encoded or used directly in tree models.

### **17.4.2 Quantile Binning (Equal‑Frequency)**

Create bins that contain approximately the same number of observations. This is useful when the distribution is skewed (e.g., volume).

```python
# Bin volume into 4 quartiles
df['Volume_Quartile'] = pd.qcut(df['Vol'], q=4, labels=['Low', 'Medium', 'High', 'Very High'])
```

**Explanation:**  
`pd.qcut` divides the data based on quantiles. For skewed volume data, this ensures each bin has enough samples. The feature `Volume_Quartile` indicates the volume level relative to historical distribution – a high‑volume day (top 25%) might be more significant than an absolute volume number, especially across different stocks.

### **17.4.3 Custom Bins Based on Domain Knowledge**

Sometimes domain knowledge suggests specific thresholds. For example, RSI values above 70 are considered overbought, below 30 oversold.

```python
def rsi_zone(rsi):
    if rsi < 30:
        return 'Oversold'
    elif rsi < 50:
        return 'Neutral_Bearish'
    elif rsi < 70:
        return 'Neutral_Bullish'
    else:
        return 'Overbought'

df['RSI_Zone'] = df['RSI'].apply(rsi_zone)
```

**Explanation:**  
This custom binning uses standard technical analysis thresholds. The resulting categorical variable captures the market sentiment regime, which may be more predictive than the raw RSI value because it reflects trader behavior (e.g., mean‑reversion signals when entering overbought/oversold zones).

### **17.4.4 Handling Binned Features in Models**

Binned features are categorical. They can be used as:
- **One‑hot encoded** (creating dummy variables) for linear models.
- **Ordinal encoded** (if there is a natural order) for tree models.
- Directly in tree‑based models (like Random Forest) without encoding, as they handle categorical splits naturally.

```python
# One-hot encoding example
return_dummies = pd.get_dummies(df['Return_Bin'], prefix='Return')
df = pd.concat([df, return_dummies], axis=1)
```

---

## **17.5 Target Encoding**

Target encoding (also called mean encoding or likelihood encoding) replaces a categorical variable with the average target value for that category. For time‑series, this must be done carefully to avoid leakage: the average should be computed only from past data, not including the current or future rows.

### **17.5.1 Basic Target Encoding for Static Categories**

Suppose we have a categorical variable like `Symbol` (stock ticker). We might want to encode it with the historical average return of that stock. But using the global mean would leak future information into training. Instead, we compute the mean using only data up to each point.

```python
def target_encode_time_series(df, cat_col, target_col, min_samples_leaf=10):
    """
    Perform target encoding for a categorical column in a time-series setting.
    Uses expanding mean (past only) to avoid leakage.
    """
    df = df.sort_values('Date')  # ensure chronological order
    encoded = []
    
    for symbol in df[cat_col].unique():
        mask = df[cat_col] == symbol
        symbol_data = df.loc[mask].copy()
        # Expanding mean of target (excluding current row)
        symbol_data['enc'] = (symbol_data[target_col].shift(1).expanding().mean())
        # For first few rows, expanding mean will be NaN; fill with global prior
        prior = symbol_data[target_col].iloc[:min_samples_leaf].mean() if len(symbol_data) > min_samples_leaf else df[target_col].mean()
        symbol_data['enc'].fillna(prior, inplace=True)
        encoded.append(symbol_data)
    
    result = pd.concat(encoded).sort_index()
    return result['enc']

# Example usage
df['Symbol_Encoded'] = target_encode_time_series(df, 'Symbol', 'Daily_Return')
```

**Explanation:**

- We loop over each unique symbol. For each symbol, we calculate the expanding mean of the target (here `Daily_Return`) over all previous rows (using `shift(1)` to exclude the current row). This ensures that the encoding uses only information available at prediction time.
- For the earliest rows where the expanding mean is undefined, we fall back to a prior – either the mean of the first few rows of that symbol, or the global mean. This prevents missing values.
- The result is a numeric feature that represents the historical average return for that symbol, which can capture persistent differences in stock behavior (e.g., some stocks consistently outperform).

### **17.5.2 Smoothing and Regularization**

Target encoding can overfit, especially for categories with few samples. Smoothing (adding a prior) helps:  
`encoding = (n * mean + m * global_mean) / (n + m)`, where `n` is the count for the category, `m` is a smoothing parameter.

```python
def smoothed_target_encoding(df, cat_col, target_col, m=10):
    """
    Smoothed target encoding with global prior.
    """
    global_mean = df[target_col].mean()
    cat_stats = df.groupby(cat_col)[target_col].agg(['count', 'mean'])
    
    # Smoothing formula
    cat_stats['encoding'] = (cat_stats['count'] * cat_stats['mean'] + m * global_mean) / (cat_stats['count'] + m)
    
    # Map back to original dataframe
    return df[cat_col].map(cat_stats['encoding'])

# Apply (but note: this uses all data, causing leakage! For time-series we must compute this in a time-aware manner.)
```

For time‑series, the smoothing must also be computed sequentially, which is more complex but follows the same idea: maintain cumulative counts and means per category, and apply the smoothing formula at each step.

### **17.5.3 When to Use Target Encoding**

Target encoding is powerful for high‑cardinality categorical variables (e.g., stock symbols, sectors) where one‑hot encoding would create too many features. In time‑series, it captures the “momentum” of a category – e.g., if a stock has been performing well recently, that information is encoded. However, it is prone to leakage and must be implemented with strict temporal separation. Always use expanding windows or time‑series cross‑validation when evaluating models that use target‑encoded features.

---

## **17.6 Feature Hashing**

Feature hashing (also known as the hashing trick) is a technique to encode categorical variables into a fixed‑size vector by applying a hash function to the category names and using the hash values as indices. It is especially useful for high‑cardinality categorical features where one‑hot encoding would be impractical.

### **17.6.1 Basic Principle**

A hash function `h` maps each category to an integer between 0 and `n-1` (where `n` is the desired number of dimensions). The feature value is then placed in the bin corresponding to that hash. Collisions (different categories mapping to the same bin) are handled by summing the values, which often works well because the signal from many rare categories is aggregated.

### **17.6.2 Applying Feature Hashing to NEPSE Data**

In the NEPSE dataset, we might have a categorical column like `Symbol` (hundreds of stocks). Instead of creating hundreds of dummy variables, we can hash them into, say, 50 dimensions.

```python
from sklearn.feature_extraction import FeatureHasher

# Prepare a DataFrame with one column 'Symbol' (as strings)
symbols = df[['Symbol']].copy()

# Initialize hasher with desired number of features
hasher = FeatureHasher(n_features=50, input_type='string')

# Transform the symbol column (requires list of lists)
symbol_hashed = hasher.transform(symbols['Symbol'].apply(lambda x: [x]))

# Convert to dense array (careful: may be large, but 50 features is manageable)
symbol_hashed_dense = symbol_hashed.toarray()

# Add as new columns
for i in range(50):
    df[f'Symbol_hash_{i}'] = symbol_hashed_dense[:, i]
```

**Explanation:**

- `FeatureHasher` takes an iterable of lists of strings (each list contains the tokens for one sample). Here, each row’s list contains a single element: the stock symbol.
- `n_features=50` sets the output dimensionality. The hash function will map each symbol to one of 50 bins. Because of collisions, multiple symbols may be mapped to the same bin, but this is often acceptable if the number of bins is large enough.
- The result is a sparse matrix (efficient storage). We convert to dense for demonstration, but in practice you can keep it sparse or use partial fitting.
- The hashed features are then added as numeric columns to the DataFrame. They can be used directly in models.

### **17.6.3 Advantages and Caveats**

- **Advantages:** Memory efficient, handles new categories (they just hash to some bin), no need to maintain a mapping dictionary.
- **Caveats:** Collisions can cause information loss; interpretability is lost because you cannot trace back which symbol contributed to which bin. Also, hashing is stateless – you must ensure that the same hasher configuration is used during training and inference.

For time‑series, hashed features are static per symbol (unless the symbol changes, which it doesn’t). They can be combined with rolling statistics to create dynamic features that incorporate symbol‑specific history.

---

## **17.7 Embedding‑Based Features**

Embeddings are dense vector representations of categorical variables learned by a neural network. They capture similarities between categories in a low‑dimensional space. While typically associated with deep learning, pre‑trained embeddings or embeddings learned jointly with a model can be used as features for other models (e.g., as input to a gradient boosting machine).

### **17.7.1 Entity Embeddings for Categorical Variables**

Suppose we want to represent each stock symbol by a learned vector. We can train a small neural network that predicts the target (e.g., next‑day return) and includes an embedding layer for `Symbol`. The trained embedding weights then become features.

```python
import tensorflow as tf
from tensorflow.keras.layers import Input, Embedding, Flatten, Dense, Concatenate
from tensorflow.keras.models import Model

# Prepare data: we need integer codes for symbols
symbol_codes = {s:i for i,s in enumerate(df['Symbol'].unique())}
df['Symbol_code'] = df['Symbol'].map(symbol_codes)

# Define input layers
symbol_input = Input(shape=(1,), dtype='int32', name='symbol')
numeric_input = Input(shape=(X_numeric.shape[1],), name='numeric')

# Embedding layer for symbols (output dimension = min(50, #symbols/2))
embed_dim = min(50, len(symbol_codes) // 2)
embedding = Embedding(input_dim=len(symbol_codes), output_dim=embed_dim, name='symbol_embedding')(symbol_input)
embedding_flat = Flatten()(embedding)

# Concatenate with numeric features
concat = Concatenate()([embedding_flat, numeric_input])

# Output layer for regression (e.g., next-day return)
output = Dense(1, activation='linear')(concat)

model = Model(inputs=[symbol_input, numeric_input], outputs=output)
model.compile(optimizer='adam', loss='mse')

# Train (using time-series split, not shown)
# After training, extract embedding weights:
embedding_weights = model.get_layer('symbol_embedding').get_weights()[0]
# Now each symbol has a vector of length embed_dim
```

**Explanation:**

- We first map each symbol to an integer index. The embedding layer will learn a vector for each index.
- The model is trained to predict the target using both numeric features and the symbol embedding. During training, the embedding vectors are updated to minimize prediction error.
- After training, we extract the embedding weights: a matrix of shape `(num_symbols, embed_dim)`. Each row is a dense representation of a symbol, capturing patterns like “symbols that behave similarly have similar vectors.”
- These vectors can then be used as features in any other model (e.g., concatenated to the feature matrix for a Random Forest). Alternatively, the entire model can be used as a feature extractor.

### **17.7.2 Pre‑trained Embeddings**

If we have a large corpus of financial news or social media data, we could pre‑train word embeddings (like Word2Vec) and then aggregate them for each symbol (e.g., average of word vectors for news headlines mentioning that stock). This is a form of transfer learning – using knowledge from another domain to enrich features. While beyond the scope of this chapter, it’s an active area of research in financial forecasting.

---

## **17.8 Transfer Learning for Features**

Transfer learning leverages features learned from one task or dataset to improve performance on another. In time‑series, this could mean:
- Using a model pre‑trained on a large set of stocks (e.g., all NEPSE stocks) to extract features for a specific stock.
- Using representations from a related domain (e.g., economic indicators) as additional features.

### **17.8.1 Example: Pre‑trained Autoencoder for Feature Extraction**

Train an autoencoder on the raw time‑series of many stocks to learn compressed representations. The encoder part can then be used to generate features for any new stock.

```python
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.models import Model

# Assume we have a matrix X of shape (samples, timesteps) – e.g., 60-day windows of normalized returns
input_dim = 60
encoding_dim = 10

input_series = Input(shape=(input_dim,))
encoded = Dense(encoding_dim, activation='relu')(input_series)
decoded = Dense(input_dim, activation='linear')(encoded)

autoencoder = Model(input_series, decoded)
autoencoder.compile(optimizer='adam', loss='mse')

# Train on a large set of windows from many stocks (unsupervised)
autoencoder.fit(X_train, X_train, epochs=50, batch_size=256, validation_split=0.1)

# Extract the encoder part
encoder = Model(input_series, encoded)

# Now for any new 60-day window, we can get a 10-dimensional feature vector
encoded_features = encoder.predict(X_new)
```

**Explanation:**

- The autoencoder learns to reconstruct its input. The bottleneck layer (encoding_dim) forces it to learn a compressed representation that captures the most important patterns.
- After training, the encoder transforms any input window into a low‑dimensional feature vector. These features can then be used as inputs to a prediction model (e.g., to forecast the next day’s return).
- This is unsupervised transfer learning – the encoder learns general structure from many time series without needing labels. It can then be applied to any stock, potentially improving generalization when labeled data is scarce.

---

## **17.9 Meta‑Learning Features**

Meta‑learning (learning to learn) involves building models that learn from the performance of other models or from metadata about the learning process. In feature engineering, meta‑learning can be used to automatically select the best transformations or to generate features based on dataset characteristics.

### **17.9.1 Automated Feature Selection as Meta‑Learning**

We can treat the performance of different features across many tasks as metadata to learn which types of features are generally useful. For example, you could run experiments on many stocks and record which features (e.g., lag 1, RSI, volatility) consistently have high importance. This knowledge can then guide feature engineering for new stocks.

### **17.9.2 Feature Generation via Meta‑Features**

Meta‑features are features about the dataset itself: e.g., number of samples, mean, variance, skewness, entropy, etc. In a multi‑stock setting, you could compute meta‑features for each stock’s history and use them to predict which stock‑specific features will be most effective. This is an advanced research area but illustrates the breadth of possibilities.

For most practical applications, meta‑learning is not a daily tool, but understanding the concept helps appreciate how automated systems can improve over time.

---

## **17.10 Feature Engineering Best Practices**

Advanced feature engineering techniques can greatly improve model performance, but they also increase complexity and risk. Adhering to best practices ensures that your efforts lead to robust, generalizable models.

### **17.10.1 Avoid Leakage at All Costs**

Every feature must be computable using only information available at the time of prediction. This means:
- Lag features must use `shift()`.
- Rolling statistics must be computed with closed windows (current value excluded).
- Target encoding must use expanding means, not global means.
- Any transformation involving the target (e.g., scaling using target statistics) must be done in a time‑aware manner.

### **17.10.2 Start Simple, Then Iterate**

Begin with basic features (lags, rolling means, simple technical indicators). Evaluate performance. Only add advanced features if they provide a clear improvement on a validation set. Avoid “kitchen sink” engineering that drowns the model in noise.

### **17.10.3 Use Feature Selection**

After generating a large set of advanced features, apply feature selection (filter, wrapper, embedded) to keep only the most relevant. This reduces overfitting, speeds up training, and improves interpretability.

### **17.10.4 Maintain a Feature Store and Documentation**

As your feature set grows, keep a centralized repository (feature store) with definitions, versioning, and lineage. Document each feature’s rationale, formula, and temporal constraints (e.g., “requires 20 days of history”). This is essential for team collaboration and debugging.

### **17.10.5 Validate with Time‑Series Cross‑Validation**

Because of temporal dependencies, standard k‑fold cross‑validation can leak information. Always use time‑series cross‑validation (e.g., expanding window or rolling window) to evaluate models built with advanced features.

### **17.10.6 Monitor Feature Drift**

In production, feature distributions may change over time (e.g., volatility patterns shift). Monitor your engineered features for drift and retrain models when necessary. Tools like `alibi‑detect` or custom statistical tests can help.

### **17.10.7 Balance Complexity and Interpretability**

Advanced features like embeddings or hashed interactions can be opaque. If your application requires explainability (e.g., regulatory compliance), consider using simpler features or supplementing with SHAP/LIME to explain model decisions.

---

## **Chapter Summary**

In this chapter, we explored advanced feature engineering techniques that can unlock deeper patterns in time‑series data. We began with **automated feature engineering** using `tsfresh` and `Featuretools`, which can generate hundreds of candidate features quickly. We then examined **feature crosses and interactions** to capture non‑linear relationships, and **polynomial features** for trend modeling. **Binning and discretization** transform continuous variables into interpretable regimes, while **target encoding** efficiently handles high‑cardinality categoricals. **Feature hashing** offers a memory‑efficient alternative for categorical encoding. We introduced **embedding‑based features** learned via neural networks, and touched on **transfer learning** and **meta‑learning** as avenues for incorporating external knowledge. Finally, we summarized **best practices** to keep advanced feature engineering under control.

With these tools, you can enrich your NEPSE prediction system with features that capture complex market dynamics. However, always remember that more features do not guarantee better models – rigorous validation and thoughtful selection are the keys to success.

In the next chapter, we will move into **Model Development**, starting with **Chapter 18: Machine Learning Fundamentals for Time‑Series**, where we’ll cover the core concepts needed to build predictive models on top of the features we’ve engineered.

---

**End of Chapter 17**