# Chapter 52: Transfer Learning and Pre‑training

## Learning Objectives

By the end of this chapter, you will be able to:

- Understand the core concepts of transfer learning and why it is beneficial when data is limited
- Distinguish between different pre‑training strategies (supervised, self‑supervised, multi‑task)
- Apply fine‑tuning techniques to adapt a pre‑trained model to a new domain or task
- Implement domain adaptation methods to handle shifts in data distribution between source and target
- Explore self‑supervised learning approaches to learn useful representations from unlabeled time‑series data
- Use contrastive learning to build robust features without labels
- Understand the emerging role of foundation models for time‑series forecasting
- Apply few‑shot learning techniques to make predictions with very few labeled examples
- Recognise the limitations and potential pitfalls of transfer learning in financial time‑series
- Implement practical transfer learning workflows using PyTorch or TensorFlow for the NEPSE prediction system

---

## Introduction

In the previous chapters, we trained our NEPSE stock prediction models from scratch using historical data from the Nepal Stock Exchange. However, what if we have only a few years of data? Or what if a new stock is listed and we have very little trading history? Training a deep neural network from scratch on such small datasets often leads to overfitting. **Transfer learning** offers a solution: we can leverage knowledge learned from a related task or a larger dataset and adapt it to our target problem.

Transfer learning has revolutionised computer vision and natural language processing. For time‑series forecasting, especially in finance, it is gaining traction. We can pre‑train a model on a large corpus of stocks from other exchanges (e.g., US, Indian) and then fine‑tune it on the smaller NEPSE dataset. The pre‑trained model may have learned general patterns of price movements, volatility, and technical indicators that are useful across markets.

In this chapter, we will explore various transfer learning techniques, from simple fine‑tuning to more advanced methods like domain adaptation, self‑supervised learning, and few‑shot learning. We will use the NEPSE system as a running example, showing how to implement these ideas in practice with deep learning frameworks.

---

## 52.1 Transfer Learning Concepts

**Transfer learning** is a machine learning technique where a model developed for a task is reused as the starting point for a model on a second task. It is especially popular in deep learning because training deep networks from scratch requires massive amounts of data and computational resources.

The key idea is that the first layers of a neural network learn general features (e.g., edges in images, or trend and seasonality in time‑series), while later layers learn task‑specific features. By transferring the general features, we can train a model for the new task with less data and fewer iterations.

### 52.1.1 When to Use Transfer Learning

- **Limited target data**: NEPSE has a relatively short history and few stocks compared to developed markets.
- **Similar source domain**: There exists a related domain with abundant data (e.g., other stock exchanges, or even synthetic data).
- **Computational efficiency**: Pre‑training on a large dataset can be done once, then reused.
- **Improved generalisation**: Transfer learning can reduce overfitting on the small target dataset.

### 52.1.2 Transfer Learning Scenarios

1. **Inductive transfer**: Source and target tasks are different but related. Example: pre‑train on predicting next‑day price movement for US stocks, fine‑tune on NEPSE.
2. **Transductive transfer**: Tasks are the same, but domains differ. Example: train on data from 2010–2015, adapt to 2020–2025 (concept drift).
3. **Unsupervised transfer**: Source has no labels (self‑supervised learning). Example: pre‑train a model to predict masked values in a time‑series, then fine‑tune on classification.

---

## 52.2 Pre‑training Strategies

Pre‑training is the process of training a model on a source task before adapting it to the target task. The choice of pre‑training strategy depends on the availability of labels and the similarity of tasks.

### 52.2.1 Supervised Pre‑training

If we have a large labeled dataset from a related domain (e.g., all stocks on the New York Stock Exchange with daily returns), we can pre‑train a model on that dataset. The model learns to map features to returns, which may be transferable.

**Example: Pre‑training on US stock data**

Suppose we have a dataset of daily features for 3000 US stocks over 20 years. We train an LSTM to predict next‑day return direction.

```python
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

# Assume we have preprocessed US stock data: X_us (samples, sequence_length, features), y_us (binary)
X_us = torch.tensor(us_features, dtype=torch.float32)
y_us = torch.tensor(us_labels, dtype=torch.float32)

dataset = TensorDataset(X_us, y_us)
dataloader = DataLoader(dataset, batch_size=256, shuffle=True)

class LSTMPredictor(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers):
        super().__init__()
        self.lstm = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size, 1)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        out, _ = self.lstm(x)
        out = out[:, -1, :]  # last time step
        out = self.fc(out)
        return self.sigmoid(out).squeeze()

model = LSTMPredictor(input_size=10, hidden_size=64, num_layers=2)
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

for epoch in range(10):
    for batch_x, batch_y in dataloader:
        optimizer.zero_grad()
        outputs = model(batch_x)
        loss = criterion(outputs, batch_y)
        loss.backward()
        optimizer.step()
    print(f"Epoch {epoch+1}, Loss: {loss.item():.4f}")

# Save pre-trained model
torch.save(model.state_dict(), 'pretrained_us_lstm.pth')
```

**Explanation:**  
We train an LSTM classifier on the large US dataset. The model learns to extract temporal patterns that are predictive of returns. These patterns (e.g., momentum, mean reversion) may generalise to other markets.

### 52.2.2 Self‑supervised Pre‑training

Self‑supervised learning (SSL) creates labels from the data itself, without human annotation. For time‑series, common SSL tasks include:

- **Masked reconstruction**: Mask a portion of the input and train the model to predict the missing values.
- **Contrastive learning**: Pull representations of similar samples (e.g., different views of the same time‑series) closer, and push apart representations of different samples.
- **Temporal ordering**: Predict whether two segments are in correct chronological order.
- **Forecasting**: Predict future values (this is inherently supervised but can be done on unlabeled data).

SSL can learn powerful representations that capture the underlying structure of the data, which can then be fine‑tuned for downstream tasks with few labels.

**Example: Masked autoencoder for time‑series**

```python
import torch
import torch.nn as nn

class TimeSeriesMAE(nn.Module):
    """Masked Autoencoder for time-series."""
    def __init__(self, input_dim, hidden_dim, mask_ratio=0.3):
        super().__init__()
        self.mask_ratio = mask_ratio
        self.encoder = nn.LSTM(input_dim, hidden_dim, batch_first=True)
        self.decoder = nn.Linear(hidden_dim, input_dim)

    def forward(self, x):
        # x shape: (batch, seq_len, input_dim)
        # Randomly mask some time steps
        mask = torch.rand(x.shape[0], x.shape[1], 1) > self.mask_ratio
        masked_x = x * mask.float()
        # Encode
        enc_out, _ = self.encoder(masked_x)  # (batch, seq_len, hidden_dim)
        # Decode (predict original values)
        recon = self.decoder(enc_out)  # (batch, seq_len, input_dim)
        # Compute loss only on masked positions
        loss_mask = (~mask).float()
        loss = ((recon - x) ** 2 * loss_mask).sum() / (loss_mask.sum() + 1e-8)
        return recon, loss

# Training loop
model = TimeSeriesMAE(input_dim=10, hidden_dim=64)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
for epoch in range(50):
    total_loss = 0
    for batch_x in dataloader_unlabeled:  # unlabeled data
        _, loss = model(batch_x)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    print(f"Epoch {epoch+1}, Loss: {total_loss/len(dataloader):.4f}")
```

**Explanation:**  
The model learns to reconstruct masked portions of the input time‑series. The encoder must capture the temporal dependencies to fill in the missing values. The resulting encoder can then be used as a feature extractor for downstream tasks.

### 52.2.3 Multi‑task Pre‑training

Multi‑task learning trains a model on multiple related tasks simultaneously. For time‑series, these tasks could be predicting next‑day return, volatility, and volume. The shared representation may be more robust.

```python
class MultiTaskLSTM(nn.Module):
    def __init__(self, input_size, hidden_size):
        super().__init__()
        self.lstm = nn.LSTM(input_size, hidden_size, batch_first=True)
        self.return_head = nn.Linear(hidden_size, 1)
        self.volatility_head = nn.Linear(hidden_size, 1)
        self.volume_head = nn.Linear(hidden_size, 1)

    def forward(self, x):
        out, _ = self.lstm(x)
        out = out[:, -1, :]
        ret = torch.sigmoid(self.return_head(out))
        vol = self.volatility_head(out)  # regression
        vol_pred = self.volume_head(out)
        return ret, vol, vol_pred
```

Training with combined loss (e.g., BCE for return, MSE for volatility and volume) encourages the LSTM to learn features useful for all tasks.

---

## 52.3 Fine‑tuning Techniques

After pre‑training, we adapt the model to the target task (NEPSE prediction) via **fine‑tuning**. The standard approach:

1. Load the pre‑trained model.
2. Replace the final classification/regression layer(s) with new randomly initialised layers suited to the target task.
3. Optionally freeze some of the early layers to preserve general features and prevent overfitting on the small target dataset.
4. Train on the target data with a lower learning rate.

**Example: Fine‑tuning the US pre‑trained LSTM on NEPSE data**

```python
# Load pre-trained model
model = LSTMPredictor(input_size=10, hidden_size=64, num_layers=2)
model.load_state_dict(torch.load('pretrained_us_lstm.pth'))

# Replace the final layer (since target may have different number of classes)
# In our case, still binary classification, but we may want to reset it.
model.fc = nn.Linear(64, 1)  # reinitialise
model.sigmoid = nn.Sigmoid()

# Optionally freeze early layers
for name, param in model.named_parameters():
    if 'lstm' in name:
        param.requires_grad = False  # freeze LSTM layers

# Only train the new head (and maybe unfreeze later)
optimizer = optim.Adam(filter(lambda p: p.requires_grad, model.parameters()), lr=0.001)

# Prepare NEPSE data
X_nepse = torch.tensor(nepse_features, dtype=torch.float32)
y_nepse = torch.tensor(nepse_labels, dtype=torch.float32)
dataset = TensorDataset(X_nepse, y_nepse)
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

# Fine-tune
for epoch in range(10):
    for batch_x, batch_y in dataloader:
        optimizer.zero_grad()
        outputs = model(batch_x)
        loss = criterion(outputs, batch_y)
        loss.backward()
        optimizer.step()
    print(f"Epoch {epoch+1}, Loss: {loss.item():.4f}")
```

**Explanation:**  
We first freeze the LSTM layers to avoid destroying the useful features learned from US data. Only the new final layer is trained for a few epochs. Then we may unfreeze the later LSTM layers and continue training with a very low learning rate to adapt the features to NEPSE specifics.

### 52.3.2 Progressive Unfreezing

A more sophisticated fine‑tuning strategy is **progressive unfreezing**: start with only the new head trainable, then gradually unfreeze layers from the top down, each time lowering the learning rate.

```python
# Phase 1: train only head
for name, param in model.named_parameters():
    if 'fc' not in name:
        param.requires_grad = False
# train for a few epochs...

# Phase 2: unfreeze last LSTM layer
for name, param in model.named_parameters():
    if 'lstm.2' in name:  # last layer
        param.requires_grad = True
# continue training with lower LR...

# Phase 3: unfreeze all
for param in model.parameters():
    param.requires_grad = True
# train with very low LR
```

### 52.3.3 Discriminative Learning Rates

Instead of a single learning rate, use different rates for different layers: lower for early layers (which capture general features) and higher for later layers (task‑specific). This is common in fine‑tuning NLP models.

```python
# Group parameters
params = []
for name, param in model.named_parameters():
    if 'lstm.0' in name:
        params.append({'params': param, 'lr': 1e-5})
    elif 'lstm.1' in name:
        params.append({'params': param, 'lr': 1e-4})
    elif 'fc' in name:
        params.append({'params': param, 'lr': 1e-3})
optimizer = optim.Adam(params)
```

---

## 52.4 Domain Adaptation

Domain adaptation is a subfield of transfer learning that deals with situations where the source and target domains have different distributions (e.g., US stocks vs. NEPSE stocks). The goal is to adapt the model to the target domain, often without any labels in the target (unsupervised domain adaptation) or with few labels (semi‑supervised).

### 52.4.1 Maximum Mean Discrepancy (MMD)

MMD is a distance between distributions in a reproducing kernel Hilbert space. By minimising MMD between source and target feature representations, we can learn domain‑invariant features.

```python
import torch

def mmd_loss(x_source, x_target, kernel='rbf'):
    """Compute MMD between source and target features."""
    if kernel == 'rbf':
        sigma = 1.0
        source_source = torch.exp(-torch.cdist(x_source, x_source)**2 / (2 * sigma**2))
        target_target = torch.exp(-torch.cdist(x_target, x_target)**2 / (2 * sigma**2))
        source_target = torch.exp(-torch.cdist(x_source, x_target)**2 / (2 * sigma**2))
        mmd = source_source.mean() + target_target.mean() - 2 * source_target.mean()
    return mmd

# In training loop, after forward pass, get features from some layer (e.g., before classifier)
features_source = model.get_features(source_batch)
features_target = model.get_features(target_batch)
loss_mmd = mmd_loss(features_source, features_target)
total_loss = loss_task + lambda_mmd * loss_mmd
```

**Explanation:**  
By adding MMD loss to the task loss, we encourage the model to learn representations that are similar across domains, thus improving generalisation to the target.

### 52.4.2 Adversarial Domain Adaptation

Inspired by GANs, adversarial domain adaptation uses a domain discriminator to distinguish source from target features, and the feature extractor is trained to fool it. This results in domain‑invariant features.

```python
class FeatureExtractor(nn.Module):
    # LSTM part
    def forward(self, x):
        # return features

class TaskClassifier(nn.Module):
    # final layers for prediction

class DomainDiscriminator(nn.Module):
    def forward(self, features):
        # binary classifier: source vs target

# Training loop
for source_x, source_y, target_x in dataloader:
    # Extract features
    source_features = feature_extractor(source_x)
    target_features = feature_extractor(target_x)

    # Task loss on source
    task_output = task_classifier(source_features)
    task_loss = criterion(task_output, source_y)

    # Domain loss
    domain_source = domain_discriminator(source_features.detach())
    domain_target = domain_discriminator(target_features.detach())
    domain_loss_source = domain_criterion(domain_source, torch.ones_like(domain_source))
    domain_loss_target = domain_criterion(domain_target, torch.zeros_like(domain_target))
    domain_loss = (domain_loss_source + domain_loss_target) / 2

    # Total loss for discriminator
    disc_loss = domain_loss
    disc_optimizer.zero_grad()
    disc_loss.backward()
    disc_optimizer.step()

    # Adversarial loss for feature extractor (reverse gradient)
    domain_source_adv = domain_discriminator(source_features)
    domain_target_adv = domain_discriminator(target_features)
    adv_loss = -domain_criterion(domain_source_adv, torch.ones_like(domain_source_adv)) \
               -domain_criterion(domain_target_adv, torch.zeros_like(domain_target_adv))
    total_loss = task_loss + lambda_adv * adv_loss
    feat_optimizer.zero_grad()
    total_loss.backward()
    feat_optimizer.step()
```

**Explanation:**  
The feature extractor tries to make features indistinguishable to the domain discriminator, while the discriminator tries to tell them apart. The equilibrium yields domain‑invariant features.

---

## 52.5 Self‑Supervised Learning for Time‑Series

Self‑supervised learning (SSL) has become a dominant paradigm for learning representations without labels. Several SSL methods have been adapted to time‑series.

### 52.5.1 Contrastive Learning (SimCLR for Time‑Series)

Contrastive learning aims to pull together representations of different views of the same sample (positive pairs) and push apart views of different samples (negative pairs). For time‑series, we can create views by:

- Adding noise
- Cropping and resizing
- Time warping
- Masking

**Example: SimCLR‑style contrastive learning for time‑series**

```python
import torch
import torch.nn.functional as F

class ContrastiveTransform:
    """Apply two random augmentations to a time-series."""
    def __call__(self, x):
        # x: (seq_len, features)
        aug1 = add_noise(x, scale=0.01)
        aug2 = time_warp(x, warp_factor=0.1)
        return aug1, aug2

def contrastive_loss(z1, z2, temperature=0.5):
    """NT-Xent loss."""
    batch_size = z1.shape[0]
    z = torch.cat([z1, z2], dim=0)  # 2N
    similarity = F.cosine_similarity(z.unsqueeze(1), z.unsqueeze(0), dim=2) / temperature
    # Mask out self-comparisons
    mask = torch.eye(2*batch_size, device=z.device).bool()
    similarity.masked_fill_(mask, -float('inf'))
    # Positive pairs: (i, i+batch_size) for i=0..N-1
    labels = torch.cat([torch.arange(batch_size, 2*batch_size), torch.arange(0, batch_size)], dim=0)
    loss = F.cross_entropy(similarity, labels)
    return loss

# Training loop
model = TimeSeriesEncoder()  # e.g., LSTM or Transformer
projector = nn.Sequential(nn.Linear(hidden_dim, 128), nn.ReLU(), nn.Linear(128, 64))
transform = ContrastiveTransform()

for batch_x in unlabeled_dataloader:
    x1, x2 = transform(batch_x), transform(batch_x)  # two views
    h1 = model(x1)
    h2 = model(x2)
    z1 = projector(h1)
    z2 = projector(h2)
    loss = contrastive_loss(z1, z2)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
```

**Explanation:**  
The model learns to produce similar representations for two augmented views of the same time‑series, while distinguishing them from views of other series. After training, the encoder can be used for downstream tasks.

### 52.5.2 Temporal Contrastive Learning

For time‑series, we can also use temporal proximity: nearby time steps should have similar representations, while distant ones should differ. This is similar to CPC (Contrastive Predictive Coding).

```python
# Example: predict future representations from past
past = encoder(series[:, :t])
future = encoder(series[:, t:])
# Use contrastive loss to match correct future with its past against negatives
```

---

## 52.6 Foundation Models for Time‑Series

Recently, large pre‑trained models for time‑series have emerged, analogous to GPT and BERT in NLP. These **foundation models** are trained on massive collections of time‑series data and can be fine‑tuned for various tasks. Examples include:

- **Chronos** (Amazon): Pre‑trained on a large corpus of time‑series.
- **Lag‑Llama**: A foundation model for forecasting.
- **Moirai**: Another time‑series foundation model.

These models can be used in a zero‑shot or few‑shot manner, potentially outperforming models trained from scratch on small datasets like NEPSE.

### 52.6.1 Using a Pre‑trained Foundation Model

Assuming we have access to a pre‑trained Chronos model (via HuggingFace or similar), we can use it for NEPSE prediction.

```python
from transformers import ChronosModel, ChronosConfig

# Load pre-trained model
model = ChronosModel.from_pretrained('amazon/chronos-t5-small')

# Prepare NEPSE data in the required format
# (may need tokenization or patching)

# Fine-tune on NEPSE (if supported)
# Or use zero-shot: simply feed NEPSE data and get predictions
```

**Note:** Foundation models are still an emerging area; availability and ease of use vary.

---

## 52.7 Multi‑Task Learning

Multi‑task learning (MTL) can be seen as a form of transfer learning where knowledge is shared across tasks. In the NEPSE context, we might jointly predict:

- Next‑day direction (classification)
- Next‑week volatility (regression)
- Trading volume (regression)

The shared layers learn representations useful for all tasks, which can improve performance on each, especially if data for some tasks is limited.

**Implementation with PyTorch (shared LSTM + task‑specific heads)** – as shown earlier.

---

## 52.8 Few‑Shot Learning

Few‑shot learning aims to learn from a very small number of labeled examples. This is relevant for NEPSE when a new stock is listed with only a few months of data. Techniques include:

- **Metric‑based methods**: Learn an embedding space where examples of the same class are close. For a new class, compare its few examples to the embeddings of known classes.
- **Prototypical networks**: Compute a prototype (mean embedding) for each class from few shots; classify new points by nearest prototype.
- **Meta‑learning (learning to learn)**: Train on many episodes of few‑shot tasks to learn how to adapt quickly.

**Example: Prototypical Network for few‑shot classification of NEPSE patterns**

```python
class PrototypicalNetwork(nn.Module):
    def __init__(self, encoder):
        super().__init__()
        self.encoder = encoder

    def forward(self, support, query, n_way, k_shot):
        # support: (n_way * k_shot, seq_len, features)
        # query: (n_way * query_per_class, ...)
        support_emb = self.encoder(support)
        query_emb = self.encoder(query)

        # Compute prototypes (mean per class)
        prototypes = support_emb.view(n_way, k_shot, -1).mean(dim=1)

        # Compute distances from query to prototypes
        dist = torch.cdist(query_emb, prototypes, p=2)  # (n_query, n_way)
        logits = -dist
        return logits

# Training: sample episodes from base classes
# Fine-tuning: on new stock with few examples
```

**Explanation:**  
The model learns an embedding function that clusters same‑class examples together. When faced with a new class (e.g., a newly listed stock), we compute its prototype from the few labeled examples and classify by proximity.

---

## 52.9 Implementation Considerations

When applying transfer learning to the NEPSE system, consider:

- **Data preprocessing**: Ensure source and target data are preprocessed similarly (same features, scaling, sequence length). If not, you may need to adapt.
- **Model architecture**: The pre‑trained model's input size must match the target. If features differ, you may need to add a projection layer.
- **Overfitting**: Fine‑tuning on a very small dataset can still overfit. Use strong regularisation (dropout, weight decay) and early stopping.
- **Evaluation**: Use time‑series cross‑validation to estimate performance. Compare with training from scratch.
- **Computational cost**: Pre‑training on large datasets may be expensive. Use cloud GPUs or existing pre‑trained models.

**Example: Projection layer for mismatched features**

```python
# If pre-trained model expects 10 features but we have 8
class AdaptedModel(nn.Module):
    def __init__(self, pretrained_model):
        super().__init__()
        self.projection = nn.Linear(8, 10)  # map our 8 to 10
        self.pretrained = pretrained_model

    def forward(self, x):
        x = self.projection(x)
        return self.pretrained(x)
```

---

## 52.10 Limitations and Pitfalls

- **Negative transfer**: If source and target are too dissimilar, pre‑training may hurt performance.
- **Catastrophic forgetting**: During fine‑tuning, the model may forget useful source knowledge. Use lower learning rates and possibly replay.
- **Domain shift**: Financial markets differ significantly across countries (regulations, investor behaviour). Simple fine‑tuning may not suffice; domain adaptation may be necessary.
- **Temporal shift**: Even within the same market, the distribution changes over time. Pre‑training on old data and fine‑tuning on recent may still face concept drift.

---

## Chapter Summary

In this chapter, we explored transfer learning and pre‑training techniques and their application to the NEPSE stock prediction system. We covered:

- The core concepts of transfer learning and when it is beneficial.
- Supervised pre‑training on a related large dataset (e.g., US stocks).
- Self‑supervised pre‑training methods like masked autoencoders and contrastive learning to learn from unlabeled data.
- Multi‑task pre‑training to learn shared representations.
- Fine‑tuning strategies, including progressive unfreezing and discriminative learning rates.
- Domain adaptation techniques (MMD, adversarial) to handle distribution shifts.
- Few‑shot learning for scenarios with very limited labeled data.
- Emerging foundation models for time‑series.
- Practical implementation tips and potential pitfalls.

By leveraging transfer learning, we can build more accurate NEPSE prediction models even with limited local data, and adapt more quickly to new stocks or changing market conditions. In the next chapter, we will discuss **Automated Machine Learning** and how to automate the process of model selection, feature engineering, and hyperparameter tuning.

---

**End of Chapter 52**

<div style='width:100%; display:flex; justify-content:space-between; align-items:center; margin: 1em 0;'>
  <a href='51. ensemble_methods.ipynb' style='font-weight:bold; font-size:1.05em;'>&larr; Previous</a>
  <a href='../TOC.md' style='font-weight:bold; font-size:1.05em; text-align:center;'>Table of Contents</a>
  <a href='53. automated_machine_learning.ipynb' style='font-weight:bold; font-size:1.05em;'>Next &rarr;</a>
</div>
