In [None]:
# === Environment Setup ===
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import display, Markdown, Image
import tensorflow as tf
from tensorflow.keras import layers, Model, Input

# --- Configuration ---
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams.update({'font.size': 14, 'figure.figsize': (10, 6), 'figure.dpi': 150})
np.set_printoptions(suppress=True, linewidth=120, precision=4)

# --- Utility Functions ---
def note(msg): display(Markdown(f"<div class='alert alert-info'>📝 {msg}</div>"))
def sec(title): print(f"\n{80*'='}\n| {title.upper()} |\n{80*'='}")

note("Environment initialized for Multi-modal Fusion.")

# Chapter 7.14: Multi-modal Fusion

---

### Table of Contents

1.  [**What is Multi-modal Learning?**](#intro)
2.  [**Economic Applications**](#econ-apps)
3.  [**Fusion Strategies**](#fusion)
    - [Early Fusion (Feature-level)](#early-fusion)
    - [Late Fusion (Decision-level)](#late-fusion)
    - [Intermediate/Hybrid Fusion](#intermediate-fusion)
4.  [**Code Lab: Fusing Tabular and Text Data**](#code-lab)
5.  [**Challenges in Multi-modal Learning**](#challenges)
6.  [**Summary**](#summary)

<a id='intro'></a>
## 1. What is Multi-modal Learning?

Humans perceive the world by integrating information from multiple senses (vision, hearing, touch). Similarly, **multi-modal machine learning** aims to build models that can process and relate information from multiple data types, or **modalities**.

Common modalities include:
- **Tabular Data**: Structured data from databases, spreadsheets (e.g., financial statements).
- **Text**: Unstructured text from news articles, reports, social media.
- **Images**: Visual data from satellites, charts, products.
- **Audio**: Speech from earnings calls, interviews.

The core idea is that different modalities can provide complementary information. For example, the numerical data in a financial report might tell us *what* a company's revenue is, while the text in the report might explain *why* it has changed. The goal of multi-modal fusion is to combine these disparate sources to create a more robust and accurate predictive model than could be achieved using any single modality alone.

<a id='econ-apps'></a>
## 2. Economic Applications

Multi-modal models are becoming increasingly relevant in economics and finance:

- **Credit Scoring**: Combining structured financial data (income, debt) with unstructured text from loan applications.
- **Asset Pricing**: Fusing quantitative market data with sentiment analysis from news articles and company reports (e.g., 10-K filings).
- **Economic Forecasting**: Integrating satellite imagery (e.g., nighttime lights, shipping activity) with traditional macroeconomic time series.
- **Real Estate Valuation**: Combining property features (square footage, bedrooms) with images of the property and text descriptions from listings.

<a id='fusion'></a>
## 3. Fusion Strategies

The central challenge in multi-modal learning is deciding *how* and *when* to combine the information from different modalities. There are three primary strategies:

In [None]:
display(Image(filename='../images/07-Machine-Learning/multimodal_fusion_strategies.png'))

<a id='early-fusion'></a>
### 3.1 Early Fusion (Feature-level)

In this approach, features are extracted from each modality and concatenated together at the input layer. A single, unified model is then trained on this combined feature vector.

- **Pros**: Simple to implement. Can learn correlations between low-level features from different modalities.
- **Cons**: Modalities must be easy to align. The resulting feature vector can be very high-dimensional and sparse. Can be sensitive to missing modalities.

<a id='late-fusion'></a>
### 3.2 Late Fusion (Decision-level)

In late fusion, a separate model is trained for each modality. The final prediction is made by combining the outputs (e.g., prediction scores or class probabilities) of these individual models, often through averaging, voting, or a simple meta-learner.

- **Pros**: Robust to missing modalities. Allows for modality-specific model architectures. Conceptually simple.
- **Cons**: Fails to learn low-level interactions between modalities, as the fusion happens too late in the process.

<a id='intermediate-fusion'></a>
### 3.3 Intermediate (or Hybrid) Fusion

This strategy offers a compromise. Separate neural network branches process each modality initially. The outputs of these branches are then fused at an intermediate layer within the network, allowing for further joint processing before the final prediction is made.

- **Pros**: Balances the benefits of early and late fusion. Can learn complex interactions between modalities at a higher level of abstraction.
- **Cons**: More complex to design and tune. The optimal point of fusion is a key hyperparameter.

<a id='code-lab'></a>
## 4. Code Lab: Fusing Tabular and Text Data

Let's build a simple intermediate fusion model to predict a house price (a regression task) using both structured tabular data (e.g., number of bedrooms, square footage) and unstructured text data (e.g., property description).

This is a common problem in economics and finance, where we often have both quantitative and qualitative data. For this example, we will simulate the data, but the model architecture is a template that can be applied to real-world datasets.

In [None]:
sec("Building a Multi-modal Fusion Model")

# --- 1. Define Inputs ---
# We define two separate inputs for our model, one for each modality.
input_tabular = Input(shape=(10,), name='tabular_input') # Represents 10 numerical features.
input_text = Input(shape=(100,), name='text_input')     # Represents a sequence of 100 integer-encoded words.

# --- 2. Tabular Branch ---
# This branch processes the structured, tabular data.
# It consists of a simple Multi-Layer Perceptron (MLP).
tabular_branch = layers.Dense(32, activation='relu')(input_tabular)
tabular_branch = layers.Dense(16, activation='relu')(tabular_branch)
tabular_model = Model(inputs=input_tabular, outputs=tabular_branch)

# --- 3. Text Branch ---
# This branch processes the unstructured text data.
# It uses an Embedding layer to convert the integer-encoded words into dense vectors,
# and an LSTM layer to capture sequential patterns in the text.
text_branch = layers.Embedding(input_dim=10000, output_dim=64)(input_text) # Vocabulary size of 10000.
text_branch = layers.LSTM(32)(text_branch)
text_model = Model(inputs=input_text, outputs=text_branch)

# --- 4. Fusion (Intermediate) ---
# The outputs of the two branches are concatenated to form a single feature vector.
# This is the 'fusion' step.
combined = layers.concatenate([tabular_model.output, text_model.output])

# --- 5. Joint Processing ---
# After fusion, the combined feature vector is passed through a few more Dense layers
# to learn the relationships between the fused features.
z = layers.Dense(32, activation="relu")(combined)
z = layers.Dense(16, activation="relu")(z)
# The final output layer has a single neuron with a linear activation function for our regression task.
output = layers.Dense(1, activation="linear")(z)

# --- 6. Create the Final Model ---
# The final model takes the two inputs and produces a single output.
model = Model(inputs=[input_tabular, input_text], outputs=output)
model.compile(optimizer='adam', loss='mean_squared_error')

note("Multi-modal model created. It accepts two inputs and fuses them.")
model.summary()

<a id='challenges'></a>
## 5. Challenges in Multi-modal Learning

Despite its power, multi-modal learning presents several challenges:

- **Alignment**: How to align data from different sources? (e.g., synchronizing audio and video streams).
- **Representation**: How to learn joint representations that capture both modality-specific and shared information?
- **Scalability**: The complexity of models increases with each new modality.
- **Missing Data**: How should the model behave if one modality is unavailable at prediction time? (This is often addressed with techniques like *modality dropout* during training).

<a id='summary'></a>
## 6. Summary

Multi-modal fusion is a powerful technique that allows models to develop a more holistic understanding of a problem by integrating data from various sources. For economists, this opens up exciting possibilities for combining traditional structured data with the vast amounts of unstructured text, image, and sensor data that are now available. The choice of fusion strategy—early, late, or intermediate—is a critical design decision that depends on the specific characteristics of the data and the problem at hand.