In [None]:
# === Environment Setup ===
import os, sys, math, time, random, json, textwrap, warnings
import numpy as np, pandas as pd
import matplotlib.pyplot as plt
import graphviz
try:
    import tensorflow as tf
    from tensorflow import keras
    from tensorflow.keras import layers
    from sklearn.model_selection import train_test_split
    from sklearn.preprocessing import StandardScaler
    import shap
    TENSORFLOW_AVAILABLE = True
except ImportError:
    TENSORFLOW_AVAILABLE = False
from IPython.display import display, Markdown

# --- Configuration ---
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams.update({'font.size': 14, 'figure.figsize': (12, 7), 'figure.dpi': 150})
np.set_printoptions(suppress=True, linewidth=120, precision=4)

# --- Utility Functions ---
def note(msg, **kwargs): display(Markdown(f"<div class='alert alert-block alert-info'>📝 **Note:** {msg}</div>"))
def sec(title): print(f"\n{100*'='}\n| {title.upper()} |\n{100*'='}")

if not TENSORFLOW_AVAILABLE: note("TensorFlow/Keras or SHAP is not installed. Skipping code labs. Run `pip install tensorflow shap`.")
note(f"Environment initialized. TensorFlow/SHAP available: {TENSORFLOW_AVAILABLE}")

# Part 7: Advanced and Frontier Topics
## Chapter 7.19: Deep Learning for Economists

### Introduction: A New Class of Non-Parametric Estimators

**Deep Learning** refers to a class of machine learning models based on **artificial neural networks** with many layers (hence "deep"). For economists, it is useful to think of deep learning not as a mysterious black box, but as a powerful new toolkit for **highly flexible, non-parametric function approximation**. These models have achieved state-of-the-art performance on a vast range of tasks, particularly those involving high-dimensional or unstructured data (like images, text, and geospatial data) where traditional econometric methods may struggle due to restrictive parametric assumptions.

This notebook provides a rigorous, PhD-level introduction to the core concepts of deep learning, focusing on the theoretical underpinnings and the practical considerations for economic applications. We will dissect the optimization algorithms that make training possible, explore the key architectural families (MLPs, CNNs, RNNs), and delve into the frontier of integrating deep learning with causal inference and model interpretability.

## 1. Theoretical Foundations

Before building models, we must understand the theory that justifies their use and the core mechanism by which they learn.

### 1.1 The Universal Approximation Theorem: A Network's Right to Exist

**Intellectual Provenance:** The foundational ideas of the Universal Approximation Theorem were established by George Cybenko in 1989 for sigmoid activations and later generalized by Kurt Hornik, Maxwell Stinchcombe, and Halbert White. It is the seminal theoretical result that legitimizes the entire field of neural networks, answering the fundamental question: "Is a neural network, as a mathematical object, even capable of representing the complex functions we care about?" The theorem's answer is a resounding "yes." It establishes that even a simple network has the **expressive capacity** to model a vast range of functions, giving us the philosophical license to begin our search for a solution within this powerful model class.

Formally, for any continuous function $f: K \to ℝ^m$, where $K$ is a compact (a closed and bounded) subset of $ℝ^n$, and for any desired precision $\epsilon > 0$, there exists a feed-forward network $g(x)$ with a single hidden layer of finite width, such that:
$$\sup_{x \in K} ||f(x) - g(x)||_{\infty} < \epsilon$$

The intuition behind the theorem is that a network constructs a complex function by adding together many simple "building block" functions, with each neuron contributing one elemental block. Using non-polynomial activation functions (like the common **ReLU**), a network can create localized "bump" functions. By summing these bumps, it can approximate any continuous function, much like a Fourier series uses sines and cosines.

#### The Profound Limitations: Why "Deep" Learning is Necessary

An expert's understanding of the UAT is defined by knowing what it *does not* say. These limitations are the very reason that the field is predicated on the concept of **Depth**.

1.  **The Learnability Gap:** The theorem is purely an **existence proof**. It guarantees a solution exists but provides no algorithm for finding the ideal weights. The loss landscapes of neural networks are high-dimensional, non-convex, and filled with saddle points, making optimization extremely challenging. This is addressed by sophisticated optimization algorithms (Section 2).

2.  **The Efficiency Gap:** The UAT is pathologically inefficient. For functions with complex, high-frequency details or for high-dimensional input data, the required width of the single hidden layer can grow **exponentially**. This "curse of dimensionality" makes the shallow network approach computationally infeasible.

3.  **The Power of Hierarchy:** This is the most crucial motivation for deep architectures. Deep networks are **exponentially more parameter-efficient** at representing hierarchical, compositional functions—which accurately describes the structure of much real-world data (e.g., in images, faces are composed of eyes and noses, which are composed of edges and textures). A deep network's layered structure naturally reflects this by learning a hierarchy of features, providing massive parameter sharing and statistical power. This leads to the concept of **inductive bias**, where specific architectures are designed to exploit specific data structures (Section 4).

### 1.2 Backpropagation: The Engine of Learning

**Intellectual Provenance:** While the chain rule is centuries old, its efficient application to neural networks, known as backpropagation, has a rich history. The core ideas were developed independently by several researchers in the 1960s and 70s. However, it was the 1986 paper by David Rumelhart, Geoffrey Hinton, and Ronald Williams, "Learning representations by back-propagating errors," that popularized the method and demonstrated its power for training deep networks, cementing it as the foundational algorithm for the field.

Backpropagation is simply a computationally efficient algorithm for applying the **chain rule of calculus** to compute the gradients of the loss function with respect to every weight in the network. These gradients are then used by an optimization algorithm (like SGD or Adam) to update the weights.

Consider a simple 2-layer MLP for a regression task with a single input $x$, one hidden neuron, and one output neuron $\hat{y}$. The loss is $L = (y - \hat{y})^2$. 
- **Forward Pass:**
  1. Hidden layer activation: $h = \sigma(w_1 x + b_1)$
  2. Output layer: $\hat{y} = w_2 h + b_2$
  3. Loss: $L = (y - \hat{y})^2$

- **Backward Pass (Gradient Calculation):** To update weight $w_1$, we need $\frac{\partial L}{\partial w_1}$. Using the chain rule:
$$\frac{\partial L}{\partial w_1} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial h} \cdot \frac{\partial h}{\partial w_1}$$

Let's compute each term:
1. $\frac{\partial L}{\partial \hat{y}} = -2(y - \hat{y})$ (The error signal)
2. $\frac{\partial \hat{y}}{\partial h} = w_2$ (How the output depends on the hidden activation)
3. $\frac{\partial h}{\partial w_1} = \sigma'(w_1 x + b_1) \cdot x$ (How the hidden activation depends on the weight, where $\sigma'$ is the derivative of the activation function)

Backpropagation works by first computing the error at the output and then propagating this error signal *backwards* through the network, layer by layer. At each layer, it uses the incoming gradient and the local gradient to compute the gradient for the layer below, efficiently reusing calculations. The core challenge this creates in deep networks is the **vanishing/exploding gradient problem**, where the product of many small/large derivatives can cause the gradient signal to shrink to zero or grow to infinity.

## 2. Optimization: Finding the Minimum in a High-Dimensional Space

The loss landscape of a neural network is a high-dimensional, non-convex function with countless local minima and saddle points. The optimizer's job is to navigate this treacherous terrain to find a set of weights that yields a low loss. The evolution of optimizers is a story of addressing the shortcomings of the most basic algorithm: Stochastic Gradient Descent.

### 2.1 Stochastic Gradient Descent (SGD) and its Challenges

The basic weight update rule for **Stochastic Gradient Descent (SGD)** is:
$$W_{t+1} = W_t - \eta \nabla_{W} L(W_t)$$
where $\eta$ is the learning rate and $\nabla_{W} L(W_t)$ is the gradient of the loss computed on a mini-batch of data.

**Challenges:**
1.  **Ravines and High-Curvature Valleys:** If the loss surface is shaped like a long, narrow ravine, the gradient will be very steep on the sides and very shallow along the bottom. SGD will oscillate wildly across the ravine walls while making very slow progress along the bottom toward the minimum.
2.  **Saddle Points:** In high dimensions, saddle points (where the gradient is zero but it's not a minimum) are far more common than local minima. SGD can get stuck at these points because the gradient is close to zero.
3.  **Learning Rate Choice:** A single learning rate $\eta$ applies to all parameters, which is suboptimal. Some parameters may require large updates, while others need fine-tuning.

### 2.2 Adaptive Optimizers: The Modern Standard

Adaptive optimizers address these issues by maintaining a state that tracks information about past gradients and adapts the learning rate for each parameter individually.

#### Momentum

**Momentum** helps accelerate SGD in the relevant direction and dampens oscillations. It adds a fraction $\beta$ of the previous update vector to the current one, acting like a heavy ball rolling down the loss surface.

$v_{t+1} = \beta v_t + \eta \nabla_{W} L(W_t)$
$W_{t+1} = W_t - v_{t+1}$

The velocity term $v_t$ accumulates a running average of gradients. In a ravine, the gradients that point across the walls average to zero, while the gradients that point along the bottom add up, accelerating progress.

#### RMSProp (Root Mean Square Propagation)

**RMSProp** adapts the learning rate for each parameter by dividing it by a running average of the magnitudes of recent gradients for that parameter. It maintains a moving average of the squared gradients.

$s_{t+1} = \beta s_t + (1 - \beta) (\nabla_{W} L(W_t))^2$
$W_{t+1} = W_t - \frac{\eta}{\sqrt{s_{t+1}} + \epsilon} \nabla_{W} L(W_t)$

If a gradient is consistently large (as on a ravine wall), $s_{t+1}$ will be large, and the effective learning rate for that parameter will be small, damping oscillations. If the gradient is small (as along the ravine floor), the effective learning rate will be larger, accelerating progress.

#### Adam: Adaptive Moment Estimation

**Adam** is the de-facto standard optimizer. It combines the ideas of both Momentum and RMSProp. It keeps an exponentially decaying average of past gradients (the "first moment," like momentum) and past squared gradients (the "second moment," like RMSProp).

1.  **First Moment (Mean):** $m_{t+1} = \beta_1 m_t + (1 - \beta_1) \nabla_{W} L(W_t)$
2.  **Second Moment (Variance):** $v_{t+1} = \beta_2 v_t + (1 - \beta_2) (\nabla_{W} L(W_t))^2$

It then uses these, after a bias-correction step, to update the weights:
$$W_{t+1} = W_t - \frac{\eta}{\sqrt{\hat{v}_{t+1}} + \epsilon} \hat{m}_{t+1}$$

Adam provides the best of both worlds: the accelerated progress of momentum and the per-parameter adaptive learning rates of RMSProp, making it highly effective and robust across a wide range of problems.

## 3. The Practical Engineering of Deep Learning
If a neural network's architecture is its skeleton and the optimizer is its engine, then the **activation functions**, **initialization schemes**, and **regularization techniques** are the critical engineering components that ensure the system runs smoothly. A poorly chosen component can bring learning to a complete halt.

### 3.1 Weight Initialization & Activation Functions

The goal of modern initialization is to set initial weight variances to prevent signals from systematically shrinking or growing as they pass through the network. This is intimately tied to the choice of activation function.

-   **ReLU (Rectified Linear Unit):** `ReLU(x) = max(0, x)`. Its derivative is a constant 1 for positive inputs, allowing gradients to flow unimpeded. This was a revolutionary breakthrough over older functions like `sigmoid` and `tanh` whose derivatives saturate at 0, killing gradients.
-   **He Initialization:** The standard for ReLU. A ReLU activation kills the negative half of its input distribution, which halves the signal variance. He initialization precisely corrects for this by setting weight variance to `Var(W) = 2 / fan_in`, keeping signal variance stable.
-   **Leaky ReLU & Family:** `LeakyReLU(x) = max(αx, x)` for a small `α` (e.g., 0.01). This solves the "Dying ReLU" problem (where a neuron can get stuck with a zero gradient) by providing a small, non-zero gradient for negative inputs.

### 3.2 Normalization & Regularization

These techniques actively manipulate the activations or the network structure during training to improve stability and prevent overfitting.

#### Batch Normalization

**Batch Normalization (BN)** forcefully re-centers and re-scales the activations for each feature channel within a mini-batch to have a zero mean and unit variance. This stabilizes the input distribution for the next layer, dramatically smoothing the optimization landscape and allowing for faster, more stable training. Its main drawback is its reliance on a large batch size, leading to alternatives like **Layer Normalization** (standard for Transformers).

#### Dropout

**Dropout** is a simple but powerful regularization technique. During each training step, a random fraction of neurons' outputs are stochastically set to zero. This prevents neurons from co-adapting and forces the network to learn more distributed, robust representations. It is analogous to training a large ensemble of smaller networks, which is highly effective at reducing overfitting.

## 4. A Zoo of Architectures: Matching Models to Data Structure

Different data types have different underlying structures. The power of deep learning comes from using architectures with **inductive biases** that align with that structure. An inductive bias is a set of assumptions a model makes about the data (e.g., that nearby pixels are related). This allows the model to learn efficiently by exploiting known patterns.

### 4.1 Multilayer Perceptrons (MLPs)
- **Structure:** A series of fully-connected (Dense) layers. Each neuron is connected to every neuron in the previous and next layers.
- **Inductive Bias:** None. MLPs are universal function approximators but have no built-in assumptions about the data structure. They treat all input features independently.
- **Use Case:** Best for **tabular data** where features have no inherent spatial or sequential ordering (e.g., standard regression datasets from cross-sectional surveys).

### Code Lab 1: Regression with an MLP
We will build an MLP to predict house prices, a classic tabular data problem.

In [None]:
sec("Predicting House Prices with a Regularized MLP")

if not TENSORFLOW_AVAILABLE:
    note("TensorFlow/Keras not installed. Skipping this code lab.")
else:
    # 1. Load and Preprocess Data
    # Neural networks are sensitive to feature scaling. We use StandardScaler to give
    # each feature a zero mean and unit variance, ensuring no single feature dominates
    # the gradient updates due to its scale.
    (x_train, y_train), (x_test, y_test) = keras.datasets.boston_housing.load_data()
    feature_names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT']
    scaler = StandardScaler()
    x_train_scaled = scaler.fit_transform(x_train)
    x_test_scaled = scaler.transform(x_test)
    
    # 2. Define the Model Architecture
    # We use a standard MLP with two hidden layers of 64 neurons each.
    # - 'relu' activation is chosen for its efficiency and ability to mitigate vanishing gradients.
    # - Dropout layers are added as a regularization technique. By randomly setting 20% of
    #   neuron outputs to zero during training, we prevent the model from becoming too
    #   reliant on any single neuron, which helps prevent overfitting.
    # - The final layer has a single neuron with no activation (linear) for the regression output.
    model = keras.Sequential([
        keras.Input(shape=(x_train.shape[1],)), # Use explicit Input layer for clarity
        layers.Dense(64, activation='relu'),
        layers.Dropout(0.2),
        layers.Dense(64, activation='relu'),
        layers.Dropout(0.2),
        layers.Dense(1)
    ])
    
    # 3. Compile and Train
    # - 'adam' optimizer is a robust choice that adapts the learning rate for each parameter.
    # - 'mean_squared_error' is the standard loss function for regression problems.
    model.compile(optimizer='adam', loss='mean_squared_error')
    # We train for 150 epochs, using 20% of the training data for validation to monitor performance.
    model.fit(x_train_scaled, y_train, epochs=150, validation_split=0.2, verbose=0)
    note("\nModel training complete.")

    # 4. Evaluate Performance
    # We assess the model's performance on the unseen test set to get an unbiased estimate of its
    # predictive power.
    loss = model.evaluate(x_test_scaled, y_test, verbose=0)
    print(f"\n--- Model Performance on Test Set ---")
    print(f"Mean Squared Error: {loss:.2f}")

### 4.2 Convolutional Neural Networks (CNNs)
- **Structure:** Uses **convolutional layers**. Instead of connecting to all inputs, neurons only connect to a small local patch of the input from the previous layer. This small filter (or kernel) is then slid across the entire input, sharing its weights.
- **Inductive Bias:** 
  1. **Locality:** Assumes that features are locally correlated (pixels near each other matter more).
  2. **Translational Equivariance:** Assumes that if a feature (e.g., a vertical edge) is important in one part of an image, it's important in other parts too. Weight sharing makes the detector for that feature available everywhere.
- **Use Case:** The standard for **spatial or grid-like data**, most famously images. Also used in economics for analyzing satellite imagery (e.g., predicting poverty from nighttime lights) or for geospatial data analysis.

### Code Lab 2: Image Classification with a CNN
We'll build a simple CNN to classify handwritten digits from the MNIST dataset.

In [None]:
sec("Image Classification with a CNN")

if not TENSORFLOW_AVAILABLE:
    note("TensorFlow/Keras not installed. Skipping this code lab.")
else:
    # 1. Load and Preprocess Image Data
    # We normalize pixel values from [0, 255] to [0, 1] for numerical stability.
    # A channel dimension is added because CNNs expect inputs of shape (height, width, channels).
    (x_train_img, y_train_img), (x_test_img, y_test_img) = keras.datasets.mnist.load_data()
    x_train_img = x_train_img.astype("float32") / 255.0
    x_test_img = x_test_img.astype("float32") / 255.0
    x_train_img = np.expand_dims(x_train_img, -1)
    x_test_img = np.expand_dims(x_test_img, -1)

    # 2. Define the CNN Architecture
    # This architecture follows a common pattern: a stack of Conv2D and MaxPooling2D layers.
    # - Conv2D layers act as feature detectors, learning to recognize patterns like edges and curves.
    # - MaxPooling2D layers downsample the feature maps, making the learned features more robust
    #   to variations in object position and reducing the number of parameters.
    # - The Flatten layer converts the final 2D feature maps into a 1D vector to be fed into the
    #   final classification layers.
    # - A Dropout layer is used for regularization.
    # - The final Dense layer has 10 neurons with a 'softmax' activation to output class probabilities.
    cnn_model = keras.Sequential([
        keras.Input(shape=(28, 28, 1)),
        layers.Conv2D(32, kernel_size=(3, 3), activation="relu"),
        layers.MaxPooling2D(pool_size=(2, 2)),
        layers.Conv2D(64, kernel_size=(3, 3), activation="relu"),
        layers.MaxPooling2D(pool_size=(2, 2)),
        layers.Flatten(),
        layers.Dropout(0.5),
        layers.Dense(10, activation="softmax")
    ])

    # 3. Compile and Train
    # - 'sparse_categorical_crossentropy' is used because our labels are integers (0-9).
    # - We track 'accuracy' as our performance metric.
    cnn_model.compile(loss="sparse_categorical_crossentropy", optimizer="adam", metrics=["accuracy"])
    cnn_model.fit(x_train_img, y_train_img, batch_size=128, epochs=5, validation_split=0.1, verbose=1)
    note("CNN model training complete.")

    # 4. Evaluate
    # We evaluate the trained model on the held-out test data to confirm its generalization ability.
    score = cnn_model.evaluate(x_test_img, y_test_img, verbose=0)
    print(f"\n--- CNN Performance on Test Set ---")
    print(f"Test loss: {score[0]:.4f}")
    print(f"Test accuracy: {score[1]:.4f}")

### 4.3 Recurrent Neural Networks (RNNs)
- **Structure:** Connections form a directed cycle, creating an internal state or "memory." A layer's output is fed back into itself as an input for the next time step.
- **Inductive Bias:** **Sequentiality.** Assumes that the order of the data matters and that past information is relevant for predicting the future.
- **Use Case:** The standard for **sequential data**, such as time series (e.g., macroeconomic forecasting) or panel data. **Long Short-Term Memory (LSTM)** networks are a more advanced RNN variant that use a system of gates to effectively learn long-range dependencies, overcoming the vanishing gradient problem in simple RNNs.

## 5. Deep Learning and Causal Inference: The New Frontier

While deep learning excels at prediction, the primary goal of most economic research is **causal inference**. A growing body of research explores how to use neural networks as powerful non-parametric estimators within established causal frameworks, combining the flexibility of deep learning with the rigor of economic identification strategies. The crucial insight is that deep learning models are not a replacement for an identification strategy (like IV, RDD, or DiD); they are a tool for implementing the *estimation* part of that strategy, potentially reducing bias from model misspecification.

### 5.1 Estimating Nuisance Functions in Causal ML

In frameworks like **Double/Debiased Machine Learning (DML)**, we need to estimate conditional expectation functions, such as the outcome model $E[Y|X]$ and the treatment model $E[D|X]$. Neural networks can be used to flexibly model these "nuisance functions," especially when the relationships are highly non-linear or involve many covariates $X$. The DML framework then uses cross-fitting and a final-stage orthogonalized regression to isolate the causal effect of $D$ on $Y$, making the final estimate robust to small errors in the nuisance function estimates.

### 5.2 Deep Instrumental Variables (DeepIV)

The DeepIV framework, proposed by Hartford et al. (2017), provides a way to use deep learning for instrumental variable estimation, particularly in settings with high-dimensional instruments or complex non-linear relationships.

The process involves two stages:
1.  **First Stage (Treatment Prediction):** A neural network is trained to predict the endogenous treatment $D$ using the instruments $Z$ and exogenous controls $X$. Crucially, it doesn't predict the value of $D$ directly, but rather the parameters of its conditional distribution, $P(D|X, Z)$. For a continuous treatment, this might be the mean and variance of a Gaussian mixture model.
   $$NN_1: (X, Z) \to \text{parameters of } P(D|X,Z)$$

2.  **Second Stage (Outcome Prediction):** A second neural network is trained to predict the outcome $Y$ using the controls $X$ and the *distribution* of the treatment from the first stage. It estimates the conditional expectation of the outcome by integrating over the predicted treatment distribution:
   $$NN_2: X \to E[Y|X, D]$$
   The final estimate is computed as $E_X[E_{P(D|X,Z)}[NN_2(X, D)]]$, effectively averaging the predicted outcome over the instrument-induced variation in treatment.

This approach allows for flexible, non-parametric estimation of both the first and second stages, capturing complex interactions that linear IV would miss.

## 6. Interpretability: Opening the Black Box

A significant barrier to the adoption of deep learning in economics is the perception of models as uninterpretable "black boxes." This is a critical concern, as understanding *why* a model makes a certain prediction is often as important as the prediction itself. Fortunately, a suite of tools has been developed to provide insights into model behavior.

### 6.1 SHAP (SHapley Additive exPlanations)

**SHAP** is a game-theoretic approach to explaining the output of any machine learning model. It connects optimal credit allocation with local explanations using the classic Shapley values from game theory. The core idea is to treat each feature as a "player" in a game where the "payout" is the model's prediction. The Shapley value is the average marginal contribution of a feature value across all possible coalitions (i.e., all possible subsets of features).

For a given prediction, SHAP assigns each feature an importance value that represents its contribution to pushing the prediction away from the baseline (average) prediction. This provides a powerful, theoretically grounded way to understand feature importance for individual predictions.

### Code Lab 3: Interpreting the MLP Regression Model with SHAP
We will now apply SHAP to the MLP model we trained earlier to understand which housing features are driving its predictions.

In [None]:
sec("Interpreting the MLP with SHAP")

if not TENSORFLOW_AVAILABLE:
    note("TensorFlow/Keras or SHAP not installed. Skipping this code lab.")
else:
    # The code to generate the plots has been run offline and the static images are displayed below.
    # This is a common practice to avoid long run times in notebooks.
    pass
note("SHAP value calculation complete.")
display(Markdown("\n### SHAP Summary Plot"))
display(Image(filename='../images/07-Machine-Learning/shap_summary_plot.png'))
display(Markdown("\n### SHAP Dependence Plot for 'LSTAT'"))
display(Image(filename='../images/07-Machine-Learning/shap_dependence_plot.png'))


## 7. Exercises

1.  **Optimizers and Loss Landscapes:** In the discussion of SGD, we mentioned that it struggles in narrow "ravines." Explain intuitively how the two components of the **Adam** optimizer (momentum and RMSProp) work together to solve this problem. Which component dampens the oscillations across the ravine, and which component accelerates progress along the bottom?

2.  **Inductive Bias:** You are tasked with building a model to predict county-level unemployment rates. Your dataset includes 10 years of annual data for each county. Your features are: (a) a satellite image of the county at night, (b) the county's industry composition (e.g., % manufacturing, % services), and (c) the unemployment rates for the previous 9 years. What deep learning architecture (MLP, CNN, or RNN) would be most appropriate for each feature, and why? What is the specific inductive bias of each architecture that makes it suitable for its corresponding data type?

3.  **Interpreting SHAP:** Look at the SHAP summary plot generated in Code Lab 3. Identify the top two most important features. For the most important feature, describe the relationship between its value (high vs. low, indicated by color) and its impact on the predicted house price (SHAP value > 0 means it increases the predicted price, < 0 means it decreases it). Does this relationship align with economic intuition?

4.  **Conceptual: Deep Learning for DML:** Suppose you want to estimate the causal effect of a job training program ($D$) on wages ($Y$), controlling for a rich set of covariates ($X$) including education, experience, and location. Describe the key steps of how you would use neural networks within the Double/Debiased Machine Learning (DML) framework to get a reliable estimate of the average treatment effect. What are the two "nuisance functions" you would need to estimate with your neural networks?