# DML Pipeline - Module Specifications

This document outlines the modular functions we need to build to implement the Double-Residual Machine Learning (DML) pipeline.

## The Workflow

The analysis is broken into three main functional components:

1.  **Module 1: `train_f_model`**
    -   **Purpose:** Trains the "Outcome Model" ($\hat{f}$).
    -   **Model:** $Y \sim X$ (Salary ~ Performance)
    -   **Assigned to:** Leo

2.  **Module 2: `train_h_models`**
    -   **Purpose:** Trains the "Treatment Models" ($\hat{h}$). We train *one model for each* contextual/bias factor.
    -   **Model:** $Z_j \sim X$ (e.g., Draft Status ~ Performance, Age ~ Performance, etc.)
    -   **Assigned to:** Macy

3.  **Module 3: `run_dml_crossfit`**
    -   **Purpose:** The main "engine" that implements the K-fold cross-fitting (Section 6 of our doc). It calls the training functions to generate out-of-sample residuals.
    -   **Assigned to:** Tyler + Alberto
4.  **Module 4: `train_bias_model`**
    -   **Purpose:** The only purpose is to take the out-of-sample residuals and train the final model.
    -   **Model:** $\hat{\epsilon}_Y \sim \hat{\epsilon}_Z$
    -   **Assigned to:** Gary

## Module 1: Outcome Model ($\hat{f}$)

This function takes training data and returns a *trained* model object that has a `.predict()` method. 

We need to get the whole pipeline working first, then we can tune or swap models later. Therefore, try to use more simplier models.

```python
def train_f_model(X_train: pd.DataFrame, y_train: pd.Series) -> Any:
    """
    Trains the outcome model f: Y ~ X (Salary ~ Performance)

    Args:
        X_train: DataFrame of performance features (training fold).
        y_train: Series of the outcome (log_salary) (training fold).

    Returns:
        A trained model object with a .predict() method.
    """
    print(f"Training f_model on {X_train.shape[0]} samples...")

    return model_f
```

```
```

# Module 2: Treatment Models ($\hat{h}$)

This function is slightly different from Module 1. It takes the training data and returns a *dictionary* of trained models, one for each column in $Z$ (our bias factors). 

Your task is to train a separate model for each bias factor ($Z_j$) as a function of performance ($X$). The goal is to "clean" the bias factors by finding the part of them that cannot be explained by on-court stats.

```python
def train_h_models(X_train: pd.DataFrame, Z_train: pd.DataFrame) -> Dict[str, Any]:
    """
    Trains the treatment models h: Z_j ~ X (Bias Factor ~ Performance)
    for each bias factor j in Z.

    Args:
        X_train: DataFrame of performance features (training fold).
        Z_train: DataFrame of contextual/bias factors (training fold).

    Returns:
        A dictionary where keys are Z column names (e.g., 'Draft_Status')
        and values are the corresponding trained model objects.
    """
    print(f"Training h_models for {list(Z_train.columns)}...")
    models_h = {}

    return models_h
```

# Module 3: DML Residual Generation Engine

This is the main function that orchestrates the DML cross-fitting process. It takes the *full* datasets ($X, Y, Z$) and the *training functions* (Modules 1 & 2) as inputs.

This function implements the "cross-fitting" algorithm from Section 6 of our project document. Its sole purpose is to generate and return the out-of-sample residuals.

Its job is to:

1.  Create K-folds.
2.  Loop through each fold.
3.  On each loop, it calls `model_f_trainer` and `model_h_trainer` on the "training" data.
4.  It then uses those trained models to generate *predictions* on the "prediction" (out-of-sample) data.
5.  It calculates and stores the residuals ($\hat{\epsilon}_Y$ and $\hat{\epsilon}_Z$) for the prediction data.
6.  After the loop finishes, it returns the two complete sets of out-of-sample residuals.

<!-- end list -->

```python
def generate_dml_residuals(X: pd.DataFrame,
                             Y: pd.Series,
                             Z: pd.DataFrame,
                             model_f_trainer: Callable,
                             model_h_trainer: Callable,
                             k_folds: int = 5) -> Tuple[pd.Series, pd.DataFrame]:
    """
    Implements the DML cross-fitting algorithm (Section 6 of the doc).

    This function orchestrates the training and prediction across K-folds
    to generate out-of-sample residuals for both the outcome (Y) and
    the treatment/bias factors (Z).

    Args:
        X: Full DataFrame of performance features.
        Y: Full Series of the outcome (log_salary).
        Z: Full DataFrame of contextual/bias factors.
        model_f_trainer: The function to train model f (i.e., train_f_model).
        model_h_trainer: The function to train models h (i.e., train_h_models).
        k_folds: Number of folds for cross-fitting.

    Returns:
        A tuple containing:
        - residuals_Y_oos (pd.Series): The out-of-sample residuals for Y (epsilon_Y).
        - residuals_Z_oos (pd.DataFrame): The out-of-sample residuals for Z (epsilon_Z).
    """
    
    print("Starting DML Residual Generation...")

    return residuals_Y_oos, residuals_Z_oos
```



## Module 4: Final OLS Regression Function

**Note:** This function implements Step 4 from Section 6 of our doc ("Run Final OLS") and follows the "All-at-Once" approach from Section 6.1.

```python
def run_final_ols(residuals_Y: pd.Series, 
                    residuals_Z: pd.DataFrame) -> RegressionResultsWrapper:
    """
    Runs the final debiased OLS regression on the out-of-sample residuals.

    This model estimates:  epsilon_Y ~ epsilon_Z
    
    This corresponds to Section 6.1 of the methodology document.

    Args:
        residuals_Y: The out-of-sample residuals for Y (epsilon_Y).
        residuals_Z: The out-of-sample residuals for Z (epsilon_Z).

    Returns:
        The statsmodels OLS results object for the final regression.
    """
    
    print("Running final OLS regression on residuals...")
    
    return final_ols_results
```