Ah, my apologies\! You're right, no need for a file when you just want the content. Here is the plan from the notebook, formatted as a series of markdown and code blocks that you can copy directly.

-----

# Week 7 - Action Plan: Enhancements & Visualization

This document outlines the concrete tasks for the three main themes of this week's work.

## Theme 1: Model Modification (Improving Accuracy)

**Goal:** Improve the accuracy of our nuisance models (`f_model: Y~X` and `h_models: Z_j~X`). More accurate models produce cleaner residuals, which leads to a more reliable and precise final OLS estimate (`gamma`).

**Assigned to:** .

### Task 1.A: Implement Hyperparameter Tuning

Instead of using a simple `OLS`, we should tune our models. The tuning *must* happen *inside* the cross-fit loop on the *training data for that fold*.

  * **Action (for `src/analysis/train_f_model.py`):**

    1.  Inside the `train_f_model` function, replace the simple `model.fit()` with a `GridSearchCV` or `RandomizedSearchCV`.

    2.  Define a parameter grid for your chosen model (e.g., `RandomForestRegressor`).

<!-- end list -->

```python
# Example for RandomForest
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor

param_grid = {
    'n_estimators': [50, 100],
    'max_depth': [5, 10, None],
    'min_samples_leaf': [1, 3]
}
# Use a 3-fold CV for tuning, and use all available cores
grid_search = GridSearchCV(
    estimator=RandomForestRegressor(random_state=42),
    param_grid=param_grid,
    cv=3, 
    n_jobs=-1,
    scoring='neg_mean_squared_error'
)
# grid_search.fit(X_train, Y_train) # This would be run inside your function

# The .best_estimator_ is already re-fitted on the whole (X_train, Y_train)
# return grid_search.best_estimator_
```

  * **Action (for `src/analysis/train_h_models.py`):**

    1.  Apply the *exact same logic* inside the `for` loop in `train_h_models`. Each `h_model` (for Age, Draft \#, etc.) will be individually tuned on each fold. This will be more computationally intensive but much more robust.

### Task 1.B: Experiment with More Powerful Models

  * **Action:** Try a more powerful model for tabular data, like `GradientBoostingRegressor` or `XGBoost`.

<!-- end list -->

```python
# Example for GradientBoosting
from sklearn.ensemble import GradientBoostingRegressor

param_grid = {
    'n_estimators': [100, 200],
    'learning_rate': [0.05, 0.1],
    'max_depth': [3, 5]
}
```

## Theme 2: Investigating Deterministic Rules

**Goal:** Our `f_model` (Salary \~ Performance) will be very inaccurate for players on deterministic contracts (e.g., rookies), where salary is set by `DRAFT_NUMBER`, not `PIE`. This inaccuracy pollutes your `epsilon_Y` residual. We must "help" the `f_model` by telling it about these rules.

**The Solution:** We will *control* for these deterministic effects by adding them to the `X` (controls) matrix. This moves them from "unexplained" to "explained," giving us a cleaner "unexplained" residual.

  * **Action (for `main.py` -\> `preprocess_data` function):**

    1.  Create new feature flags for these deterministic groups.

    2.  Add this logic inside `preprocess_data`, *after* handling NaNs for draft:

<!-- end list -->

```python
# ... inside preprocess_data(df) ...

# 3. Handle missing DRAFT* ... (already done)

# 3.5. Engineer Deterministic Contract Flags
logging.info("Engineering deterministic contract flags...")

# Flag for players on 1st round rookie scale (e.g., drafted in last 4 years)
# Adjust '2021' as needed for your data's "current season"
proc_df['is_rookie_scale'] = (
    (proc_df['DRAFT_YEAR'] >= 2021) & 
    (proc_df['DRAFT_ROUND'] == 1)
).astype(int)

# Flag for undrafted players (using our imputed value)
proc_df['is_undrafted'] = (proc_df['DRAFT_ROUND'] == 3).astype(int)

# Flag for "Max Contract" players (This is an approximation)
# Find the 95th percentile salary and flag anyone above it.
max_contract_threshold = proc_df['log_Salary'].quantile(0.95)
proc_df['is_max_contract'] = (
    proc_df['log_Salary'] > max_contract_threshold
).astype(int)
```

  * **Action (for `main.py` -\> `CONFIG`):**

    1.  Add these new columns to your `X_COLUMNS` list. This tells the `f_model` to use them as controls.

<!-- end list -->

```python
# ... inside CONFIG ...
"X_COLUMNS": [
    'OFF_RATING', 'DEF_RATING', # ..., 'FGM_PG', 'FGA_PG',
    # New deterministic controls:
    'is_rookie_scale',
    'is_undrafted',
    'is_max_contract'
],
```

This change will make your `f_model` *much* more accurate, as it can now learn "This player is a new hire? Their salary is *this*, regardless of performance." The `epsilon_Y` residual will now represent the "salary unexplained by *both* performance *and* deterministic contract rules," which is a much cleaner target for your final analysis.

## Theme 3: Visualization (The Final Presentation)

**Goal:** Create an interactive, web-based presentation of your findings.

**Tool:** Streamlit is the perfect choice. It's 100% Python and integrates perfectly with Pandas and Plotly.

### Task 3.A: Save Data for the App

Your Streamlit app needs the final residuals. 

### Task 3.B: Brainstorm Streamlit App (`presentation.py`)

Create a new file, `presentation.py`. Here is a full-fledged brainstorm for what it could contain.

  * **Page 1: The "Naive" View**

      * **Title:** NBA Salary: Performance vs. Perception

      * **Intro:** "Our project investigates what *really* determines an NBA player's salary. We all know performance matters, but what about 'bias' factors like age, draft pick, or social media fame?"

      * **Viz 1: The Noisy Relationship:**

          * Show a `plotly.express.scatter` plot of `log_Salary` vs. `PIE`.

          * **Make it interesting:** Use `color='AGE'` and `size='Followers'`.

          * **Point:** "This is a *mess*\! It's hard to see a clear trend because performance (PIE) is tangled up with other factors. For example, young players (blue dots) have high performance but low pay (rookie contracts)."

  * **Page 2: Our Method (DML Explained Simply)**

      * **Title:** How We Untangle the Mess

      * **Viz 2: A Simple Diagram:**

          * "We use a method called Double Machine Learning (DML) to isolate the *true* relationships."

          * **Box 1:** "Salary ($Y$)" -\> "Cleaned Salary ($\epsilon_Y$)" (by removing all `X` Performance effects)

          * **Box 2:** "Age ($Z$)" -\> "Cleaned Age ($\epsilon_Z$)" (by removing all `X` Performance effects)

          * **Final Step:** Compare "Cleaned Salary" vs. "Cleaned Age".

  * **Page 3: The "Money" Charts (The Debias-ed Results)**

      * **Title:** The *True* Effect of Bias

      * **Intro:** "After cleaning our data, here is what we found. These charts show the *true*, *debiased* relationship between our factors and salary, after *all* on-court performance is accounted for."

      * **Viz 3: The Final OLS (The *real* result\!)**

          * Load `dml_residuals_for_viz.csv`.

          * Add a `st.selectbox` to let the user choose which Z-factor to view: `['AGE', 'DRAFT_NUMBER', 'Followers', 'COUNTRY_USA']`.

          * Based on the selection, show the scatter plot:

<!-- end list -->

```python
import plotly.express as px

selected_z = 'AGE' (from selectbox)
z_col = f'residual_{selected_z}'
fig = px.scatter(df_viz, 
                x=z_col, 
                y='residual_Y', 
                trendline='ols',
                hover_data=['PLAYER_NAME', 'log_Salary', selected_z])
fig.update_layout(
    title=f"Debiased Effect of {selected_z} on log(Salary)",
    xaxis_title=f"Residual {selected_z} (Performance-Adjusted)",
    yaxis_title="Residual log(Salary) (Performance-Adjusted)"
)
st.plotly_chart(fig)
```

  * **How to make it interesting:** The `hover_data` is key. A user can mouse over an outlier dot and see "LeBron James" or "Victor Wembanyama" and see how their real vs. residual values stack up.

  * **Page 4: Conclusions & Full Results**

      * **Title:** Final Coefficients

      * **Viz 4: The Summary Table:**

          * Load and parse `final_dml_regression_summary.txt`.

          * Display the OLS results table in a clean `st.dataframe` or `st.code`.

          * **Interpretation:** Write out in plain English what your `gamma` coefficients mean.

          * **Example:** "Our model finds that for every 1-unit increase in 'Performance-Adjusted Followers', a player's 'Performance-Adjusted log(Salary)' increases by **{gamma\_followers}**. This effect is statistically significant (p \< 0.05)."

This 4-page Streamlit app tells a complete and compelling story, from the "naive" problem to your sophisticated DML solution and the final, interpretable results.