# Module Contract: `configs/project_config.py`

This is the **most important file** for our Week 4 workflow. It is the central "contract" and "single source of truth" that all our modules will read from.

### Why do we need this?

1.  **No "Magic Strings":** We will never hard-code a file path (e.g., `"data/raw/my_data.csv"`) or a constant (e.g., `SEASON = "2024-25"`) inside any of our scripts. This is a bad practice that makes code hard to maintain. Instead, every script will import these variables from this one file.
2.  **Maintainability:** If we need to change a filename or run the *entire* project for a different season (e.g., "2023-24"), we only have to change it **in this one file**.

### How does this help us work in parallel?

* Modules **do not talk to each other directly**. `get_nba_stats.py` will *never* import `get_salary_data.py`.
* Instead, they all import `project_config.py`.
* This file tells each module **what inputs to read** and **where to write its output**. For example, Module 2 knows its job is to create `RAW_SALARY_FILE`, and Module 3 knows its job is to read `RAW_SALARY_FILE`. This "contract" is all they need to know.

## Module 1a: NBA Performance Stats

* **Owner:** Gary
* **File:** `src/data_collection/get_nba_stats.py`
* **Job:** The goal of this module is to fetch all *performance* statistics for every player in the target season. This dataset will become our **$X$ vector** (the predictors) for our clustering and regression models. You'll primarily use the `nba_api` library, likely the `leaguedashplayerstats` endpoint, to get advanced metrics (like `TS_PCT`, `AST_PCT`, etc.) rather than just basic "per game" stats.

### Inputs (from `project_config.py`):
This script will import variables from our central config file.
* `SEASON`: The season string (e.g., "2024-25") that will be passed directly to the `nba_api` function call.
* `RAW_STATS_FILE`: The full `pathlib.Path` object where the final, raw dataframe will be saved as a CSV file (e.g., `.../data/raw/raw_player_stats.csv`).

### Output:
* **File:** `data/raw/raw_player_stats.csv`
* **Description:** This CSV is the "raw" output from the API. It should be one row per player.
* **Required Columns:**
    * `PLAYER_ID`: This is the **most important column** as it's the unique primary key we'll use for the first, easy merge with Module 1b.
    * `PLAYER_NAME`: The player's full name as it appears in the API.
    * `TEAM_ID`: The unique ID for the player's team.
    * **$X$ Vector Stats:** A wide set of advanced stats. We need *at least*:
        * `TS_PCT` (True Shooting %)
        * `AST_PCT` (Assist %)
        * `REB_PCT` (Rebound %)
        * `USG_PCT` (Usage %)
        * `BLK_PCT` (Block %)
        * `STL_PCT` (Steal %)
        * `FGA_2PT_PCT` (Percentage of shots that are 2-pointers)
        * `FGA_3PT_PCT` (Percentage of shots that are 3-pointers)
        * ...and any other rate-based stats we find useful.

## Module 1b: NBA Contextual Data

* **Owner:** Alberto
* **File:** `src/data_collection/get_context_data.py`
* **Job:** This module's job is to fetch the *non-performance*, demographic, and career-related data for all players. This will become our $\mathbf{Z}_{\text{context}}$ vector, which is the key to our final bias analysis (Phase 4). We need to get this data separately because it often comes from a different API endpoint (like `commonplayerinfo`) than the advanced stats (which come from `leaguedashplayerstats`).

### Inputs (from `project_config.py`):
This script will import variables from our central config file.
* `SEASON`: This will be used to get the list of active players for that season. The script will likely need to get a list of all player IDs first, then loop through them to call the `commonplayerinfo` endpoint for each one.
* `RAW_CONTEXT_FILE`: The full `pathlib.Path` object where the final, raw dataframe will be saved as a CSV (e.g., `.../data/raw/raw_player_context.csv`).

### Output (The Contract):
* **File:** `data/raw/raw_player_context.csv`
* **Description:** This CSV will be one row per player, containing their "biographical" data.
* **Required Columns:**
    * `PLAYER_ID`: The **primary key**. This is critical as it's how this file will be merged with `raw_player_stats.csv` in Module 3.
    * `PLAYER_NAME`: The player's full name from the API (for cross-referencing).
    * `BIRTHDATE`: The player's date of birth (e.g., "1984-12-30T00:00:00"). We **must** have this so we can calculate their `AGE` during the processing step.
    * `COUNTRY`: The player's home country (for our 'Nationality' bias test).
    * `DRAFT_YEAR`: The year the player was drafted.
    * `DRAFT_ROUND`: The round the player was drafted in.
    * `DRAFT_NUMBER`: The overall draft pick number. (These three draft columns are crucial for our "pedigree" bias test).

## Module 2: Salary Data Scraping

* **Owner:** Tyler and Macy
* **File:** `src/data_collection/get_salary_data.py`
* **Job:** This is one of the most critical and difficult modules. Its only job is to go to an external website (like **Spotrac** or **Basketball-Reference**), scrape the salary table for our target season, and save that raw data. This module is completely separate from the `nba_api` because it's a "messy" data source. The player names *will not* match the API, and the salary figures will be text (e.g., "$15,000,000") that needs cleaning. This module's job is *not* to clean the data, just to get it and save it.

### Inputs (from `project_config.py`):
This script will import variables from our central config file.
* `SEASON`: The season string (e.g., "2024-25") will be used to construct the correct URL to scrape (e.g., `https://spotrac.com/nba/payroll/2024/`).
* `RAW_SALARY_FILE`: The full `pathlib.Path` object where the final, raw dataframe will be saved as a CSV (e.g., `.../data/raw/raw_player_salaries.csv`).

### Output (The Contract):
* **File:** `data/raw/raw_player_salaries.csv`
* **Description:** This CSV is the raw, uncleaned table scraped from the website.
* **Required Columns:**
    * `Player_Name`: The player's name exactly as it appears on the website (e.g., "LucMbah a Moute"). This will be our messy key for the merge.
    * `Salary`: The player's salary as a *string* (e.g., "$15,000,000", or "$1,200,000 (cap)") exactly as it appears on the site. We will clean this in Module 3.

## Module 3: Merging & Cleaning

* **Owners:** Leo
* **File:** `src/data_processing/merge_data.py`
* **Job:** This is the most important and complex module of Week 4. It's the "integration" step where all our raw data comes together. Its job is to take the three separate, raw files (stats, context, and salary), combine them into one, and "process" them into a final, clean master file. This module handles all the dirty work: merging on different keys, cleaning messy text data, and calculating new variables.

### Helper Script: `src/data_processing/cleaning_helpers.py`
* This module's logic will be complex, so we'll put our reusable cleaning functions in a separate file.
* It **MUST** contain a function: `standardize_player_name(name: str) -> str`.
* **Explanation:** This function is our "Rosetta Stone." It will take a messy name from any source (e.g., "LucMbah a Moute", "LeBron James.", "Luka Dončić") and convert it to a single, standardized key (e.g., "luc mbah a moute", "lebron james", "luka doncic"). We will apply this function to the name columns from *both* the NBA API data and the salary data before attempting the merge. This is how we solve the "different names" problem.

### Inputs (The Contract):
This script will read three files, using the paths from `project_config.py`:
* `RAW_STATS_FILE` (from Module 1a)
* `RAW_CONTEXT_FILE` (from Module 1b)
* `RAW_SALARY_FILE` (from Module 2)

### Logic:
1.  **Load Stats & Context:** Load `raw_player_stats.csv` and `raw_player_context.csv`.
2.  **First Merge (Easy):** Perform an inner merge on these two dataframes using `PLAYER_ID` as the key. This gives us one unified "NBA API" dataframe.
3.  **Load Salaries:** Load `raw_player_salaries.csv`.
4.  **Standardize Keys:**
    * Apply `standardize_player_name` to the `PLAYER_NAME` column of the NBA API dataframe, creating a new `merge_key` column.
    * Apply `standardize_player_name` to the `Player_Name` column of the salary dataframe, creating its `merge_key` column.
5.  **Second Merge (Hard):** Perform a **left merge**, joining the salary data *onto* our main NBA API dataframe using `merge_key`. We use a left merge so we keep all players, even if they are missing salary data.
6.  **Clean & Process:**
    * **Clean Salary:** Convert the `Salary` column (e.g., "$15,000,000") to a clean numeric type (e.g., `15000000`).
    * **Calculate Age:** Use the `BIRTHDATE` column to calculate the player's `AGE` at the start of the `SEASON`.
    * **Handle NaNs:** Explicitly check for players with `NaN` in the `Salary` column. We will log the names of these "merge failures" to the console to see who we missed.
7.  **Save:** Save the final, processed dataframe.

### Output (The Contract):
* **File:** `data/processed/merged_player_data_v1.csv`
* **Description:** This is the **final, golden dataset** for the entire team. It should have one row per player, with all performance ($X$), context ($Z_{\text{context}}$), and salary ($Y$) columns cleaned and in the correct data type. All subsequent modules (EDA, Clustering, Modeling) will read from this *one file*.

## Module 4: EDA & Validation

* **Owners:** -- 
* **File:** `notebooks/Macy/00_sandbox.ipynb`, `notebooks/Alberto/00_sandbox.ipynb`
* **Job:** This module is our "Quality Assurance" (QA) step. The owners act as the first "customers" or "users" of the data produced by Module 3. The goal is twofold: 1) **Validate** that the merged dataset is complete and correct, and 2) **Explore** the data (EDA) to understand its properties, which will inform our future modeling decisions in Phase 1 (Clustering) and Phase 3 (Salary Modeling).

### Inputs (The Contract):
This module reads *only one file*, the final output from Module 3.
* `PROCESSED_MERGED_FILE` (i.e., `data/processed/merged_player_data_v1.csv`), which is imported from `project_config.py`.

### Tasks:
1.  **Load & Validate:**
    * Load `PROCESSED_MERGED_FILE` into a pandas DataFrame.
    * Immediately run `df.info()` and `df.describe()`.
    * **Critical Validation:** Check the number of non-null values for the `Salary` column. This is our **merge success rate**. (e.g., "We have 540 players in the NBA stats, but only 480 have non-null salaries. Our merge success rate is 88.8%"). This is the most important metric to report back to the team.
    * Check the data types (`dtypes`). Is `Salary` numeric? Is `AGE` numeric? Are all our $X$ stats numeric?

2.  **Initial Analysis (EDA):**
    * **Salary Distribution:** Plot a histogram of the `Salary` column. Then, plot a histogram of `log(Salary)`. This is essential to confirm our hypothesis that salary is log-normally distributed and that we **must** use `log(Salary)` as our $Y$ variable in Phase 3.
    * **Predictor Distributions:** Plot histograms for 3-4 key variables from our $X$ vector (e.g., `TS_PCT`, `USG_PCT`) and our $Z_{\text{context}}$ vector (e.g., `AGE`). This helps us spot skewness or outliers *before* we feed them into our clustering algorithm.
    * **Correlation:** Create a correlation heatmap of our main $X$ vector variables. This will show us which stats are highly correlated (e.g., `FGA_2PT_PCT` and `FGA_3PT_PCT` will be negatively correlated) and helps us understand the structure of our data.

3.  **Report:**
    * Post a summary of the findings in the team chat (e.g., "Data loaded. Merge success rate is 90%. `log(Salary)` looks good. `USG_PCT` is a bit skewed, but nothing major. We are clear to proceed to Phase 1.").

### Parallel Work:
* This module can be **started immediately**. The owners can create a *fake, 5-row* `merged_player_data_v1.csv` file with the expected column names. They can write their entire notebook (all the `df.info()`, `df.hist()`, etc. calls) using this fake data. When Module 3 is finally complete, they just re-run their notebook with the *real* data.

## Discussion: How This All Works Together

This modular structure is the most important part of our Week 4 plan. It's how we'll get a complex, multi-part task done in one week. Here's why this setup is designed for success:

### 1. It Enforces "Decoupling"
* **What it means:** Notice that `get_nba_stats.py` and `get_salary_data.py` **do not depend on each other at all**. They are "decoupled." They don't import each other's code, and one's failure doesn't stop the other from running.
* **Why it's smart:** This allows for **true parallel work**. -- and -- can start their coding the moment they read this plan, without a single check-in. Their only shared dependency is the *filename* they write to, which is defined in our "contract," the `project_config.py` file.

### 2. It Creates a "Consumer" Model
* **What it means:** -- and -- (Module 4) are the "consumers" of the data. They don't need to know *how* -- scraped the data or *how* -- merge logic works. They only care about the final, promised output file: `merged_player_data_v1.csv`.
* **Why it's smart:** This **also enables parallel work**. -- and -- don't have to wait for Module 3 to be finished. They can **create a fake, 5-row "dummy" CSV** with the exact column names we've specified in this plan. With that dummy file, they can build their entire EDA notebook—all the plots, all the validation checks (`df.info()`, etc.). When Module 3 is finally done, they just re-run their *already-finished* notebook with the *real* data.

### 3. It Makes Testing and Debugging 10x Easier
* **What it means:** Because each part is a separate script, we can run just one part at a time.
* **Why it's smart:** If `main.py` fails, we'll know exactly which module is broken. If the salary scraper (Module 2) breaks because the website changed, we can fix it without touching or worrying about the NBA API (Module 1). This isolation is key to managing a complex data pipeline and not wasting time.