# Optimal NBA Team Selection (CST-435)

**Contributors:** Preston Brownlee, Chat-GPT Assistance

**Streamlit App:** streamlit run app/streamlit_app.py  
**Data Window:** 2018–19 to 2022–23  
**Player Pool:** 100 unique players (Random)

**Executive Summary.**  
I implemented a deep Artificial Neural Network (MLP) in **NumPy** to classify “optimal” players in a 5-year NBA window using interpretable basketball features. The Streamlit app constructs three lineups: **Overall**, **Offense**, and **Defense** enforcing one player per position (PG/SG/SF/PF/C), and explains the selections with stat-based drivers. This notebook documents the approach, dataset, model, and findings, using screenshots exported from the app.


## Run Instructions

### 1) Install dependencies
pip install -r requirements.txt

### 2) Run Streamlit App
python streamlit run app/streamlit_app.py

## Table of Contents
1. Problem Statement  
2. Dataset Description (Strengths & Weaknesses)  
3. Feature Set & Labeling (What “Optimal” Means)  
4. Model Architecture (Deep NumPy MLP)  
5. Learning Procedure (Forward, Loss, Backprop, Epochs)  
6. Team Construction Logic (Positions & Fallbacks)  
7. App Outputs & How to Read Them  
8. Findings (Overall / Offense / Defense)  
9. Limitations & Future Work  
10. Reproducibility & Deployment  
11. References


## 1) Problem Statement

**Goal.** From a 5-year window of NBA seasons, sample a **100-player** pool and select an **optimal team of 5**—one per position (PG, SG, SF, PF, C)—according to a definition of “optimal” that balances team roles and performance.

**Deliverables.**
- A deep **NumPy MLP** that predicts whether a player is “optimal” (binary).
- A Streamlit app that:
  - Shows **All Candidates by Position** (ranked),
  - Builds a **Predicted Team** (Overall / Offense / Defense),
  - Explains **Why These Picks** (player-level & team-level drivers),
  - Visualizes players on a **half-court diagram**.
- This documentation notebook summarizing the pipeline and results.


## 2) Dataset Description (Strengths & Weaknesses)

**Source.** “NBA Players Dataset”. Columns used include:
- Identity/Context: `player_name`, `team_abbreviation`, `season`, `age`, `player_height`, `player_weight`.
- Performance: `pts`, `reb`, `ast`, `net_rating`, `usg_pct`, `ts_pct`, `ast_pct`, `oreb_pct`, `dreb_pct`.

**Sampling.**  
- Seasons filtered to **2018–19 → 2022–23**.  
- One row per player (most recent season inside the window).  
- Random sample of **100** unique players (fixed seed).

**Strengths.**
- Includes **efficiency** (TS%), **usage** (USG%), **impact** (net rating),
  **role** (AST%), **rebounding splits** (OREB%/DREB%).
- Multi-season coverage allows recency selection.

**Weaknesses/Assumptions.**
- Limited in defensive stats (no STL/BLK/opponent data).  
- `net_rating` mixes offense and defense.  
- Per-game rates reflect role/minutes; **z-scoring** mitigates scale issues.  
- If position is absent, app infers via height/weight rules.

**Mitigations.**
- Create **Offense** and **Defense** composite scores;  
- Use **DREB% + size proxies** for defense;  
- Enforce **positions** when constructing teams.


## 3) Feature Set & Labeling (What “Optimal” Means)

**Inputs (12 features).**  
`age`, `player_height`, `player_weight`, `pts`, `reb`, `ast`,  
`net_rating`, `usg_pct`, `ts_pct`, `ast_pct`, `oreb_pct`, `dreb_pct`.

**Composite Scores** (features are z-scored inside the 100-player pool):

- **Offense:**

  $$\mathrm{off\_score}
  = 0.40\,\mathrm{PTS}
  + 0.25\,\mathrm{AST}
  + 0.20\,\mathrm{TS}\%
  + 0.10\,\mathrm{USG}\%
  + 0.05\,\mathrm{OREB}\%$$

- **Defense (proxies):**

  $$\mathrm{def\_score}
  = 0.55\,\mathrm{DREB}\%
  + 0.25\,\mathrm{height}
  + 0.20\,\mathrm{weight}$$

- **Overall:**

  $$\mathrm{score}
  = 0.5\,\mathrm{off\_score}
  + 0.5\,\mathrm{def\_score}$$


**Labels.**  
- “Optimal” = top **20%** by **Overall** score within the sampled pool → **label=1**, else 0.  
- The MLP is trained to learn this labeling; Offense/Defense teams in the app are built by ranking on the respective composite.


## 4) Model Architecture (Deep NumPy MLP)

**Layer sizes:** [12 → 32 → 16 → 1] (hidden layers use ReLU; output uses Sigmoid).

**Forward propagation.** For layer $l$:

$$
Z^{(l)} = W^{(l)} A^{(l-1)} + b^{(l)}, \qquad
A^{(l)} =
\begin{cases}
\mathrm{ReLU}(Z^{(l)}) & l < L \\
\sigma(Z^{(l)}) & l = L
\end{cases}
$$

**Loss (Binary Cross-Entropy).**

$$
\mathcal{L} = -\frac{1}{m} \sum \left( y \log \hat{y} + (1-y)\log(1-\hat{y}) \right)
$$

**Backpropagation.**

- Output error: $\delta^{(L)} = \hat{y} - y$ (after BCE+sigmoid simplification).  
- Hidden: $\delta^{(l)} = \left(W^{(l+1)T}\delta^{(l+1)}\right) \odot \mathrm{ReLU}'\!\left(Z^{(l)}\right)$.  
- Gradients: $\nabla_{W^{(l)}} = \frac{1}{m}\,\delta^{(l)} A^{(l-1)T}, \quad \nabla_{b^{(l)}} = \frac{1}{m}\sum \delta^{(l)}$.

**Optimization.** Learning rate = 0.01; epochs = 1,000; prediction threshold = 0.5.


## 5) Learning Procedure (What Actually Happens)

1. **Standardize** inputs (z-scores).  
2. **Split** the 100 players: 80 train / 20 test (stratified by label).  
3. **Train** MLP for 1,000 epochs on the binary label (optimal vs. not).  
4. **Evaluate** test accuracy (agreement with our label definition).  
5. **Predict** for all 100 players (probabilities & binary predictions).  
6. **Construct teams** (Overall/Offense/Defense) with position enforcement.


## 6) Team Construction Logic (Positions & Fallbacks)

**Positions:** One of each: PG, SG, SF, PF, C.

**Selection (Overall mode).**
1) Prefer **predicted=1** _within that position_ (highest Overall score).  
2) If none, take the **best within position** by Overall score.  
3) If still none, search **position family** (PG↔SG, SF↔PF, PF↔C).  
4) Last resort: **global best remaining** (noted as fallback).

**Selection (Offense/Defense modes).**  
- Rank by **off_score** or **def_score** respectively.  
- No predicted=1 requirement; still enforce positions and family fallbacks.

**Interpretation helpers.**
- **prediction (0/1):** MLP’s binary output (trained on Overall label).  
- **pred_proba (0–1):** MLP’s confidence for label=1.  
- **score/off_score/def_score:** ranking metrics for the team mode.


## 7) App Outputs & How to Read Them (Screenshots)

### 7.1 All Candidates by Position (Ranked)
- **What it shows:** Within each position (PG/SG/SF/PF/C), players ranked by the current mode’s score.  
- **Styling:** The **best value per stat in that position is green & bold**.  
- **How to use:** Scan top names per slot; note **prediction** and **pred_proba** to see how well the MLP agrees.

**PG — Overall**
![All Candidates — Overall (PG)](image/OVR_PG.png)

**SG — Overall**
![All Candidates — Overall (SG)](image/OVR_SG.png)

**SF — Overall**
![All Candidates — Overall (SF)](image/OVR_SF.png)

**PF — Overall**
![All Candidates — Overall (PF)](image/OVR_PF.png)

**C — Overall**
![All Candidates — Overall (C)](image/OVR_C.png)

---

### 7.2 Predicted Best Overall Team (One per Position)
- **What it shows:** The selected five (PG→C) with their scores and key stats.  
- **Notes column:** If any cross-position fallback is used, it is explicitly annotated.

**Predicted Team — Overall**
![Predicted Team — Overall](image/OVR_TEAM.png)

**Predicted Team — Offense**
![Predicted Team — Offense](image/OFF_TEAM.png)

**Predicted Team — Defense**
![Predicted Team — Defense](image/DEF_TEAM.png)

---

### 7.3 Why These Picks? (Per-Player Drivers)
- **What it shows:** For each selected player, the top 3 **weighted, standardized** stat drivers under the chosen mode.  
- **Interpretation:** Positive values mean the player is **above pool average** on that stat and the stat is **heavily weighted**.

![Player Drivers — Overall](image/OVR_PLAYER_DRIVERS.png)

---

### 7.4 Team-Level Drivers
- **What it shows:** Weighted team mean z-scores by stat (under the mode’s weights).  
- **Interpretation:** Tells **why the team, as a whole**, is strong in this mode.

![Team Drivers — Overall](image/OVR_TEAM_DRIVERS.png)

---

### 7.5 Court Visualization
- **What it shows:** Players placed at PG/SG/SF/PF/C spots on a compact half-court.  
- **Interpretation:** Quick positional visualization of the lineup.

![Court View — Overall lineup](image/Court_View.png)

---


## 8) Findings (Overall / Offense / Defense)

### Overall
- **Pattern observed:** (e.g., balanced creators + efficient finishers + board presence)
- **Drivers (team-level):** (e.g., net_rating, PTS/AST/REB, TS%)  
- **Trade-offs:** (e.g., high-usage guard chosen over slightly more efficient scorer due to assist load)

### Offense
- **Pattern observed:** (e.g., high-usage, high-efficiency backcourt; OREB% bigs)
- **Drivers (team-level):** PTS, AST, TS%, USG%; OREB% adds extra possessions
- **Trade-offs:** (e.g., defense sacrificed for elite shooting & creation)

### Defense
- **Pattern observed:** (e.g., size & glass cleaning; wing with strong DREB% proxy)
- **Drivers (team-level):** DREB% + size proxies
- **Trade-offs:** (e.g., lower off_score tolerated for better rim protection)


## 9) Limitations & Future Work

- **Defensive coverage:** Add STL, BLK, opponent on/off, or DRTG if available.  
- **Position inference:** Replace height/weight heuristics with true position labels or role clustering.  
- **Modeling:** Cross-validation, hyperparameter tuning, regularization (L2/Dropout), LR schedules.  
- **Targets:** Consider per-position labels or multi-objective training (joint offense/defense heads).  
- **Data leakage/quality:** Confirm season alignment, handle outliers, consider per-pos z-scores.
