# WNBA Season Business and Data Understanding

This project analyzes 10 years of WNBA data, including player, team, coach, and game statistics. Basketball seasons are structured in two phases: a regular season, where teams compete to maximize wins, and a playoff stage, where top teams face off in knockout matches for the championship.


**Insights extracted for strategic relevance:**
- Teams with high roster turnover are more likely to change coaches.  
- Players with consistent top-tier performance across multiple metrics are more likely to win individual awards.  
- Defensive metrics and win consistency are strong predictors of regular-season rankings.

## Definition of business goals
The business objectives for predicting the upcoming WNBA season are:

1. **Predict the final regular-season conference rankings of each team**  

2. **Identify teams likely to change coaches**  

3. **Predict winners of individual awards**

**Strategic rationale:**
- Accurate prediction of rankings informs team performance analysis and fan engagement strategies.  
- Anticipating coaching changes allows for studying team stability and potential strategic interventions.  
- Forecasting individual awards highlights key player contributions and supports marketing or talent evaluation simulations.  


## Translation of business goals into data mining goals
Each business goal has been mapped to specific **predictive modeling objectives**, with algorithmic approach, features, and evaluation metrics explicitly defined:

1. **Teams likely to change coaches → Classification**
   - **Model type:** Random Forest, Gradient Boosting, or Logistic Regression.  
   - **Features:** team win/loss records, roster turnover, coach tenure, historical coaching changes, performance trends.  
   - **Evaluation metrics:** accuracy, F1-score, precision, recall.  
   - **Insight-driven refinement:** feature importance analysis to determine which variables most influence coaching changes.  

2. **Predict winners of individual awards → Ranking / Regression**
   - **Model type:** Ranking models (e.g., XGBoost ranking) or regression predicting award points.  
   - **Features:** player performance metrics (points, rebounds, assists, efficiency ratings), consistency across seasons, team success.  
   - **Evaluation metrics:** top-3 ranking accuracy, Spearman correlation with actual awards.  
   - **Insight-driven refinement:** analyze correlations between metrics and award likelihood to improve model interpretability.  

3. **Regular-season conference rankings → Regression / Ranking**
   - **Model type:** Regression or ranking models to predict final standings.  
   - **Features:** team statistics (offensive/defensive efficiency, win streaks, home/away performance), player averages, coaching stability.  
   - **Evaluation metrics:** ranking error, RMSE between predicted and actual standings.  
   - **Insight-driven refinement:** incorporate ensemble methods and cross-validation to increase prediction robustness.  


## Datasets Reading and Storing

In [15]:
import sys
import os
sys.path.append('..')

from data_scripts import _store_data as sd
from pathlib import Path

sd.read_and_store_data()
sd.save_data(Path("../data"))

## Datasets Specification


#### **Dataset "awards_players.csv"**

This dataset contains information about the individual awards, being them:
- **All-Star Game Most Valuable Player**

    Given to the best-performing player in the annual WNBA All-Star Game, as voted on after the game.

- **Coach of the Year**

    Awarded to the head coach who has demonstrated outstanding leadership, team success, and strategic excellence over the season.

- **Defensive Player of the Year**

    Honors the league’s top defender, recognizing excellence in steals, blocks, on-ball defense, and overall defensive impact.

- **Kim Perrot Sportsmanship Award**

    Named after Houston Comets guard Kim Perrot, who passed away from cancer in 1999. It honors the player who exemplifies fairness, respect, and ethical behavior on and off the court.

- **Most Improved Player**

    Recognizes the player who has shown the greatest improvement from the previous season.

- **Most Valuable Player**

    The league’s highest individual honor, awarded annually to the best player of the regular season, typically based on performance, leadership, and impact.

- **Rookie of the Year**

    Awarded to the top-performing first-year player in the WNBA.

- **Sixth Woman of the Year**

    Awarded to the league’s best player coming off the bench (non-starter) who provides a major impact for her team.

- **WNBA Finals Most Valuable Player**

    Awarded to the most outstanding player of the WNBA Finals, given at the end of the championship series.

- **WNBA All-Decade Team**

    It featured the best 10 players of the first decade of the WNBA.

- **WNBA All Decade Team Honorable Mention**

    Recognized additional players who made major contributions to the league during its first 10 years but were not part of the main 10-player team.

The WNBA All-Decade Team and All-Decade Team Honorable Mention were only awarded in year 7, since this is a decade award we just need to predict this awards if it is year 17.

#### **Dataset "coaches.csv"**

The dataset contains information about every WNBA coach over a sampled 10-year period. For each coach, it records the year and the team they coached, along with the number of victories and losses in both the regular season and postseason. It also includes a variable called stint, which indicates the order of coaches for a team within a given year. If a team had only one coach in that season, the stint is recorded as 0. If multiple coaches were assigned, the stint specifies their sequence, with 1 representing the first coach, 2 the second, and so on. 


#### **Dataset "players_teams.csv"**

This dataset contains details about all players over a 10-year span, including each player’s professional statistics by team and year. It also includes a ‘player stint’ column, which indicates the order of teams a player joined within a given year: 0 means the player did not change teams that year, 1 represents the first team they played for, 2 the second team, and so on.

#### **Dataset "players.csv"**

Contains personal info about each player, being: position, the first and last season they played, their height, weight, colleges where they studied, and the dates of birth and death. 

#### **Dataset "series_post.csv"**

Information about the playoffs froam each year, specifying the year, the rounds, the teams that played in each round and the results.
In the first 5 years all the rounds were Bo3 (best of three) after the sixth both the first round and the conference final stayed as Bo3 and the final changed to Bo5 (best of five).

#### **Dataset "teams_post.csv"**

Information about the results of each team in the playoffs of each year, the wins and losses in total from the playoffs.

#### **Dataset "teams.csv"**

Contains detailed team-level statistics for WNBA teams over multiple seasons. For each team in a given year, it records identifiers such as the franchise, conference, and division, along with the team’s rank and playoff information, including whether they made the playoffs, their seed, and results in the first round, semifinals, and finals. It also includes team performance statistics for both offense and defense. Additionally, the dataset tracks overall wins, losses, games played, home and away records, and conference records. Finally, it provides supplementary information including total minutes played, attendance, and home arena.

## Database Dictionary

### award_players.csv
#### Player Awards Dataset

| Column Name | Type | Description |
|-------------|------|-------------|
| playerID | String | Unique identifier for the player |
| award | String | Name or type of award received |
| year | Integer | Year the award was received |
| lgID | String | League identifier (e.g., NBA, ABA) |


### coaches.csv
#### Coaches Performance Dataset

| Column Name | Type | Description |
|-------------|------|-------------|
| coachID | String | Unique identifier for the coach |
| year | Integer | Season year |
| tmID | String | Team identifier |
| lgID | String | League identifier |
| stint | Integer | Separate period with the same team in same year |
| won | Integer | Regular season games won |
| lost | Integer | Regular season games lost |
| post_wins | Integer | Postseason games won |
| post_losses | Integer | Postseason games lost |


### players_teams.csv
#### Player Statistics by Team and Season

| Column Name | Type | Description |
|-------------|------|-------------|
| playerID | String | Unique identifier for the player |
| year | Integer | Season year |
| stint | Integer | Separate period with the same team in same year |
| tmID | String | Team identifier |
| lgID | String | League identifier |
| GP | Integer | Games Played |
| GS | Integer | Games Started |
| minutes | Integer | Total minutes played |
| points | Integer | Total points scored |
| oRebounds | Integer | Offensive Rebounds |
| dRebounds | Integer | Defensive Rebounds |
| rebounds | Integer | Total Rebounds |
| assists | Integer | Assists |
| steals | Integer | Steals |
| blocks | Integer | Blocks |
| turnovers | Integer | Turnovers |
| PF | Integer | Personal Fouls |
| fgAttempted | Integer | Field Goals Attempted |
| fgMade | Integer | Field Goals Made |
| ftAttempted | Integer | Free Throws Attempted |
| ftMade | Integer | Free Throws Made |
| threeAttempted | Integer | Three-Point Field Goals Attempted |
| threeMade | Integer | Three-Point Field Goals Made |
| dq | Integer | Disqualifications |
| PostGP | Integer | Postseason Games Played |
| PostGS | Integer | Postseason Games Started |
| PostMinutes | Integer | Postseason Minutes Played |
| PostPoints | Integer | Postseason Points Scored |
| PostoRebounds | Integer | Postseason Offensive Rebounds |
| PostdRebounds | Integer | Postseason Defensive Rebounds |
| PostRebounds | Integer | Postseason Total Rebounds |
| PostAssists | Integer | Postseason Assists |
| PostSteals | Integer | Postseason Steals |
| PostBlocks | Integer | Postseason Blocks |
| PostTurnovers | Integer | Postseason Turnovers |
| PostPF | Integer | Postseason Personal Fouls |
| PostfgAttempted | Integer | Postseason Field Goals Attempted |
| PostfgMade | Integer | Postseason Field Goals Made |
| PostftAttempted | Integer | Postseason Free Throws Attempted |
| PostftMade | Integer | Postseason Free Throws Made |
| PostthreeAttempted | Integer | Postseason Three-Point Field Goals Attempted |
| PostthreeMade | Integer | Postseason Three-Point Field Goals Made |
| PostDQ | Integer | Postseason Disqualifications |


### players.csv
#### Player Biographical Information

| Column Name | Type | Description |
|-------------|------|-------------|
| bioID | String | Unique biographical identifier for the player |
| pos | String | Position (G=Guard, F=Forward, C=Center) |
| firstseason | Integer | First season played |
| lastseason | Integer | Last season played |
| height | Float | Height in inches |
| weight | Float | Weight in pounds |
| college | String | College or university attended |
| collegeOther | String | Other colleges attended |
| birthDate | Date | Date of birth |
| deathDate | Date | Date of death (if applicable) |


### series_post.csv
#### Postseason Series Results

| Column Name | Type | Description |
|-------------|------|-------------|
| year | Integer | Season year |
| round | String | Playoff round (Finals, Conference Finals, etc.) |
| series | String | Series identifier |
| tmIDWinner | String | Winning team identifier |
| lgIDWinner | String | Winning team's league identifier |
| tmIDLoser | String | Losing team identifier |
| lgIDLoser | String | Losing team's league identifier |
| W | Integer | Games won by winning team in series |
| L | Integer | Games lost by winning team in series |


### teams_post.csv
#### Team Postseason Records

| Column Name | Type | Description |
|-------------|------|-------------|
| year | Integer | Season year |
| tmID | String | Team identifier |
| lgID | String | League identifier |
| W | Integer | Postseason Wins |
| L | Integer | Postseason Losses |


### teams.csv
#### Team Statistics and Information

| Column Name | Type | Description |
|-------------|------|-------------|
| year | Integer | Season year |
| lgID | String | League identifier |
| tmID | String | Team identifier |
| franchID | String | Franchise identifier |
| confID | String | Conference identifier |
| divID | String | Division identifier |
| rank | Integer | Final standing/rank |
| playoff | String | Made playoffs (Y/N) |
| seeded | Integer | Playoff seed number |
| firstRound | String | Advanced past first round (Y/N) |
| semis | String | Advanced to conference semifinals (Y/N) |
| finals | String | Advanced to finals (Y/N) |
| name | String | Team name |
| o_fgm | Integer | Team Field Goals Made |
| o_fga | Integer | Team Field Goals Attempted |
| o_ftm | Integer | Team Free Throws Made |
| o_fta | Integer | Team Free Throws Attempted |
| o_3pm | Integer | Team Three-Pointers Made |
| o_3pa | Integer | Team Three-Pointers Attempted |
| o_oreb | Integer | Team Offensive Rebounds |
| o_dreb | Integer | Team Defensive Rebounds |
| o_reb | Integer | Team Total Rebounds |
| o_asts | Integer | Team Assists |
| o_pf | Integer | Team Personal Fouls |
| o_stl | Integer | Team Steals |
| o_to | Integer | Team Turnovers |
| o_blk | Integer | Team Blocks |
| o_pts | Integer | Team Points Scored |
| d_fgm | Integer | Opponent Field Goals Made |
| d_fga | Integer | Opponent Field Goals Attempted |
| d_ftm | Integer | Opponent Free Throws Made |
| d_fta | Integer | Opponent Free Throws Attempted |
| d_3pm | Integer | Opponent Three-Pointers Made |
| d_3pa | Integer | Opponent Three-Pointers Attempted |
| d_oreb | Integer | Opponent Offensive Rebounds |
| d_dreb | Integer | Opponent Defensive Rebounds |
| d_reb | Integer | Opponent Total Rebounds |
| d_asts | Integer | Opponent Assists |
| d_pf | Integer | Opponent Personal Fouls |
| d_stl | Integer | Opponent Steals |
| d_to | Integer | Opponent Turnovers |
| d_blk | Integer | Opponent Blocks |
| d_pts | Integer | Opponent Points Scored |
| tmORB | Integer | Team Offensive Rebounds (adjusted) |
| tmDRB | Integer | Team Defensive Rebounds (adjusted) |
| tmTRB | Integer | Team Total Rebounds (adjusted) |
| opptmORB | Integer | Opponent Team Offensive Rebounds (adjusted) |
| opptmDRB | Integer | Opponent Team Defensive Rebounds (adjusted) |
| opptmTRB | Integer | Opponent Team Total Rebounds (adjusted) |
| won | Integer | Games Won |
| lost | Integer | Games Lost |
| GP | Integer | Games Played |
| homeW | Integer | Home Wins |
| homeL | Integer | Home Losses |
| awayW | Integer | Away Wins |
| awayL | Integer | Away Losses |
| confW | Integer | Conference Wins |
| confL | Integer | Conference Losses |
| min | Integer | Total Minutes Played by All Team's Players (time played by the team * 5)|
| attend | Integer | Total Attendance |
| arena | String | Arena Name |


### Abbreviations Guide

| Abbreviation | Full Term |
|--------------|-----------|
| GP | Games Played |
| GS | Games Started |
| FG | Field Goal |
| FT | Free Throw |
| 3P / Three | Three-Point Field Goal |
| PF | Personal Fouls |
| DQ | Disqualifications |
| ORB / oReb | Offensive Rebounds |
| DRB / dReb | Defensive Rebounds |
| TRB | Total Rebounds |
| AST / ASTS | Assists |
| STL | Steals |
| BLK | Blocks |
| TO | Turnovers |
| PTS | Points |
| Post | Postseason |
| o_ | Team Offensive Stats |
| d_ | Opponent/Defensive Stats |
| tm | Team |
| opptm | Opponent Team |
| lgID | League Identifier |
| tmID | Team Identifier |
| confID | Conference Identifier |
| divID | Division Identifier |
| franchID | Franchise Identifier |
| bioID | Biographical Identifier |
| W | Wins |
| L | Losses |