## **Analyzing "player_teams"**

The dataset contains season-by-season statistics for WNBA players, including both **regular season** and **postseason** performance. The dataset can be analyzed to uncover trends in player performance, efficiency, and contribution to team success.

The analysis will focus on exploring:

- **Player Activity Metrics**: Track games played, games started, and minutes to gauge player involvement.
- **Overall Performance per Game**: Summarize key statistics to evaluate a player’s general Performance.
- **Offensive Performance per Game**: Analyze scoring metrics, including points, field goals, free throws, and three-pointers.  
- **Defensive Performance per Game**: Assess defensive contributions, including rebounds, steals, blocks, and personal fouls.    
- **Overall Performance per Minute**: Evaluate efficiency on a per-minute basis, useful for awards like SPOTY. 
- **Correlations between Player and Teammates' Statistics**: Investigate whether stronger teammates contribute to improved individual performance.



---

To analyze a player’s **overall performance**, we use the following formula:

$$
\textbf{Performance Score per Game}
$$
$$ = \left(\frac{PTS}{GP} + \frac{REB}{GP} \times 1.2 + \frac{AST}{GP} \times 1.5 + \frac{STL}{GP} \times 3 + \frac{BLK}{GP} \times 2\right) - \left(\frac{TOV}{GP} \times 2\right) + \left(\frac{FGM}{FGA} \times 10\right)
$$

This metric combines points, rebounds, assists, steals, blocks, turnovers, and shooting efficiency into a single value.

---

To analyze a player’s **offensive performance**, we use the following formula:

$$
\textbf{Offense Score}
$$
$$ = \frac{PTS}{GP} + 1.5 \times \frac{AST}{GP} + 10 \times FG\% - 2 \times \frac{TOV}{GP}
$$

This focuses on scoring, assists, shooting efficiency, and turnover management.

---

To analyze a player’s **defensive performance**, we use the following formula:

$$
\textbf{Defense Score}
$$
$$ = 1.2 \times \frac{REB}{GP} + 3 \times \frac{STL}{GP} + 2 \times \frac{BLK}{GP} - 1 \times \frac{PF}{GP}
$$

This considers rebounds, steals, blocks, and personal fouls.

---

**Abbreviations**

- **PTS** – Points scored  
- **REB** – Total rebounds  
- **AST** – Assists  
- **STL** – Steals  
- **BLK** – Blocks  
- **TOV** – Turnovers  
- **FGM** = Field goals made  
- **FGA** = Field goals attempted 
- **PF** = total personal fouls in the season  
- **GP** = Games played  

**Explanation**

This formulas provides a consistent **performance per game** metric even with only season-level data is available. A higher **Performance Score per Game** indicates greater overall effectiveness and efficiency on the court.

**Note**:
To calculate the player overall perfomance per minute we will just substitute "Games Played" for "Minutes Played" in the formulas.


### **Introduction to the Dataset**

This section provides a brief analysis of the dataset, highlighting its key metrics and characteristics.

In [3]:
import importlib
import sys
import os
sys.path.append('..')

from data_scripts import _store_data as sd;
from data_scripts import players_teams_data as ptd;
from pathlib import Path

sd.load_data(Path("../data"))
display(sd.df_info_table(sd.players_teams_df))

Unnamed: 0,Non-Null Count,Null Count,Missing %,Dtype,Unique Values
playerID,1876,0,0.0,object,555
year,1876,0,0.0,int64,10
stint,1876,0,0.0,int64,4
tmID,1876,0,0.0,object,20
lgID,1876,0,0.0,object,1
GP,1876,0,0.0,int64,34
GS,1876,0,0.0,int64,35
minutes,1876,0,0.0,int64,899
points,1876,0,0.0,int64,530
oRebounds,1876,0,0.0,int64,111


As shown in this **table**, there are **no null or missing values** across any of the columns, indicating that the **dataset is clean** and does not require any **preprocessing** or **correction**.  

The dataset covers a **$10$-year period** and includes records for **$555$ players** playing for **$20$ different teams**.


### **Cleaning**

#### Dropping Columns with Unique Values

In [4]:
del sd.players_teams_df['lgID']

Since the `lgID` column contains only one unique value, it can be removed as it will not affect the analysis of the dataset

### **EDA**

#### Performance Classification

To classify each overall player's performance, the analysis will consider the performances of all players over a 10-year period. The average performance across all players during this time will serve as the basis for comparison.

In [5]:
ptd.average_players_perfomance()

The **Violin Plot** is the most valuable chart here, as it displays the **full distribution** (density, median, and range) of player performance for each of the 10 seasons. This "violin plot" is great to compare the **full distribution** (spread and density) between seasons.

* **Stability:** The shape and spread of the violins are **remarkably stable** across all 10 seasons. This indicates that the **overall talent distribution** and league balance have remained consistent, with no major shift towards a "better" or "worse" league talent pool.
* **Median:** The median performance (the line within the box) sits consistently between **13 and 15**, defining this as the typical performance level for an average player.


**Talent Tiers (Density Analysis)**

With this plot, it is now possible to classify a player's overall performance in relation to other players.

| Performance Tier | Score Range | Density/Rarity | Interpretation |
| :--- | :--- | :--- | :--- |
| **Bad Players** | Below 10 | Highest density near the bottom. | Low-minute players, bench depth, or low-impact contributors. |
| **Average Core** | 10–25 | **Widest section** of the violin. | The majority of the league; consistent rotation players and average starters. |
| **Good Players** | 25–40 | Density narrows significantly. | High-impact starters and team leaders. |
| **Great Players** | Above 40 | Rare; represented by the thin tips. | **Elite, All-Star caliber** performers; only a handful each season. |


The **Histogram Plot** aggregates all individual game performance scores, regardless of the season. Showing the **overall frequency** and **shape** of the performance variable. Confirms the **right-skewness** of the data, reinforcing that elite scores are outliers.

The distribution is heavily **right-skewed** (a long tail to the right). The **mode (peak)** is low, around **8–10**. This is typical for sports data. It shows that in any given game, most players have a low-to-average score due to limited minutes or roles, while high scores (e.g., above 30 or 40) are statistically **rare events**.

The **Bar Chart** shows the simple **mean** performance score for all players in each season. Highlighting the **summary statistic (mean)** over time.  Best for quickly spotting the subtle **upward trend** in average performance across the years.

There is a **slight, steady upward trend** in the mean performance, increasing from approximately **15.0 (Season 1) to 17.5 (Season 10)**. This suggests players, on average, are becoming marginally **more efficient** or scoring slightly higher over the decade, even though the overall talent *distribution* remains the same.


#### Offensive and Defensive Perfomance Classification

In [6]:
ptd.off_def_players_perfomance()

The **Violin Plots** provide a comprehensive view of the **full distribution** of player performance across all 10 seasons for both offensive and defensive metrics. These visualizations are excellent for comparing the **spread, concentration, and consistency** of performance distributions over time.

**Talent Tiers: Offensive Performance**

| Performance Tier | Score Range | Density/Rarity | Interpretation |
|:---|:---|:---|:---|
| **Bad Players** | Below 5 | Low density at bottom | Limited offensive role players, minimal scoring/playmaking impact |
| **Average Core** | 5-17 | **Widest section** of the violin | The bulk of the league; solid rotation players and average starters |
| **Good Players** | 17-25 | Density narrows considerably | High-impact offensive contributors and scoring leaders |
| **Great Players** | Above 25 | Rare; thin violin tips | **Elite offensive weapons**; All-Star caliber scorers and playmakers |

**Talent Tiers: Defensive Performance**

| Performance Tier | Score Range | Density/Rarity | Interpretation |
|:---|:---|:---|:---|
| **Bad Players** | Below 2 | Highest density at bottom | Defensive liabilities or players with minimal defensive responsibilities |
| **Average Core** | 2-10 | **Widest section** of the violin | Competent defenders who fulfill their role consistently |
| **Good Players** | 10-14 | Density narrows significantly | Strong defensive anchors and difference-makers |
| **Great Players** | Above 14 | Extremely rare; thin tips | **Elite defensive specialists**; Defensive Player of the Year candidates |

The **offense shows greater differentiation** between elite and average players (wider range, higher median, more pronounced elite tier), while **defense shows more compression** (narrower range, lower median, tighter clustering). This suggests that offensive talent may be more diverse or easier to measure distinctly, whereas defensive contributions tend to be more uniform across the player population. Both metrics demonstrate **league-wide stability** with no significant talent inflation or deflation over the 10-season span.

#### Player–Teammates Performance Correlation

In [7]:
ptd.player_teammates_corr()

Average correlation with teammates (across seasons): -0.2118


#### Player–Teammates Performance Correlation Analysis

This analysis explores how an **individual player's yearly performance** correlates with the **average performance of their teammates** over multiple seasons.  

Each point in the **histogram** represents one player, and the value reflects how closely their performance tends to **move with the rest of their team** across years.

- **Negative correlations** (most common) suggest that when a player performs **better**, their teammates’ performances tend to **drop slightly**.  
  This typically occurs with **high-usage or star players** who **dominate team possessions** — when they take over games, teammates’ opportunities **decrease**.  

- **Positive correlations** indicate players whose performance **improves when the entire team performs well**.  
  These are often **system players** or **role players** who **thrive in cohesive, well-functioning teams**.

The overall distribution shows a **mild skew toward negative values**, implying that **individual excellence** often comes at a **small cost** to teammates’ **statistical production**.  
However, a few players exhibit **strong positive correlations**, meaning their **success** is strongly linked with their teams' **collective success**.


#### Performance per Minute

In [8]:
ptd.perf_per_min()

Most players’ **average performance per season** falls between **$3$ and $7$ points**, corresponding to the **Average** and **Good** tiers. Very few players are in the **Bad (<3)** or **Great (>7)** ranges, making them **clear outliers**. The **densest region** is roughly **$4$–$6$ points**, showing that while most players cluster around the **middle**, there is still some **diversity** each season.  

The **distribution of player performance** is **consistent across all 10 seasons**, with no major **spikes** or **drops** in **median values**. **Outliers** are easy to identify: some players achieve **elite performance levels** (>7 points, up to ~12), while others fall **below 1–2 points**, likely representing **rookies**, **limited minutes**, or **very poor performances**.  

The **colored performance bands** provide context for evaluating **relative player performance** across the league.


#### Total Games Played VS Started 

In [9]:
ptd.gs_gp()

Players on average **play in almost every game** each season, but they **start in about half** of them. **Participation** is consistent over the **10-year span**, with slight **peaks** in both **playing** and **starting** in **years 4, 5, and 10**.  

The large gap between **GP** and **GS** highlights that many players are often **substitutes** rather than **regular starters**.


### Average Minutes Played

In [10]:
ptd.avg_mins()

The **average minutes played** consistently remain **high**, generally fluctuating between approximately **$475$ and $525$ minutes**. The **lowest average** appears in **Year 3**, around **$475$ minutes**, while the **highest** is in **Year 10**, exceeding **$525$ minutes**.  

**Years 4, 5, 7, and 8** also show **high averages**, clustering around the **$500$–$525$ minute** mark. Overall, the data suggests a **stable and sustained high level** of average minutes played, with a **slight upward trend** observed towards the **final year**.


### **EDA Conclusions**

The **data analysis** has provided **highly valuable information** for the subsequent **prediction phase**.  

A key finding suggests an **inverse relationship** between **team strength** and **individual visibility**: having **exceptional teammates** may **obscure a player's individual performance**, whereas a **weaker team environment** often highlights the necessity of a central **"carry" figure**.  

The resulting **player performance classification** is essential for accurately **forecasting individual award recipients**. Moreover, assessing the **average team performance metrics** will directly support the task of **predicting final team rankings**.  

**Visual analysis** of the plots confirmed the existence of **elite star players** who have attained an **outstanding overall performance rating** exceeding **$40$** within the bounds of our **proprietary formulas**.


In [11]:
sd.save_data(Path("../data"))