# NBA Player Performance Dynamics: Data Exploration

## Introduction & Project Overview

This notebook introduces a novel approach to NBA analytics using dynamical systems theory. Traditional basketball analytics typically focus on averages and aggregates, missing the underlying patterns and dynamics that determine true player value and team fit. By modeling player performance as a complex dynamical system, we can extract deeper insights about consistency, adaptability, and impact that are invisible to conventional statistics.

### Key Research Questions

1. **Performance Stability**: How can we quantify the game-to-game consistency of player performance beyond simple variance metrics?
2. **Team Styles**: Can we identify distinct playing styles and tactical patterns across NBA teams?
3. **Player-Team Fit**: How do players perform across different team systems, and what determines optimal fit?
4. **Teammate Influence**: How can we map and quantify the network of teammate interactions and influences?
5. **Performance Prediction**: Can dynamical systems metrics better predict future performance than traditional statistics?

### Value Over Traditional Analytics

This approach offers several advantages over traditional basketball analytics:

- **Deeper Performance Understanding**: Captures the dynamics of performance, not just static averages
- **System Compatibility**: Quantifies how players perform in different team contexts
- **Hidden Value Identification**: Reveals undervalued players with favorable stability profiles
- **Roster Construction Insights**: Provides framework for balancing stability and volatility
- **Predictive Power**: Better forecasts future performance by modeling the underlying system

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import sys
from datetime import datetime

# Add the project root to the path so we can import our modules
sys.path.append('..')

# Import our data processing module
from src.data_processing import load_data, preprocess_data, create_temporal_features
from src.utils import setup_plotting_style

# Set up plotting style
setup_plotting_style()

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

## Data Loading & Schema Exploration

We'll be working with four primary datasets:
1. **Teams**: Information about NBA teams
2. **Players**: Information about NBA players
3. **Games**: Game-level statistics for teams
4. **Player Games**: Individual player performance in each game

Let's load these datasets and explore their structure.

In [None]:
# Load the data
teams, players, games, player_games = load_data()

# Check if data was loaded successfully
if teams is None or players is None or games is None or player_games is None:
    print("Error loading data. Please check the data directory.")

### Teams Dataset

In [None]:
# Examine the teams dataset
print("Teams dataset shape:", teams.shape)
teams.head()

In [None]:
# Data dictionary for teams dataset
teams_dict = {
    'id': 'Unique team identifier',
    'full_name': 'Full team name (e.g., "Los Angeles Lakers")',
    'abbreviation': 'Team abbreviation (e.g., "LAL")',
    'nickname': 'Team nickname (e.g., "Lakers")',
    'city': 'Team city location',
    'state': 'Team state location',
    'year_founded': 'Year the team was founded'
}

# Display data dictionary
pd.DataFrame(list(teams_dict.items()), columns=['Column', 'Description'])

### Players Dataset

In [None]:
# Examine the players dataset
print("Players dataset shape:", players.shape)
players.head()

In [None]:
# Data dictionary for players dataset
players_dict = {
    'id': 'Unique player identifier',
    'full_name': 'Player full name',
    'first_name': 'Player first name',
    'last_name': 'Player last name',
    'is_active': 'Whether the player is currently active',
    'position': 'Player position (e.g., "G", "F", "C")',
    'height': 'Player height in feet-inches',
    'weight': 'Player weight in pounds',
    'birth_date': 'Player birth date',
    'college': 'Player college (if applicable)',
    'country': 'Player country of origin',
    'draft_year': 'Year player was drafted',
    'draft_round': 'Draft round',
    'draft_number': 'Draft pick number'
}

# Display data dictionary
pd.DataFrame(list(players_dict.items()), columns=['Column', 'Description'])

### Games Dataset

In [None]:
# Examine the games dataset
print("Games dataset shape:", games.shape)
games.head()

In [None]:
# Data dictionary for games dataset
games_dict = {
    'Game_ID': 'Unique game identifier',
    'GAME_DATE': 'Date of the game',
    'MATCHUP': 'Teams playing (e.g., "LAL vs. BOS")',
    'WL': 'Win or Loss ("W" or "L")',
    'Team_ID': 'Team identifier',
    'PTS': 'Points scored',
    'FGM': 'Field goals made',
    'FGA': 'Field goals attempted',
    'FG_PCT': 'Field goal percentage',
    'FG3M': '3-point field goals made',
    'FG3A': '3-point field goals attempted',
    'FG3_PCT': '3-point field goal percentage',
    'FTM': 'Free throws made',
    'FTA': 'Free throws attempted',
    'FT_PCT': 'Free throw percentage',
    'OREB': 'Offensive rebounds',
    'DREB': 'Defensive rebounds',
    'REB': 'Total rebounds',
    'AST': 'Assists',
    'STL': 'Steals',
    'BLK': 'Blocks',
    'TOV': 'Turnovers',
    'PF': 'Personal fouls',
    'PLUS_MINUS': 'Plus-minus score'
}

# Display data dictionary
pd.DataFrame(list(games_dict.items()), columns=['Column', 'Description'])

### Player Games Dataset

In [None]:
# Examine the player games dataset
print("Player games dataset shape:", player_games.shape)
player_games.head()

In [None]:
# Data dictionary for player games dataset
player_games_dict = {
    'Game_ID': 'Unique game identifier',
    'GAME_DATE': 'Date of the game',
    'MATCHUP': 'Teams playing (e.g., "LAL vs. BOS")',
    'WL': 'Win or Loss ("W" or "L")',
    'Player_ID': 'Player identifier',
    'MIN': 'Minutes played',
    'PTS': 'Points scored',
    'FGM': 'Field goals made',
    'FGA': 'Field goals attempted',
    'FG_PCT': 'Field goal percentage',
    'FG3M': '3-point field goals made',
    'FG3A': '3-point field goals attempted',
    'FG3_PCT': '3-point field goal percentage',
    'FTM': 'Free throws made',
    'FTA': 'Free throws attempted',
    'FT_PCT': 'Free throw percentage',
    'OREB': 'Offensive rebounds',
    'DREB': 'Defensive rebounds',
    'REB': 'Total rebounds',
    'AST': 'Assists',
    'STL': 'Steals',
    'BLK': 'Blocks',
    'TOV': 'Turnovers',
    'PF': 'Personal fouls',
    'PLUS_MINUS': 'Plus-minus score'
}

# Display data dictionary
pd.DataFrame(list(player_games_dict.items()), columns=['Column', 'Description'])

### Visualizing Relationships Between Tables

Let's visualize how these datasets are related to each other.

In [None]:
# Create a simple diagram of table relationships
from graphviz import Digraph

# Create a new graph
dot = Digraph(comment='NBA Data Schema')

# Add nodes for each table
dot.node('Teams', 'Teams\n(Team information)')
dot.node('Players', 'Players\n(Player information)')
dot.node('Games', 'Games\n(Team game statistics)')
dot.node('PlayerGames', 'Player Games\n(Player game statistics)')

# Add edges to show relationships
dot.edge('Teams', 'Games', label='Team_ID')
dot.edge('Players', 'PlayerGames', label='Player_ID')
dot.edge('Games', 'PlayerGames', label='Game_ID')

# Render the graph
dot.render('nba_schema', format='png', cleanup=True)

# Display the image
from IPython.display import Image
Image('nba_schema.png')

## Exploratory Data Analysis

Now that we understand the structure of our data, let's explore it in more detail to gain insights about player and team performance.

### Basic Statistics

In [None]:
# Basic statistics for games dataset
games.describe()

In [None]:
# Basic statistics for player games dataset
player_games.describe()

### Distributions of Key Performance Metrics

In [None]:
# Distribution of points scored by teams
plt.figure(figsize=(10, 6))
sns.histplot(games['PTS'], kde=True)
plt.title('Distribution of Team Points Scored', fontsize=14)
plt.xlabel('Points', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.grid(True, alpha=0.3)
plt.show()

In [None]:
# Distribution of points scored by players
plt.figure(figsize=(10, 6))
sns.histplot(player_games['PTS'], kde=True)
plt.title('Distribution of Player Points Scored', fontsize=14)
plt.xlabel('Points', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.grid(True, alpha=0.3)
plt.show()

In [None]:
# Distribution of plus/minus for players
plt.figure(figsize=(10, 6))
sns.histplot(player_games['PLUS_MINUS'], kde=True)
plt.title('Distribution of Player Plus/Minus', fontsize=14)
plt.xlabel('Plus/Minus', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.grid(True, alpha=0.3)
plt.axvline(x=0, color='red', linestyle='--')
plt.show()

In [None]:
# Distribution of minutes played
plt.figure(figsize=(10, 6))
sns.histplot(player_games['MIN'], kde=True)
plt.title('Distribution of Minutes Played', fontsize=14)
plt.xlabel('Minutes', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.grid(True, alpha=0.3)
plt.show()

### Correlation Analysis

In [None]:
# Select relevant columns for correlation analysis
game_corr_columns = ['PTS', 'FGM', 'FGA', 'FG3M', 'FG3A', 'FTM', 'FTA', 
                     'OREB', 'DREB', 'AST', 'STL', 'BLK', 'TOV', 'PLUS_MINUS']

# Calculate correlation matrix
game_corr = games[game_corr_columns].corr()

# Plot correlation heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(game_corr, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Correlation Matrix of Team Game Statistics', fontsize=14)
plt.tight_layout()
plt.show()

In [None]:
# Select relevant columns for correlation analysis
player_corr_columns = ['PTS', 'MIN', 'FGM', 'FGA', 'FG3M', 'FG3A', 'FTM', 'FTA', 
                       'OREB', 'DREB', 'AST', 'STL', 'BLK', 'TOV', 'PLUS_MINUS']

# Calculate correlation matrix
player_corr = player_games[player_corr_columns].corr()

# Plot correlation heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(player_corr, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Correlation Matrix of Player Game Statistics', fontsize=14)
plt.tight_layout()
plt.show()

### Time Series Overview

In [None]:
# Convert date strings to datetime objects
games['GAME_DATE'] = pd.to_datetime(games['GAME_DATE'])
player_games['GAME_DATE'] = pd.to_datetime(player_games['GAME_DATE'])

# Sort by date
games_sorted = games.sort_values('GAME_DATE')
player_games_sorted = player_games.sort_values('GAME_DATE')

In [None]:
# Aggregate points by date
daily_points = games_sorted.groupby(games_sorted['GAME_DATE'].dt.date)['PTS'].mean().reset_index()

# Plot time series of average points per game
plt.figure(figsize=(14, 6))
plt.plot(daily_points['GAME_DATE'], daily_points['PTS'], marker='o', alpha=0.7, linestyle='-')
plt.title('Average Points Per Game Over Time', fontsize=14)
plt.xlabel('Date', fontsize=12)
plt.ylabel('Average Points', fontsize=12)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# Select a sample player for time series analysis
# Find a player with many games
player_game_counts = player_games['Player_ID'].value_counts()
sample_player_id = player_game_counts.index[0]
sample_player_name = player_games[player_games['Player_ID'] == sample_player_id]['PlayerName'].iloc[0]

# Get player's game data
player_games_data = player_games[player_games['Player_ID'] == sample_player_id].copy()
player_games_data = player_games_data.sort_values('GAME_DATE')

# Plot time series of points and plus/minus
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(14, 10), sharex=True)

# Points time series
ax1.plot(player_games_data['GAME_DATE'], player_games_data['PTS'], marker='o', linestyle='-', label='Points')
ax1.set_ylabel('Points', fontsize=12)
ax1.set_title(f"{sample_player_name}: Game-by-Game Performance", fontsize=14)
ax1.grid(True, alpha=0.3)
ax1.axhline(y=player_games_data['PTS'].mean(), color='r', linestyle='--', label='Average')
ax1.legend()

# Plus/Minus time series
ax2.plot(player_games_data['GAME_DATE'], player_games_data['PLUS_MINUS'], marker='o', linestyle='-', color='green', label='Plus/Minus')
ax2.set_xlabel('Game Date', fontsize=12)
ax2.set_ylabel('Plus/Minus', fontsize=12)
ax2.grid(True, alpha=0.3)
ax2.axhline(y=0, color='gray', linestyle='--')
ax2.axhline(y=player_games_data['PLUS_MINUS'].mean(), color='r', linestyle='--', label='Average')
ax2.legend()

plt.tight_layout()
plt.show()

## Feature Engineering

Now let's create derived metrics that will be useful for our analysis.

In [None]:
# Preprocess the data using our module
games_processed, player_games_processed = preprocess_data(teams, players, games, player_games)

# Check the processed data
print("Processed games dataset shape:", games_processed.shape)
print("Processed player games dataset shape:", player_games_processed.shape)

In [None]:
# Examine the new features in the games dataset
new_features = ['PointsPerPossession', 'AssistRatio', 'TurnoverRatio', 'EffectiveFG', 'DefensiveRebound%', 'OffensiveRebound%']
games_processed[new_features].describe()

In [None]:
# Examine the new features in the player games dataset
player_new_features = ['UsageRate', 'EffectiveFG', 'TrueShootingPct', 'PointsPerMinute', 'ReboundsPerMinute', 'AssistsPerMinute']
player_games_processed[player_new_features].describe()

### Advanced Basketball Metrics

Let's create some additional advanced metrics that aren't included in the preprocessing function.

In [None]:
# Create additional advanced metrics for teams
games_processed['DefensiveRating'] = games_processed['PTS'] / (games_processed['FGA'] - games_processed['OREB'] + games_processed['TOV'] + 0.44 * games_processed['FTA'])
games_processed['OffensiveRating'] = games_processed['PTS'] / (games_processed['FGA'] - games_processed['OREB'] + games_processed['TOV'] + 0.44 * games_processed['FTA'])
games_processed['NetRating'] = games_processed['OffensiveRating'] - games_processed['DefensiveRating']
games_processed['PaceEstimate'] = games_processed['FGA'] - games_processed['OREB'] + games_processed['TOV'] + 0.44 * games_processed['FTA']

# Create additional advanced metrics for players
player_games_processed['GameScore'] = player_games_processed['PTS'] + 0.4 * player_games_processed['FGM'] - 0.7 * player_games_processed['FGA'] - 0.4 * (player_games_processed['FTA'] - player_games_processed['FTM']) + 0.7 * player_games_processed['OREB'] + 0.3 * player_games_processed['DREB'] + player_games_processed['STL'] + 0.7 * player_games_processed['AST'] + 0.7 * player_games_processed['BLK'] - 0.4 * player_games_processed['PF'] - player_games_processed['TOV']
player_games_processed['BoxPlusMinus'] = player_games_processed['PLUS_MINUS'] / player_games_processed['MIN'] * 100  # Per 100 possessions

# Handle infinite values from division by zero
for df in [games_processed, player_games_processed]:
    for col in df.select_dtypes(include=['float64']).columns:
        df[col] = df[col].replace([np.inf, -np.inf], np.nan)

# Fill NaN values with appropriate replacements
games_processed = games_processed.fillna(0)
player_games_processed = player_games_processed.fillna(0)

In [None]:
# Examine the new advanced metrics
games_processed[['OffensiveRating', 'DefensiveRating', 'NetRating', 'PaceEstimate']].describe()

In [None]:
# Examine the new player advanced metrics
player_games_processed[['GameScore', 'BoxPlusMinus']].describe()

### Handle Edge Cases and Anomalies

In [None]:
# Check for outliers in player minutes
plt.figure(figsize=(10, 6))
sns.boxplot(x=player_games_processed['MIN'])
plt.title('Distribution of Minutes Played (Box Plot)', fontsize=14)
plt.xlabel('Minutes', fontsize=12)
plt.grid(True, alpha=0.3)
plt.show()

In [None]:
# Handle outliers in minutes played (e.g., overtime games)
# Flag games with unusually high minutes
high_minute_threshold = 48  # Regular game is 48 minutes
high_minute_games = player_games_processed[player_games_processed['MIN'] > high_minute_threshold]
print(f"Number of player-games with more than {high_minute_threshold} minutes: {len(high_minute_games)}")

# Examine these games
if len(high_minute_games) > 0:
    high_minute_games[['PlayerName', 'GAME_DATE', 'MIN', 'PTS', 'PLUS_MINUS']].head(10)

In [None]:
# Check for games with very low minutes but high stats (potential data errors)
low_min_high_pts = player_games_processed[(player_games_processed['MIN'] < 5) & (player_games_processed['PTS'] > 10)]
print(f"Number of player-games with less than 5 minutes but more than 10 points: {len(low_min_high_pts)}")

# Examine these games
if len(low_min_high_pts) > 0:
    low_min_high_pts[['PlayerName', 'GAME_DATE', 'MIN', 'PTS', 'PLUS_MINUS']].head(10)

In [None]:
# Handle potential data errors
# For this analysis, we'll filter out games with very low minutes but high stats
player_games_cleaned = player_games_processed[~((player_games_processed['MIN'] < 5) & (player_games_processed['PTS'] > 10))]
print(f"Removed {len(player_games_processed) - len(player_games_cleaned)} potential data errors")

## Temporal Feature Engineering

Now let's create temporal features that will be the foundation for our dynamical systems analysis.

In [None]:
# Create temporal features using our module
player_temporal_df = create_temporal_features(player_games_cleaned)

# Check the temporal features
print("Player temporal dataset shape:", player_temporal_df.shape)

In [None]:
# Examine the temporal features
temporal_features = ['PTS_MA5', 'PTS_Trend', 'PTS_Volatility', 'PTS_Change', 'PLUS_MINUS_Change', 'PTS_Momentum', 'Performance_Momentum']
player_temporal_df[temporal_features].describe()

In [None]:
# Select a sample player for temporal feature visualization
sample_player_id = player_game_counts.index[0]
sample_player_name = player_temporal_df[player_temporal_df['Player_ID'] == sample_player_id]['PlayerName'].iloc[0]

# Get player's temporal data
player_temporal_data = player_temporal_df[player_temporal_df['Player_ID'] == sample_player_id].copy()
player_temporal_data = player_temporal_data.sort_values('GAME_DATE')

# Plot temporal features
fig, (ax1, ax2, ax3) = plt.subplots(3, 1, figsize=(14, 12), sharex=True)

# Points and moving average
ax1.plot(player_temporal_data['GAME_DATE'], player_temporal_data['PTS'], marker='o', linestyle='-', label='Points')
ax1.plot(player_temporal_data['GAME_DATE'], player_temporal_data['PTS_MA5'], linestyle='-', color='red', label='5-Game Moving Avg')
ax1.set_ylabel('Points', fontsize=12)
ax1.set_title(f"{sample_player_name}: Temporal Features", fontsize=14)
ax1.grid(True, alpha=0.3)
ax1.legend()

# Points trend and volatility
ax2.plot(player_temporal_data['GAME_DATE'], player_temporal_data['PTS_Trend'], marker='o', linestyle='-', color='green', label='Points Trend')
ax2.plot(player_temporal_data['GAME_DATE'], player_temporal_data['PTS_Volatility'], linestyle='-', color='orange', label='Points Volatility')
ax2.set_ylabel('Value', fontsize=12)
ax2.grid(True, alpha=0.3)
ax2.axhline(y=0, color='gray', linestyle='--')
ax2.legend()

# Momentum features
ax3.plot(player_temporal_data['GAME_DATE'], player_temporal_data['PTS_Momentum'], marker='o', linestyle='-', color='purple', label='Points Momentum')
ax3.plot(player_temporal_data['GAME_DATE'], player_temporal_data['Performance_Momentum'], linestyle='-', color='brown', label='Performance Momentum')
ax3.set_xlabel('Game Date', fontsize=12)
ax3.set_ylabel('Momentum', fontsize=12)
ax3.grid(True, alpha=0.3)
ax3.axhline(y=0, color='gray', linestyle='--')
ax3.legend()

plt.tight_layout()
plt.show()

## Data Quality & Preprocessing

Let's address any remaining data quality issues and prepare the final datasets for subsequent notebooks.

In [None]:
# Check for missing values in the temporal dataset
print("Missing values in player temporal dataset:")
print(player_temporal_df.isnull().sum())

In [None]:
# Handle missing values in temporal features
# For this analysis, we'll fill missing values with 0
player_temporal_df = player_temporal_df.fillna(0)

# Verify no missing values remain
print("Missing values after filling:")
print(player_temporal_df.isnull().sum().sum())

In [None]:
# Ensure date formats are consistent
player_temporal_df['GAME_DATE'] = pd.to_datetime(player_temporal_df['GAME_DATE'])

# Sort by player and date
player_temporal_df = player_temporal_df.sort_values(['Player_ID', 'GAME_DATE'])

In [None]:
# Save the processed datasets for use in subsequent notebooks
import os

# Create processed data directory if it doesn't exist
os.makedirs('../data/processed', exist_ok=True)

# Save datasets
games_processed.to_csv('../data/processed/games_processed.csv', index=False)
player_games_cleaned.to_csv('../data/processed/player_games_processed.csv', index=False)
player_temporal_df.to_csv('../data/processed/player_temporal.csv', index=False)

print("Saved processed datasets to ../data/processed/")

## Conclusion

In this notebook, we've explored the NBA dataset that will be used for our dynamical systems analysis. We've examined the structure of the data, checked for quality issues, performed exploratory analysis, and created derived features that will be the foundation for our subsequent analysis.

Key accomplishments:
1. Loaded and explored four primary datasets: teams, players, games, and player games
2. Visualized the relationships between these datasets
3. Performed exploratory data analysis to understand distributions and correlations
4. Created advanced basketball metrics through feature engineering
5. Developed temporal features for dynamical systems analysis
6. Addressed data quality issues and prepared clean datasets

In the next notebook, we'll apply dynamical systems theory to model player performance stability and extract insights about player consistency and volatility.