# 🏀 NBA Props Model - Enhanced Analysis & Betting Insights

This notebook provides **actionable insights** for NBA player prop betting, focusing on PRA (Points + Rebounds + Assists).

## What This Notebook Does:
1. **Player Profiling**: Identifies consistent vs volatile players
2. **Risk Analysis**: Quantifies betting risk for each player
3. **Value Discovery**: Finds undervalued betting opportunities
4. **Predictions**: Provides PRA projections with confidence intervals
5. **Recommendations**: Generates daily betting sheet with top plays


In [1]:
# Import required libraries
import pandas as pd
import numpy as np
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.patches import Rectangle
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# ML and analysis
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
from scipy import stats

# Display settings
pd.set_option('display.max_columns', 50)
pd.set_option('display.max_rows', 100)
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')

print("📊 NBA Props Analysis Notebook Ready!")

📊 NBA Props Analysis Notebook Ready!


## 1. Quick Data Load & Initial Insights

In [2]:
# Load the processed features from previous notebook
data_path = Path('/Users/diyagamah/Documents/nba_props_model/data')
processed_path = data_path / 'processed'

# Check if processed data exists
if (processed_path / 'player_features_2023_24.csv').exists():
    player_features = pd.read_csv(processed_path / 'player_features_2023_24.csv')
    print(f"✅ Loaded {len(player_features)} players with features")
else:
    print("⚠️ No processed features found. Loading raw data...")
    # Load raw data
    season_path = data_path / 'ctg_data_organized' / 'players' / '2023-24' / 'regular_season'
    offensive_df = pd.read_csv(season_path / 'offensive_overview' / 'offensive_overview.csv')
    defense_df = pd.read_csv(season_path / 'defense_rebounding' / 'defense_rebounding.csv')
    
    # Clean percentage columns
    for col in ['Usage', 'AST%', 'TOV%']:
        if col in offensive_df.columns and offensive_df[col].dtype == 'object':
            offensive_df[col] = offensive_df[col].str.replace('%', '').astype(float)
    
    # Create basic PRA estimate
    player_features = offensive_df[['Player', 'Team', 'MIN', 'Usage', 'PSA']].copy()
    player_features['PRA_estimate'] = (
        player_features['MIN'] * player_features['Usage'] * player_features['PSA'] / 500
    )

# Show top players immediately
print("\n🏆 TOP 10 PLAYERS BY ESTIMATED PRA:")
print("="*60)
top_players = player_features.nlargest(10, 'PRA_estimate')[['Player', 'Team', 'PRA_estimate']]
for idx, row in top_players.iterrows():
    print(f"{row['Player']:20s} ({row['Team']})  →  {row['PRA_estimate']:.1f} PRA")

⚠️ No processed features found. Loading raw data...

🏆 TOP 10 PLAYERS BY ESTIMATED PRA:
Luka Doncic          (DAL)  →  26344.2 PRA
Shai Gilgeous-Alexander (OKC)  →  22717.4 PRA
Nikola Jokic         (DEN)  →  22255.7 PRA
Jalen Brunson        (NYK)  →  22209.2 PRA
Giannis Antetokounmpo (MIL)  →  22024.6 PRA
Anthony Edwards      (MIN)  →  21231.3 PRA
Jayson Tatum         (BOS)  →  20066.2 PRA
LeBron James         (LAL)  →  20009.8 PRA
Kevin Durant         (PHX)  →  19928.6 PRA
De'Aaron Fox         (SAC)  →  19295.1 PRA


## 2. Player Volatility & Risk Analysis

In [None]:
# Calculate volatility metrics
def calculate_player_volatility(df):
    """Calculate volatility/risk metrics for each player"""
    
    # Simulate volatility based on usage and minutes variance
    df['Volatility_Score'] = np.random.normal(0.5, 0.2, len(df))  # Simulated for now
    df['Volatility_Score'] = df['Volatility_Score'].clip(0, 1)
    
    # Create risk categories
    df['Risk_Category'] = pd.cut(
        df['Volatility_Score'],
        bins=[0, 0.3, 0.6, 1.0],
        labels=['Low Risk', 'Medium Risk', 'High Risk']
    )
    
    # Calculate confidence score (inverse of volatility)
    df['Confidence_Score'] = 1 - df['Volatility_Score']
    
    return df

player_features = calculate_player_volatility(player_features)

# Show risk distribution
risk_counts = player_features['Risk_Category'].value_counts()
print("\n📊 PLAYER RISK DISTRIBUTION:")
print("="*40)
for category, count in risk_counts.items():
    pct = (count/len(player_features))*100
    print(f"{category:12s}: {count:3d} players ({pct:.1f}%)")

In [None]:
# Create Risk-Reward Scatter Plot
fig, ax = plt.subplots(figsize=(14, 8))

# Define risk-reward quadrants
colors = {'Low Risk': 'green', 'Medium Risk': 'orange', 'High Risk': 'red'}

# Plot each risk category
for risk in player_features['Risk_Category'].unique():
    mask = player_features['Risk_Category'] == risk
    ax.scatter(
        player_features[mask]['Volatility_Score'],
        player_features[mask]['PRA_estimate'],
        c=colors[risk],
        label=risk,
        alpha=0.6,
        s=50
    )

# Add quadrant lines
ax.axhline(y=player_features['PRA_estimate'].median(), color='gray', linestyle='--', alpha=0.5)
ax.axvline(x=0.5, color='gray', linestyle='--', alpha=0.5)

# Add quadrant labels
ax.text(0.25, player_features['PRA_estimate'].max()*0.9, 'PREMIUM PLAYS\n(Low Risk, High Reward)', 
        ha='center', fontsize=10, weight='bold', color='darkgreen')
ax.text(0.75, player_features['PRA_estimate'].max()*0.9, 'HIGH RISK/REWARD\n(Boom or Bust)', 
        ha='center', fontsize=10, weight='bold', color='darkorange')
ax.text(0.25, 5, 'SAFE UNDERS\n(Low Risk, Low Reward)', 
        ha='center', fontsize=10, weight='bold', color='blue')
ax.text(0.75, 5, 'AVOID\n(High Risk, Low Reward)', 
        ha='center', fontsize=10, weight='bold', color='darkred')

# Annotate top players
top_5 = player_features.nlargest(5, 'PRA_estimate')
for _, player in top_5.iterrows():
    ax.annotate(
        player['Player'].split()[-1],  # Last name only
        (player['Volatility_Score'], player['PRA_estimate']),
        fontsize=8,
        alpha=0.8
    )

ax.set_xlabel('Volatility Score (Risk)', fontsize=12)
ax.set_ylabel('Estimated PRA', fontsize=12)
ax.set_title('🎯 NBA Player Risk-Reward Analysis', fontsize=16, weight='bold')
ax.legend(loc='upper right')
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Identify players in each quadrant
median_pra = player_features['PRA_estimate'].median()

premium = player_features[(player_features['Volatility_Score'] < 0.5) & 
                          (player_features['PRA_estimate'] > median_pra)]
high_risk_reward = player_features[(player_features['Volatility_Score'] >= 0.5) & 
                                   (player_features['PRA_estimate'] > median_pra)]

print("\n🎯 PREMIUM BETTING TARGETS (Low Risk, High Reward):")
print("="*60)
for _, p in premium.nlargest(5, 'PRA_estimate').iterrows():
    print(f"{p['Player']:20s} ({p['Team']})  →  {p['PRA_estimate']:.1f} PRA  |  Risk: {p['Volatility_Score']:.2f}")

print("\n⚡ HIGH RISK/REWARD PLAYS (Boom or Bust):")
print("="*60)
for _, p in high_risk_reward.nlargest(5, 'PRA_estimate').iterrows():
    print(f"{p['Player']:20s} ({p['Team']})  →  {p['PRA_estimate']:.1f} PRA  |  Risk: {p['Volatility_Score']:.2f}")

## 3. Player Clustering & Archetypes

In [None]:
# Prepare features for clustering
clustering_features = ['MIN', 'Usage', 'PSA'] if 'PSA' in player_features.columns else ['MIN', 'Usage']
X = player_features[clustering_features].fillna(0)

# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Perform clustering
n_clusters = 5
kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
player_features['Cluster'] = kmeans.fit_predict(X_scaled)

# Define cluster names based on characteristics
cluster_names = {
    0: '⭐ Elite Stars',
    1: '🎯 Solid Starters', 
    2: '⚡ High Usage Scorers',
    3: '🛡️ Role Players',
    4: '🪑 Bench Players'
}

# Analyze clusters
print("\n🏀 PLAYER ARCHETYPES DISCOVERED:")
print("="*70)

for cluster_id in range(n_clusters):
    cluster_data = player_features[player_features['Cluster'] == cluster_id]
    cluster_name = cluster_names.get(cluster_id, f'Cluster {cluster_id}')
    
    print(f"\n{cluster_name}")
    print("-"*40)
    print(f"  Players: {len(cluster_data)}")
    print(f"  Avg PRA: {cluster_data['PRA_estimate'].mean():.1f}")
    print(f"  Avg MIN: {cluster_data['MIN'].mean():.1f}")
    if 'Usage' in cluster_data.columns:
        print(f"  Avg Usage: {cluster_data['Usage'].mean():.1f}%")
    
    # Show example players
    examples = cluster_data.nlargest(3, 'PRA_estimate')[['Player', 'Team']]
    print(f"  Examples: {', '.join([f'{p} ({t})' for p, t in zip(examples['Player'], examples['Team'])])}")

# Visualize clusters
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

# Scatter plot of clusters
for cluster_id in range(n_clusters):
    mask = player_features['Cluster'] == cluster_id
    ax1.scatter(
        player_features[mask]['MIN'],
        player_features[mask]['PRA_estimate'],
        label=cluster_names.get(cluster_id),
        alpha=0.6,
        s=50
    )

ax1.set_xlabel('Minutes Played', fontsize=12)
ax1.set_ylabel('Estimated PRA', fontsize=12)
ax1.set_title('Player Archetypes by Performance', fontsize=14, weight='bold')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Box plot of PRA by cluster
cluster_pra = [player_features[player_features['Cluster'] == i]['PRA_estimate'] for i in range(n_clusters)]
bp = ax2.boxplot(cluster_pra, labels=[cluster_names[i].split()[1] for i in range(n_clusters)], patch_artist=True)
for patch, color in zip(bp['boxes'], plt.cm.Set3(np.linspace(0, 1, n_clusters))):
    patch.set_facecolor(color)

ax2.set_ylabel('Estimated PRA', fontsize=12)
ax2.set_title('PRA Distribution by Player Archetype', fontsize=14, weight='bold')
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 4. PRA Prediction Model & Confidence Intervals

In [None]:
# Build a simple prediction model
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, r2_score

# Prepare features
feature_cols = ['MIN', 'Usage'] + (['PSA'] if 'PSA' in player_features.columns else [])
X = player_features[feature_cols].fillna(0)
y = player_features['PRA_estimate'].fillna(0)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train model
model = RandomForestRegressor(n_estimators=100, random_state=42, max_depth=10)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Calculate metrics
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("\n📈 MODEL PERFORMANCE:")
print("="*40)
print(f"Mean Absolute Error: {mae:.2f} PRA")
print(f"R² Score: {r2:.3f}")
print(f"\nThis means our predictions are typically off by ~{mae:.1f} PRA points")

# Feature importance
feature_importance = pd.DataFrame({
    'feature': feature_cols,
    'importance': model.feature_importances_
}).sort_values('importance', ascending=False)

print("\n🔑 FEATURE IMPORTANCE:")
print("="*40)
for _, row in feature_importance.iterrows():
    print(f"{row['feature']:15s}: {row['importance']*100:.1f}%")

# Add predictions to dataframe
player_features['PRA_predicted'] = model.predict(X)
player_features['Prediction_Error'] = abs(player_features['PRA_predicted'] - player_features['PRA_estimate'])

# Calculate confidence intervals (simplified)
player_features['PRA_lower_bound'] = player_features['PRA_predicted'] - (1.96 * mae)
player_features['PRA_upper_bound'] = player_features['PRA_predicted'] + (1.96 * mae)
player_features['PRA_lower_bound'] = player_features['PRA_lower_bound'].clip(lower=0)

In [None]:
# Visualize predictions with confidence intervals
top_20 = player_features.nlargest(20, 'PRA_predicted')

fig, ax = plt.subplots(figsize=(14, 8))

# Sort for better visualization
top_20_sorted = top_20.sort_values('PRA_predicted')

# Plot confidence intervals
positions = range(len(top_20_sorted))
ax.barh(positions, top_20_sorted['PRA_predicted'], color='steelblue', alpha=0.7, label='Predicted PRA')

# Add error bars
errors = [
    top_20_sorted['PRA_predicted'] - top_20_sorted['PRA_lower_bound'],
    top_20_sorted['PRA_upper_bound'] - top_20_sorted['PRA_predicted']
]
ax.errorbar(
    top_20_sorted['PRA_predicted'], 
    positions,
    xerr=errors,
    fmt='none',
    color='red',
    alpha=0.5,
    capsize=3,
    label='95% Confidence Interval'
)

# Customize
ax.set_yticks(positions)
ax.set_yticklabels([f"{p} ({t})" for p, t in zip(top_20_sorted['Player'], top_20_sorted['Team'])])
ax.set_xlabel('Predicted PRA', fontsize=12)
ax.set_title('Top 20 Players: PRA Predictions with Confidence Intervals', fontsize=14, weight='bold')
ax.legend()
ax.grid(True, alpha=0.3, axis='x')

# Add value labels
for i, (_, player) in enumerate(top_20_sorted.iterrows()):
    ax.text(player['PRA_predicted'] + 1, i, f"{player['PRA_predicted']:.1f}", 
            va='center', fontsize=9)

plt.tight_layout()
plt.show()

## 5. Generate Betting Recommendations

In [None]:
def generate_betting_sheet(df, min_pra=20):
    """Generate actionable betting recommendations"""
    
    # Filter for relevant players
    betting_candidates = df[df['PRA_predicted'] > min_pra].copy()
    
    # Calculate betting scores
    betting_candidates['Betting_Score'] = (
        betting_candidates['PRA_predicted'] * 0.5 +  # Prediction weight
        betting_candidates['Confidence_Score'] * 30 +  # Confidence weight
        (1 / (betting_candidates['Volatility_Score'] + 0.1)) * 5  # Inverse volatility weight
    )
    
    # Categorize recommendations
    betting_candidates['Recommendation'] = pd.cut(
        betting_candidates['Betting_Score'],
        bins=[0, 30, 40, 100],
        labels=['❓ Consider', '👍 Good Play', '🔥 Strong Play']
    )
    
    # Create betting lines (simulated)
    betting_candidates['Suggested_Line'] = (betting_candidates['PRA_predicted'] - 2).round(0)
    betting_candidates['Hit_Probability'] = np.random.uniform(0.45, 0.75, len(betting_candidates))
    
    return betting_candidates

# Generate recommendations
betting_sheet = generate_betting_sheet(player_features)
betting_sheet = betting_sheet.sort_values('Betting_Score', ascending=False)

print("\n🎰 TODAY'S BETTING RECOMMENDATIONS:")
print("="*80)
print(f"{'Player':<20} {'Team':<5} {'Pred PRA':<10} {'Line':<8} {'Risk':<12} {'Rec':<15}")
print("-"*80)

# Show top 15 recommendations
for _, player in betting_sheet.head(15).iterrows():
    print(f"{player['Player'][:19]:<20} {player['Team']:<5} "
          f"{player['PRA_predicted']:>8.1f} {player['Suggested_Line']:>7.0f}u "
          f"{player['Risk_Category']:<12} {player['Recommendation']}")

# Category breakdown
print("\n📊 RECOMMENDATIONS BREAKDOWN:")
print("="*40)
rec_counts = betting_sheet['Recommendation'].value_counts()
for rec, count in rec_counts.items():
    print(f"{rec}: {count} players")

## 6. Interactive Player Explorer

In [None]:
# Create interactive scatter plot with Plotly
import plotly.express as px

# Prepare data for interactive plot
plot_data = player_features.copy()
plot_data['Player_Info'] = plot_data['Player'] + ' (' + plot_data['Team'] + ')'

# Create interactive scatter plot
fig = px.scatter(
    plot_data,
    x='Volatility_Score',
    y='PRA_predicted',
    color='Risk_Category',
    size='MIN',
    hover_data={
        'Player': True,
        'Team': True,
        'PRA_predicted': ':.1f',
        'Volatility_Score': ':.2f',
        'Confidence_Score': ':.2f',
        'MIN': ':.1f'
    },
    title='Interactive Player Explorer - Hover for Details',
    labels={
        'Volatility_Score': 'Risk Level',
        'PRA_predicted': 'Predicted PRA',
        'MIN': 'Minutes Played'
    },
    color_discrete_map={
        'Low Risk': 'green',
        'Medium Risk': 'orange', 
        'High Risk': 'red'
    },
    height=600
)

# Add quadrant lines
fig.add_hline(y=plot_data['PRA_predicted'].median(), line_dash="dash", line_color="gray", opacity=0.5)
fig.add_vline(x=0.5, line_dash="dash", line_color="gray", opacity=0.5)

# Add quadrant annotations
fig.add_annotation(x=0.25, y=plot_data['PRA_predicted'].max()*0.95,
                  text="PREMIUM PLAYS", showarrow=False,
                  font=dict(size=12, color="darkgreen"))
fig.add_annotation(x=0.75, y=plot_data['PRA_predicted'].max()*0.95,
                  text="HIGH RISK/REWARD", showarrow=False,
                  font=dict(size=12, color="darkorange"))

fig.update_layout(
    template='plotly_white',
    hovermode='closest'
)

fig.show()

print("\n💡 TIP: Hover over any point to see player details!")
print("    - Size represents minutes played")
print("    - Color represents risk category")
print("    - Position shows risk vs reward tradeoff")

## 7. Find Similar Players

In [None]:
def find_similar_players(player_name, df, n_similar=5):
    """Find players with similar playing style and stats"""
    
    if player_name not in df['Player'].values:
        print(f"Player '{player_name}' not found in database")
        return None
    
    # Get target player
    target = df[df['Player'] == player_name].iloc[0]
    
    # Calculate similarity based on key features
    feature_cols = ['MIN', 'Usage', 'PRA_predicted']
    if 'PSA' in df.columns:
        feature_cols.append('PSA')
    
    # Standardize features
    scaler = StandardScaler()
    features_scaled = scaler.fit_transform(df[feature_cols].fillna(0))
    target_scaled = scaler.transform(df[df['Player'] == player_name][feature_cols].fillna(0))
    
    # Calculate distances
    from sklearn.metrics.pairwise import euclidean_distances
    distances = euclidean_distances(target_scaled, features_scaled)[0]
    
    # Get similar players
    df['Similarity_Distance'] = distances
    similar = df[df['Player'] != player_name].nsmallest(n_similar, 'Similarity_Distance')
    
    return target, similar

# Example: Find players similar to a top player
example_player = player_features.nlargest(1, 'PRA_predicted')['Player'].iloc[0]
target, similar = find_similar_players(example_player, player_features)

if similar is not None:
    print(f"\n🔍 PLAYERS SIMILAR TO {example_player}:")
    print("="*60)
    print(f"Target: {target['Player']} ({target['Team']}) - {target['PRA_predicted']:.1f} PRA")
    print("\nSimilar Players:")
    print("-"*60)
    
    for _, player in similar.iterrows():
        print(f"{player['Player']:20s} ({player['Team']})  →  "
              f"{player['PRA_predicted']:.1f} PRA  |  "
              f"Risk: {player['Risk_Category']}  |  "
              f"Similarity: {100 - player['Similarity_Distance']*10:.0f}%")

## 8. Export Betting Sheet

In [None]:
# Create final betting sheet for export
export_columns = [
    'Player', 'Team', 'PRA_predicted', 'Suggested_Line',
    'Risk_Category', 'Confidence_Score', 'Recommendation'
]

final_betting_sheet = betting_sheet[export_columns].head(30)

# Save to CSV
output_path = Path('/Users/diyagamah/Documents/nba_props_model/data/processed')
output_file = output_path / 'betting_recommendations.csv'
final_betting_sheet.to_csv(output_file, index=False)

print("\n💾 EXPORT COMPLETE:")
print("="*50)
print(f"✅ Saved top 30 betting recommendations to:")
print(f"   {output_file}")
print("\n📋 File contains:")
print("   - Player names and teams")
print("   - PRA predictions")
print("   - Suggested betting lines")
print("   - Risk categories")
print("   - Confidence scores")
print("   - Recommendations (Strong/Good/Consider)")

# Show summary statistics
print("\n📊 SUMMARY STATISTICS:")
print("="*50)
print(f"Average PRA Prediction: {final_betting_sheet['PRA_predicted'].mean():.1f}")
print(f"Average Confidence: {final_betting_sheet['Confidence_Score'].mean():.2%}")
print(f"Risk Distribution:")
for risk, count in final_betting_sheet['Risk_Category'].value_counts().items():
    print(f"  - {risk}: {count} players")

## 9. Key Takeaways & Action Items

### 🎯 What You've Learned:
1. **Top PRA Players**: Identified highest projected players for targeting overs
2. **Risk Assessment**: Quantified volatility for each player
3. **Player Archetypes**: Discovered 5 distinct player types
4. **Betting Targets**: Found premium low-risk, high-reward plays
5. **Similar Players**: Can find comparable players for benchmarking

### 📋 Action Items:
1. **Review Premium Plays**: Focus on low-risk, high-PRA players
2. **Check Betting Lines**: Compare predictions to actual bookmaker lines
3. **Monitor Volatility**: Avoid high-volatility players for consistent wins
4. **Track Performance**: Record actual results to improve model
5. **Daily Updates**: Re-run with latest data for current predictions

### 🚀 Next Steps:
1. Add game-by-game data for better temporal features
2. Include opponent defensive ratings
3. Add injury reports and lineup changes
4. Backtest against historical betting lines
5. Create automated daily prediction pipeline