Data Loading & Cleaning:
Loads the dataset and drops the "Unnamed: 0" column that was likely added during CSV export.

In [None]:
import pandas as pd

df = pd.read_csv('CS109 Final Project')
df = df.drop(columns=['Unnamed: 0'])
df

Unnamed: 0,GAME_ID,TEAM_ID,TEAM_ABBREVIATION,TEAM_CITY,PLAYER_ID,PLAYER_NAME,NICKNAME,START_POSITION,COMMENT,MIN,...,Conference_x,W_y,L_y,W/L%_y,Conference_y,W,L,W/L%,Conference,Strong
0,20901229,1.610613e+09,GSW,Golden State,201939,Stephen Curry,Stephen,G,,48.0,...,West,50,32,0.610,West,50,32,0.610,West,True
1,20901214,1.610613e+09,GSW,Golden State,201939,Stephen Curry,Stephen,G,,40.0,...,West,53,29,0.646,West,53,29,0.646,West,True
2,20901200,1.610613e+09,GSW,Golden State,201939,Stephen Curry,Stephen,G,,34.0,...,West,50,32,0.610,West,50,32,0.610,West,True
3,20901194,1.610613e+09,GSW,Golden State,201939,Stephen Curry,Stephen,G,,41.0,...,West,29,53,0.354,West,29,53,0.354,West,False
4,20901163,1.610613e+09,GSW,Golden State,201939,Stephen Curry,Stephen,G,,44.0,...,West,15,67,0.183,West,15,67,0.183,West,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1264,22400126,,,,201939,Stephen Curry,,,,0.0,...,West,21,61,0.256,West,21,61,0.256,West,False
1265,22400116,,,,201939,Stephen Curry,,,,0.0,...,West,21,61,0.256,West,21,61,0.256,West,False
1266,22400101,1.610613e+09,GSW,Golden State,201939,Stephen Curry,Stephen,G,,26.0,...,West,50,32,0.610,West,50,32,0.610,West,True
1267,22400084,1.610613e+09,GSW,Golden State,201939,Stephen Curry,Stephen,G,,27.0,...,West,17,65,0.207,West,17,65,0.207,West,False


Additional Computation: Lists all column names in the dataframe for inspection.

In [None]:
df.columns

Index(['GAME_ID', 'TEAM_ID', 'TEAM_ABBREVIATION', 'TEAM_CITY', 'PLAYER_ID',
       'PLAYER_NAME', 'NICKNAME', 'START_POSITION', 'COMMENT', 'MIN', 'FGM',
       'FGA', 'FG_PCT', 'FG3M', 'FG3A', 'FG3_PCT', 'FTM', 'FTA', 'FT_PCT',
       'OREB', 'DREB', 'REB', 'AST', 'STL', 'BLK', 'TO', 'PF', 'PTS',
       'PLUS_MINUS', 'SEASON', 'GAME_DATE', 'MATCHUP', 'WL', 'Home', 'Opp',
       'Steph', 'Win', 'W_x', 'L_x', 'W/L%_x', 'Conference_x', 'W_y', 'L_y',
       'W/L%_y', 'Conference_y', 'W', 'L', 'W/L%', 'Conference', 'Strong'],
      dtype='object')

Grouping Win Rates by Conditions:
This cell computes the mean win rate grouped by whether Steph played, if the game was at home, and whether the opponent was strong.

In [None]:
temp1 = df.groupby(['Steph', 'Home', 'Strong'])['Win'].mean()
temp1 = temp1.reset_index()
temp1

Unnamed: 0,Steph,Home,Strong,Win
0,False,False,False,0.340206
1,False,False,True,0.16129
2,False,True,False,0.563218
3,False,True,True,0.193548
4,True,False,False,0.648199
5,True,False,True,0.361111
6,True,True,False,0.793956
7,True,True,True,0.532468


##Question 1:

**Whats the probability of winning based on whether Steph play, if it was a home game, and the strength of the opponent?**

This Plotly histogram compares win rates under different conditions:

Faceted by Steph's presence (True/False)

Grouped by home vs. away

Colored by opponent strength



In [None]:
import plotly.express as px

fig = px.histogram(temp1, facet_col='Steph', x='Home', y='Win', color='Strong', barmode='group')
fig.update_layout(yaxis_title='Win Rate')
fig.show()

Analysis:
The plot shows that Steph Curry's presence significantly improves win rates, especially when:

Playing at home

Playing against weaker opponents

**Without Steph, win rates are generally low, especially against strong teams. With Steph, win rates are much higher across all situations, and the home court advantage is especially effective.**

In [None]:
# Group by season and whether Steph played
grouped = df.groupby(['SEASON', 'Steph']).agg(
    win_rate=('Win', 'mean'),
    games=('Win', 'count')
).reset_index()

grouped['SEASON'] = grouped['SEASON'].str.split('-').str[0]

# Create the plot
fig = px.line(
    grouped,
    x='SEASON',
    y='win_rate',
    color='Steph',
    markers=True,
    line_dash='Steph',
    title='Warriors Win Rate by Season: With vs Without Steph Curry',
    labels={'win_rate': 'Win Rate', 'SEASON': 'Season', 'Steph': 'Steph Played'},
    category_orders={'SEASON': sorted(df['SEASON'].unique())}
)

fig.update_traces(line=dict(width=3))
fig.update_layout(
    yaxis=dict(range=[0, 1.0]),
    xaxis=dict(title='Season'),
    template='plotly_white',
    legend=dict(title='Steph Played', x=0.01, y=0.99)
)

fig.show()


Analysis: Better visual, same story as before: Except in 2009 when Steph just started and 2019 where he had a major injury causing him to miss ~92% of games, the Warriors's win rate is always significantly higher with him.

Another note: In 2014 season, he played 80/82 games and the Warriors lost both game he missed, leading to the 0% win rate.

##Question 2:

**How would the Warriors perform if Steph played every games?**

Part 1: Compute win percentage each season

In [None]:
temp2 = df.groupby('SEASON')['Win'].mean()
temp2 = temp2.reset_index()
temp2['SEASON'] = temp2['SEASON'].str.split('-').str[0]
temp2

Unnamed: 0,SEASON,Win
0,2009,0.317073
1,2010,0.439024
2,2011,0.348485
3,2012,0.573171
4,2013,0.621951
5,2014,0.817073
6,2015,0.890244
7,2016,0.817073
8,2017,0.707317
9,2018,0.695122


Part 2: For each season, simulate that season 1000 times, assuming that Steph played every gaes. We also condition on if each game of the season was a home vs away game, and whether the opponent was "strong" or not. After simulating each season 1000 times, we take the average win percentage.

In [None]:
import numpy as np

# Placeholder for final results
results = []
win_probs = {}

# Get unique seasons
seasons = df['SEASON'].unique()

# Loop through each season
for season in seasons:
    season_df = df[df['SEASON'] == season]
    temp_df = season_df[season_df['Steph']].groupby(['Home', 'Strong'])['Win'].mean()
    temp_df = temp_df.reset_index()
    win_probs[(False, False)] = temp_df['Win'][0]
    win_probs[(False, True)] = temp_df['Win'][1]
    win_probs[(True, False)] = temp_df['Win'][2]
    win_probs[(True, True)] = temp_df['Win'][3]
    n_games = len(season_df)

    # Store win count for each trial
    win_counts = []

    for _ in range(1000):
        simulated_wins = 0

        # Loop through each game
        for _, row in season_df.iterrows():

            home_game = row['Home']
            opp_strong = row['Strong']

            # Retrieve probability
            prob = win_probs.get((home_game, opp_strong))

            # Simulate game
            win = np.random.rand() < prob
            simulated_wins += int(win)

        win_counts.append(simulated_wins / n_games)  # Proportion of wins

    # Record mean and std for season
    results.append({
        'season': season,
        'mean_win_pct': np.mean(win_counts),
        'std_win_pct': np.std(win_counts)
    })

# Final result as dataframe
simulated_df = pd.DataFrame(results)

# Clean up
simulated_df['season'] = simulated_df['season'].str.split('-').str[0]
simulated_df.rename(columns={'season': 'SEASON'}, inplace=True)
merged_df = simulated_df.merge(temp2, on='SEASON')

Part 3: Find the proportion of games Steph did not play every season, merges it with performance data.

In [None]:
steph_notplay = df.groupby('SEASON')['Steph'].mean().reset_index()
steph_notplay['Not Play'] = 1 - steph_notplay['Steph']
steph_notplay['SEASON'] = steph_notplay['SEASON'].str.split('-').str[0]
merged_df = merged_df.merge(steph_notplay, on='SEASON')
merged_df = merged_df.rename(columns={'Win': 'Actual Win Rate', 'mean_win_pct': 'Win Rate if Steph Always Play', 'Not Play': 'Proportion of games Steph did not play'})
merged_df

Unnamed: 0,SEASON,Win Rate if Steph Always Play,std_win_pct,Actual Win Rate,Steph,Proportion of games Steph did not play
0,2009,0.313878,0.049373,0.317073,0.97561,0.02439
1,2010,0.446012,0.046621,0.439024,0.902439,0.097561
2,2011,0.524318,0.056479,0.348485,0.378788,0.621212
3,2012,0.572549,0.049612,0.573171,0.95122,0.04878
4,2013,0.637585,0.047696,0.621951,0.95122,0.04878
5,2014,0.835927,0.037166,0.817073,0.97561,0.02439
6,2015,0.898451,0.032968,0.890244,0.963415,0.036585
7,2016,0.826061,0.038982,0.817073,0.963415,0.036585
8,2017,0.815732,0.043801,0.707317,0.621951,0.378049
9,2018,0.759732,0.043184,0.695122,0.841463,0.158537


Visualizing Trends: Plots win rate, simulation prediction, and Steph's absence rate over time for contextual comparison.

In [None]:
fig = px.line(merged_df, x='SEASON', y=['Win Rate if Steph Always Play', 'Actual Win Rate', 'Proportion of games Steph did not play'])
fig.update_layout(
    legend=dict(
        x=0.02,
        y=0.98,
        bgcolor='rgba(255,255,255,0.6)',
        bordercolor='black',
        borderwidth=1
    )
)
fig.update_layout(xaxis_title="Season", yaxis_title="Win Rate/Proportion")

**Analysis:**

1. **Prediction vs. Reality Gap Aligns with Steph's Absence**
In most seasons, simulated win rate is higher than the actual win rate.
Thus, the more Steph is absent, the more the team underperforms relative to what the model believes they could achieve if he played - his absence correlares directly with underperformance.

2. 2011–12 Season:
Not Play ≈ 0.61 (i.e., Steph missed more than half the games)
Context: This was the NBA lockout-shortened season, and Steph had severe ankle injuries, limiting him to just 26 of 66 games.
Impact: The team finished 23–43. The actual win rate fell far below the model’s estimate, which assumes a healthy Steph — hence, a large prediction gap.

3. 2017–18 Season
Not Play ≈ 0.38 (missed ~40% of games)
Context: Steph suffered multiple ankle sprains and a Grade 2 MCL sprain. He played only 51 of 82 games.
Impact: The Warriors still m**ade the playoffs and won the title**, but their regular season record (58–24) underperformed compared to expectations with a healthy Steph — seen in the gap between predicted and actual win rates

4. 2019–20 Season
Not Play ≈ 0.92 (missed ~90% of games)
Context: Steph broke his left hand in the fourth game of the season. He played only 5 games.
Impact: The team finished 15–50 — worst in the league. The gap between simulated win rate and actual win rate is relative big, aligning with his near-total absence.

5. 2021–22 Season
Not Play spikes again (~32%)
Context: Steph missed 18 games due to a foot sprain late in the season.
Impact: Though the team performed well (**eventually winning the championship**), regular season performance again lagged behind model expectations, especially during his absence.





##Question 3: Can we predict the Warrior's games based solely on Steph's performance and presence?

**Training Full 18-Variable Model**

Trains a Gradient Boosting Classifier using Steph's full game-level stats (e.g., shooting percentages, minutes played, plus-minus, Steph’s presence) to predict whether the Warriors won a given game. This model uses classifiers introduced in CS109 (Lecture 23) and captures how a wide range of contextual variables — centered around Steph — impact game outcomes.

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.pipeline import make_pipeline
from sklearn.pipeline import Pipeline
from sklearn.compose import make_column_transformer
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import f1_score

features = ['PTS', 'FGM', 'FGA', 'FG_PCT', 'FG3M', 'FG3A', 'FG3_PCT', 'FTM', 'FTA', 'FT_PCT', 'REB', 'AST', 'MIN', 'PLUS_MINUS', 'Steph', 'W/L%', 'Strong', 'Home']

scalers = {
    'standard': StandardScaler(),
    'minmax': MinMaxScaler(),
    'robust': RobustScaler()
}

for col in features:
    df[col] = pd.to_numeric(df[col], errors='coerce')
X = df[features].fillna(0)
y = df['Win']


# Set up pipeline with placeholder scaler
pipe = Pipeline([
    ('scaler', StandardScaler()),  # this name is key for GridSearch param reference
    ('model', GradientBoostingClassifier())
])

# Define param grid
param_grid = {
    'scaler': list(scalers.values()),  # try different scalers
    'model__n_estimators': [100, 200],
    'model__learning_rate': [0.05, 0.1, 0.2],
    'model__max_depth': [3, 4, 5]
}

# GridSearchCV setup
grid = GridSearchCV(pipe, param_grid, cv=5, scoring='f1_macro', n_jobs=-1)
grid.fit(X, y)

# Output results
print("Best Accuracy:", grid.best_score_)
print("Best Params:", grid.best_params_)

Best Accuracy: 0.7949746099208126
Best Params: {'model__learning_rate': 0.05, 'model__max_depth': 3, 'model__n_estimators': 200, 'scaler': StandardScaler()}


Plots predicted win rate from the full model against the actual win rate. Helps evaluate the model’s ability to reflect real performance, and highlights years where the team underperformed, often due to Steph’s absence.

In [None]:
best_model_18var = grid.best_estimator_
pred = best_model_18var.fit(X, y).predict(X)
df['18var_Pred'] = pred
temp = df.groupby('SEASON')['18var_Pred'].mean().reset_index()
temp['SEASON'] = temp['SEASON'].str.split('-').str[0]
merged_df = merged_df.merge(temp, on='SEASON')

In [None]:
fig = px.line(merged_df, x='SEASON', y=['Actual Win Rate', '18var_Pred'])
fig.update_layout(xaxis_title="Season", yaxis_title="Win Rate")
fig.show()

**Analysis:**
The 18-variable model performs well across most seasons, accurately estimating win rate based on detailed inputs like FG%, minutes, assist totals, Steph’s presence, home court, and opponent strength. When all these signals are consistent, such as in 2015–16 or 2022–23, the model closely mirrors reality.

**Why the gaps in the graph?**


**2011–12 and 2012–13 Seasons:**
Gap: ~2–3% underprediction.
Steph’s availability was moderate to high in 2013 (played 78 games) but limited in 2012 (only 26 of 66 games due to ankle injury).
In 2013, the team outperformed expectations thanks to:

1. Emergence of Klay Thompson and strong support from David Lee and Andrew Bogut.
2. A shift toward the fast-paced, three-point-heavy style that would define the Warriors era.
2. In 2012, despite a short season and Steph’s absence, the team had moments of overperformance.

Conclusion: The model may not fully capture roster turning points or early-stage chemistry that foreshadowed the Warriors dynasty. These seasons reflect a transitional phase that statistical patterns alone may not fully anticipate.

**2017–18 and 2016–17 Seasons:**
Gap: ~3% underprediction each year
Steph played most games, yet the model consistently predicts a slightly lower win rate than actual.
Reason: The model does not account for Kevin Durant, who joined the Warriors in 2016–17.

1. KD’s presence dramatically increased the team’s offensive and defensive efficiency.

2. Because the model centers Steph and doesn’t include KD explicitly as a feature, it underestimates team strength during this superteam era.

Conclusion: These gaps highlight the effect of missing features — the model can’t account for elite teammates not explicitly coded.


In [None]:
fig = px.line(merged_df, x='SEASON', y=['Actual Win Rate', '18var_Pred'])
fig.update_layout(xaxis_title="Season", yaxis_title="Win Rate")
fig.show()

**2020–21 Season:**
Predicted: ~60%
Actual: 64%
Largest gap in prediction. Steph played 63 of 72 games and led the league in scoring.
The model underestimates the Warriors' actual win rate, likely because:

1. Steph’s individual brilliance is not fully captured by averaged inputs.

2. Supporting cast (Poole, Wiggins, Draymond) outperformed expectations.

3. Late-season momentum and close-game wins aren't emphasized in the model.

Conclusion: This shows a limitation in the model’s ability to capture intangible or nonlinear improvements driven by superstar performance.



**2022–23 Season – Model Overpredicts by ~3%**
Predicted win rate : ~0.57
Actual win rate: 0.53
Steph’s availability: ~56 games played (missed ~25)

📉 Why the model was too optimistic:
1. Steph's stats were strong when he played
The 18-variable model sees his games as high-quality inputs: solid shooting, high minutes, good on/off performance.
It uses these to predict that if the rest of the season had looked like his games, the Warriors would’ve won more often.

2. The team struggled without him — badly
Road record: 11–30 — one of the worst in the league.
The model doesn't fully capture how steep the drop-off was in his absence (or that road struggles were systemic and not player-level).
Even with Draymond, Klay, and Poole active, the team often lost winnable games without Steph’s floor presence and leadership.

3. Chemistry and inconsistency
Off-court issues (e.g., Draymond–Poole incident) and a younger bench rotation introduced volatility that wasn't present in prior years.
The model treats each game as independent, but team dynamics across a season (e.g., lineup instability) created compounding effects not reflected in in-game stats.

It reveals that even when Steph plays a fair number of games, the model can be misled if the team's actual performance is dragged down by contextual instability that isn’t visible in stats: locker room friction, defensive drop-offs, or poorly managed rotations.

In [None]:
features = ['MIN', 'W/L%', 'Strong', 'Home']

scalers = {
    'standard': StandardScaler(),
    'minmax': MinMaxScaler(),
    'robust': RobustScaler()
}

for col in features:
    df[col] = pd.to_numeric(df[col], errors='coerce')
X = df[features].fillna(0)
y = df['Win']


# Set up pipeline with placeholder scaler
pipe = Pipeline([
    ('scaler', StandardScaler()),  # this name is key for GridSearch param reference
    ('model', GradientBoostingClassifier())
])

# Define param grid
param_grid = {
    'scaler': list(scalers.values()),  # try different scalers
    'model__n_estimators': [100, 200],
    'model__learning_rate': [0.05, 0.1, 0.2],
    'model__max_depth': [3, 4, 5]
}

# GridSearchCV setup
grid = GridSearchCV(pipe, param_grid, cv=5, scoring='f1_macro', n_jobs=-1)
grid.fit(X, y)

# Output results
print("Best Accuracy:", grid.best_score_)
print("Best Params:", grid.best_params_)

Best Accuracy: 0.6284470823498214
Best Params: {'model__learning_rate': 0.05, 'model__max_depth': 3, 'model__n_estimators': 100, 'scaler': StandardScaler()}


In [None]:
best_model_4var = grid.best_estimator_
pred = best_model_4var.fit(X, y).predict(X)
df['4var_Pred'] = pred
temp = df.groupby('SEASON')['4var_Pred'].mean().reset_index()
temp['SEASON'] = temp['SEASON'].str.split('-').str[0]
merged_df = merged_df.merge(temp, on='SEASON')

**The 4-var model** — which only considers Steph’s presence, W/L%, Home, and Strong — exhibits slightly more smoothing and misses some of the more complex year-to-year fluctuations.

However, the 4-var model still captures key trends: its predictions are consistently lower in years Steph was absent, and higher when he was present, validating its usefulness for isolating Steph’s macro-level impact.

In [None]:
fig = px.line(merged_df, x='SEASON', y=['Actual Win Rate', '4var_Pred'])
fig.update_layout(xaxis_title="Season", yaxis_title="Win Rate")
fig.show()

The model is meant to be simpler and thus while it does get lucky in predicting some of the seasons, it shows that we cannot simply predict the Warrior's game based on Steph's presence alone. Here's a few notable observations:

1. 2021–22 (Warriors championship year)
Overpredicts by ~10%
This is surprising at first — they won the title!
But regular season performance was solid, not elite, due to Steph’s late-season foot injury and a cautious ramp-up to playoffs.
The model expects domination just because Steph is present, leading to overconfidence.

2. 2022–23 and 2023–24
Overprediction explodes (~18–24%)
Team had severe chemistry issues, poor road performance, and major defensive lapses.
But the model just sees “Steph played” and thinks that’s enough.
The widening gap reflects the cost of not modeling decline, aging, or roster dynamics.


##Conclusion

This project set out to answer a deceptively simple question: How much does Steph Curry actually matter to the Golden State Warriors’ success? Using tools developed in CS109 — from probability and expectation to classification and model comparison — we approached this question not just with highlights and headlines, but with structured data and statistical reasoning.


**Key Findings**
1. Steph’s presence dramatically lifts win probability — especially at home and against weaker teams. This shows up consistently in both grouped stats and model predictions.

2. The 18-variable model predicts season performance extremely well given it's still relatively simpler nature and reliance on Steph's stats only when the roster is stable and Steph plays regularly. Gaps emerge during injury-heavy or chaotic seasons — not because the model is wrong, but because reality deviates from average patterns. Notably:
It underpredicted MVP-caliber years like 2021.
It overpredicted in years like 2023 when team chemistry collapsed despite Steph playing.

3. The 4-variable model, while interpretable, overestimates the team’s performance starting in 2018 — eventually reaching a 24% gap in 2024. Why? Because it treats Steph’s presence as an automatic boost, ignoring decline, roster changes, and context. It proves that no player, not even Steph, can win alone.