# Modeling: Predicting High-Scoring Games(20+ Points)

## Objective

This notebook demostrates a reproducible machine learning workflow using an analysis-ready dataset derived 
from NBA play-to-play data.

We build baseline predective models to classify whether a player will score **20+ points** in a game.
The emphasis is on a correct ML workflow (feature preparation, splitting, baselines, evaluation), not on
maximizing accuracy.

## Dataset

Input data is the processed feature table exported from the EDA pipeline:

    -'data/processed/player_game_feature.csv'
    
Each row represents a **player-game** observation with engineered metrics such as:

    -'final_points'(scoring outcome)
    -'rebound_events'(from play-by-play)
    -season context('season_id','season_type')

In [1]:
import pandas as pd 
import numpy as np 

from sklearn.model_selection import train_test_split
from sklearn.metrics import  classification_report, confusion_matrix, roc_auc_score

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

## Load Processed Data 

We load rhe feature table produced during the EDA phase. This notebook does not depend on the raw AQLite file, making it lightweight and reproducible for GitHub.

In [2]:
df=pd.read_csv("../data/processed/player_game_features.csv")
df.head()

Unnamed: 0,game_id,player1_id,final_points,full_name,rebound_events,season_id,season_type,game_date
0,11300001,200757,8,Thabo Sefolosha,2.0,12013,Pre Season,2013-10-05 00:00:00
1,11300001,201142,24,Kevin Durant,8.0,12013,Pre Season,2013-10-05 00:00:00
2,11300001,201586,15,Serge Ibaka,6.0,12013,Pre Season,2013-10-05 00:00:00
3,11300001,201934,4,Hasheem Thabeet,5.0,12013,Pre Season,2013-10-05 00:00:00
4,11300001,202704,9,Reggie Jackson,2.0,12013,Pre Season,2013-10-05 00:00:00


## Quick Validation

We verify required columns exist and check basic mossongness.

In [3]:
required_cols=["game_id","player1_id", "full_name", "final_points", "rebound_events", "season_id", "season_type"]
missing=[c for c in required_cols if c not in df.columns]
if missing:
    raise ValueError(f"Missing required columns: {missing}")

df[required_cols].isna().mean().sort_values(ascending=False)

full_name         0.001403
game_id           0.000000
player1_id        0.000000
final_points      0.000000
rebound_events    0.000000
season_id         0.000000
season_type       0.000000
dtype: float64

####  Validation Result

The dataset shows:

- No missing values for core modeling features
- Correct joins for season and game metadata
- Only negligible missing values for player names (<0.2%)

This confirms that feature engineering and merges were successful.  
We proceed to modeling with confidence in data integrity.


## Define Target: 20+ Points

We define a binary classification target:

- '1' if the player scored **20 or more points**
- '0' otherwise

In [5]:
df["target_20plus"]=(df["final_points"] >= 20).astype(int)

#keep rows with essential context
df= df.dropna(subset=["season_id","season_type"])

#Ensure correct dtypes
df["player1_id"]=df["player1_id"].astype(str)
df["season_id"]=df["season_id"].astype(str)
df["season_type"]=df["season_type"].astype(str)

df[["final_points", "rebound_events", "target_20plus"]].head()

Unnamed: 0,final_points,rebound_events,target_20plus
0,8,2.0,0
1,24,8.0,1
2,15,6.0,0
3,4,5.0,0
4,9,2.0,0


## Feature Set & Train/Test Split

### Features (simple + realistic)

- 'rebound_events'(numeric)
- 'season_id' (categorical)
- 'season_type' (categorical)

>Note: We intentionally **exclude 'player1_id'** to avoid the model simply learning player identity.
This improves generalization and better reflects real predictive modeling.

We split with stratification to perserve class balance.

In [8]:
features_num=["rebound_events"]
features_cat=["season_id","season_type"]

X=df[features_num + features_cat]
y= df["target_20plus"]

X_train, X_test, y_train, y_test = train_test_split(
    X,y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

X_train.shape, X_test.shape

((436301, 3), (109076, 3))

## Baseline Model: Logistic Regression

Logistic Regression is a strong, interpretable baseline for binary classification. 
We use one-hot encoding for categorical variables via a pipeline.

In [11]:
preprocess = ColumnTransformer(
    transformers= [
        ("num","passthrough", features_num),
        ("cat", OneHotEncoder (handle_unknown="ignore"), features_cat),
    ]
)

log_reg=Pipeline(
    steps=[
        ("preprocess", preprocess),
        ("model", LogisticRegression(max_iter=300))
    ]
)

log_reg.fit(X_train,y_train)

y_pred= log_reg.predict(X_test)
y_prob= log_reg.predict_proba(X_test)[:,1]

print(classification_report (y_test, y_pred))
print("ROC AUC :", roc_auc_score(y_test, y_prob))

              precision    recall  f1-score   support

           0       0.86      0.99      0.92     93110
           1       0.46      0.03      0.06     15966

    accuracy                           0.85    109076
   macro avg       0.66      0.51      0.49    109076
weighted avg       0.80      0.85      0.79    109076

ROC AUC : 0.6961319714768709


#### Model Evaluation

We trained a Logistic Regression classifier to predict whether a player scores **20+ points** in a game using basic contextual and performance features.

##### Key observations

- Dataset is highly imbalanced (~85% non-20pt games)
- High overall accuracy (0.85) is misleading due to class imbalance
- Strong performance for the majority class (non-scoring games)
- Weak recall for 20+ point games (model misses most high-scoring performances)
- ROC AUC ≈ 0.70 indicates moderate predictive signal

##### Interpretation

Rebounding activity and seasonal context alone provide limited predictive power for scoring output.  
While the model captures some signal, additional features (minutes played, shot attempts, assists, etc.) are needed to meaningfully improve detection of high-scoring games.

##### Next steps

Future improvements may include:
- adding richer box-score features
- using class balancing techniques
- testing non-linear models (Random Forest / Gradient Boosting)


## Non- Linear Model: Random Forest

Random Forests can capture non-linear relationships and feature interactions.
This model provides a strong benchmark without heavy tuning.

In [14]:
rf= Pipeline(
    steps=[
        ("preprocess",preprocess),
        ("model", RandomForestClassifier(
            n_estimators=300,
            random_state=42,
            n_jobs=-1
        ))
    ]
)

rf.fit(X_train, y_train)

y_pred_rf=rf.predict(X_test)
y_prob_rf=rf.predict_proba(X_test)[:,1]

print(classification_report(y_test, y_pred_rf))
print("ROC AUC:", roc_auc_score(y_test, y_prob_rf))

              precision    recall  f1-score   support

           0       0.85      1.00      0.92     93110
           1       0.49      0.01      0.02     15966

    accuracy                           0.85    109076
   macro avg       0.67      0.50      0.47    109076
weighted avg       0.80      0.85      0.79    109076

ROC AUC: 0.6942692096766202


#### Random Forest Benchmark

We evaluated a Random Forest classifier to capture potential non-linear relationships and feature interactions.

###### Results

- Accuracy: 0.85 (inflated by class imbalance)
- Recall (20+ pts): 0.01 → most high-scoring games missed
- ROC AUC ≈ 0.69 (similar to Logistic Regression)

##### Interpretation

Performance is nearly identical to Logistic Regression.  
This suggests that the current feature set (rebounds + seasonal context) provides limited predictive signal.

Model complexity does not improve results, indicating that **feature engineering is likely more important than model choice** at this stage.

##### Conclusion

Future improvements should prioritize:
- richer box-score features (minutes, shots, assists)
- rolling averages
- contextual game variables

rather than additional model tuning.


## Confusion Matrix

We examine the confusion matrix to understand thetypes of errors made by the model.

In [15]:
cm= confusion_matrix(y_test, y_pred_rf)
cm

array([[92908,   202],
       [15773,   193]], dtype=int64)

#### Confusion Matrix Analysis


##### Observations

- True Negatives dominate due to class imbalance
- Very few False Positives → model is conservative
- Large number of False Negatives → most 20+ point games are missed
- Only ~1% recall for the positive class

##### Interpretation

Although overall accuracy is high (≈85%), the model fails to detect most high-scoring performances. This indicates that:

- current features provide weak predictive signal
- class imbalance biases predictions toward the majority class
- accuracy is not an appropriate metric for this task

##### Conclusion

Improvement should focus on:
- richer player features
- class balancing techniques
- threshold tuning

rather than model complexity alone.


# Modeling Summary

## Objective
Predict whether a player scores **20+ points in a game** using play-by-play derived features:

- Points
- Rebounds
- Season context

Target:
`target_20plus = 1 if points ≥ 20`

---

## Models Tested
- Logistic Regression (baseline linear model)
- Random Forest (non-linear ensemble)

---

## Results

Both models show similar performance:

- Accuracy ≈ 85%
- ROC-AUC ≈ 0.69
- Very low recall for 20+ games

Confusion matrix shows most high-scoring games are **missed**.

---

## Interpretation

- Dataset is **highly imbalanced** (few 20+ games)
- Current features provide **limited predictive signal**
- Accuracy is misleading — model favors majority class

Performance is constrained more by **feature quality than algorithm choice**.

---

## Future Improvements

Main opportunities:

- Add rolling stats (last 5–10 games averages)
- Include assists, rebounds, minutes, shot attempts
- Add team/opponent context
- Handle imbalance (class weights / SMOTE)
- Try boosting models (XGBoost / LightGBM)
- Use time-based validation

---

## Conclusion

The pipeline successfully demonstrates:

- Feature engineering  
- Clean modeling workflow  
- Reproducible experiments  

With richer basketball features, predictive performance is expected to improve substantially.
