In [0]:
# load related dataset from S3
df_pitstops = spark.read.csv('s3://columbia-gr5069-main/raw/pit_stops.csv', header=True)
df_results = spark.read.csv('s3://columbia-gr5069-main/raw/results.csv', header=True)
df_drivers = spark.read.csv('s3://columbia-gr5069-main/raw/drivers.csv', header=True)
df_races = spark.read.csv('s3://columbia-gr5069-main/raw/races.csv', header=True)
df_laptimes = spark.read.csv('s3://columbia-gr5069-main/raw/lap_times.csv', header=True)
df_sprint_results = spark.read.csv('s3://columbia-gr5069-main/raw/sprint_results.csv', header=True)

1. [20 pts] Build any model of your choice with tunable hyperparameters

In [0]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
# I will predict whether a driver scores any points in a sprint race (points > 0)
# Load & prepare data
df = df_sprint_results.select("grid", "positionOrder", "laps", "milliseconds", "points").toPandas()
# replace string '\\N' with pandas NA
df = df.replace("\\N", pd.NA) 
df = df.dropna()
# Convert all features to numeric
for col in ["grid", "positionOrder", "laps", "milliseconds", "points"]:
    df[col] = pd.to_numeric(df[col], errors="coerce")
df = df.dropna()  # drop rows where conversion failed
df['scored'] = (df['points'] > 0).astype(int)
# grid - starting position
# positionOrder - finishing position
# laps - number of laps in the sprint
# milliseconds - time taken to finish
X = df[['grid', 'positionOrder', 'laps', 'milliseconds']]
y = df['scored']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# Model: I choose logistic regression for yes/no type prediction
model = LogisticRegression(C=1.0, max_iter=200)  # with tunable C (regularization strength)
model.fit(X_train, y_train)

  _warn_prf(average, modifier, msg_start, len(result))


2. [20 pts] Create an experiment setup where - for each run - you log:
- the hyperparameters used in the model
- the model itself
- every possible metric from the model you chose
- at least two artifacts (plots, or csv files)

### Q2: MLflow Experiment Setup Summary

For this experiment, I used **Logistic Regression** to predict whether a driver scores any points in a sprint race.

I set up MLflow tracking to log:
- **Hyperparameters:**
  - `C = 1.0`
  - `max_iter = 200`
  - `model_type = LogisticRegression`

- **Metrics:**
  - Accuracy = 0.70
  - F1 Score = 0.00 (due to no positive predictions — model underfitted on minority class)

- **Artifacts:**
  - `conf_matrix.png`: Confusion matrix plot (30 True Negatives, 13 False Negatives)
  - `roc_curve.png`: ROC Curve with AUC = 0.46

- **Model:**
  - Logged using `mlflow.sklearn.log_model(...)`
  - Appears under `log_reg_model` with full environment info (`.pkl`, `conda.yaml`, etc.)

In [0]:
import mlflow
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, roc_curve, auc

# Start MLflow run
with mlflow.start_run():
    mlflow.log_param("model_type", "LogisticRegression")
    mlflow.log_param("C", 1.0)
    mlflow.log_param("max_iter", 200)

    # Log model
    mlflow.sklearn.log_model(model, "log_reg_model")

    # Predict
    y_pred = model.predict(X_test)
    y_proba = model.predict_proba(X_test)[:, 1]

    # Log metrics
    mlflow.log_metric("accuracy", accuracy_score(y_test, y_pred))
    mlflow.log_metric("f1_score", f1_score(y_test, y_pred, zero_division=0))

    # === Artifact 1: Confusion Matrix ===
    cm = confusion_matrix(y_test, y_pred)
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
    plt.title("Confusion Matrix")
    plt.xlabel("Predicted")
    plt.ylabel("Actual")
    plt.savefig("conf_matrix.png")
    mlflow.log_artifact("conf_matrix.png")
    plt.close()

    # === Artifact 2: ROC Curve ===
    fpr, tpr, _ = roc_curve(y_test, y_proba)
    auc_score = auc(fpr, tpr)
    plt.plot(fpr, tpr, label=f"AUC = {auc_score:.2f}")
    plt.xlabel("False Positive Rate")
    plt.ylabel("True Positive Rate")
    plt.title("ROC Curve")
    plt.legend()
    plt.savefig("roc_curve.png")
    mlflow.log_artifact("roc_curve.png")
    plt.close()



3. [20 pts] Track your MLFlow experiment and run at least 10 experiments with different parameters each

### Q3: Running Multiple Experiments

To fulfill the requirement of running at least 10 experiments, I used a `for` loop to test different values of the hyperparameter `C` in Logistic Regression:

`C = [0.01, 0.1, 0.5, 1, 2, 5, 10, 20, 50, 100]`  
(Repeated the loop twice for testing = 20 total runs)

For each run, I logged:
- The model type and hyperparameter `C`
- Performance metrics: Accuracy and F1 Score
- The trained model itself using `mlflow.sklearn.log_model(...)`

In [0]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score
import mlflow

C_values = [0.01, 0.1, 0.5, 1, 2, 5, 10, 20, 50, 100]

for c in C_values:
    model = LogisticRegression(C=c, max_iter=200)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    acc = accuracy_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred, zero_division=0)

    with mlflow.start_run():
        mlflow.log_param("model_type", "LogisticRegression")
        mlflow.log_param("C", c)
        mlflow.log_metric("accuracy", acc)
        mlflow.log_metric("f1_score", f1)
        mlflow.sklearn.log_model(model, "log_reg_model")

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


4. [20 pts] Select your best model run and explain why

### Q4: Best Model Selection

After comparing 22 Logistic Regression models in MLflow, I evaluated them based on their ability to correctly identify whether a driver would score sprint points.

**Findings:**
- All models achieved an accuracy of 69.8%
- However, all F1 scores on the test set were 0.00
- This indicates that no model correctly predicted the positive class (scored = 1)

**Conclusion:**
Since all models had F1 = 0, I cannot consider any of them successful at identifying drivers who scored points. This reflects a **class imbalance problem** in the dataset, where most examples are from the negative class (scored = 0).

**Next Steps:**
To improve future model performance, I would:
- Try `class_weight="balanced"` in Logistic Regression
- Explore more robust classifiers like Random Forest or XGBoost