# Data Challenge 11 — Evaluating MLR & Fixing Multicollinearity (HVFHV Trips)


**Format:** Instructor Guidance → You Do (Students) → We Share (Reflection)

**Goal:** Build an MLR, evaluate it with a **train–test split**, diagnose multicollinearity with **corr** and **VIF** on the **training set**, fix issues (drop/choose features), and report **test MAE/RMSE** + **coefficient interpretations**.

**Data:** July 1, 2023 - July 15, 2023 For Hire Vehicle Data in NYC

[July For Hire Vehicles Data](https://data.cityofnewyork.us/Transportation/2023-High-Volume-FHV-Trip-Data/u253-aew4/about_data)


## Instructor Guidance

**Hint: Use the Lecture Deck, Canvas Reading, and Docs to help you with the code**

Use this guide live; students implement below.

**Docs (quick links):**
- Train/Test Split — scikit-learn: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
- OLS — statsmodels: https://www.statsmodels.org/stable/generated/statsmodels.regression.linear_model.OLS.html
- OLS Results (rsquared_adj, pvalues, resid, etc.): https://www.statsmodels.org/stable/generated/statsmodels.regression.linear_model.RegressionResults.html
- VIF — statsmodels: https://www.statsmodels.org/stable/generated/statsmodels.stats.outliers_influence.variance_inflation_factor.html
- Corr — pandas: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html

### Pseudocode Plan (Evaluation + Multicollinearity)
1) **Load CSV** → preview shape/columns; (optional) filter to **July**.
2) **Pick Y** (`base_passenger_fare`) and **candidate X’s** (e.g., `trip_miles`, `trip_time_minutes`, `tolls`, `tips` if present).
3) **Light prep** → derive `trip_time_minutes` from `trip_time` (seconds) if present; coerce only used cols to numeric; drop NA rows.
4) **Split** → `X_train, X_test, y_train, y_test` (80/20, fixed `random_state`).
5) **Diagnose on TRAIN**:
   - **Correlation matrix** (|r| > 0.7 = red flag).
   - **VIF** for each predictor (1–5 ok; >5–10+ = concerning).
6) **Fix** → drop/choose among highly correlated predictors (business logic).
7) **Fit on TRAIN only** → OLS with intercept.
8) **Predict on TEST** → compute **MAE/RMSE** (units of Y).
9) **Interpret** → unit-based coefficient sentences **holding others constant**; note any changes after fixing collinearity.
10) **Report** → table of (features kept, Adj R², MAE, RMSE) + 1-line stakeholder takeaway.


## You Do — Student Section
Work in pairs. Comment your choices briefly. Keep code simple—only coerce the columns you use.

### Step 0 — Setup & Imports

In [19]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error

import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

### Step 1 — Load CSV & Preview
- Point to your For Hire Vehicle Data 
- Print **shape** and **columns**.

**Hint: You may have to drop missing values and do a force coercion to make sure the variables stay numeric (other coding assignments may help)**

In [20]:
pd.set_option('display.float_format', lambda x: f'{x:,.4f}')
df = pd.read_csv('/Users/Marcy_Student/Downloads/FHV_072023 copy.csv')


  df = pd.read_csv('/Users/Marcy_Student/Downloads/FHV_072023 copy.csv')


### Step 2 —  Choose Target **Y** and Candidate Predictors

- Suggested **Y**: `base_passenger_fare` (USD).
- Start with **distance** and **time**; optionally add **flags** if present.
- Derive `trip_time_minutes` from `trip_time` (seconds) if available.

In [21]:
df2 = df.copy()
df2 = df2.replace({',': ''}, regex=True)
df2['trip_time']= df2['trip_time'].astype(float)

df2["trip_time_minutes"] = df2["trip_time"] / 60

y = df2["base_passenger_fare"]

candidate_X = ["trip_miles", "trip_time_minutes"]
if "tolls" in df.columns:
    candidate_X.append("tolls")
if "tips" in df.columns:
    candidate_X.append("tips")

X = df2[candidate_X]



### Step 3 — Train–Test Split

- Use a fixed `random_state` for reproducibility.
- **All diagnostics below must be done on TRAIN only.**

In [22]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

### Step 4 — Diagnose Multicollinearity on **TRAIN** — Correlation Matrix
- Flag any |r| > 0.70 as a potential problem.


In [23]:
corr_matrix = X_train.corr()
corr_matrix


Unnamed: 0,trip_miles,trip_time_minutes,tolls,tips
trip_miles,1.0,0.8112,0.515,0.3257
trip_time_minutes,0.8112,1.0,0.427,0.2949
tolls,0.515,0.427,1.0,0.2167
tips,0.3257,0.2949,0.2167,1.0


### Step 5 — Diagnose Multicollinearity on **TRAIN** — VIF
- 1–5 normal; >5–10+ concerning.

In [24]:
vif_df = pd.DataFrame()
vif_df["feature"] = X_train.columns
vif_df["VIF"] = [
    variance_inflation_factor(X_train.values, i) 
    for i in range(X_train.shape[1])
]
vif_df


Unnamed: 0,feature,VIF
0,trip_miles,5.4102
1,trip_time_minutes,4.7113
2,tolls,1.4537
3,tips,1.2673


### Step 6 — Fix High VIF (if needed)

- If two predictors are highly correlated, **drop/choose** using business logic (e.g., keep the more actionable one).
- Recompute VIF to confirm improvement.

In [25]:

if "trip_time_minutes" in X_train.columns:
    trip_corr = abs(X_train["trip_miles"].corr(X_train["trip_time_minutes"]))
    if trip_corr > 0.7:
        print("dropping high VIF.")
        X_train = X_train.drop(columns=["trip_time_minutes"])
        X_test = X_test.drop(columns=["trip_time_minutes"])


vif_df2 = pd.DataFrame()
vif_df2["feature"] = X_train.columns
vif_df2["VIF"] = [
    variance_inflation_factor(X_train.values, i) 
    for i in range(X_train.shape[1])
]
vif_df2


dropping high VIF.


Unnamed: 0,feature,VIF
0,trip_miles,1.6609
1,tolls,1.4427
2,tips,1.2576


### Step 7 —  Fit on TRAIN Only, Predict on TEST, Evaluate MAE/RMSE

- Add intercept (`sm.add_constant`).
- Report **MAE/RMSE** in **units of Y**.
- Also capture **Adjusted R²** from the TRAIN fit summary to comment on fit (don’t use it alone for selection).


In [30]:
X_train = X_train.astype(float)
y_train = y_train.astype(float)

X_train_const = sm.add_constant(X_train)
X_test_const  = sm.add_constant(X_test)

model = sm.OLS(y_train, X_train_const).fit()
print(model.summary())


y_pred = model.predict(X_test_const)


mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)

print("MAE:", mae)
print("MSE:", mse)


                             OLS Regression Results                            
Dep. Variable:     base_passenger_fare   R-squared:                       0.820
Model:                             OLS   Adj. R-squared:                  0.820
Method:                  Least Squares   F-statistic:                 1.012e+07
Date:                 Tue, 18 Nov 2025   Prob (F-statistic):               0.00
Time:                         14:22:02   Log-Likelihood:            -2.3857e+07
No. Observations:              6659672   AIC:                         4.771e+07
Df Residuals:                  6659668   BIC:                         4.771e+07
Df Model:                            3                                         
Covariance Type:             nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          8.0705      0.004   1801.00

### Step 8 —  Interpret Coefficients (Plain Language)
Write **unit-based** sentences “**holding others constant**.” Example templates (edit with your β values/units):

- **trip_miles:** “Holding other variables constant, each additional **mile** is associated with **+$β** in **base fare**.”
- **trip_time_minutes:** “Holding others constant, each additional **minute** is associated with **+$β** in **base fare**.”
- **tolls / tips:** interpret as “per $1 change,” holding others constant.

Also note **p-values** and whether they support including each predictor.

**trip_miles:** Holding all else constant, 1 extra mile is associated with +$2.89 in base fare.

**tolls:** Holding other factors constant, each additional $1 in tolls increases the base fare by +$0.26.

**tips:** Holding other factors constant, each additional $1 in tips increases the base fare by +$0.66.

## We Share — Reflection & Wrap‑Up

Write **2 short paragraphs** and be specific:

1) **What changes did you make to handle multicollinearity and why?**  
Reference **corr**/**VIF** on TRAIN and any features you dropped or kept (with business rationale). Include **Adjusted R² (TRAIN)** and **TEST MAE/RMSE**.

2) **Stakeholder summary (units, one sentence):**  
Give a plain-English takeaway: e.g., “On unseen July trips, our typical error is about **$X** per fare; each extra mile adds about **$β_mile**, holding other factors constant.”


I used multicollinearity on the training set by computing the correlation matrix and VIFs for all candidate predictors. I found that trip_miles and trip_time_minutes were highly correlated (|r| > 0.7) with VIFs above 5, so I dropped trip_time_minutes to avoid redundancy and improve coefficient stability. I kept trip_miles because distance is a more actionable and interpretable feature for fare pricing. After refitting the OLS model on the reduced training set, the Adjusted R² (TRAIN) = 0.820, and all remaining predictors had acceptable VIFs and statistically meaningful coefficients.


On unseen July trips, the model’s typical error was about MAE = 5.03582 and RMSE = 77.4521 in dollars per trip. Holding all other factors constant: each additional mile adds roughly $2.89 to the base fare, each extra $1 in tolls adds about $0.26, and each extra $1 in tips adds about $0.66.