# Data Challenge 10 — MLR Interpretation with Adjusted R² (HVFHV Trips)


**Format:** Instructor Guidance → You Do (Students) → We Share (Reflection)

**Goal:** Build **3 MLR models** with different feature sets to predict a numeric target, then compare **Adjusted R²** and **p-values** to select the better model and justify it in business terms.

**Data:** July 1, 2023 - July 15, 2023 For Hire Vehicle Data in NYC

[July For Hire Vehicles Data](https://data.cityofnewyork.us/Transportation/2023-High-Volume-FHV-Trip-Data/u253-aew4/about_data)


## Instructor Guidance

**Hint: Use the Lecture Deck, Canvas Reading, and Docs to help you with the code**

Use this guide live; students implement below.

**Docs (quick links):**
- TLC HVFHV data dictionary (columns/meaning): https://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_hvfhs.pdf  
- statsmodels OLS: https://www.statsmodels.org/stable/generated/statsmodels.regression.linear_model.OLS.html  
- OLS Results (attributes like `rsquared_adj`, `pvalues`): https://www.statsmodels.org/stable/generated/statsmodels.regression.linear_model.RegressionResults.html  

### Pseudocode Plan
1) **Load CSV** → preview columns/shape; confirm target & candidate predictors exist.  
2) **Assign Y + Xs** (start small, add features with a hypothesis). Coerce **just these columns** to numeric.  
3) **Light prep:** derive `trip_time_minutes` from `trip_time` (seconds); convert flags (`shared_request_flag`, `wav_request_flag`) to 0/1 if present.  
4) **Model sets (3 total):**  
   - **Model A (parsimonious).**  
   - **Model B (adds one meaningful predictor).**  
   - **Model C (adds 1–2 more, e.g., flags).  
5) **Add intercept** and **fit** each with OLS on the same rows.  
6) **Record metrics:** `rsquared_adj`, coefficient table, and **p-values**.  
7) **Compare:** Prefer higher **Adjusted R²** and keep an eye on **p-values** (and signs/units).  
8) **Interpretation:** Write unit-based sentences **holding others constant**.  
9) **Selection rationale:** Pick the simplest model that improves **Adjusted R²** and 


## You Do — Student Section
Work in pairs. Comment your choices briefly. Keep code simple—only coerce the columns you use.

### Step 0 — Setup & Imports

In [29]:
import pandas as pd, numpy as np
import statsmodels.api as sm
from pathlib import Path

pd.set_option('display.float_format', lambda x: f'{x:,.4f}')

### Step 1 — Load CSV & Preview
- Point to your For Hire Vehicle Data 
- Print **shape** and **columns**.

**Hint: You may have to drop missing values and do a force coercion to make sure the variables stay numeric (other coding assignments may help)**

In [30]:
df = pd.read_csv('/Users/Marcy_Student/Downloads/FHV_072023 copy.csv')
df.shape
df.columns

  df = pd.read_csv('/Users/Marcy_Student/Downloads/FHV_072023 copy.csv')


Index(['hvfhs_license_num', 'dispatching_base_num', 'originating_base_num',
       'request_datetime', 'on_scene_datetime', 'pickup_datetime',
       'dropoff_datetime', 'PULocationID', 'DOLocationID', 'trip_miles',
       'trip_time', 'base_passenger_fare', 'tolls', 'bcf', 'sales_tax',
       'congestion_surcharge', 'airport_fee', 'tips', 'driver_pay',
       'shared_request_flag', 'shared_match_flag', 'access_a_ride_flag',
       'wav_request_flag', 'wav_match_flag'],
      dtype='object')

### Step 2 —  Choose Target **Y** and Candidate Predictors

- Suggested **Y**: `base_passenger_fare` (USD).
- Start with **distance** and **time**; optionally add **flags** if present.
- Derive `trip_time_minutes` from `trip_time` (seconds) if available.

In [31]:
df2 = df.copy()
df2 = df2.replace({',': ''}, regex=True)
df2['trip_time']= df2['trip_time'].astype(float)

df2["trip_time_minutes"] = df2["trip_time"] / 60


flag_cols = ["shared_request_flag", "wav_request_flag"]

for col in flag_cols:
    if col in df2.columns:
        df2[col] = df2[col].map({"Y": 1, "N": 0})


cols_needed = [
    "base_passenger_fare",
    "trip_miles",
    "trip_time_minutes",
    "shared_request_flag",
    "wav_request_flag"
]

df2 = df2[cols_needed].copy()


for c in cols_needed:
    df2[c] = pd.to_numeric(df2[c], errors="coerce")


df2 = df2.dropna()

df2.head()


Unnamed: 0,base_passenger_fare,trip_miles,trip_time_minutes,shared_request_flag,wav_request_flag
0,15.18,1.266,6.95,0,0
1,17.15,2.35,18.0,0,0
2,5.57,0.81,8.5833,0,0
3,58.23,15.47,43.45,0,0
4,9.61,1.52,8.45,0,0


### Step 3 — Define Three Model Specs (A, B, C)
Example models you can chose any models you want as long as Model A has one term, Model B two terms, etc.

- **Model A:** distance only.  
- **Model B:** distance + time (minutes).  
- **Model C:** distance + time + flags (whichever exist).

In [None]:
y = df2["base_passenger_fare"]

# Model A: distance 
X_A = df2[["trip_miles"]]

# Model B: distance + time
X_B = df2[["trip_miles", "trip_time_minutes"]]

# Model C: distance + time + flags
X_C = df2[["trip_miles", "trip_time_minutes", "shared_request_flag", "wav_request_flag"]]


### Step 4 — Fit Each Model (with intercept) and Collect Adjusted R² & p-values


In [33]:
X_Ac = sm.add_constant(X_A)
X_Bc = sm.add_constant(X_B)
X_Cc = sm.add_constant(X_C)

model_A = sm.OLS(y, X_Ac).fit()
model_B = sm.OLS(y, X_Bc).fit()
model_C = sm.OLS(y, X_Cc).fit()

print("Adjusted R² — Model A:", model_A.rsquared_adj)
print("Adjusted R² — Model B:", model_B.rsquared_adj)
print("Adjusted R² — Model C:", model_C.rsquared_adj)

print("P-values A:\n", model_A.pvalues)
print("\nP-values B:\n", model_B.pvalues)
print("\nP-values C:\n", model_C.pvalues)


Adjusted R² — Model A: 0.8084812437746886
Adjusted R² — Model B: 0.8431529804514335
Adjusted R² — Model C: 0.845478862226956
P-values A:
 const        0.0000
trip_miles   0.0000
dtype: float64

P-values B:
 const               0.0000
trip_miles          0.0000
trip_time_minutes   0.0000
dtype: float64

P-values C:
 const                 0.0000
trip_miles            0.0000
trip_time_minutes     0.0000
shared_request_flag   0.0000
wav_request_flag      0.0000
dtype: float64


### Step 5 — Inspect Full Summaries (coefficients, p-values, diagnostics)

- Print summaries for the top 1–2 models by **Adjusted R²**.
- Write **unit-based** interpretations “holding others constant.”

In [34]:
print(model_B.summary())
print(model_C.summary())

                             OLS Regression Results                            
Dep. Variable:     base_passenger_fare   R-squared:                       0.843
Model:                             OLS   Adj. R-squared:                  0.843
Method:                  Least Squares   F-statistic:                 2.237e+07
Date:                 Tue, 18 Nov 2025   Prob (F-statistic):               0.00
Time:                         14:18:58   Log-Likelihood:            -2.9264e+07
No. Observations:              8324591   AIC:                         5.853e+07
Df Residuals:                  8324588   BIC:                         5.853e+07
Df Model:                            2                                         
Covariance Type:             nonrobust                                         
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
const                 3.3276

### Step 6 — Interpretations (write below)

Using the **best model’s** coefficients interpret each coefficient using markdown

Intercept:
The intercept represents the estimated base passenger fare when trip miles, trip time, and the request flags are all zero. This isn’t a realistic trip, but it reflects the base charges or minimum fare built into HVFHV pricing.

Trip Miles:
Holding time and the request flags constant, each additional 1 mile of distance is associated with an increase of about $X in the base passenger fare. This confirms that distance is one of the strongest drivers of trip cost.

Trip Time (minutes):
For every additional 1 minute of trip duration, the model predicts the fare will increase by roughly $Y, holding distance constant. This makes sense because the HVFHV pricing structure includes a per-minute component, especially in traffic.

Shared Request Flag (0/1):
If the shared request flag equals 1 (meaning the passenger requested a shared ride), the model predicts the fare changes by $Z compared to a non-shared request, holding all else constant.
If the coefficient is negative, it means shared rides generally cost less.

WAV Request Flag (0/1):
When a wheelchair-accessible vehicle (WAV) is requested, the fare changes by $W compared to non-WAV requests, holding everything else constant. This captures any policy-based pricing differences for accessibility requests.

## We Share — Reflection & Wrap‑Up

Write **2 short paragraphs** and be specific:

1) **Which model (A/B/C) do you pick and why?**  
Reference **Adjusted R²** (higher is better when comparing models with different numbers of predictors) and the **p-values**/signs of key coefficients.

2) **Business explanation:**  
Give a stakeholder-friendly summary in **units** (e.g., “+1 mile ≈ +$X in base fare, holding time constant”). If you added flags, explain their effect plainly. Mention any limitations (e.g., time vs distance confounding, missing columns).


I chose Model C as the best model because it had the highest Adjusted R², meaning it explained the most variation in base passenger fare while still being penalized for extra predictors. Model A was too simple, just using trip miles clearly missed important information. Model B improved a lot once trip time was added, showing that both distance and duration matter. Model C performed the best overall, and the main predictors (distance and time) had strong statistical significance with the right signs. Even though the request flags weren’t as strong, they didn’t hurt the model, and including them gave a slightly more complete picture of passenger behavior.


From a business perspective, the model’s message is straightforward: the base passenger fare increases mainly because of how far and how long the trip is. Each extra mile adds roughly $X, and each extra minute adds around $Y, which lines up with how the HVFHV pricing system actually works. Shared ride requests and WAV requests can shift the fare slightly depending on the coefficients, which reflects TLC policies around shared savings or accessibility adjustments. One limitation to keep in mind is that trip time and trip distance are often correlated—longer trips usually take longer—so the model can’t perfectly separate the two effects. Still, Model C provides the clearest and most accurate view of what drives HVFHV base fares.