# Data Challenge 10 — MLR Interpretation with Adjusted R² (HVFHV Trips)


**Format:** Instructor Guidance → You Do (Students) → We Share (Reflection)

**Goal:** Build **3 MLR models** with different feature sets to predict a numeric target, then compare **Adjusted R²** and **p-values** to select the better model and justify it in business terms.

**Data:** July 1, 2023 - July 15, 2023 For Hire Vehicle Data in NYC

[July For Hire Vehicles Data](https://data.cityofnewyork.us/Transportation/2023-High-Volume-FHV-Trip-Data/u253-aew4/about_data)


## Instructor Guidance

**Hint: Use the Lecture Deck, Canvas Reading, and Docs to help you with the code**

Use this guide live; students implement below.

**Docs (quick links):**
- TLC HVFHV data dictionary (columns/meaning): https://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_hvfhs.pdf  
- statsmodels OLS: https://www.statsmodels.org/stable/generated/statsmodels.regression.linear_model.OLS.html  
- OLS Results (attributes like `rsquared_adj`, `pvalues`): https://www.statsmodels.org/stable/generated/statsmodels.regression.linear_model.RegressionResults.html  

### Pseudocode Plan
1) **Load CSV** → preview columns/shape; confirm target & candidate predictors exist.  
2) **Assign Y + Xs** (start small, add features with a hypothesis). Coerce **just these columns** to numeric.  
3) **Light prep:** derive `trip_time_minutes` from `trip_time` (seconds); convert flags (`shared_request_flag`, `wav_request_flag`) to 0/1 if present.  
4) **Model sets (3 total):**  
   - **Model A (parsimonious).**  
   - **Model B (adds one meaningful predictor).**  
   - **Model C (adds 1–2 more, e.g., flags).  
5) **Add intercept** and **fit** each with OLS on the same rows.  
6) **Record metrics:** `rsquared_adj`, coefficient table, and **p-values**.  
7) **Compare:** Prefer higher **Adjusted R²** and keep an eye on **p-values** (and signs/units).  
8) **Interpretation:** Write unit-based sentences **holding others constant**.  
9) **Selection rationale:** Pick the simplest model that improves **Adjusted R²** and 


## You Do — Student Section
Work in pairs. Comment your choices briefly. Keep code simple—only coerce the columns you use.

### Step 0 — Setup & Imports

In [147]:
import pandas as pd, numpy as np
import statsmodels.api as sm
from pathlib import Path
from sklearn.model_selection import train_test_split

pd.set_option('display.float_format', lambda x: f'{x:,.4f}')

### Step 1 — Load CSV & Preview
- Point to your For Hire Vehicle Data 
- Print **shape** and **columns**.

**Hint: You may have to drop missing values and do a force coercion to make sure the variables stay numeric (other coding assignments may help)**

In [130]:
df = pd.read_csv("/Users/gabriel/Desktop/marcy/DA2025_Lectures2/Mod6/data/FHV_072023 copy.csv", low_memory=False)
display(df.head())
display(df.info())
display(df.describe())



Unnamed: 0,hvfhs_license_num,dispatching_base_num,originating_base_num,request_datetime,on_scene_datetime,pickup_datetime,dropoff_datetime,PULocationID,DOLocationID,trip_miles,...,sales_tax,congestion_surcharge,airport_fee,tips,driver_pay,shared_request_flag,shared_match_flag,access_a_ride_flag,wav_request_flag,wav_match_flag
0,HV0005,B03406,,07/01/2023 05:34:30 PM,,07/01/2023 05:37:48 PM,07/01/2023 05:44:45 PM,158,68,1.266,...,1.35,2.75,0.0,2.0,5.57,N,N,N,N,False
1,HV0003,B03404,B03404,07/01/2023 05:34:30 PM,07/01/2023 05:36:53 PM,07/01/2023 05:37:15 PM,07/01/2023 05:55:15 PM,162,234,2.35,...,1.52,2.75,0.0,3.28,13.38,N,N,,N,False
2,HV0003,B03404,B03404,07/01/2023 05:34:30 PM,07/01/2023 05:35:17 PM,07/01/2023 05:35:52 PM,07/01/2023 05:44:27 PM,161,163,0.81,...,0.49,2.75,0.0,0.0,5.95,N,N,,N,False
3,HV0003,B03404,B03404,07/01/2023 05:34:30 PM,07/01/2023 05:37:39 PM,07/01/2023 05:39:35 PM,07/01/2023 06:23:02 PM,122,229,15.47,...,5.17,2.75,0.0,0.0,54.46,N,N,,N,True
4,HV0003,B03404,B03404,07/01/2023 05:34:30 PM,07/01/2023 05:36:06 PM,07/01/2023 05:36:39 PM,07/01/2023 05:45:06 PM,67,14,1.52,...,0.85,0.0,0.0,3.0,7.01,N,N,,N,False


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8324591 entries, 0 to 8324590
Data columns (total 24 columns):
 #   Column                Dtype  
---  ------                -----  
 0   hvfhs_license_num     object 
 1   dispatching_base_num  object 
 2   originating_base_num  object 
 3   request_datetime      object 
 4   on_scene_datetime     object 
 5   pickup_datetime       object 
 6   dropoff_datetime      object 
 7   PULocationID          int64  
 8   DOLocationID          int64  
 9   trip_miles            float64
 10  trip_time             object 
 11  base_passenger_fare   object 
 12  tolls                 float64
 13  bcf                   float64
 14  sales_tax             float64
 15  congestion_surcharge  float64
 16  airport_fee           float64
 17  tips                  float64
 18  driver_pay            object 
 19  shared_request_flag   object 
 20  shared_match_flag     object 
 21  access_a_ride_flag    object 
 22  wav_request_flag      object 
 23  wav_mat

None

Unnamed: 0,PULocationID,DOLocationID,trip_miles,tolls,bcf,sales_tax,congestion_surcharge,airport_fee,tips
count,8324591.0,8324591.0,8324591.0,8324591.0,8324591.0,8324591.0,8324591.0,8324591.0,8324591.0
mean,137.8903,141.5913,5.0758,1.0679,0.6858,2.0046,1.0529,0.2119,1.1174
std,74.7686,77.7345,5.9688,3.8158,0.6304,1.6732,1.3333,0.7015,3.1516
min,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,75.0,75.0,1.61,0.0,0.32,0.95,0.0,0.0,0.0
50%,138.0,141.0,3.04,0.0,0.49,1.52,0.0,0.0,0.0
75%,208.0,216.0,6.31,0.0,0.82,2.48,2.75,0.0,0.0
max,265.0,265.0,486.73,94.08,79.03,210.19,8.25,6.9,230.0


### Step 2 —  Choose Target **Y** and Candidate Predictors

- Suggested **Y**: `base_passenger_fare` (USD).
- Start with **distance** and **time**; optionally add **flags** if present.
- Derive `trip_time_minutes` from `trip_time` (seconds) if available.

In [131]:
num_cols = ['trip_time', 'base_passenger_fare','trip_miles']
for c in num_cols:
    df[c] = pd.to_numeric(
        df[c].astype(str).str.strip().str.replace(r'[^0-9.+\-eE]', '', regex=True),
        errors='coerce'
)
    
# Clean data some by removing values less than 0 and dropping NAN values
df = df[(df['trip_time'] >= 0) & (df['base_passenger_fare'] > 0) & (df['trip_miles'] > 0)]
df = df.replace([np.inf, -np.inf], np.nan).dropna(subset=num_cols)

In [132]:
df['trip_time_minutes'] = df['trip_time'] / 60
df['trip_time_minutes']

0          6.9500
1         18.0000
2          8.5833
3         43.4500
4          8.4500
            ...  
8324586   25.8167
8324587   10.7500
8324588   21.8167
8324589   15.0000
8324590   10.0000
Name: trip_time_minutes, Length: 8323061, dtype: float64

In [133]:
df = df[(df['trip_time_minutes'] <= np.percentile(df['trip_time_minutes'],99.999)) & (df['trip_time_minutes'] > 5)]

# removing passenger fares that are above 99.999 percentile, as there are higher values that aren't as representative of the data
df = df[df['base_passenger_fare'] <= np.percentile(df['base_passenger_fare'],99.999)]

#increased the value of percentile as it was removing too many larger values if i did 99.999%
df = df[df['trip_miles'] <= np.percentile(df['trip_miles'],99.999)]

In [134]:
booleans = []
for i in range(len(df)):
    if df.iloc[i]['shared_request_flag'] == 'N':
        booleans.append(0)
    else:
        booleans.append(1)

df['shared_request_bool'] = booleans
#Did way too much searching up for this Data Challenge

In [135]:
modeldata = df[['base_passenger_fare','trip_miles','trip_time_minutes','shared_request_bool']]

y = modeldata['base_passenger_fare']
xone = modeldata['trip_time_minutes']
xtwo = modeldata['trip_miles']
xthree = modeldata['shared_request_bool']

### Step 3 — Define Three Model Specs (A, B, C)
Example models you can chose any models you want as long as Model A has one term, Model B two terms, etc.

- **Model A:** distance only.  
- **Model B:** distance + time (minutes).  
- **Model C:** distance + time + flags (whichever exist).

In [148]:
X_one = sm.add_constant(df['trip_time_minutes'])

np.random.seed(10)
Xone_train, Xone_test, yone_train, yone_test = train_test_split(X_one, y, test_size=0.2, random_state=10)

print(f"X Training data shape: {Xone_train.shape}")
print(f"X Testing data shape:  {Xone_test.shape}")
print(f"Y Training data shape: {yone_train.shape}")
print(f"Y Testing data shape:  {yone_test.shape}")

modela = sm.OLS(yone_train, Xone_train).fit()
modela.summary()

X Training data shape: (6347548, 2)
X Testing data shape:  (1586887, 2)
Y Training data shape: (6347548,)
Y Testing data shape:  (1586887,)


0,1,2,3
Dep. Variable:,base_passenger_fare,R-squared:,0.702
Model:,OLS,Adj. R-squared:,0.702
Method:,Least Squares,F-statistic:,14970000.0
Date:,"Mon, 10 Nov 2025",Prob (F-statistic):,0.0
Time:,00:20:01,Log-Likelihood:,-24296000.0
No. Observations:,6347548,AIC:,48590000.0
Df Residuals:,6347546,BIC:,48590000.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-0.9162,0.008,-115.725,0.000,-0.932,-0.901
trip_time_minutes,1.2860,0.000,3868.564,0.000,1.285,1.287

0,1,2,3
Omnibus:,6083095.404,Durbin-Watson:,2.0
Prob(Omnibus):,0.0,Jarque-Bera (JB):,1154101281.519
Skew:,4.164,Prob(JB):,0.0
Kurtosis:,68.531,Cond. No.,42.8


In [149]:
X_two = sm.add_constant(df[['trip_time_minutes', 'trip_miles']])

np.random.seed(10)
Xtwo_train, Xtwo_test, ytwo_train, ytwo_test = train_test_split(X_two, y, test_size=0.2, random_state=10)

print(f"X Training data shape: {Xtwo_train.shape}")
print(f"X Testing data shape:  {Xtwo_test.shape}")
print(f"Y Training data shape: {ytwo_train.shape}")
print(f"Y Testing data shape:  {ytwo_test.shape}")

X Training data shape: (6347548, 3)
X Testing data shape:  (1586887, 3)
Y Training data shape: (6347548,)
Y Testing data shape:  (1586887,)


In [150]:
X_three = sm.add_constant(df[['trip_time_minutes', 'trip_miles','shared_request_bool']])

np.random.seed(10)
Xthree_train, Xthree_test, ythree_train, ythree_test = train_test_split(X_three, y, test_size=0.2, random_state=10)

print(f"X Training data shape: {Xthree_train.shape}")
print(f"X Testing data shape:  {Xthree_test.shape}")
print(f"Y Training data shape: {ythree_train.shape}")
print(f"Y Testing data shape:  {ythree_test.shape}")

X Training data shape: (6347548, 4)
X Testing data shape:  (1586887, 4)
Y Training data shape: (6347548,)
Y Testing data shape:  (1586887,)


### Step 4 — Fit Each Model (with intercept) and Collect Adjusted R² & p-values


In [151]:
modela = sm.OLS(yone_train, Xone_train).fit()
print('Const P-Value:',modela.pvalues['const'],'\ntrip_time_minutes p-value:',modela.pvalues['trip_time_minutes'], '\nr_squared:',round(modela.rsquared_adj,3))

Const P-Value: 0.0 
trip_time_minutes p-value: 0.0 
r_squared: 0.702


In [152]:
modelb = sm.OLS(ytwo_train, Xtwo_train).fit()
print('Const P-Value:',modelb.pvalues['const'],'\ntrip_time_minutes p-value:',modelb.pvalues['trip_time_minutes'],'\ntrip_miles p-value',modelb.pvalues['trip_miles'], '\nr_squared:',round(modelb.rsquared_adj,3))

Const P-Value: 0.0 
trip_time_minutes p-value: 0.0 
trip_miles p-value 0.0 
r_squared: 0.84


In [153]:
modelc = sm.OLS(ythree_train, Xthree_train).fit()
print('const P-Value:',modelc.pvalues['const'],'\ntrip_time_minutes p-value:',modelc.pvalues['trip_time_minutes'],'\ntrip_miles p-value',modelc.pvalues['trip_miles'], '\nshared_request_bool p-value:',modelc.pvalues['shared_request_bool'], '\nr_squared:',round(modelc.rsquared_adj,3))

const P-Value: 0.0 
trip_time_minutes p-value: 0.0 
trip_miles p-value 0.0 
shared_request_bool p-value: 0.0 
r_squared: 0.843


### Step 5 — Inspect Full Summaries (coefficients, p-values, diagnostics)

- Print summaries for the top 1–2 models by **Adjusted R²**.
- Write **unit-based** interpretations “holding others constant.”

In [156]:
modela.summary()

0,1,2,3
Dep. Variable:,base_passenger_fare,R-squared:,0.702
Model:,OLS,Adj. R-squared:,0.702
Method:,Least Squares,F-statistic:,14970000.0
Date:,"Mon, 10 Nov 2025",Prob (F-statistic):,0.0
Time:,00:22:46,Log-Likelihood:,-24296000.0
No. Observations:,6347548,AIC:,48590000.0
Df Residuals:,6347546,BIC:,48590000.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-0.9162,0.008,-115.725,0.000,-0.932,-0.901
trip_time_minutes,1.2860,0.000,3868.564,0.000,1.285,1.287

0,1,2,3
Omnibus:,6083095.404,Durbin-Watson:,2.0
Prob(Omnibus):,0.0,Jarque-Bera (JB):,1154101281.519
Skew:,4.164,Prob(JB):,0.0
Kurtosis:,68.531,Cond. No.,42.8


In [155]:
modelb.summary()

0,1,2,3
Dep. Variable:,base_passenger_fare,R-squared:,0.84
Model:,OLS,Adj. R-squared:,0.84
Method:,Least Squares,F-statistic:,16700000.0
Date:,"Mon, 10 Nov 2025",Prob (F-statistic):,0.0
Time:,00:22:45,Log-Likelihood:,-22318000.0
No. Observations:,6347548,AIC:,44640000.0
Df Residuals:,6347545,BIC:,44640000.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,3.1386,0.006,518.775,0.000,3.127,3.150
trip_time_minutes,0.5022,0.000,1214.183,0.000,0.501,0.503
trip_miles,2.1659,0.001,2343.297,0.000,2.164,2.168

0,1,2,3
Omnibus:,6258810.582,Durbin-Watson:,2.0
Prob(Omnibus):,0.0,Jarque-Bera (JB):,1552898725.192
Skew:,4.288,Prob(JB):,0.0
Kurtosis:,79.144,Cond. No.,46.7


### Step 6 — Interpretations (write below)

Using the **best model’s** coefficients interpret each coefficient using markdown

In [154]:
modelc.summary()

0,1,2,3
Dep. Variable:,base_passenger_fare,R-squared:,0.843
Model:,OLS,Adj. R-squared:,0.843
Method:,Least Squares,F-statistic:,11350000.0
Date:,"Mon, 10 Nov 2025",Prob (F-statistic):,0.0
Time:,00:22:44,Log-Likelihood:,-22267000.0
No. Observations:,6347548,AIC:,44530000.0
Df Residuals:,6347544,BIC:,44530000.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,3.2613,0.006,542.239,0.000,3.250,3.273
trip_time_minutes,0.5050,0.000,1230.209,0.000,0.504,0.506
trip_miles,2.1626,0.001,2358.223,0.000,2.161,2.164
shared_request_bool,-6.7321,0.021,-318.726,0.000,-6.773,-6.691

0,1,2,3
Omnibus:,6351432.246,Durbin-Watson:,2.0
Prob(Omnibus):,0.0,Jarque-Bera (JB):,1637172753.035
Skew:,4.391,Prob(JB):,0.0
Kurtosis:,81.186,Cond. No.,164.0


## We Share — Reflection & Wrap‑Up

Write **2 short paragraphs** and be specific:

1) **Which model (A/B/C) do you pick and why?**  
Reference **Adjusted R²** (higher is better when comparing models with different numbers of predictors) and the **p-values**/signs of key coefficients.

2) **Business explanation:**  
Give a stakeholder-friendly summary in **units** (e.g., “+1 mile ≈ +$X in base fare, holding time constant”). If you added flags, explain their effect plainly. Mention any limitations (e.g., time vs distance confounding, missing columns).

Model C is the best choice because it has the highest Adjusted R² (0.843), meaning it explains over 84% of the variation in base fares while accounting for model complexity. All coefficients are highly significant (p < 0.001) and have logical signs — fares increase with both trip time and distance and decrease for shared rides — showing that Model C is both statistically strong and economically sensible.

Each additional minute adds about $0.51 to the fare, and each extra mile adds around $2.16, holding other factors constant. Shared rides reduce the base fare by roughly $6.73 compared to private trips. This means fares grow predictably with distance and time, while shared requests receive a clear discount. Some limitations remain, such as time–distance overlap and unmodeled effects like surge pricing or location fees.