# Data Challenge 10 — MLR Interpretation with Adjusted R² (HVFHV Trips)


**Format:** Instructor Guidance → You Do (Students) → We Share (Reflection)

**Goal:** Build **3 MLR models** with different feature sets to predict a numeric target, then compare **Adjusted R²** and **p-values** to select the better model and justify it in business terms.

**Data:** July 1, 2023 - July 15, 2023 For Hire Vehicle Data in NYC

[July For Hire Vehicles Data](https://data.cityofnewyork.us/Transportation/2023-High-Volume-FHV-Trip-Data/u253-aew4/about_data)


## Instructor Guidance

**Hint: Use the Lecture Deck, Canvas Reading, and Docs to help you with the code**

Use this guide live; students implement below.

**Docs (quick links):**
- TLC HVFHV data dictionary (columns/meaning): https://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_hvfhs.pdf  
- statsmodels OLS: https://www.statsmodels.org/stable/generated/statsmodels.regression.linear_model.OLS.html  
- OLS Results (attributes like `rsquared_adj`, `pvalues`): https://www.statsmodels.org/stable/generated/statsmodels.regression.linear_model.RegressionResults.html  

### Pseudocode Plan
1) **Load CSV** → preview columns/shape; confirm target & candidate predictors exist.  
2) **Assign Y + Xs** (start small, add features with a hypothesis). Coerce **just these columns** to numeric.  
3) **Light prep:** derive `trip_time_minutes` from `trip_time` (seconds); convert flags (`shared_request_flag`, `wav_request_flag`) to 0/1 if present.  
4) **Model sets (3 total):**  
   - **Model A (parsimonious).**  
   - **Model B (adds one meaningful predictor).**  
   - **Model C (adds 1–2 more, e.g., flags).  
5) **Add intercept** and **fit** each with OLS on the same rows.  
6) **Record metrics:** `rsquared_adj`, coefficient table, and **p-values**.  
7) **Compare:** Prefer higher **Adjusted R²** and keep an eye on **p-values** (and signs/units).  
8) **Interpretation:** Write unit-based sentences **holding others constant**.  
9) **Selection rationale:** Pick the simplest model that improves **Adjusted R²** and 


## You Do — Student Section
Work in pairs. Comment your choices briefly. Keep code simple—only coerce the columns you use.

### Step 0 — Setup & Imports

In [1]:
import pandas as pd, numpy as np
import statsmodels.api as sm
from pathlib import Path

pd.set_option('display.float_format', lambda x: f'{x:,.4f}')

### Step 1 — Load CSV & Preview
- Point to your For Hire Vehicle Data 
- Print **shape** and **columns**.

**Hint: You may have to drop missing values and do a force coercion to make sure the variables stay numeric (other coding assignments may help)**

In [8]:
df = pd.read_csv("/Users/kabbo/Downloads/FHV_072023.csv", low_memory=False)

In [9]:
df.columns

Index(['hvfhs_license_num', 'dispatching_base_num', 'originating_base_num',
       'request_datetime', 'on_scene_datetime', 'pickup_datetime',
       'dropoff_datetime', 'PULocationID', 'DOLocationID', 'trip_miles',
       'trip_time', 'base_passenger_fare', 'tolls', 'bcf', 'sales_tax',
       'congestion_surcharge', 'airport_fee', 'tips', 'driver_pay',
       'shared_request_flag', 'shared_match_flag', 'access_a_ride_flag',
       'wav_request_flag', 'wav_match_flag'],
      dtype='object')

### Step 2 —  Choose Target **Y** and Candidate Predictors

- Suggested **Y**: `base_passenger_fare` (USD).
- Start with **distance** and **time**; optionally add **flags** if present.
- Derive `trip_time_minutes` from `trip_time` (seconds) if available.

In [3]:
pip install pyarrow

Note: you may need to restart the kernel to use updated packages.


In [12]:
# Remove commas and clean up trip_time
df['trip_time'] = (
    df['trip_time']
    .astype(str)
    .str.replace(',', '', regex=False)             # remove commas
    .str.strip()                                  # remove spaces
    .replace('', np.nan)                          # turn empty strings into NaN
    .str.replace(r'[^0-9.]', '', regex=True)      # keep only numbers
)

# convert to numeric safely
df['trip_time'] = pd.to_numeric(df['trip_time'])

# convert from seconds to minutes
df['trip_time_minutes'] = df['trip_time'] / 60

# check
df[['trip_time', 'trip_time_minutes']].head(10)


Unnamed: 0,trip_time,trip_time_minutes
0,417,6.95
1,1080,18.0
2,515,8.5833
3,2607,43.45
4,507,8.45
5,851,14.1833
6,797,13.2833
7,3151,52.5167
8,200,3.3333
9,1112,18.5333


In [13]:
# convert flags (`shared_request_flag`, `wav_request_flag`) to 0/1 if present.
df['shared_request_flag'] = df['shared_request_flag'].map({'Y': 1, 'N':0})
df['wav_request_flag'] = df['wav_request_flag'].map({'Y': 1, 'N':0})

### Step 3 — Define Three Model Specs (A, B, C)
Example models you can chose any models you want as long as Model A has one term, Model B two terms, etc.

- **Model A:** distance only.  
- **Model B:** distance + time (minutes).  
- **Model C:** distance + time + flags (whichever exist).

In [19]:
import numpy as np

# Convert the 'Y'/'N' flags into 1/0 columns for the model
df['is_shared_request'] = np.where(df['shared_request_flag'] == 'Y', 1, 0)
df['is_wav_request'] = np.where(df['wav_request_flag'] == 'Y', 1, 0)



In [None]:
import statsmodels.api as sm

# Define the target variable (same for all models)
Y = df['base_passenger_fare']

# Model A
# Hypothesis: Fare is just a function of distance.
X_A_features = ['trip_miles']
X_A = df[X_A_features]


# Model B 
# Hypothesis: Adding trip time will significantly improve the model.
X_B_features = ['trip_miles', 'trip_time_minutes']
X_B = df[X_B_features]


# Model C
# Hypothesis: Ride flags (shared, WAV) also explain the fare.
X_C_features = ['trip_miles', 'trip_time_minutes', 'is_shared_request', 'is_wav_request']
X_C = df[X_C_features]


print(f"Model A features: {X_A_features}")
print(f"Model B features: {X_B_features}")
print(f"Model C features: {X_C_features}")

Model A features: ['trip_miles']
Model B features: ['trip_miles', 'trip_time_minutes']
Model C features: ['trip_miles', 'trip_time_minutes', 'is_shared_request', 'is_wav_request']


### Step 4 — Fit Each Model (with intercept) and Collect Adjusted R² & p-values


In [None]:
import statsmodels.api as sm
import pandas as pd


# Add Constant (Intercept)
X_A_const = sm.add_constant(X_A, has_constant='add')
X_B_const = sm.add_constant(X_B, has_constant='add')
X_C_const = sm.add_constant(X_C, has_constant='add')


# Fit OLS Models
# We fit each model on the same Y variable
model_A = sm.OLS(Y, X_A_const).fit()
model_B = sm.OLS(Y, X_B_const).fit()
model_C = sm.OLS(Y, X_C_const).fit()


# Collect Metrics


# We use a dictionary to build a comparison table
metrics_data = {
    'Model A': {
        'Adj. R-squared': model_A.rsquared_adj,
        'Num. Features': len(model_A.pvalues) - 1  # -1 for the intercept
    },
    'Model B': {
        'Adj. R-squared': model_B.rsquared_adj,
        'Num. Features': len(model_B.pvalues) - 1
    },
    'Model C': {
        'Adj. R-squared': model_C.rsquared_adj,
        'Num. Features': len(model_C.pvalues) - 1
    },
}

# Transpose for a clean view (Models as rows)
metrics_df = pd.DataFrame(metrics_data).T 
print(metrics_df.to_markdown(floatfmt=".6f"))

print("\n\nP-Values for Each Model")

print("\nModel A P-Values:")
print(model_A.pvalues) 

print("\nModel B P-Values:")
print(model_B.pvalues)

print("\nModel C P-Values:")
print(model_C.pvalues)

|         |   Adj. R-squared |   Num. Features |
|:--------|-----------------:|----------------:|
| Model A |         0.808383 |        1.000000 |
| Model B |         0.844032 |        2.000000 |
| Model C |         0.844032 |        4.000000 |


P-Values for Each Model

Model A P-Values:
const        0.0000
trip_miles   0.0000
dtype: float64

Model B P-Values:
const               0.0000
trip_miles          0.0000
trip_time_minutes   0.0000
dtype: float64

Model C P-Values:
const               0.0000
trip_miles          0.0000
trip_time_minutes   0.0000
is_shared_request      NaN
is_wav_request         NaN
dtype: float64


In [26]:
# Check the counts of your flag columns
print("Value counts for is_shared_request:")
print(df['is_shared_request'].value_counts())

print("\nValue counts for is_wav_request:")
print(df['is_wav_request'].value_counts())

Value counts for is_shared_request:
is_shared_request
0    8324576
Name: count, dtype: int64

Value counts for is_wav_request:
is_wav_request
0    8324576
Name: count, dtype: int64


### Step 5 — Inspect Full Summaries (coefficients, p-values, diagnostics)

- Print summaries for the top 1–2 models by **Adjusted R²**.
- Write **unit-based** interpretations “holding others constant.”

In [27]:
# Print the summary for Model A
print(model_A.summary())

# Print the summary for Model B
print(model_B.summary())

                             OLS Regression Results                            
Dep. Variable:     base_passenger_fare   R-squared:                       0.808
Model:                             OLS   Adj. R-squared:                  0.808
Method:                  Least Squares   F-statistic:                 3.512e+07
Date:                 Sat, 08 Nov 2025   Prob (F-statistic):               0.00
Time:                         18:16:17   Log-Likelihood:            -3.0050e+07
No. Observations:              8324576   AIC:                         6.010e+07
Df Residuals:                  8324574   BIC:                         6.010e+07
Df Model:                            1                                         
Covariance Type:             nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          8.1081      0.004   1990.68

## Model A Interpretation

* **Adjusted R-squared:** 0.808 (This model explains 80.8% of the variation in fare.)
* **Intercept (`const`):** The model predicts an **$8.11** fare for a 0-mile trip.
* **`trip_miles` ($\beta_1$):** For each one additional mile of `trip_miles`, the `base_passenger_fare` is expected to **increase by $3.09**.

## Model B Interpretation (The Better Model)

* **Adjusted R-squared:** 0.844 (This model explains 84.4% of the variation in fare.)
* **Intercept (`const`):** The model predicts a **$3.34** fare for a trip of 0 miles and 0 minutes.
* **`trip_miles` ($\beta_1$):** Holding `trip_time_minutes` constant, for each one additional mile of `trip_miles`, the `base_passenger_fare` is expected to **increase by $2.19**.
* **`trip_time_minutes` ($\beta_2$):** Holding `trip_miles` constant, for each one additional minute of `trip_time_minutes`, the `base_passenger_fare` is expected to **increase by $0.49** (49 cents).

### Step 6 — Interpretations (write below)

Using the **best model’s** coefficients interpret each coefficient using markdown

Based on the analysis, **Model B is the best model**.

It has the highest valid **Adjusted R-squared (0.844)**, a significant improvement over Model A (0.808). Model C failed to add any predictive power, as its new features had no variance.

Interpretations for Model B's coefficients:

* **const (Intercept): $3.34**
    * The model predicts a starting fare of **$3.34** for a trip of 0 miles and 0 minutes. This is a mathematical anchor for the model and does not have a practical business meaning.

* **`trip_miles`: $2.19**
    * Holding `trip_time_minutes` constant, for each **one additional mile** of `trip_miles`, the `base_passenger_fare` is expected to **increase by $2.19**.

* **`trip_time_minutes`: $0.49**
    * Holding `trip_miles` constant, for each **one additional minute** of `trip_time_minutes`, the `base_passenger_fare` is expected to **increase by $0.49** (49 cents).

## We Share — Reflection & Wrap‑Up

Write **2 short paragraphs** and be specific:

1) **Which model (A/B/C) do you pick and why?**  
Reference **Adjusted R²** (higher is better when comparing models with different numbers of predictors) and the **p-values**/signs of key coefficients.

2) **Business explanation:**  
Give a stakeholder-friendly summary in **units** (e.g., “+1 mile ≈ +$X in base fare, holding time constant”). If you added flags, explain their effect plainly. Mention any limitations (e.g., time vs distance confounding, missing columns).

***1. I select Model B as the final model.*** 

        The selection is based on Adjusted R-squared, which is the correct metric for comparing models with different numbers of predictors. Model A (trip_miles only) yielded an Adjusted R² of 0.808, while Model B (trip_miles + trip_time_minutes) showed a significant improvement to 0.844. This indicates that adding trip_time_minutes explained a substantial new portion of the variance in base_passenger_fare. All coefficients in Model B had p-values of 0.000, confirming both predictors are statistically significant. Model C failed because its added features (is_shared_request, is_wav_request) had no variance in the cleaned data, resulting in NaN p-values and no improvement over Model B.

***2. Business explanation***

         The analysis found that the base_passenger_fare is best predicted by a combination of both distance and time. Model B, which explains 84.4% of the fare, gives us two clear business rules: holding trip time constant, each additional mile driven adds approximately $2.19 to the fare. Similarly, holding distance constant, each additional minute of trip time (for example, time spent in traffic) adds $0.49 to the fare.  I was unable to measure the effect of shared or wheelchair-accessible rides, as no such rides were present in our final clean dataset. A primary limitation is that miles and minutes are highly correlated, but the model is nonetheless a strong predictor of base fare.