# Homework03 - Rui Pinto

In [1]:
# before running from project root: mlflow ui --backend-store-uri file:$(pwd)/03-orchestration/mlruns
# OR if running from 03-orchestration directory: mlflow ui --backend-store-uri file:./mlruns

In [2]:
import pandas as pd
import sys
import subprocess

# Import utility functions that we've defined for model training and MLflow logging
from model_utils import (
    read_dataframe,
    create_X,
    train_linear_model,
    log_model_with_mlflow,
    find_model_size,
)

2025/06/09 08:30:53 INFO mlflow.tracking.fluent: Experiment with name 'nyc-taxi-experiment' does not exist. Creating a new experiment.


Using MLflow server at http://localhost:5000


# Q1. Select the Tool

You can use the same tool you used when completing the module, or choose a different one for your homework.

What's the name of the orchestrator you chose?

- Prefect ✅

# Q2. Version

What's the version of the orchestrator?

In [3]:
# check version of prefect
print("Checking Prefect version...")


def get_prefect_version():
    try:
        result = subprocess.run(
            [sys.executable, "-m", "prefect", "--version"],
            capture_output=True,
            text=True,
            check=True,
        )
        return result.stdout.strip()
    except subprocess.CalledProcessError as e:
        print(f"Error checking Prefect version: {e}")
        return None


version = get_prefect_version()
if version:
    print(f"Prefect version: {version}")

Checking Prefect version...
Prefect version: 3.4.5


# Q3. Creating a pipeline

Let's read the March 2023 Yellow taxi trips data.

How many records did we load?

- 3,003,766
- 3,203,766
- 3,403,766 ✅
- 3,603,766

(Include a print statement in your code)

In [4]:
#!curl https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-03.parquet > data/yellow_tripdata_2023-03.parquet

In [5]:
def read_parquet_file(file_path):
    try:
        df = pd.read_parquet(file_path)
        print(f"DataFrame shape: {df.shape}")
        return df
    except Exception as e:
        print(f"Error reading parquet file: {e}")
        return None


# Using our read_dataframe function directly
file_path = "data/yellow_tripdata_2023-03.parquet"

# Option 1: Load using direct parquet file path
try:
    raw_df = pd.read_parquet(file_path)
    print(f"DataFrame shape: {raw_df.shape}")
    print(f"\nWe have {raw_df.shape[0]:,} records")
    print("DataFrame loaded successfully.")
except Exception as e:
    print(f"Error reading parquet file: {e}")
    print("Trying to load through read_dataframe function...")
    # Option 2: If direct loading fails, use our function
    raw_df = read_dataframe(filename=file_path)

DataFrame shape: (3403766, 19)

We have 3,403,766 records
DataFrame loaded successfully.


# Q4. Data preparation

Let's continue with pipeline creation.

We will use the same logic for preparing the data we used previously.

This is what we used (adjusted for yellow dataset):

```bash
def read_dataframe(filename):
    df = pd.read_parquet(filename)

    df['duration'] = df.tpep_dropoff_datetime - df.tpep_pickup_datetime
    df.duration = df.duration.dt.total_seconds() / 60

    df = df[(df.duration >= 1) & (df.duration <= 60)]

    categorical = ['PULocationID', 'DOLocationID']
    df[categorical] = df[categorical].astype(str)
    
    return df
```

Let's apply to the data we loaded in question 3.

What's the size of the result?

- 2,903,766
- 3,103,766
- 3,316,216 ✅
- 3,503,766

In [6]:
# Process the data using read_dataframe function
df = read_dataframe(2023, 3)
print(f"\nNumber of records after data preparation: {df.shape[0]:,}\n")

Attempting to read from local file: data/yellow_tripdata_2023-03.parquet
DataFrame shape after processing: (3316216, 20)

Number of records after data preparation: 3,316,216



# Question 5. Train a model

We will now train a linear regression model using the same code as in homework 1.

- Fit a dict vectorizer.
- Train a linear regression with default parameters.
- Use pick up and drop off locations separately, don't create a combination feature.

Let's now use it in the pipeline. We will need to create another transformation block, and return both the dict vectorizer and the model.

What's the intercept of the model?

Hint: print the intercept_ field in the code block

- 21.77
- 24.77 ✅
- 27.77
- 31.77

In [7]:
# fit dict vectorizer and transform the data
X, dv = create_X(df)

# prepare target variable
target = "duration"
y = df[target].values

# Train model using our utility function
lr, rmse = train_linear_model(X, y)

# print the intercept
print(f"Model intercept: {lr.intercept_:.2f}")
print(f"RMSE on training data: {rmse:.2f}")

Model intercept: 24.78
RMSE on training data: 8.16


# Q6 Register the model 

The model is trained, so let's save it with MLFlow.

Find the logged model, and find MLModel file. What's the size of the model? (`model_size_bytes` field):

- 14,534
- 9,534
- 4,534 ✅
- 1,534

In [8]:
# Register the model with MLflow
run_id, artifact_uri = log_model_with_mlflow(lr, X, y, dv, rmse)

# Print the run_id and artifact_uri
print(f"MLflow run ID: {run_id}")
print(f"Artifact URI: {artifact_uri}")

# Find model size bytes in MLmodel file
model_sizes = find_model_size()



🏃 View run salty-steed-807 at: http://localhost:5000/#/experiments/1/runs/7b0cbdee22e94ce6956e63334c7f1e17
🧪 View experiment at: http://localhost:5000/#/experiments/1
MLflow run ID: 7b0cbdee22e94ce6956e63334c7f1e17
Artifact URI: mlflow-artifacts:/1/7b0cbdee22e94ce6956e63334c7f1e17/artifacts
Found 0 MLmodel files

Q6 Answer: 4,534 bytes
