# Homework

The goal of this homework is to create a simple training pipeline, use mlflow to track experiments and register best model, but use Mage for it.

We'll use the same NYC taxi dataset, the Yellow taxi data for March, 2023.

## Question 1. Select the Tool

You can use the same tool you used when completing the module, or choose a different one for your homework.

What's the name of the orchestrator you chose?

### Prefect

## Question 2. Version
What's the version of the orchestrator?

In [1]:
!prefect version

Version:             3.4.3
API version:         0.8.4
Python version:      3.10.5
Git commit:          1c2ba7a4
Built:               Thu, May 22, 2025 10:00 PM
OS/Arch:             win32/AMD64
Profile:             local
Server type:         ephemeral
Pydantic version:    2.11.5
Server:
  Database:          sqlite
  SQLite version:    3.37.2


## Question 3. Creating a pipeline

Let's read the March 2023 Yellow taxi trips data.

How many records did we load?

In [2]:
import pandas as pd

In [3]:
year = 2023
month = 3

In [5]:
url = f'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_{year}-{month:02d}.parquet'
df = pd.read_parquet(url)
df.shape

(3403766, 19)

## Question 4. Data preparation

Let's continue with pipeline creation.

We will use the same logic for preparing the data we used previously.

This is what we used (adjusted for yellow dataset):

In [6]:
def read_dataframe(filename):
    df = pd.read_parquet(filename)

    df['duration'] = df.tpep_dropoff_datetime - df.tpep_pickup_datetime
    df.duration = df.duration.dt.total_seconds() / 60

    df = df[(df.duration >= 1) & (df.duration <= 60)]

    categorical = ['PULocationID', 'DOLocationID']
    df[categorical] = df[categorical].astype(str)
    
    return df

Let's apply to the data we loaded in question 3.

What's the size of the result?

In [7]:
df = read_dataframe(url)
df.shape

(3316216, 20)

## Question 5. Train a model

We will now train a linear regression model using the same code as in homework 1.

    Fit a dict vectorizer.
    Train a linear regression with default parameters.
    Use pick up and drop off locations separately, don't create a combination feature.

Let's now use it in the pipeline. We will need to create another transformation block, and return both the dict vectorizer and the model.

What's the intercept of the model?

Hint: print the intercept_ field in the code block

In [8]:
from sklearn.feature_extraction import DictVectorizer

In [11]:
categorical = ['PULocationID', 'DOLocationID']

train_dicts = df[categorical].to_dict(orient='records')

dv = DictVectorizer()
X_train = dv.fit_transform(train_dicts)

In [12]:
from sklearn.linear_model import LinearRegression

In [13]:
target = 'duration'
y_train = df[target].values

In [14]:
lr = LinearRegression()
lr.fit(X_train, y_train)

In [15]:
lr.intercept_

np.float64(24.776359644078624)

## Question 6. Register the model

The model is trained, so let's save it with MLFlow.

Find the logged model, and find MLModel file. What's the size of the model? (model_size_bytes field):

In [16]:
import mlflow

mlflow.set_tracking_uri("sqlite:///mlflow.db")
mlflow.set_experiment("nyc-taxi-homework3")

2025/05/29 09:11:45 INFO mlflow.store.db.utils: Creating initial MLflow database tables...
2025/05/29 09:11:45 INFO mlflow.store.db.utils: Updating database tables
INFO  [alembic.runtime.migration] Context impl SQLiteImpl.
INFO  [alembic.runtime.migration] Will assume non-transactional DDL.
INFO  [alembic.runtime.migration] Running upgrade  -> 451aebb31d03, add metric step
INFO  [alembic.runtime.migration] Running upgrade 451aebb31d03 -> 90e64c465722, migrate user column to tags
INFO  [alembic.runtime.migration] Running upgrade 90e64c465722 -> 181f10493468, allow nulls for metric values
INFO  [alembic.runtime.migration] Running upgrade 181f10493468 -> df50e92ffc5e, Add Experiment Tags Table
INFO  [alembic.runtime.migration] Running upgrade df50e92ffc5e -> 7ac759974ad8, Update run tags with larger limit
INFO  [alembic.runtime.migration] Running upgrade 7ac759974ad8 -> 89d4b8295536, create latest metrics table
INFO  [89d4b8295536_create_latest_metrics_table_py] Migration complete!
INFO  

<Experiment: artifact_location=('file:///d:/Universidad/Cursos/MLOps en '
 'Zoomcamp/03-orchestration/Homework3/mlruns/1'), creation_time=1748531513146, experiment_id='1', last_update_time=1748531513146, lifecycle_stage='active', name='nyc-taxi-homework3', tags={}>

In [18]:
with mlflow.start_run():
    mlflow.sklearn.log_model(lr, artifact_path="models_sklearn")




![image.png](attachment:image.png)