## Homework

The goal of this homework is to create a simple training pipeline, use mlflow to track experiments and register best model, but use Mage for it.

We'll use [the same NYC taxi dataset](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page), the **Yellow** taxi data for March, 2023. 

## Question 1. Select the Tool

You can use the same tool you used when completing the module,
or choose a different one for your homework.

What's the name of the orchestrator you chose? 

**Airflow**

## Question 2. Version

What's the version of the orchestrator? 

**3.0.0**

## Question 3. Creating a pipeline

Let's read the March 2023 Yellow taxi trips data.

How many records did we load? 

- **3,403,766**

(Include a print statement in your code)

In [None]:
import pandas as pd
# Downloaded the data from this URL: https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page

# Load the data
df = pd.read_parquet('../data/yellow_tripdata_2023-03.parquet')

# Display the first few rows of the DataFrame
df.head()

# Count the number of rows in the DataFrame
row_count = df.shape[0]
# Print the row count
print(f"Number of rows in the DataFrame: {row_count}")

Number of rows in the DataFrame: 3403766


## Question 4. Data preparation

Let's continue with pipeline creation.

We will use the same logic for preparing the data we used previously. 

This is what we used (adjusted for yellow dataset):

```python
def read_dataframe(filename):
    df = pd.read_parquet(filename)

    df['duration'] = df.tpep_dropoff_datetime - df.tpep_pickup_datetime
    df.duration = df.duration.dt.total_seconds() / 60

    df = df[(df.duration >= 1) & (df.duration <= 60)]

    categorical = ['PULocationID', 'DOLocationID']
    df[categorical] = df[categorical].astype(str)
    
    return df
```

Let's apply to the data we loaded in question 3. 

What's the size of the result? 

- **3,316,216** 

In [6]:
def read_dataframe(filename):
    df = pd.read_parquet(filename)

    df['duration'] = df.tpep_dropoff_datetime - df.tpep_pickup_datetime
    df.duration = df.duration.dt.total_seconds() / 60

    df = df[(df.duration >= 1) & (df.duration <= 60)]

    categorical = ['PULocationID', 'DOLocationID']
    df[categorical] = df[categorical].astype(str)
    
    return df

df = read_dataframe('../data/yellow_tripdata_2023-03.parquet')
# Count the number of rows in the DataFrame
row_count = df.shape[0]
# Print the row count
print(f"Number of rows in the DataFrame: {row_count}")

Number of rows in the DataFrame: 3316216


## Question 5. Train a model

We will now train a linear regression model using the same code as in homework 1.

* Fit a dict vectorizer.
* Train a linear regression with default parameters.
* Use pick up and drop off locations separately, don't create a combination feature.

Let's now use it in the pipeline. We will need to create another transformation block, and return both the dict vectorizer and the model.

What's the intercept of the model? 

Hint: print the `intercept_` field in the code block

- **24.77**

In [9]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Feature preparation
def prepare_features(df):
    df_copy = df.copy()
    
    # Keep only the required columns (PULocationID and DOLocationID)
    df_features = df_copy[['PULocationID', 'DOLocationID']]

    # Convert dataframe to list of dictionaries with string conversion
    dicts = df_features.astype(str).to_dict(orient='records')

    # Create and fit a DictVectorizer
    dv = DictVectorizer()
    X = dv.fit_transform(dicts)

    # Get the dimensionality (number of columns)
    dimensionality = X.shape[1]

    print(f"Feature matrix shape: {X.shape}")
    print(f"Dimensionality (number of columns): {dimensionality}")

    # Features (X) from previous step - the one-hot encoded pickup and dropoff locations
    # Target variable (y) - duration
    y = df_copy['duration']
    
    return dv, X, y

# Train the model
def train_model(X, y):
    # Train a linear regression model
    lr = LinearRegression()
    lr.fit(X, y)

    # Print the intercept (needed for the question)
    print(f"Model intercept: {lr.intercept_:.2f}")

    # Make predictions on the training data
    y_pred = lr.predict(X)

    # Calculate RMSE on the training data
    rmse = np.sqrt(mean_squared_error(y, y_pred))

    print(f"RMSE on training data: {rmse:.4f}")
    
    return lr

# Load the prepared data
df = read_dataframe('../data/yellow_tripdata_2023-03.parquet')

# Prepare features and target
dv, X, y = prepare_features(df)

# Train the model
model = train_model(X, y)

Feature matrix shape: (3316216, 518)
Dimensionality (number of columns): 518
Model intercept: 24.78
RMSE on training data: 8.1587


## Question 6. Register the model 

The model is trained, so let's save it with MLFlow.

Find the logged model, and find MLModel file. What's the size of the model? (`model_size_bytes` field):

* **14,534**

In [10]:
import mlflow
import os
from mlflow.models.signature import infer_signature

# Set MLflow tracking URI - using file system for local storage
mlflow.set_tracking_uri("./mlruns")

# Start a new MLflow run
with mlflow.start_run() as run:
    # Log model parameters
    mlflow.log_param("model_type", "LinearRegression")
    
    # Log model metrics (RMSE from previous step)
    y_pred = model.predict(X)
    rmse = np.sqrt(mean_squared_error(y, y_pred))
    mlflow.log_metric("rmse", rmse)
    
    # Create model signature
    signature = infer_signature(X, y_pred)
    
    # Log the model with additional components (vectorizer)
    mlflow_pyfunc_model_path = "model"
    
    # Log the scikit-learn model as an artifact
    mlflow.sklearn.log_model(
        sk_model=model,
        artifact_path=mlflow_pyfunc_model_path,
        signature=signature,
        input_example=X[:5],
        registered_model_name="yellow_taxi_duration_predictor"
    )
    
    # Log the DictVectorizer as a separate artifact
    dict_vectorizer_path = "dict_vectorizer"
    mlflow.sklearn.log_model(
        sk_model=dv,
        artifact_path=dict_vectorizer_path
    )
    
    # Print the run ID for reference
    print(f"MLflow Run ID: {run.info.run_id}")
    print(f"Model saved in run {run.info.run_id}")
    
    # Get the path to the MLmodel file
    artifacts_uri = mlflow.get_artifact_uri()
    model_path = os.path.join(artifacts_uri, mlflow_pyfunc_model_path)
    print(f"Model artifacts saved at: {model_path}")
    print("Check the MLmodel file in this directory to find model_size_bytes")

Successfully registered model 'yellow_taxi_duration_predictor'.
Created version '1' of model 'yellow_taxi_duration_predictor'.


MLflow Run ID: 149a9b28242743a49b75fea8a617fb21
Model saved in run 149a9b28242743a49b75fea8a617fb21
Model artifacts saved at: /Users/alexandru.huc/Downloads/Development/GitHub/mlops-zoomcamp/03-orchestration/homework/mlruns/0/149a9b28242743a49b75fea8a617fb21/artifacts/model
Check the MLmodel file in this directory to find model_size_bytes
