## NYC Taxi Data Training Pipeline with Mage and MLflow

## Overview
Create a training pipeline for the [NYC Yellow taxi dataset (March 2023)](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page) using Mage as an orchestration tool and MLflow for experiment tracking and model registration.

## Tasks

### 1. 🐳 Deploy Mage with Docker
**Task**: Launch Mage using Docker Compose following the quick start guidelines.

**Question**: Identify the running Mage version from the UI.

### 2. 📁 Project Setup
**Task**: Initialize a new project named "homework_10".

**Question**: Check the generated `metadata.yaml` file and report the number of lines it contains.

Options:
- 35 lines
- 45 lines
- 55 lines
- 65 lines
### 3. 🔄 Creating a Pipeline
**Task**: Create a data ingestion code block for the March 2023 Yellow taxi trips data.

**Question**: How many records were loaded from the dataset?

Options:
- 3,003,766
- 3,203,766
- 3,403,766
- 3,603,766

### 4. 🔧 Data Preparation
**Task**: Create a transformer code block using the provided data preparation logic for Yellow taxi dataset.

**Code Template**:
```python
   
   df.tpep_dropoff_datetime = pd.to_datetime(df.tpep_dropoff_datetime)
   df.tpep_pickup_datetime = pd.to_datetime(df.tpep_pickup_datetime)
   
   df['duration'] = df.tpep_dropoff_datetime - df.tpep_pickup_datetime
   df.duration = df.duration.dt.total_seconds() / 60
   
   df = df[(df.duration >= 1) & (df.duration <= 60)]
   
   categorical = ['PULocationID', 'DOLocationID']
   df[categorical] = df[categorical].astype(str)
```

   ### 5. 🤖 Train a Model
**Task**: Create a transformation block to train a linear regression model using the following steps:

Steps:
1. Fit a dict vectorizer
2. Train a linear regression model with default parameters
3. Use pickup and dropoff locations as separate features (no combination)

**Requirements**:
- Create a transformation block
- Return both dict vectorizer and model
- Print the `intercept_` field in the code block

**Question**: What is the intercept value of the trained model?

Options:
- 21.77
- 24.77
- 27.77
- 31.77

Note:
- Use the same code structure as in homework 8
- The transformation block should output both vectorizer and model objects

### 6. 📝 Register the Model with MLflow
**Task**: Set up MLflow and save the trained model.

1. First, stop the current docker-compose or use Ctrl + C to stop:
```bash
docker-compose down
```
2. Create mlflow.dockerfile:

```bash
FROM python:3.10-slim

RUN pip install mlflow==2.12.1

EXPOSE 5000

CMD [ \
    "mlflow", "server", \
    "--backend-store-uri", "sqlite:///home/mlflow_data/mlflow.db", \
    "--host", "0.0.0.0", \
    "--port", "5000" \
]
```

3. Add MLflow service configuration to `docker-compose.yaml`:

```yaml
mlflow:
 build:
   context: .
   dockerfile: mlflow.dockerfile
 ports:
   - "5000:5000"  # Expose MLflow UI port
 volumes:
   - "${PWD}/mlflow_data:/home/mlflow_data/"  # Mount MLflow data directory
 networks:
   - app-network
   ```
### MLflow Setup and Model Export

1. **Network Configuration**:
- Ensure `app-network` in MLflow configuration matches your Mage and Postgres network
- This allows communication between all services

2. **Dependencies**:
- Verify `mlflow==2.12.1` is in `requirements.txt` of your Mage project
- Add it if starting fresh

3. **Create Data Exporter Block**:

**Task**: Create a new block to:
- Log the linear regression model
- Save and log the dict vectorizer artifact

**MLflow Access**:
- MLflow UI should be available at `http://mlflow:5000`
- Create exporter block to interact with MLflow

**Question**: Check the logged MLModel file and report the model size (`model_size_bytes` field):

Options:
- 14,534
- 9,534
- 4,534
- 1,534

Note: It's common practice to combine the model logging and artifact saving in a single code block.