# MLflow Installation Guide

The purpose of this guide is to help you get started with MLflow - a powerful tool for managing machine learning experiments and models.

## Q1. Installing MLflow

First, you'll need to install the MLflow Python package. I recommend following these steps:

1. Create a new Python environment (preferably using conda):
   ```bash
   conda create -n mlflow_env python=3.9 or python=3.10
   conda activate mlflow_env

After installing the package, run this command: ``mlflow --version`` to verify the installation.
Question: What version of MLflow did you install?

In [None]:
# mlflow, version 2.17.2

# Q2. Processing 

You have to work with NYC Green Taxi dataset to build a trip duration prediction model.

## Data Preparation:

1. Get the data:
  * Download Green Taxi data for Q1 2023 (Jan-Mar) in parquet format
  * Link to data: [NYC Taxi Data Portal](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page)

2. Run preprocessing:
  * Find `preprocess_data.py` script
  * This script will:
    - Read taxi data from your download folder
    - Transform features using `DictVectorizer` (trained on January data)
    - Create and save processed datasets

## Command to Run:

   ``python preprocess_data.py --raw_data_path <YOUR_DOWNLOAD_DATA_FOLDER> --dest_path ./output``

> Note: Before running, make sure you're in `previous/homework/folfer/` directory and replace `<YOUR_DOWNLOAD_DATA_FOLDER>` with your actual data path.
YOUR_DOWNLOAD_DATA_FOLDER IS THE FOLDER YOU SAVED THE DATA FROM 2023 JANUARY to March 

After running the preprocessing script, count the files in `OUTPUT_FOLDER`. How many are there?

Select one:
* 1
* 3 
* 4
* 7

In [2]:
# 4

# Q3. Training with MLflow Autologging

In this step, we'll train a `RandomForestRegressor` from scikit-learn using our taxi dataset.

## Training Process

We'll use `train.py`. The script does the following:
* Loads preprocessed datasets
* Trains the Random Forest model
* Calculates RMSE on validation data

## Your Task:

1. Modify `train.py` to enable MLflow autologging
2. Run the script
3. Open MLflow UI to verify experiment tracking

## Important Tips:

1. Remember to wrap your training code with:

   with mlflow.start_run():
       # your training code here

2. Keep default hyperparameters for quick training

What is the value of the `min_samples_split` parameter?
* 2
* 4
* 8
* 10

> Note: You can find this in MLflow UI after running the experiment

In [4]:
# RMSE: 5.431162180141208
# min_samples_split: 2

# Q4. Setting Up Local MLflow Tracking Server

Now we'll set up complete ML model lifecycle management by launching a tracking server with model registry access.

## Tasks:

1. Start MLflow tracking server locally
2. Configure storage:
  * Backend: SQLite database
  * Artifacts: Create and use `artifacts` folder

> Keep the server running for the next two exercises!

Besides `--backend-store-uri`, which parameter is needed for proper server configuration?

Select one:
* `default-artifact-root`
* `serve-artifacts`
* `artifacts-only`
* `artifacts-destination`

## Note:
This tracking server will enable us to:
* Track experiments
* Store model artifacts
* Access model registry
* Compare model versions

In [None]:
#  --default-artifact-root URI

# Q5. Hyperparameter Tuning with MLflow

We'll optimize our `RandomForestRegressor` using `hyperopt` to reduce validation error. You'll use the prepared script `hpo.py`.

## Task:

1. Modify `hpo.py`:
  * Add code to log validation RMSE to tracking server
  * Update the `objective` function accordingly
  * Run without additional parameters

2. Check Results:
  * Open MLflow UI
  * Find experiment named `random-forest-hyperopt`
  * Review the runs

## Important Notes:

* Do NOT use autologging
* Log only essential information:
   - Hyperparameters used in optimization
   - Validation RMSE (February 2023 data)


What was the best validation RMSE achieved?
* 4.817
* 5.335
* 5.818
* 6.336

In [None]:
# 5.3700860692

# Q6. Model Registry Promotion

After successful hyperparameter optimization, we'll move our best model to production by registering it in MLflow's model registry.

## Process Overview:

The script `register_model.py` will:
1. Find top 5 models from previous runs
2. Test them on March 2023 data
3. Save results in experiment `random-forest-best-models`

## Your Task:

Update `register_model.py` to:
* Select model with lowest test RMSE
* Register it in the model registry

## Helpful Tips:

1. To find best model:
  ```python
  client = MlflowClient()
  client.search_runs(...)  # Find lowest RMSE

In [None]:
# test_rmse 5.370086069268862