Reproducible Machine Learning Pipeline for Short-Term Rental Price Prediction

This repository contains a Machine Learning (ML) pipeline which is able to predict short-term rental property prices in New York City. It is designed to be retrained easily with new data that comes frequently in bulk, assuming that prices (and thus, the model) vary constantly. The pipeline is composed by the typical steps or components from short/medium-size projects; these are explained in the section Introduction.

The following tools are used:

MLflow for reproduction and management of pipeline processes.
Weights and Biases for artifact and execution tracking.
Hydra for configuration management.
Conda for environment management.
Pandas for data analysis.
Scikit-Learn for data modeling.

The Weights & Biases project which collects the tracked experiments can be found here: nyc_airbnb.

The starter code of the repository was originally forked from a project in the Udacity repository build-ml-pipeline-for-short-term-rental-prices. The instructions of that source project can be found in the file Instructions.md. If you would like to know more about why reproducible ML pipelines matter and how the tools used here interact, you can have a look at my ML pipeline project boilerplate.

The used dataset is a modified AirBnB dataset for New York City, which is in the source forked repository. AirBnB has a data dictionary which explains each feature. A subset of 15 independent variables has has been taken.

Table of contents:

Reproducible Machine Learning Pipeline for Short-Term Rental Price Prediction

Introduction

The following figure summarizes all the steps/components and the artifacts of the pipeline:

The final goal of the pipeline is to produce the optimal inference artifact which is able to regress the price of a rented unit given its features.

The step related to the Exploratory Data Analysis (EDA) in light green is not part of the re-trainable pipeline, i.e., it needs to be called explicitly; additionally, the component contains research notebooks that are crafted manually to gather insights of the dataset. Note that the data processing and modeling are quite simple; the focus of the project lies on the MLOps aspect.

Some components and artifacts have been labeled as remote. Even though they are locally available in this repository, they are fetched from the original repository, showing that MLflow capability.

The file and folder structure reflects the diagram above:

.
├── README.md                       # This file
├── MLproject                       # High-level pipeline project
├── main.py                         # Pipeline main file
├── environment.yml                 # Environment to run the project
├── conda.yml                       # Environment of main pipeline
├── config.yaml                     # Configuration of pipeline
├── Instructions.md                 # Original instructions by Udacity
├── LICENSE.txt
├── images/                         # Markdown assets
│   └── ...
├── cookie-mlflow-step/             # Template for (empty) step generation
│   └── ...
├── components/                     # Remote components
│   ├── README.md
│   ├── conda.yml
│   ├── get_data/                   # Download/get data step/component
│   │   ├── MLproject               # MLflow project file
│   │   ├── conda.yml               # Environment for step
│   │   ├── data/                   # Dataset samples (retrieved with URL)
│   │   │   ├── sample1.csv
│   │   │   └── sample2.csv
│   │   └── run.py                  # Step implementation
│   ├── test_regression_model/      # Evaluation step/components
│   │   ├── MLproject
│   │   ├── conda.yml
│   │   └── run.py
│   └── train_val_test_split/       # Segregation step/component
│       ├── MLproject
│       ├── conda.yml
│       └── run.py
└── src/                            # Local components
    ├── README.md
    ├── basic_cleaning/             # Cleaning step/component
    │   ├── MLproject               # MLflow project file
    │   ├── conda.yml               # Environment for step
    │   └── run.py                  # Step implementation
    ├── data_check/                 # Non-/Deterministic tests step/component
    │   ├── MLproject
    │   ├── conda.yml
    │   ├── conftest.py
    │   └── test_data.py
    ├── eda/                        # Exploratory Data Analysis & Modeling (research env.)
    │   ├── EDA.ipynb               # Basic EDA notebook
    │   ├── MLproject
    │   ├── Modeling.ipynb          # Modeling trials, transferred to train_random_forest
    │   └── conda.yml
    └── train_random_forest/        # Inference pipeline
        ├── MLproject
        ├── conda.yml
        ├── feature_engineering.py
        └── run.py

Each step/component has a dedicated folder with the following files:

run.py: The implementation of the step function/goal.
MLproject: The MLflow project file which executes run.py with the proper arguments.
conda.yaml: The conda environment required by run.py which is set by MLflow.

All the components are controlled by the root-level script main.py; that file executes all the steps in order and with the parameters specified in config.yaml.

How to Use This Project

If you're unfamiliar with the tools used in this project, I recommend you to have a look at my ML pipeline project boilerplate.
Install the dependencies.
Run the pipeline as explained in the dedicated section.
You can modify the pipeline following these guidelines and having into account the original Instructions.md.

Dependencies

In order to set up the main environment from which everything is launched you need to install conda and create a Weights and Biases account; then, the following sets everything up:

# Clone repository
git clone git@github.com:mxagar/ml_pipeline_rental_prices.git
cd ml_pipeline_rental_prices

# Create new environment
conda env create -f environment.yml

# Activate environment
conda activate nyc_airbnb_dev

# Log in via the web interface
wandb login

All step/component dependencies are handled by MLflow using the dedicated conda.yaml environment definition files.

How to Run the Pipeline

There are multiple ways of running this pipeline, for instance:

Local execution or execution from cloned source code of the complete pipeline.
Local execution of selected pipeline steps.
Remote execution of a release.

In the following, some example commands that show how these approaches work are listed:

# Go to the root project level, where main.py is
cd ml_pipeline_rental_prices

# Local execution of the entire pipeline
mlflow run . 

# Step execution: Download or get the data
# Step name can be found in main.py
# This step uses the code in a remote repository
mlflow run . -P steps="download"

# Step execution: Clean / pre-process the raw dataset
# Step name can be found in main.py
# Note that any upstream artifacts need to be available
mlflow run . -P steps="basic_cleaning"

# Step execution: data check + segregation
# Step names can be found in main.py
# Note that any upstream artifacts need to be available
# The step data_split uses the code in a remote repository
mlflow run . -P steps="data_check,data_split"

# Hyperparameter variation with hydra sweeps
# Step execution: Train the random forest model
# Step names can be found in main.py
# Note that any upstream artifacts need to be available
mlflow run . \
-P steps=train_random_forest \
-P hydra_options="modeling.max_tfidf_features=10,15,30 modeling.random_forest.max_features=0.1,0.33,0.5,0.75,1.0 -m"

# Execution of a remote release hosted in my repository
# and use a different dataset/sample from the standard one
# Release: 1.0.1
mlflow run https://github.com/mxagar/ml_pipeline_rental_prices.git \
-v 1.0.1 \
-P hydra_options="etl.sample='sample2.csv'"

Note that any execution generates, among others:

Auxiliary tracking files and folders that we can ignore: wandb, mlruns, artifacts, etc.
Entries in the associated Weights and Biases project.

The entries in the Weights and Biases web interface cover extensive information related to each run of a component or uploaded/downloaded artifacts:

Time, duration and resources.
Logged information: parameters, metrics, images, etc.
Git commit branches.
Relationship between components and artifacts.
Etc.

As an example, here is a graph view of the lineage of the pipeline, which resembles the pipeline flow diagram shown above:

How to Modify the Pipeline

We can generate new steps/components easily by creating a new folder under src/ with the following three files, properly implemented:

MLproject
conda.yaml
run.py

Then, we need to add the step and its parameters to the mlflow pipeline in main.py, as it is done with the other steps.

Instead of creating the folder and the files manually, we can use the tool cookiecutter, installed with the general requirements. For that, we run the following command:

cookiecutter cookie-mlflow-step -o src

That prompts a form we need to fill in via the CLI; for instance, if we'd like to create a new component which would train an XGBoost model, we could write this:

step_name [step_name]: train_xgboost
script_name [run.py]: run.py
job_type [my_step]: train_model
short_description [My step]: Training of an XGBoost model
long_description [An example of a step using MLflow and Weights & Biases]: Training of an XGBoost model
parameters [parameter1,parameter2]: trainval_artifact,val_size,random_seed,stratify_by,xgb_config,max_tfidf_features,output_artifact

All that creates the component folder with the three files, to be implemented.

Issues / Notes

There are some deviations from the original Instructions.md that might be worth considering:

I changed the remote component folder/URL; additionally, we might need to specify the branch with the parameter version in the mlflow.run() call, because mlflow defaults to use master.
I had to change several conda.yaml files to avoid versioning/dependency issues: protobuf, etc.

Possible Improvements

Extend the Exploratory Data Analysis (EDA) and the Feature Engineering (FE). See my Guide on EDA, Data Cleaning and Feature Engineering.
Extend the current data modeling and the hyperparameter tuning.
Try other models than the random forest: create new components/steps for those and compare them.

Interesting Links

Reproducible ML pipeline boilerplate.
Weights & Biases Model Registry.
Machine learning model serving for newbies with MLflow.
This repository doesn't focus on the techniques for data processing and modeling, instead it focuses on the MLOps aspect; if you are interested in the former topics, you can visit my Guide on EDA, Data Cleaning and Feature Engineering.
This project creates an inference pipeline managed with MLflow and tracked with Weights and Biases; however, it is possible to define a production inference pipeline in a more simple way without the exposure to those 3rd party tools. In this blog post I describe how to perform that transformation from research code to production-level code; the associated repository is customer_churn_production.
If you are interested in the automated deployment of production-ready ML pipelines packaged in APIs, check my example repository census_model_deployment_fastapi.
If you are interested in more MLOps-related content, you can visit my notes on the Udacity Machine Learning DevOps Engineering Nanodegree: mlops_udacity.
If you're interested in such short-term renting price modeling, have a look at my Analysis and Modelling of the AirBnB Dataset from the Basque Country.
Weights and Biases tutorials.
Weights and Biases documentation.

Authorship

Mikel Sagardia, 2022.
No guarantees.

If you find this repository helpful and use it, please link to the original source.

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
.github/workflows		.github/workflows
components		components
cookie-mlflow-step		cookie-mlflow-step
images		images
src		src
.gitignore		.gitignore
CODEOWNERS		CODEOWNERS
Instructions.md		Instructions.md
LICENSE.txt		LICENSE.txt
MLproject		MLproject
README.md		README.md
conda.yml		conda.yml
config.yaml		config.yaml
environment.yml		environment.yml
main.py		main.py

License

mxagar/ml_pipeline_rental_prices

Folders and files

Latest commit

History

Repository files navigation

Reproducible Machine Learning Pipeline for Short-Term Rental Price Prediction

Introduction

How to Use This Project

Dependencies

How to Run the Pipeline

How to Modify the Pipeline

Issues / Notes

Possible Improvements

Interesting Links

Authorship

About

Topics

Resources

License

Stars

Watchers

Forks

Languages