Listing my MLOps learnings
This repository is containing my notes from this Udemy course.
-
85 percent of trained ML model don't reach production and 55 % of companies don't deploy a single model.
-
An ideal ML life cycle
-
Researches show that companies using UI increased their profit margin by 3% to 15%.
-
DevOps applied to Machine Learning is known as MLOps. Model creation must be scalable, collaborative and reproducible. The principles, tools and techniques that make models scalable, collaborative and reproducible are known as MLOps.
-
MLOps process:
-
DevOps applied to Machine Learning is known as MLOps. DevOps applied to Data is known as DataOps.
-
Roles in MLOps
- Challenges addressed by MLOps
-
Data and Artifact versioning
-
Model Tracking: Degradition of performance due to data drift.
-
Feature Generation: MLOPS allows to reuse methods
- Parts of MLOPS
- MLOps Tools
- Some data labelling tools:
- Some Feature Engineering Tools:
- Some Hyperparameter Optimization Tools:
-
Fast API can be used in serving ML model.
-
Streamlit is useful for POC.
-
MLOps stages:
- Some tools to use
- Structuring ML projects in one of 3 ways.
- Cookiecutter is a tool to structure our ML projects and folders. It should be installed GLOBALLY on the computer(not in a virtual environment).
pip install cookiecutter
cookiecutter https://github.com/khuyentran1401/data-science-template
cookiecutter https://github.com/MuhammedBuyukkinaci/Clean-Data-Science-Project-Template.git
-
Poetry allows us to manage dependencies and versions. Poetry is an alternative to pip.
- Poetry separates main dependencies and sub dependencies into two separate files. Whereas,pip stores all dependencies in a single file(requirements.txt).
- Poetry creates readable dependency files.
- Poetry removes all sub dependencies when removing a library
- Poetry avoids installing new libraries in conflict with existing libraries.
- Poetry packages project with few lines of code.
- All the dependencies of he project are specified in pyproject.toml
# TO install poetry on your machine(for linux and mac)
curl -sSL https://install.python-poetry.org | python3 -
# To generate a project
poetry new <project_name>
# To install dependencies
poetry install
# To add a new pypi library
poetry add <library_name>
# To delete a library
poetry remove <library_name>
# To show installed libraries
poetry show
# To show sub dependencies
poetry show --tree
# Link our existing environment(venv, conda etc) to poetry
poetry env use /path/to/python
-
Hydra manages configuration files. It makes project management easier.
- Configuration information shouldn't be mixed with main code.
- It is easier to modify things in a configuration file.
- YAML is a common language for a configuration file.
- An example config file and its usage via hydra
- We can modify hydra parameters via CLI without modifying config file.
-
Hydro logging is super useful.
-
To use hydra, we must add config as an argument to a function.
import hydra
from pipeline2 import pipeline2
@hydra.main(config_name = 'preprocessing')
def run_training(config):
match_pipe = pipeline2(config)
- Pre-commit plugins: It automates code review and formatting. In order to install them, use
pip install pre-commit
. After installingpre-commit
, fill out.pre-commit-config.yaml
and runpre-commit install
to install it. Then, some checks are run before committing to local repository. Commit will not be done until the problem got solved.--no-verify
is flag that can be appended to git commit. It doesn'T force you to correct the mistakes detected by pre-commit.
- Formatter: black
- PEP8 Checker: flake8
- Sort imports: isort
- Check for docstrings: interrogate
- Black and Flake8
# pip install black
black file_name_01.py
# pip install flake8
flake8 temp.py
#pip install isort
isort file_name.py
#pip install interrogate
interrogate -vv file_name.py
-
DVC is used for version control of model training data.
-
pdoc is used to automatically create documentation for projects.
pip install pdoc3
pdoc --http localhost:8080 temp.py
- Makefile creates short and readable commands for configuration tasks. We can use Makefile to automate tasks such as setting up the environment.
-
A solution design is available at here
-
MLOps stages:
- What AutoML does:
- PyCaret is an open source, low code ML library. It has been developed in Python and reduce the time needed to create a model to minutes.
- PyCaret incorporates these libraries:
- Pandas Profiling is allowing us to develop an exhaustive analysis of data.
- An example of PyCaret setup function:
- Tukey-Anscomble Plot && Normal QQ Plot
- Scale-Location Plot && Residuals & Leverage
- MLOps Tracking Server and Model Registry
- MLFlow UI for different runs
- Different Components of MLFlow
- We can log parameters, metrics and models in MLFlow.
import mlflow
from sklearn.linear_model import LogisticRegression
from urllib.parse import urlparse
alpha = 0.5
def rmse_compute(true,preds):
pass
X_train = None
y_train = None
X_test = None
y_test = None
with mlflow.start_run():
lr = LogisticRegression(alpha = alpha)
lr.fit(X_train,y_train)
y_test_preds = lr.predict(X_test)
rmse = rmse_compute(y_test,y_test_preds)
mlflow.log_param('alpha',alpha)
mlflow.log_metric('rmse',rmse)
tracking_url_type_store = urlparse(mlflow.get_tracking_uri()).scheme
if tracking_url_type_store != 'file':
mlflow.sklearn.log_model(lr,'model',registered_monel_name = 'ElasticNetWineModel')
else:
mlflow.sklearn.log_model(lr,'model')
- We can register models into MLFlow via PyCaret.
#pass log_experiment = True, experiment_name = 'diamond'
s = setup(data, target = 'Precio', transform_target = True, log_experiment = True, experiment_name = 'diamond')
- Shap is a Python library about model interpretability.
- A prediction for a single record
- We can use SHAP with PyCaret.
- We aren't just deploying a model(a pickle file). We are also deploying a pipeline(composed of preprocessing, feature engineering etc.)
-
There are 2 different ways to deploy a model in a production environment
- Through API
- Through Applications(mobile/web)
- API is an intermediary between 2 different applications that communicate each other. If we want our applications to be available for other developers, creating an api as an intermediate connector is convenient. Developers send http request to consume this service. We can think of an API as an bstraction of our application. Thanks to API, users don't need code or install the dependencies.
- HTTP verbs and Status Codes
-
FastAPI is the framework creating robust & high-performance API for production environments. Compared to Flask, which is a development framework, FastAPI has the following advantages:
- Using syncio
- Implementing PyDantic for data validation
- FastAPI forces the schema application and input data and detect data types at runtime.
- FastAPI uses swagger UI to create automatic documentation.
- FastAPI has a better security and authentication features.
-
FastAPI documentation UI
-
FastAPI is built on top of uvicorn library.
-
A basic usage of FastAPI
import uvicorn
from fastapi import FastAPI
app = FastAPI()
@app.get('/')
def home():
return {'Hello': 'World'}
#@app.post("/")
#def home_post():
# return {"Hello": "POST"}
if __name__ == '__main__':
uvicorn.run("hellow_world_fastapi:app")
#query parameters
@app.get("/employee")
def home(department: str):
return {"department": department}
#path parameters
@app.get("/employee/{id}")
def home(id: int):
return {"id": id}
- Pydantic usage in FastAPI
- PyCaret is able to create automated FastAPI.
-
Gradio is a web application to deploy our ML models. It has a UI for business users.
-
An example demo for gradio app
- PyCaret has a function to create gradio apps easily.
-
Flask is a web development framework in Python.
- It is easy to use.
- It is flexible.
- Allows testing.
-
An example code snippet for Flask
from flask import Flask
app = Flask(__name__)
@app.route("/")
def home():
return {"Hello": "World"}
if __name__ == '__main__':
app.run()
- A ML Project Deployment Pipeline
-
Dockerfile can be thought as recipy to cook a meal.
-
Phyical Machine vs Virtualization vs Container Deployment vs Kubernetes
- We can create a docker image using PyCaret vida
create_docker
function of PyCaret.
-
If we are using a paid Azure account, we can register our Docker images on Azure Registry.
-
Azure Blob Storage is similar to Amazon S3. To use asure blob storage, azure SDK is needed.
-
ML job listings are much more than data science job listings.
-
MLOps is the process of automating machine learning using DevOps methodologies.
-
Data drift is a phonomenia that means data changed from training to inference.
-
New Relic, Data Dog and Stackdriver are performance monitoring tools.
-
An example Makefile is below. Run
make install
ormake lint
ormake test
.
install:
pip install --upgrade pip &&\
pip install -r requirements.txt
lint:
pylint --disable=R,C hello.py
test:
python -m pytest -vv --cov=hello test_hello.py
-
Data Lake is a place where we can process data without transferring it outside.
-
MLOPS is possible after Devops(Jenkins), Data automation(Airflow) and platform automation(AWS Sagemaker) are completed.
-
Building reusable ML pipelines is crucial and related to versioning.
-
MLOPS is a combination of Data, Devops, Models and Business equally.
-
The future means more ML at the edge devices.
-
The Coral Project is a platform that helps build local (on-device) inferencing that captures the essence of edge deployments: fast, close to the user, and offline.
-
Azure Percept is a Microsoft solution similar to The Coral Project.
-
Github actions can be located under .github/workflows/ as main.yaml.
-
An examle overview of CI/CD
-
Kubernetes is a good way to implement Blue-Green Deployment.
-
Blue green deployment is a way for deployments. There are 2 environments. The current is blue and the new version is green. The app is running on blue environment. We are installing the new version on green environment and carrying out tests. If everything goes well, direct the traffic to green environment.
-
Canary deployment is a way for deployments. We are progressively moving traffic from old environment to new environment. If something unexpected happens in the new environment, we are rolling back to the previous environment. Thus, less traffic got affected due to mistakes. If everything goes well, the traffic directed to the new environment is progressively increased up to 100%.
1 ) KaizenML is about automating everything about the machine learning process and improving it.
-
Software for training machine learning models could turn into something like the Linux kernel, free and ubiquitous.
-
AutoML is just a technique, like continuous integration (CI); it automates trivial tasks.
-
DevOps + KaizenML = MLOps. KaizenML includes building Feature Stores, i.e., a registry of high-quality machine learning inputs and the ability to monitor data for drift and register and serve out ML models
-
Uber's Michelangelo is a ML as a Service. Databricks has a feature store solution too.
-
Apple has a machine learning framework called Core ML. It is under Xcode > Open Developer Tool > Create ML.
-
TFHUB is a hub to store various pretrained ML models.
-
SageMaker Autopilot is AWS's complete solution for AutoML and MLOps.
-
[Ludwig] is a Python library that builds up ML solutions declaratively. We are defining a yaml file and then running ludwig's API via programatically or CLI. It is a part of Linux foundation.
-
FlaML is an AutoML solution. It has a design that accounts for cost-effective hyperparameter optimization.
-
Tpot is an AutoML solution. TPOT is a Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.
-
PyCaret is an AutoML tool.
-
Autosklearn is an AutoML tool.
-
H2O AutoML is an AutoML tool.
-
ELI5 and SHAP are 2 popular open source Model Explainability frameworks.
-
Go Hugo is a website to build up static websites.
-
AWS DeepLens is a hardware sol
-
It would bea good practice to have a cli.py to get predictions.
-
Github actions is an alternative to bitbucket pipelines and jenkins. Its cloud native alternative on AWS is AWS Codebuild.
-
Github container registry is an alternative to Amazon ECR(Elastic COntainer Registry).
-
ONNX is a tool for ML Interoperability. It is a product of collaboration of Facebook and Microsoft.
-
Some libraries are able to being converted to ONNX format:
- An example script to convert a pytorch model to ONNX format
import torch
import torchvision
dummy_tensor = torch.randn(8, 3, 200, 200)
model = torchvision.models.resnet18(pretrained=True)
input_names = [ "input_%d" % i for i in range(12) ]
output_names = [ "output_1" ]
torch.onnx.export(
model,
dummy_tensor,
"resnet18.onnx",
input_names=input_names,
output_names=output_names,
opset_version=7,
verbose=True,
)
- ONNX has a special format callet ORT for minimized build size of the model.
-
requirements.txt file and setup.py files are able to install dependencies for a Python project. However, only setup.py file can package a project for distribution.
-
Command Line Tool Development can be useful in a situation that needs solving.
-
Porting a Python model into a production language like C++ or Java is challenging, and often results in reduced performance of the original, trained model
-
Fairlearn is a python package to mitigate observed unfairness. It is a library to detect bias among gender, race, religion etc.
-
InterpretML is a python package for ML interpretability.