Skip to content

MuhammedBuyukkinaci/My-MLOps-Notes

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

My-MLOps-Notes

Listing my MLOps learnings

This repository is containing my notes from this Udemy course.

  1. 85 percent of trained ML model don't reach production and 55 % of companies don't deploy a single model.

  2. An ideal ML life cycle

ml_life_cycle

  1. Researches show that companies using UI increased their profit margin by 3% to 15%.

  2. DevOps applied to Machine Learning is known as MLOps. Model creation must be scalable, collaborative and reproducible. The principles, tools and techniques that make models scalable, collaborative and reproducible are known as MLOps.

  3. MLOps process:

mlops_process

  1. DevOps applied to Machine Learning is known as MLOps. DevOps applied to Data is known as DataOps.

  2. Roles in MLOps

roles

  1. Challenges addressed by MLOps
  • Data and Artifact versioning

  • Model Tracking: Degradition of performance due to data drift.

  • Feature Generation: MLOPS allows to reuse methods

  1. Parts of MLOPS

part1

part2

part_all

  1. MLOps Tools

tools

  1. Some data labelling tools:

Feature Engineering

  1. Some Feature Engineering Tools:

Hyperparameter Optimization

  1. Some Hyperparameter Optimization Tools:
  1. Fast API can be used in serving ML model.

  2. Streamlit is useful for POC.

  3. MLOps stages:

ml_ops_stages

  1. Some tools to use

ml_ops_stages

  1. Structuring ML projects in one of 3 ways.

structuring

Cookiecutter

  1. Cookiecutter is a tool to structure our ML projects and folders. It should be installed GLOBALLY on the computer(not in a virtual environment).
pip install cookiecutter

cookiecutter https://github.com/khuyentran1401/data-science-template

cookiecutter

cookiecutter https://github.com/MuhammedBuyukkinaci/Clean-Data-Science-Project-Template.git

Poetry

  1. Poetry allows us to manage dependencies and versions. Poetry is an alternative to pip.

    • Poetry separates main dependencies and sub dependencies into two separate files. Whereas,pip stores all dependencies in a single file(requirements.txt).
    • Poetry creates readable dependency files.
    • Poetry removes all sub dependencies when removing a library
    • Poetry avoids installing new libraries in conflict with existing libraries.
    • Poetry packages project with few lines of code.
    • All the dependencies of he project are specified in pyproject.toml
# TO install poetry on your machine(for linux and mac)
curl -sSL https://install.python-poetry.org | python3 -


# To generate a project
poetry new <project_name>

# To install dependencies
poetry install

# To add a new pypi library
poetry add <library_name>

# To delete a library
poetry remove <library_name>

# To show installed libraries
poetry show

# To show sub dependencies
poetry show --tree

# Link our existing environment(venv, conda etc) to poetry
poetry env use /path/to/python

Hydra

  1. Hydra manages configuration files. It makes project management easier.

    • Configuration information shouldn't be mixed with main code.
    • It is easier to modify things in a configuration file.
    • YAML is a common language for a configuration file.
    • An example config file and its usage via hydra
    • We can modify hydra parameters via CLI without modifying config file.

    Hydra

    Hydra

    • Hydro logging is super useful.

    • To use hydra, we must add config as an argument to a function.

import hydra
from pipeline2 import pipeline2

@hydra.main(config_name = 'preprocessing')
def run_training(config):

    match_pipe = pipeline2(config)


Hydra

Hydra

Pre-commit

  1. Pre-commit plugins: It automates code review and formatting. In order to install them, use pip install pre-commit. After installing pre-commit, fill out .pre-commit-config.yaml and run pre-commit install to install it. Then, some checks are run before committing to local repository. Commit will not be done until the problem got solved. --no-verify is flag that can be appended to git commit. It doesn'T force you to correct the mistakes detected by pre-commit.

precommit

- Formatter: black
- PEP8 Checker: flake8
- Sort imports: isort
- Check for docstrings: interrogate

precommit

  1. Black and Flake8
# pip install black
black file_name_01.py

# pip install flake8
flake8 temp.py
  1. isort and iterrogate

    • correct isort: isort
#pip install isort
isort file_name.py
#pip install interrogate
interrogate -vv file_name.py
  1. DVC is used for version control of model training data.

  2. pdoc is used to automatically create documentation for projects.

pip install pdoc3

pdoc --http localhost:8080 temp.py
  1. Makefile creates short and readable commands for configuration tasks. We can use Makefile to automate tasks such as setting up the environment.

Makefile

  1. A solution design is available at here

  2. MLOps stages:

mlops_stages

  1. What AutoML does:

mlops_stages

PyCaret

  1. PyCaret is an open source, low code ML library. It has been developed in Python and reduce the time needed to create a model to minutes.

pycaret

  1. PyCaret incorporates these libraries:

pycaret

  1. Pandas Profiling is allowing us to develop an exhaustive analysis of data.

pandas_profiling

  1. An example of PyCaret setup function:

pycaret

  1. Tukey-Anscomble Plot && Normal QQ Plot

plot1

  1. Scale-Location Plot && Residuals & Leverage

plot2

  1. MLOps Tracking Server and Model Registry

plot2

  1. MLFlow UI for different runs

plot2

MLFlow

  1. Different Components of MLFlow

  1. We can log parameters, metrics and models in MLFlow.
import mlflow
from sklearn.linear_model import LogisticRegression
from urllib.parse import urlparse

alpha = 0.5

def rmse_compute(true,preds):
    pass

X_train = None
y_train = None

X_test = None
y_test = None

with mlflow.start_run():

    lr = LogisticRegression(alpha = alpha)
    lr.fit(X_train,y_train)
    y_test_preds = lr.predict(X_test)
    rmse = rmse_compute(y_test,y_test_preds)
    mlflow.log_param('alpha',alpha)
    mlflow.log_metric('rmse',rmse)

    tracking_url_type_store = urlparse(mlflow.get_tracking_uri()).scheme

    if tracking_url_type_store != 'file':
        mlflow.sklearn.log_model(lr,'model',registered_monel_name = 'ElasticNetWineModel')
    else:
        mlflow.sklearn.log_model(lr,'model')
  1. We can register models into MLFlow via PyCaret.
#pass log_experiment = True, experiment_name = 'diamond'

s = setup(data, target = 'Precio', transform_target = True, log_experiment = True, experiment_name = 'diamond')

Shap

  1. Shap is a Python library about model interpretability.

  1. A prediction for a single record

  1. We can use SHAP with PyCaret.

Deploying the model

  1. We aren't just deploying a model(a pickle file). We are also deploying a pipeline(composed of preprocessing, feature engineering etc.)

  1. There are 2 different ways to deploy a model in a production environment

    • Through API
    • Through Applications(mobile/web)

FastAPI

  1. API is an intermediary between 2 different applications that communicate each other. If we want our applications to be available for other developers, creating an api as an intermediate connector is convenient. Developers send http request to consume this service. We can think of an API as an bstraction of our application. Thanks to API, users don't need code or install the dependencies.

  1. HTTP verbs and Status Codes

  1. FastAPI is the framework creating robust & high-performance API for production environments. Compared to Flask, which is a development framework, FastAPI has the following advantages:

    • Using syncio
    • Implementing PyDantic for data validation
    • FastAPI forces the schema application and input data and detect data types at runtime.
    • FastAPI uses swagger UI to create automatic documentation.
    • FastAPI has a better security and authentication features.
  2. FastAPI documentation UI

  1. FastAPI is built on top of uvicorn library.

  2. A basic usage of FastAPI

import uvicorn
from fastapi import FastAPI

app = FastAPI()
@app.get('/')
def home():
    return {'Hello': 'World'}

#@app.post("/")
#def home_post():
#    return {"Hello": "POST"}

if __name__ == '__main__':
    uvicorn.run("hellow_world_fastapi:app")

#query parameters
@app.get("/employee")
def home(department: str):
    return {"department": department}

#path parameters
@app.get("/employee/{id}")
def home(id: int):
    return {"id": id}

  1. Pydantic usage in FastAPI

  1. PyCaret is able to create automated FastAPI.

Gradio

  1. Gradio is a web application to deploy our ML models. It has a UI for business users.

  2. An example demo for gradio app

  1. PyCaret has a function to create gradio apps easily.

Flask

  1. Flask is a web development framework in Python.

    • It is easy to use.
    • It is flexible.
    • Allows testing.
  2. An example code snippet for Flask

from flask import Flask

app = Flask(__name__)

@app.route("/")
def home():
    return {"Hello": "World"}

if __name__ == '__main__':
    app.run()

  1. A ML Project Deployment Pipeline

  1. Dockerfile can be thought as recipy to cook a meal.

  2. Phyical Machine vs Virtualization vs Container Deployment vs Kubernetes

  1. We can create a docker image using PyCaret vida create_docker function of PyCaret.

Azure

  1. If we are using a paid Azure account, we can register our Docker images on Azure Registry.

  2. Azure Blob Storage is similar to Amazon S3. To use asure blob storage, azure SDK is needed.

Practical MLOPS Notes

Intrıduction to MLOPS

  1. ML job listings are much more than data science job listings.

  2. MLOps is the process of automating machine learning using DevOps methodologies.

  3. Data drift is a phonomenia that means data changed from training to inference.

  4. New Relic, Data Dog and Stackdriver are performance monitoring tools.

  5. An example Makefile is below. Run make install or make lint or make test.

install:
	pip install --upgrade pip &&\
		pip install -r requirements.txt
lint:
	pylint --disable=R,C hello.py

test:
	python -m pytest -vv --cov=hello test_hello.py
  1. Data Lake is a place where we can process data without transferring it outside.

  2. MLOPS is possible after Devops(Jenkins), Data automation(Airflow) and platform automation(AWS Sagemaker) are completed.

  3. Building reusable ML pipelines is crucial and related to versioning.

  4. Locust and loader.io are 2 ways for load testing.

  5. MLOPS is a combination of Data, Devops, Models and Business equally.

MLOPS Foundations

MLOPS for containers and Edge Devices

  1. The future means more ML at the edge devices.

  2. The Coral Project is a platform that helps build local (on-device) inferencing that captures the essence of edge deployments: fast, close to the user, and offline.

  3. Azure Percept is a Microsoft solution similar to The Coral Project.

Continuous Delivery for Machine Learning Models

  1. Github actions can be located under .github/workflows/ as main.yaml.

  2. An examle overview of CI/CD

  1. Kubernetes is a good way to implement Blue-Green Deployment.

  2. Blue green deployment is a way for deployments. There are 2 environments. The current is blue and the new version is green. The app is running on blue environment. We are installing the new version on green environment and carrying out tests. If everything goes well, direct the traffic to green environment.

  3. Canary deployment is a way for deployments. We are progressively moving traffic from old environment to new environment. If something unexpected happens in the new environment, we are rolling back to the previous environment. Thus, less traffic got affected due to mistakes. If everything goes well, the traffic directed to the new environment is progressively increased up to 100%.

AutoML and KaizenML

1 ) KaizenML is about automating everything about the machine learning process and improving it.

  1. Software for training machine learning models could turn into something like the Linux kernel, free and ubiquitous.

  2. AutoML is just a technique, like continuous integration (CI); it automates trivial tasks.

  3. DevOps + KaizenML = MLOps. KaizenML includes building Feature Stores, i.e., a registry of high-quality machine learning inputs and the ability to monitor data for drift and register and serve out ML models

  4. Uber's Michelangelo is a ML as a Service. Databricks has a feature store solution too.

  5. Apple has a machine learning framework called Core ML. It is under Xcode > Open Developer Tool > Create ML.

  6. TFHUB is a hub to store various pretrained ML models.

  7. SageMaker Autopilot is AWS's complete solution for AutoML and MLOps.

  8. [Ludwig] is a Python library that builds up ML solutions declaratively. We are defining a yaml file and then running ludwig's API via programatically or CLI. It is a part of Linux foundation.

  1. FlaML is an AutoML solution. It has a design that accounts for cost-effective hyperparameter optimization.

  2. Tpot is an AutoML solution. TPOT is a Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.

  3. PyCaret is an AutoML tool.

  4. Autosklearn is an AutoML tool.

  5. H2O AutoML is an AutoML tool.

  6. ELI5 and SHAP are 2 popular open source Model Explainability frameworks.

AWS for MLOPS

  1. Go Hugo is a website to build up static websites.

  2. AWS DeepLens is a hardware sol

  3. It would bea good practice to have a cli.py to get predictions.

  4. Github actions is an alternative to bitbucket pipelines and jenkins. Its cloud native alternative on AWS is AWS Codebuild.

  5. Github container registry is an alternative to Amazon ECR(Elastic COntainer Registry).

Machine Learning Interoperability

  1. ONNX is a tool for ML Interoperability. It is a product of collaboration of Facebook and Microsoft.

  2. Some libraries are able to being converted to ONNX format:

  1. An example script to convert a pytorch model to ONNX format
import torch
import torchvision

dummy_tensor = torch.randn(8, 3, 200, 200)
model = torchvision.models.resnet18(pretrained=True)

input_names = [ "input_%d" % i for i in range(12) ]
output_names = [ "output_1" ]

torch.onnx.export(
    model,
    dummy_tensor,
    "resnet18.onnx",
    input_names=input_names,
    output_names=output_names,
    opset_version=7,
    verbose=True,
)
  1. ONNX has a special format callet ORT for minimized build size of the model.

Building MLOps Command Line Tools and Microservices

  1. requirements.txt file and setup.py files are able to install dependencies for a Python project. However, only setup.py file can package a project for distribution.

  2. Command Line Tool Development can be useful in a situation that needs solving.

ML Engineering and ML Use cases

  1. Porting a Python model into a production language like C++ or Java is challenging, and often results in reduced performance of the original, trained model

  2. Fairlearn is a python package to mitigate observed unfairness. It is a library to detect bias among gender, race, religion etc.

  3. InterpretML is a python package for ML interpretability.

About

Listing my MLOps learnings

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages