# MLOPS

## Stages

1. Scooping *We define the project, check if the problem requires Machine Learning to solve it. Perform requirement engineering, check if the relevant data is available. Verify if the data is non-biased and reflects the real-world use case.*

2. Data Engineering *This stage involves collecting data, establishing baselines, cleaning the data, formatting the data, labelling, and organizing the data.*

3. Modelling *Now we come to the coding part, here we create the ML model. We train the model with the processed data. Perform error analysis, define error measurement, and track the model performance.*

4. Deployment *Here we package the model, deploy it in the cloud or on edge devices as necessary. Packaging could be model wrapped with an API server exposing REST or gRPC endpoints, a docker container deployed on cloud infrastructure, deployed on server-less cloud platform, or a mobile app for edge-based models.*

5. Monitoring *Once the deployment is done, we rely on a monitoring infrastructure to help us maintain and update the model.*


## ML Production Infrastructure


1. Data Collection — This step involves collecting data from various sources. ML models require a lot of data to learn. Data collection involves consolidating all kinds of raw data related to the problem. i.e Image classification might require you to collect all available images or scrape the web for images. Voice recognition may require you to collect tons of audio samples.

2. Data Verification — In this step we check the validity of the data, if the collected data is up to date, reliable, and reflects the real world, is it in a proper consumable format, is the data structured properly.

3. Feature Extraction — Here, we select the best features for the model to predict. In other words, your model may not require all the data in its entirety for discovering patterns, some columns or parts of data might be not used at all. Some models perform well when a few columns are dropped. We usually rank the features with importance, features with high importance are included, lower ones or near zero ones are dropped.

4. Configuration — This step involves setting up the protocols for communications, system integrations, and how various components in the pipeline are supposed to talk to each other. You want your data pipeline to be connected to the database, you want your ML model to connect to database with proper access, your model to expose prediction endpoints in a certain way, your model inputs to be formatted in a certain way. All the necessary configurations required for the system need to be properly finalized and documented.

5. ML Code — Now we, come to the actual coding part. In this stage, we develop a base model, which can learn from the data and predict. There are tons of ML libraries out there with multiple language support. Ex: tensorflow, pytorch, scikit-learn, keras, fast-ai and many more. Once we have a model, we start improving its performance by tweaking the hyper-parameters, testing different learning approaches until we are satisfied that the model is performing relatively better than its previous version.

6. Machine Resource Management — This step involves the planning of the resources for the ML model. Usually, ML models require heavy resources in terms of CPU, memory, and storage. Deep learning models are dependent on GPU and TPU for computation. Training ML models involves cost in terms of time and money. Slower CPUs involve more time, Powerful CPUs are pricier. The larger the model, the bigger the storage you will have to invest in.

7. Analysis Tool — Once your model is ready, how do you know if the model is performing up to mark. We decide on model analysis in this stage. How do we compute loss, what error measurement should we use, how do we check if the model is drifting, is the prediction result proper, has the model been overfitted or underfit? Usually, the libraries with which we implement the model ship with analysis kits and error measurements.

8. Project Management Tool — Tracking an ML project is very important. It’s easy to get lost and mess up while dealing with huge data, features, ML code, resource management. Luckily there are a lot of project management tools out on the Internet to help us out.

9. Serving Infrastructure — Once the model is developed, tested, and ready to go, we need to deploy it somewhere the users can access it. The majority of the models are deployed on cloud. Public cloud providers like AWS, GCP, and Azure even have specific ML-related features for easy deployment of models. Depending on the budget you can select the provider suited for your needs.

10. Monitoring — We need to implement a monitoring system to observe our deployed model and the system on which it runs. Collecting model logs, user access logs, and prediction logs will help in maintaining the model. There are several monitoring solutions like greylog, elasticstack, and fluentd available. Cloud providers usually ship their own monitoring systems.