# Notes: Required questions - theory Kirenz


## Data-centric AI

Notes on the video of AI pioneer Andrew Ng:"A Chat with Andrew on MLOps: From Model-centric to Data-centric AI".

<iframe width="560" height="315" src="https://www.youtube.com/embed/06-AZXmwHjo" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>


- **AI-System** = Code (model/algorithm) + Data 

- Data is food for AI  
![image](../../../assets/img/dataIsFood.png)     

- It's very important especially for small data sets that the labels are consistencly  
![image](../../../assets/img/smallDataAndLabelConsistency.png)  

    - **Noisy Dataset**: data that contains a large amount of additional meaningless information. E.g. corrupted data...all data that cannot be understood and interpreted  by a user system.
    
    - **Noisy labels**: labels that were set incorrectly or inconsitently  


- **Theory: Clean vs. noisy Data**  
    You have 500 Examples and 12% of the examples are noisy (incorrectly or inconsitently labeld)

    The following are about equally effective: 
        - Clean up the noise => 60 examples
        - Collect annother 500 new examples (double the training set)
    
    With a data centric view, there is significant of room for improvment in problems <10.000 examples

```{admonition} Required questions
:class: tip
- **Describe the lifecycle of an ML project**

    ![image](../../../assets/img/LifecycleMlProject.png)  

    - **Collect data**
        Define and collect the data. It's important that the data is labeled consistently. 

         ![image](../../../assets/img/iguanaDetection.png)  

         All 3 options are fine, but we should label the whole dataset in one way. For example...we label all the data as in the first picture

         Ho can we make data quality systematic in MLOps?
         - Ask two independent labelers to label a sample of images
         - Measure consistency between labelers to discover where they disagree
         - For clases where the labelers disagree, revise the labeling instruction until they become consistent

    - **Train model**
        It's important that after ech training we analyze what the error was. 

        Making it systemtic - iteratively improving the data (Data-centric view)
        - Train a model
        - Error the analysis to identify the types of data the algorithm does poorly on (e.g. speech with car noise).
          *Example: Speech with car nois in background. If that's the problem, we should collect more data with speech in background.Not just add 5000 more data, but specifically data with speech and car noise in the background  --> Model can be significantly improved by selectively adding datan*
        - Either get more of that data via data augmentation, data generation or data collection (change inputs x) or give more consistent definition for labels if they were found to be ambiguous (change labels y)

____________________________

- **What is the difference between a model-centric vs data-centric view**  
  
    **Model-centric view**  
    Collect what data you can, and develop a model good enough to deal with the noise in the data.  

    Hold the data fixed and iteratively improve the code/model.  
      
    **Data-centric view**  
    The consistency of the data is paramount. Use tools to improve the data quality; this will allow multiple models to do well.  

    Hold the code fixed and iteratively improve the data..  
____________________________

- **Describe MLOps’ most important task**  
  
    Ensure consistently high-quality data in all phases oft he ML project lifecycle        
    What is good Data?
    - Defined consistently (definition of labels y is unambiguous)
    - Cover of important cases (good coverage of inputs x)
    - Enough data – for example enough data of speech with car noise in background
    - Has timely feedback from production data (distribution covers data drift and concept drift)
    - Sized appropriately

```

## Common MLOps related challenges

Notes on the video of Nayur Khan, global head of technical delivery at McKinsey

<iframe width="560" height="315" src="https://www.youtube.com/embed/M1F0FDJGu0Q" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

```{admonition} Required questions
:class: tip
- **Describe 4 typical challenges when creating machine learning products.**  
    
    What we typically find is that a couple of teams sitting in the same office or sometimes geographically separated and many problems result….  
    
    - *Lack of collaboration* or sharing of components/libaries betwenn the teams but even within the same team. 
    - *Inconsistency*: Different approachtes to solving same problem, or using inonsistent data or algortihms
    - *Duplicaton*: Reinveting the wheel. Teams sitting next to each other and try to solve the same problem. 
    - *Tech Debt*: Large codebases and tech debt. You don't know how to maintain your code


____________________________

- **Reusability concerns within a codebase: Explain a common way to look at what code is doing in a typical ML project.**  
  
____________________________

- **What kind of problems does the open-source framework Kedro solve and where does Kedro fit in the MLOps ecosystem?**  
  
    Kedro solved to problem of maintability.   

    Kedro is not for the deployment. Kedro focuses on how you work while writing standardized, modular, maintaible and reproducible data sience code and does not focus on how you would like to run it in production. The responsibility of „What time will this pipeline run?" and "How will i know if it failed?" is left to tools called orchestrators like Apache Airflow, Luigi, Dagster and Perfect. Orchestrators do not focus on the process of producing something that could be deployed, which is what Kedro does. 


```



## Components 

Next, you’ll get an overview about some of the primary components of MLOps. “An introduction to MLOps on Google Cloud” by Nate Keating:

<iframe width="560" height="315" src="https://www.youtube.com/embed/6gdrwFMaEZ0" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

```{admonition} Required questions
:class: tip
- **Describe the challenges of current ML systems (where are teams today)?**  
    
____________________________

- **What are the components of the ML solution lifecycle?**  
  
____________________________

- **Explain the steps in an automated E2E pipeline.**  


```

## Framework

```{admonition} Required questions
:class: tip
- **Describe the difference betweeen DevOps versus MLOps**  
    DevOps is a popular practice in developing and operating large-scale software systems. This practice provides benefits such as shortening the development cycles, increasing deployment velocity, and dependable releases. To achieve these benefits, you introduce two concepts in the software system development:
        - *Continuous Integration (CI)*
        - *Continuous Delivery (CD)*

    An ML system is a software system, so similar practices apply to help guarantee that you can reliably build and operate ML systems at scale.
    However, ML systems differ from other software systems in the following ways:**
        - Team skills : In an ML project, the team usually includes data scientists or ML researchers, who focus on exploratory data analysis, model development, and experimentation. These members might not be experienced software engineers who can build production-class services.
        - Development: ML is experimental in nature. You should try different features, algorithms, modeling techniques, and parameter configurations to find what works best for the problem as quickly as possible. The challenge is tracking what worked and what didn't, and maintaining reproducibility while maximizing code reusability.*
        - Testing: Testing an ML system is more involved than testing other software systems. In addition to typical unit and integration tests, you need data validation, trained model quality evaluation, and model validation.
        - Deployment: In ML systems, deployment isn't as simple as deploying an offline-trained ML model as a prediction service. ML systems can require you to deploy a multi-step pipeline to automatically retrain and deploy model. This pipeline adds complexity and requires you to automate steps that are manually done before deployment by data scientists to train and validate new models.
        - Production: ML models can have reduced performance not only due to suboptimal coding, but also due to constantly evolving data profiles. In other words, models can decay in more ways than conventional software systems, and you need to consider this degradation. Therefore, you need to track summary statistics of your data and monitor the online performance of your model to send notifications or roll back when values deviate from your expectations.

    ML and other software systems are similar in continuous integration of source control, unit testing, integration testing, and continuous delivery of the software module or the package. However, in ML, there are a few notable differences:
        - CI is no longer only about testing and validating code and components, but also testing and validating data, data schemas, and models.
        - CD is no longer about a single software package or a service, but a system (an ML training pipeline) that should automatically deploy another service (model prediction service).
        - CT is a new property, unique to ML systems, that's concerned with automatically retraining and serving the models
____________________________

- **Name and explain the steps for developing ML models**  

1.	Data extraction: You select and integrate the relevant data from various data sources for the ML task.
2.	Data analysis: You perform exploratory data analysis (EDA) to understand the available data for building the ML model. This process leads to the following:
    - Understanding the data schema and characteristics that are expected by the model.
    - Identifying the data preparation and feature engineering that are needed for the model.
3.	Data preparation: The data is prepared for the ML task. This preparation involves data cleaning, where you split the data into training, validation, and test sets. You also apply data transformations and feature engineering to the model that solves the target task. The output of this step are the data splits in the prepared format.
4.	Model training: The data scientist implements different algorithms with the prepared data to train various ML models. In addition, you subject the implemented algorithms to hyperparameter tuning to get the best performing ML model. The output of this step is a trained model.
5.	Model evaluation: The model is evaluated on a holdout test set to evaluate the model quality. The output of this step is a set of metrics to assess the quality of the model.
6.	Model validation: The model is confirmed to be adequate for deployment—that its predictive performance is better than a certain baseline.
7.	Model serving: The validated model is deployed to a target environment to serve predictions. This deployment can be one of the following:
    - Microservices with a REST API to serve online predictions.
    - An embedded model to an edge or mobile device.
    - Part of a batch prediction system.
8.	Model monitoring: The model predictive performance is monitored to potentially invoke a new iteration in the ML process.

____________________________

- **Explain the steps in an automated E2E pipeline.**  
tbd.

```