# Training A ML Models From Scratch

> A demonstration of ML pipelines and how to develop them

## About

This notebook is to help those attending my 2024 Intel oneAPI Workshop on reusing and extending deep learning models.

This notebook, while it does not provide any deep learning code, does lay the foundational information for:

1. What is a machine learning pipeline,
2. How to train a machine learning model,
3. How to evaluate a machine learning model, and
4. How to execute a machine learning model on new data.

## Pipelines

A *data pipeline* can be thought of as a series of steps or actions to manipulate data to adhere to a specified format **without**.

A common data pipeline pattern is *ETL*, or *Extract, Transform, Load*.

*Machine learning pipelines* expand on ETL by including further steps to train, evaluate, refine, and deploy machine learning models.

### Data Pipeline

From IBM:

```
A data pipeline is a method in which raw data is ingested from various data sources, transformed and then ported to a data store, such as a data lake or data warehouse, for analysis.
```

- [IBM, *What is a data pipeline?*](https://www.ibm.com/topics/data-pipeline)

To illistrate the flow of data, see the below figure:

![Data pipeline example image](images/data_pipeline_example.png)

**NOTE:** I intentionally did not add any detail on *how* data flows through the pipeline. Pipelines, at a high-level, can be thought of implementation agnostic details of a larger application.

**NOTE:** The pipeline described can also be thought of as an *Extract, Transform, Load* (*ETL*) pipeline as well as we will next discuss. But while uncommon, data pipelines, unlike ETL pipelines, **do not need to** transform data to be considered  as data pipelines. 

### Extract, Transform, Load (ETL) Pipeline

ETL pipelines are a popular sub-category of data pipelines that follow a rigid set of instructions for manipulating data.

These instructions are:

1. Extract the data from a source data store, repository, or database,
2. Transform the data with set algorithmic instructions or processes to generate a subset of the data, new representations of the data, or entirely new data, and
3. Load the transformed data into a data store, repository, or database.

From IBM:

```
ETL pipelines follow a specific sequence. As the abbreviation implies, they extract data, transform data, and then load and store data in a data repository. Not all data pipelines need to follow this sequence.
```

- [IBM, *What is a data pipeline?*](https://www.ibm.com/topics/data-pipeline)

To illistrate the flow of data, see the below figure:

![ETL pipeline example image](images/etl_pipeline_example.png)

ETL pipelines can be used to allow for interoperability between two different applications on the same data.

Assume we have two applications, **x** and **y** and source data **a**:

If application **x** takes **a** as input and outputs new data **b**, then application **y** can take **b** as input and ouput new data **c**.

### Machine Learning Pipelines

![ML pipeline stages image](images/ml_pipeline_stages.png)

- Yibo Wang, Ying Wang, Tingwei Zhang, Yue Yu, Shing-Chi Cheung, Hai Yu, and Zhiliang Zhu. 2023. *Can Machine Learning Pipelines Be Better Configured*? In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2023). Association for Computing Machinery, New York, NY, USA, 463–475. [https://doi.org/10.1145/3611643.3616352](https://doi.org/10.1145/3611643.3616352)

ML pipelines expand the concept of a data and ETL pipelines by including *feedback loops*, *feature engineering*, and ML specific stages including *training*, *evaluation*, and *deployment*.

Furthermore, Machine Learning Operations (MLOps) (an extension of DevOps practices aimed at machine and deep learning) now takes into consideration the state of the model post-deployment and how to update the model to continously match the requirements.

While MLOps is an interesting and exciting topic, **it is not covered** in this workshop.

However, please take a look at [Intel's MLOps Professional course](https://www.intel.com/content/www/us/en/developer/certification/mlops.html) for more information.


#### Feedback Loops

Machine learning is not strictly an engineering task, but also a scientific one.

This is to say that when you train a model on a dataset, you may not get the best result the first time.

Different model architectures, implemntations, hyper-parameters, and features may result in better or worse models.

Thus, while you need to be a software engineer to build a machine learning pipeline, you also need to be a computer scientist and explore different model configurations to identify which best suits your needs.

For this workshop, we are less interested in the software engineering aspect, and more interested in the computer science one.

#### Feature Engineering

Feature engineering is the act of taking a data source and undergoing a *data pipeline* to **extract** relevant features of the data to train a ML model on.

Given that we have to extract relevant features, an ETL pipeline is a good first choice for designing a data pipeline for a machine learning model

#### ML Specific Stages

**Training** is the process of taking your engineered data and processing it with an algorithm that updates its underlying weights continously as the data is passed through it (this is the core of ML and DL).

**Evaluation** is the process of taking labelled testing data and passing them into your trained ML model and computing metrics such as accuracy, precision, and recall.

**Deployment** is the process of actually deploying your model to within an application for users to provide completely unseen data to your model.

There is enough academic and professional literature on all three of these stages to fill several volumes, so for conciseness, I will not expand on the intricies of these here until relevant. 

## Time To Code!

### Install requirements