# FOUNDATIONS

## Machine Learning Lifecycle

### Problem Formation and Understanding

* Analyze whether machine learning is a probable solution with the given problem
* Identify the inputs and outputs for the model
* Identify the acceptable accuracy and prediction error you would tolerate from the model

### Data Collection and Preparation

* Identify the source of raw data you would need for the development of your machine learning model
* Allocation of time and effort toward annotation and wrangling of raw data so it may be used for the process.
* Allocation of time and effort towards labeling data, removing irrelevant features, tossing outliers, transform data, and inputting missing values.

### Model Training and Testing

* Allocation of 80% development time towards the training of the model, 10% towards validation of the model, and 10% towards testing the model.
* Identify the necessary machine learning algorithm appropriate for the problem.
* Allocation of time and effort towards iteration and experimentation of the algorithm, fine tuning of the model, evaluation of the results, and place model for deployment.

### Model Deployment and Maintenance

## Identification of Pre-built Models

For the development of the training model for use in your problem, it is important to note that you have two (2) choices, which are either **to create a model from scratch** or **use pre-built models**

**Use of pre-built models**

Uses pre-built models for use thereby increasing speed of development cycle and uses a similar benchmark dataset to your problem. For example, in image classification problems, a pre-built model can be used to solve your identical problem using transfer learning. This allows you to add your data on top of the pre-trained model, which in turn allows you train new models that inherits the learnings from the pre-built model.

There are existing websites and organizations that allows you to acquire pre-built models either for a price or for free as long as it has Apache 2.0 license, such as **AWS Marketplace**, **ModelZoo** and **Huggingface**

## Machine Learning Model Training Tools

If no pre-built models exist for the use as solution for our problem, you can choose the option to build your own customized model from scratch. Upon collection of the required dataset for the training of the model, it is important that before you start, you have the necessary hardware that can support computationally intensive processes required for training. This often means that generic laptops and computers will not be able to handle the load necessary to train your model. 

To solve this, you can choose the option of acquiring the GPU power of cloud services such as **Amazon Web Services** and **Google Colabs** 

For the development of your own customized model using Python using the Jupyter Notebook IDE. Necessary libraries that we can use for the development process are the following:

**Pandas and NumPy libraries** for the access and modification of solid state data structures, n-dimensional matrices, and perform exploratory data analysis, and allows you to read CSV, JSON, and TSV data files.

**Matplotllib and Seaborn libraries** – for the data visualization phase requiring the plotting of charts and graphs.

**Scikit-learn, TensorFlow, MXNext, PyTorch,** and **Keras** framework libraries for the actual training of the model.

# Machine Learning Model Training

## Introductory Concepts

A **machine learning algorithm** or **learning algorithm** studies data for *trends* and *patterns* during the training process.

An **epoch** is a *pass/iteration* the computer makes as it uses the *chosen machine learning algorithm* to study the training data.

A **mathematical model** or **model** is the *mathematical 'storage'* containing the trends and patterns uncovered by the machine learning algorithm. 

A **hyperparameter** is a critical parameter during the training process that determines the *number of epochs* to run over our training data, which results in an effective model.

## Model Development (Training/Evaluation)

Usually, **80%** of our data is used for the training process, while the remaining **20%** is used for evaluation.

During the training process, **loss functions** are used to *assess the model's predictive ability*, such as accuracy, by measuring how far an *estimated value derived* from the model from the *actual value of the data*. These are used to optimize the training process. 

During the evaluation process, the model is judged on the evaluation data it has not seen before using the **loss functions evaluation metrics** derived from the *training process' loss functions*. Below are the standard evaluation metrics usually found in such metrics:
* Mean Square Error (MSE)
* Accuracy
* F1 Score
* AUC
* R^2

The results of the evaluation process can then be used to tweak our *hyperparameters* to improve the model's performance.

## Introduction to Learning Algorithms

**Linear Regression** is used to *solve regression problems* and is used to *predict numeric values*. This is achieved using **linear equations** that establishes the *relationships between independent and dependent variables or features by fitting it to a regression line*.

**Logistic Regression** is used for *classification problems* and is used to *predict probability using binary values* based on a set of independent variables.

**Decision trees** are used for *classification and regression problems*. This is achieved by following a process wherein the algorithm **(1)** segredate data based on features, **(2)** uncover a flow that produces the best results/prediction, and **(3)** remove irrelevant branches. *Hyperparameters* are used to configure the decision tree depth.

**Random trees** is a *set of decision trees* wherein each tree is created/instantiated from a different sample of rows, and *each tree making its own prediction*. All results are averaged to create the *final result*.