# Machine Learning Model Training

## Introductory Concepts

A **machine learning algorithm** or **learning algorithm** studies data for *trends* and *patterns* during the training process.

An **epoch** is a *pass/iteration* the computer makes as it uses the *chosen machine learning algorithm* to study the training data.

A **mathematical model** or **model** is the *mathematical 'storage'* containing the trends and patterns uncovered by the machine learning algorithm. 

A **hyperparameter** is a critical parameter during the training process that determines the *number of epochs* to run over our training data, which results in an effective model.

## Model Development (Training/Evaluation)

Usually, **80%** of our data is used for the training process, while the remaining **20%** is used for evaluation.

During the training process, **loss functions** are used to *assess the model's predictive ability*, such as accuracy, by measuring how far an *estimated value derived* from the model from the *actual value of the data*. These are used to optimize the training process. 

During the evaluation process, the model is judged on the evaluation data it has not seen before using the **loss functions evaluation metrics** derived from the *training process' loss functions*. Below are the standard evaluation metrics usually found in such metrics:
* Mean Square Error (MSE)
* Accuracy
* F1 Score
* AUC
* R^2

The results of the evaluation process can then be used to tweak our *hyperparameters* to improve the model's performance.

## Introduction to Learning Algorithms

*Below are common examples of learning algorithms usually used in the community:*

**Linear Regression algorithms** is used to *solve regression problems* and is used to *predict numeric values*. This is achieved using **linear equations** that establishes the *relationships between independent and dependent variables or features by fitting it to a regression line*.

**Logistic Regression algorithms** is used for *classification problems* and is used to *predict probability using binary values* based on a set of independent variables.

**Decision trees algorithms** are used for *classification and regression problems*. This is achieved by following a process wherein the algorithm **(1)** segredate data based on features, **(2)** uncover a flow that produces the best results/prediction, and **(3)** remove irrelevant branches. *Hyperparameters* are used to configure the decision tree depth.

**Random trees algorithms** is a *set of decision trees* wherein each tree is created/instantiated from a different sample of rows, and *each tree making its own prediction*. All results are averaged to create the *final result*.

## Logistic Regression Algorithms

*Logistic regression algorithms* are used for classification problems to predict probability using binary values such as (0,1), (True, False) or (Yes/No). Its *output value* is between 0 to 1, wherein results closer to 1 usually shows higher confidence and accuracy.

An example of a logistic regression algorithm is **XGBoost (eXtreme Gradient Boosting)**, provided by *Amazon SageMaker Studio*, and is by nature a *decision-tree based logistic regression algorithm*.

Using our *processed data*, we create an **estimator** containing the instance of Docker container for training data. We then configure the *hyperparameters* using **get_hyperparameters()** and **set_hyperparameters()** function - nearly existent in all learning algorithms's libraries. We can now then train our model using the **fit()** function taking in as parameters the **training input data** and the **validation input data**.

The model can then be hosted in Amazon and make it available as a **URL endpoint**. This means that any application can access its output using a simple API request.

The model can then be used to predict data using **inference**. *Inference* allows the transfer of input of data to the model and then sending a response back, showing its prediction and a probability score of between 0 to 1.

## Linear Regression Algorithms

*Regression algorithms* are used in regression problems to predict numerical values. There are three common examples of regressio algorithms and is listed below, and the outline of how each are worked through will be shown in subsequent cells:
* Scikit-learn Linear Regression
* RandomForestRegressor
* XGBoost

**Linear Regression**

Using the processed dataset, we will create *2 numpy arrays of X/Y train data* and *2 numpy arrays of X/Y test data*  through **random splitting**. This is accomplished by using the **train_test_split(x, y, random_state, shuffle, test_size)** function. *x* and *y* are default arguments, *random_state* defines the extent of the split randomization with values between 0 to 100, *shuffle* with default value of True, and *test_size* which determines the size of the test data with values between 0 to 1.

A **LinearRegression object** will then be instantiated using the default constructor of *LinearRegression()*, and use its function **fit(numpy_array, numpy_array)** to train our model. The *two numpy arrays* arguments are derived from *X/Y train numpy_arrays*. We can then run our predictions using the **predict(numpy_array)** function where the parameter is derived from the *X test numpy_array* and store the output in a variable, which we will name *X_prediction_test*.

To evaluate our model, we create an *instance of a DataFrame* using the constructor **DataFrame(dic)**. The *dictionary* contains two pairs of values derived from the *Y test numpy_array* and *X_prediction_test*. The *keys* for each one of these values are *'Actual* and *Predicted*. We can now then manually compare the values in the dictionary with each other, although it will be tedious and even improbable for larger datasets.

We can also use the **R^2 metric** to evaluate our model using the **round(model.score(numpy_array, numpy_array), 2)** function, replacing the name of the *model* into the actual name of your model, and the numpy arrays of your *X/Y test numpy_arrays*. You can then print the output to check its rating, with 1 being the best and 0 being the worst.