# Machine Learning Theory

This section of my notes will introduce some basic theory that will be required to understand machine learning fundamentals as well as assist with any future projects.

## Table of Contents

 1. [Types of Systems](#Types-of-Systems)
    1. [Supervised / Unsupervised Learning](#Supervised-/-Unsupervised-Learning)
    2. [Batch and Online Learning](#Batch-and-Online-Learning)
    3. [Instance Based and Model Based Learning](#Instance-Based-and-Model-Based-Learning)
 2. [Main Challenges of Machine Learning](#Main-Challenges-of-Machine-Learning)
 3. [Testing and Validating](#Testing-and-Validating)
 4. [Hyperparameter & Model Selection](#Hyperparameter-&-Model-Selection)
 5. [Data Mismatch](#Data-Mismatch)
 6. [Key Steps in Data Science Projects](#Key-Steps-in-Data-Science-Projects)
    1. [Frame the Problem](#Frame-the-Problem)
        1. [Pipelines](#Pipelines)
    2. [Select a Performance Measure](#Select-a-Performance-Measure)
    3. [Check the Assumptions](#Check-the-Assumptions)
 7. [Data Considerations](#Data-Considerations)
 8. [Preparing the Data](#Preparing-the-Data)
    1. [Data Cleaning](#Data-Cleaning)
    2. [Handling Text and Categorical Variables](#Handling-Text-and-Categorical-Variables)
    3. [Feature Scaling](#Feature-Scaling)
    4. [Transformation Pipelines](#Transformation-Pipelines)
 9. [Train and Select the Model](#Train-and-Select-the-Model)
    1. [Better Evaluation Using Cross Validation](#Better-Evaluation-Using-Cross-Validation)
    2. [Saving Models](#Saving-Models)
 10. [Fine Tune Your Model](#Fine-Tune-Your-Model)
    1. [Grid Search](#Grid-Search)
    2. [Randomized Search](#Randomized-Search)
    3. [Ensemble Methods](#Ensemble-Methods)
    4. [Analyze the best Models and their Errors](#Analyze-the-best-Models-and-their-Errors)
 11. [Launch, Monitor and Maintain your System](#Launch,-Monitor-and-Maintain-your-System)

## Types of Systems

### Supervised / Unsupervised Learning

This is the main classification based on the type and amount of supervision the model gets during training.

There are four Major Categories:

 * Supervised
    * The training set includes the desired solutions.
 * Unsupervised
    * The training set is unlabeled and does not have the desired solutions, the model tries to learn without a teacher.
 * Semisupervised
    * Since labelling is time consuming and costly, you will often have plenty of unlabeled instances and a few labeled instances. This is where Semisupervised category comes in, some models can handle partially labelled data.
 * Reinforced Learning
    * This is a system that teaches itself and learns based on the rewards and penalties that it receives for certain good and bad outcomes.

### Batch and Online Learning

The next criteria used to classify machine learning is whether the system can learn incrementally from a stream of incoming data.

 * Batch Learning
    * The system is incapable of learning incrementally and is trained using all of the avaliable data,this is usually done offline as it takes a lot of time.
 * Online Learning
    * The system is trained incrementally by feeding it data instances sequentially in mini-batches, each of these learning steps is usually fast.

### Instance Based and Model Based Learning

The next catergorization is how the models generalize.

 * Instance Based learning
    * This can be thought as learning from heart and classfying everything that is identical to our previous examples.
 * Model-Based Learning
    * This is when we use a set of examples to build a model and use this model to make predictions.

## Main Challenges of Machine Learning

 * Insufficient Quantity of Training Data
 * Nonrepresentative Training Data
 * Poor-Quality Data
 * Irrelevant Features
 * Overfitting the training Data
    * This is when the model is too complex for the amount of noise that is in the training set.
    * This can be fixed in the following ways:
        * Simplify the model
        * Gather more training data
        * Reduce the noise in the training data
        * apply a constraint to the model using a hyper parameter.
 * Under fitting the training data

## Testing and Validating

The only way to know a model will generalize to new cases is to actually try it out on new cases.

The way to do this split the data into a train and test set, and use the training set to determine the gereralization error or out of sample error.

If the training error is low but the generalization error is high it means we are over fitting the training data.

## Hyperparameter & Model Selection

Evaluating a model is simple with just using a test and training set, but the bigger issue is a situation where we are wanting to decide between two different models.
The issue is deciding between the two models to decide on what is the best one, one option is to train both and compare how well they generalize the test set.

There may be a situation where we need to evaluate the hyperparameter that we would like to apply to one of the models, we issue is trying to decide on what value should be used. The one method to do this is to apply multiple different models using a different value for the hyperparameter.
Let's say that the hyperparameter decided on produces the lowest generalization error of 5% but when it is put in production the error rate is 15%. This is because the
model created is used to best model the particular set.

A solution to this is called holdout validation.

This is when we will hold out a part of the training set to evaluate several candidate models and select the best ones, the next
held out set is called the validation set (or Development set). In essence you train multiple models with different hyperparameters on the reduced
training set (Training set minus the validation set) and select the one that performs the best. After this holdout validation process you estimate the model on
the full training set and you evaluate it on the test set to get the estimate of the gererilization estimate.

For this approach we need to ensure that the validation set is large enough for the model evaulations to be precise but should not be larger than the smaller
training set.


## Data Mismatch

In some cases it is easy to get a large set of data for modelling but the data is not representative of the final data that will be used in production.

This could be solved by ensuring the training and test data is more representative of the final data that you are expecting in the production phase of the project.

## Key Steps in Data Science Projects

### Frame the Problem

The first thing to understand is the exact business model that the Machine Learning model is used to answer. This is key so we know how to frame the problem,
which algorithms to select and which performance measures to use to evaluate the model.

The next question to ask the boss is what the current solution looks like and will give a reference point for performance as well as insights
on how to solve the problem.

Once we understand the problem and have classified them into the three groups. If we need to do batch Modelling and we have a large dataset we can split the modelling
across various servers using  the *MapReduce* technique or an online learning technique.

#### Pipelines

A sequence of data processing components is called a data pipeline, these are common in Machine Learning as there is a lot of data to manipulate and transformations to apply.

### Select a Performance Measure

There are many different performance Measures that are avaliable for various models.

### Check the Assumptions

It is always important to list and verify the assumptions that have been made so far for the model, this can catch issues that may affect the model later on.

An example of this would be working with an assumption of the type of data that the model should calculate if it needs to feed into another process down the pipeline.

## Data Considerations

Once we have received the data that will be used in the Machine Learning model we will want to get a full understanding of the variables. This can be done through charting the histograms,
scatter Matrix and descriptive statistics. All of these graphical representations can assist in finding some interesting factors of the data. The following may be a key concern:

 * Any outliers that we can see within the data set
 * If the data has restrictions applied (For instance Income can't be higher than R50k, so we have a flat line at the top of out income scatter plot).

When looking at the data we should do this with only the train set as we are creating the model based on this data and would rather overfit here and flag it on the test set than use the full dataset and not realise that model is overfitting
and not applicable to real world usage.

The other data consideration we may need to do it see if the variables we receive make sense in terms of what we are trying to estimate. For instance if we are trying to estimate the price of a house and we have been given
the average price of houses in an area and the Total rooms in an area, it may make more sense to be able to turn the Total rooms in an area into a "Total rooms per house" for a more relavant calculation.

## Preparing the Data

It may be very useful to create functions for any transformations that are required as these can be used later on and will save time in rewritting the transformations

### Data Cleaning

Most machine learning techniques can't work with missing features, so we need to handle this data before we start with any other kind of analysis, this can be done through one of three methods:

 * Remove the full row of these instances
 * Get rid of the full variable
 * Set the missing to a certain value (0,Mean, Median, ect.)

If you use the third option to use a calculated value, it is important that we save the calculated value from the train set so you can use this value to insert into the test set (So use the same value to replace the NA rather than calculate a new value.
This value will also be used for the replacement of the missing values once the model is in production.

*Scikit-Learn* provides a handy class to take care of the missing values: **SimpleImputer** within the **sklean.impute** class. This Imputer method will calcualte the imputation value for every numerical variable in the data and use these
values in the transformation for any missing values it comes across.

### Handling Text and Categorical Variables

Most of the machine learning techniques do not analyse text or categorical data in their string format and require converting these variables into a numerical value, there is a function in **sklearn** that we can use to do this called
**OrdinalEncoder** in **sklearn.preprossesing** class. Once the encoding has been completed and the variable has been returned the Encoder instance will have a **categories_** variable with a list of the categories.

When dealing with these categorical variables we need to be aware of which are ordinal and which are not.

If a variable is not ordinal then we will need to split the variable out into dummy variables. This is called *one-hot encoding*. Again **Sklearn** offers a method that can assist with this, called **OneHotEncoder** in **sklearn.preprossesing** class.
Once the *One-Hot Encoding* has finished the **SKlearn** function will output a sparse matrix for memory reasons. To get his to an array we can use the **.toarray()** method and the list of categories will be stored in **.categories_**.

### Feature Scaling

One of the most important transformations you will need to do is *feature scaling*. With most Machine Learning models
the algorithm does not work well when the features have different numerical scales. when we look at running this we don't need the
target values.

There are two common ways to get the attributes to have the same scale: *Min-Max Scaling* and *standardization*.

*Min-Max Scaling* (normalization) is the simplest with the values being shifted so they range between 0 and 1, this is done by subtracting the minimum value and dividing by the maximum minus the minimum.
**SKlearn** provides a method to do this called **MinMaxScaler** with a hyperparameter that allows us to select the final range of the values.

*Standardization* is different, it converts the values into a normal distribution by subtracting the mean and dividing by the Standard Deviation. *Unlike Min-Max Scaling* the values are
not restricted to a range (Which can also be a problem for some algorithms) but it is less affected by outliers. **SKlearn** provides a method to perform this transformation
using the **StandardScaler** in the **Sklearn.preprocessing**.

As with all transformations it is important to fit the scalers to the training set only and not the full set
and only applying a transformation to the rest of the data.

### Transformation Pipelines

As there are usually many data steps that need to be performed in the right order we need to set it out in this way.
Fortunately **SKlearn** provides a **Pipeline** class to help with such sequences of transformations.

This class accepts a list of name / estimator pairs defining a sequence of steps, all but the last estimator must be a
transformer (have a fit_transform() method). The names can be anything we like as long as they are unique and don't
contain any underscores. When we call the pipelines fit method it will call the fit transform sequentially on all of the
transformers.

## Train and Select the Model

This is the moment in which we finally take our transformed training data and use it to create initial models that we will be
evaluating to determine how well the model does before using it in production.

### Better Evaluation Using Cross Validation

As discussed before we can use a validation set to determine how well the training set does rather than use the test set to be the
only measure of performance. A great method to create these cross folds is supplied with **SKlearn**, this function will split the training set into 10
distinct subsets called folds then it trains and evaluates the model 10 times - picking a different fold for evaluation each time
and training the other 9 folds. This results in an array of the evaluation scores. The method is **cross_val_score** in **sklearn.model_selection**.

An important note is that the Cross-Validation is expecting a utility function (Higher the score the better) and not a cost function (Lower Score is better)
so if a Cost function is used as the evaluation we need to invert the output.

The only downside to cross-validation is it will take a larger portion of estimation time if we are working with big datasets
as it is training a model multiple times.

### Saving Models

Even if the model you have just is not performing the best it should be saved for later usage if you want to return to it,
this can be done either through python's **pickle** or the **joblib** library.

## Fine Tune Your Model

Once we have a few models that we would like to apply to the data it is time to start fine tuning the models.

### Grid Search

One option is to fiddle with the hyperparameters manually until you find the ideal combination to use in the model, but this can be tedious and you may not explore all of the combinations.

Instead we should use a grid search function that will look over all of the different combinations of the hyperparameters values that are inputted to find the ideal situation.

**SKlearn** offers a method that will do the grid search with the hyperparameters entered. The class is **GridSearchCV** in the **sklearn.model_selection**.
The Hyperparameters are parsed in using a dictionary.

### Randomized Search

The grid search approach is fine when we are exploring a few combinations but the hyperparameter search space is very large, this is where the preference to use a Randomized search may be of better use,
while it is uses the same sort of idea as *Grid Search* it inserts random values into the hyperparameters and evaluates the best fitting model.

This can be done in **SKlearn** using the **RandomizedSearchCV** from **sklearn.model_selection**.

### Ensemble Methods

Another way to fine tune the models is to try combine the models that perform best.

### Analyze the best Models and their Errors

We will often get good insights on the problem by inspecting the best models. This could be done through looking at the importance scores of each attributes to determine which may be some variables that we can exclude.

## Launch, Monitor and Maintain your System

Now that we have settled on a model we can get it ready for production. one of the first things is saving the model to be used in production, this can be done using the **joblib** library including the full preprocessing and
prediction pipeline.