# Machine Learning Basics

## What is Machine Learning?

Machine learning is the field of science that studies algorithms that approximate functions increasingly well as they are given more observations. 

Machine learning algorithms are programs that can learn from data and improve from experience, without human intervention. Learning tasks may include learning the function that maps the input to the output, learning the hidden structure in unlabeled data; or ‘instance-based learning’, where a class label is produced for a new instance by comparing the new instance (row) to instances from the training data, which were stored in memory. 


## Types of machine learning algorithms

### Supervised learning

It is useful in cases where a label is available for a certain training set, but is missing and needs to be predicted for other instances. It uses the labeled training data to learn the mapping function that turns input variables (X) into the output variable (Y).

Labeled data is data that has the information about target variable for each instance. 

Types of supervised machine learning algorithms:

- Classification is used to predict the outcome of a given sample when the output variable is in the form of categories.

- Regression is used to predict the outcome of a given sample when the output variable is in the form of real values.

- Ensembling is another type of supervised learning. It means combining the predictions of multiple machine learning models that are individually weak to produce a more accurate prediction on a new sample. Ensembling means combining the results of multiple learners (classifiers) for improved results, by voting or averaging. Voting is used during classification and averaging is used during regression. The idea is that ensembles of learners perform better than single learners. Bagging and Boosting are two types of ensambling algorithms.

**What is the difference between bagging and boosting?**

Bagging and boosting are both ensemble methods, meaning they combine many weak predictors to create a strong predictor. One key difference is that bagging builds independent models in parallel, whereas boosting builds models sequentially, at each step emphasizing the observations that were missed in previous steps.


Some examples of supervised learning algorithms are:

1. Decision Trees

2. Naive Bayes Classification

3. Ordinary Least Squares Regression

4. Logistic Regression

5. Support Vector Machines


### Unsupervised learning

It is useful in cases where the challenge is to discover implicit relationships in a given unlabeled dataset. In other words, we only have the input variables (X) and no corresponding output variables. 

Types of unsupervised learning:

- Association is used to discover the probability of the co-occurrence of items in a collection. For example, an association model might be used to discover that if a customer purchases bread, s/he is 80% likely to also purchase eggs.

- Clustering is used to group samples such that objects within the same cluster are more similar to each other than to the objects from another cluster.

- Dimensionality Reduction is used to reduce the number of variables of a data set while ensuring that important information is still conveyed. Dimensionality Reduction can be done using Feature Extraction methods and Feature Selection methods.

Some examples of unsupervised learning algorithms are:

1. K-means

2. PCA


### Reinforcement learning

It falls between these 2 extremes — It describes a set of algorithms that learn from the outcome of each decision.
Reinforcement algorithms usually learn optimal actions through trial and error. 

For example, a robot could use reinforcement learning to learn that walking forward into a wall is bad, but turning away from a wall and walking is good, or imagine, a video game in which the player needs to move to certain places at certain times to earn points. A reinforcement algorithm playing that game would start by moving randomly but, over time through trial and error, it would learn where and when it needed to move the in-game character to maximize its point total.


## Online vs offline learning

While online learning does have its uses, traditional machine learning is performed offline using the batch learning method. 
‍

In batch learning, data is accumulated over a period of time. The machine learning model is then trained with this accumulated data from time to time in batches. If new data comes in, an entire new batch (including all the old and new data) must be fed into the algorithm to learn from the new data. In batch learning, the machine learning algorithm updates its parameters only after consuming batches of new data.

It is the direct opposite of online learning because the model is unable to learn incrementally from a stream of live data. Online learning refers to updating models incrementally as they gain more information.
 


### How is data divided?

What is training data and what is it used for?

Training data is a set of examples that will be used to train the machine learning model.
For supervised machine learning, this training data must have a labeled. What you are trying to predict must be defined.
For unsupervised machine learning, the training data will contain only features and will use no labeled targets. What you are trying to predict is not defined.

What is a validation set and why use one?

A validation set is a set of data that used to evaluate a model's performance during training/model selection. After models are trained, they are evaluated on the validation set to select the best possible model.

It must never be used for training the model directly.

It must also not be used as the test data set because we have biased our model selection toward working well on this data, even tough the model was not directly trained on it.

What is a test set and why use one?

A test set is a set of data not used during training or validation. The model's performance is evaluated on the test set to predict how well it will generalize to new data.

### Training and validation techniques

**Train-test-split**

**Cross Validation**

Cross validation is a technique for more accurately training and validation models. It rotates what data is held out from model training to be used as the validation data.

Several models are trained and evaluated, with every piece of data being held out from one model. The average performance of all the models is then calculated.

It is a more reliable way to validate models but is more computationally costly. For example, 5-fold cross validation requires training and validating 5 models instead of 1.

**What is overfitting?**

Overfitting when a model makes much better predictions on known data (data included in the training set) than unknown data (data not included in the training set).

How can we combat overfitting?

A few ways of combatting overfitting are:

- simplify the model (often done by changing)

- select a different model

- use more training data

- gather better quality data to combat overfitting

How can we tell if our model is overfitting the data?

If our training error is low and our validation error is high, then our model is most likely overfitting our training data.

How can we tell if our model is underfitting the data?

If our training and validation error are both relatively equal and very high, then our model is most likely underfitting our training data.

Source:
    
https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6

https://www.kdnuggets.com/2020/09/understanding-bias-variance-trade-off-3-minutes.html

https://medium.com/@ranjitmaity95/7-tactics-to-combat-imbalanced-classes-in-machine-learning-datase-4266029e2861

https://www.kdnuggets.com/2016/08/10-algorithms-machine-learning-engineers.html

https://www.dataquest.io/blog/top-10-machine-learning-algorithms-for-beginners/#:~:text=The%20first%205%20algorithms%20that,are%20examples%20of%20supervised%20learning.

https://www.qwak.com/post/online-vs-offline-machine-learning-whats-the-difference#:~:text=While%20online%20learning%20does%20have,using%20the%20batch%20learning%20method.&text=In%20batch%20learning%2C%20data%20is,time%20to%20time%20in%20batches.
