# MAAI Bootcamp :: ML Fundamentals

## Introduction

Here we recapitulate the fundamentals concepts of machine learning and used jargons you should know by heart. This section is based on chapter 3 of the book *Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 3rd Edition 2022* by Aurélien Géron, and many code snippets, explanations and illustrations were used. Some changes were made to accommodate the subject within the MAAI bootcamp's timeframe. The book is a relevant source for a recap, and it is available in the [HvA Library](https://bib.hva.nl).

It is expected that, in the end, you:

- Can describe what is ML 
- Give examples of applications
- Recognise the different types of ML systems 

## What is Machine Learning?

According to Stuart Russel and Peter Norvig in the book "Artificial Intelligence: A Modern Approach", Machine Learning is defined as:

> The branch of artificial intelligence that deals with the construction and study of systems that can learn from data

Some important definitions:

- ML programs uses a set of examples for learning purposes, called *the training set*. This set contains *training examples* or *training instances*
- The part that makes predictions is called a *model*. Neural networks and random forests are examples of models
- Metrics can be used to evaluate the performance of the model. For example, the performance of a spam classifier can be measured by the ratio of correctly classified emails. This metric is called *accuracy*

## Why Machine Learning

Machine learning is great for:

- Problems for which existing solutions require a lot of fine-tuning or long lists of rules (a machine learning model can often simplify code and perform better than the traditional approach)
- Complex problems for which using a traditional approach yields no good solution (the best machine learning techniques can perhaps find a solution)
- Fluctuating environments (a machine learning system can easily be retrained on new data, always keeping it up to date)
- Getting insights about complex problems and large amounts of data

## Examples of Applications

**Discriminative AI**
* Image Classification: Recognizing objects in images (e.g., drive assistants).
* Sentiment Analysis: Determining the sentiment (positive, negative, neutral) in a piece of text.
* Fraud Detection: Detecting fraudulent transactions in financial systems.
* Face Recognition: Identifying or verifying a person based on their facial features.
* Speech Recognition: Converting spoken language into text.
* Disease Diagnosis: Predicting diseases based on medical data, such as X-rays or MRIs.
* Recommendation Systems: Predicting user preferences and recommending products, movies, etc.

**Generative AI**
- Text Generation: Creating human-like text (e.g., GPT generating essays, articles, or code).
- Image/Video Generation: Producing images/videos from textual descriptions
- Style Transfer: Applying artistic styles to images (e.g., turning a photo into a painting).
- Music Composition: Creating original music compositions based on certain styles or inputs.
- Data Augmentation: Generating additional training data, such as synthetic images for machine learning.
- 3D Model Generation: Creating 3D models from 2D images or sketches.
- Drug Discovery: Generating potential molecular structures for new drugs.
- Deepfake Creation: Producing realistic fake videos by swapping faces or altering voices.


## Types of Machine Learning Systems

Categories:
* How models are supervised
* Whether or not models can learn incrementaly on the fly (online versus batch learning)
* Whether they work by simply comparing new data points to known data points, or instead by detecting patterns in the training data and building a predictive model, much like scientists do (instance-based versus model-based learning

### How models are supervised during training

#### Supervised learning

* Training data has labels, indicating the desired solution
* Supervised learning tasks include:
    * Classification: Sort input samples into categories, like a spam filter
    * Regression: Predict a target numerical value, like the price of a car, given a set of features, like mileage, age, brand, etc.

![Supervised Learning](support/images/mls3_0105.png)
<br><sub><sup>Figure: A labeled training set for spam classification (an example of supervised learning) - Source: Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow<sub><sup>
<br>

![Regression](support/images/mls3_0106.png)
<br><sub><sup>Figure: A regression problem: predict a value, given an input feature (there are usually multiple input features, and sometimes multiple output values) - Source: Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow<sub><sup>
<br>

#### Unsupervised learning

Here, training data is unlabeled, and different approaches can be used.

![Unsupervised Learning](support/images/mls3_0107.png)
<br><sub><sup>Figure: An unlabeled training set for unsupervised learning - Source: Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow<sub><sup>
<br>

##### Clustering algorithm (Sorts data into groups)

This family of algorithms are useful to discover groups with similar characteristics.

![Clustering](support/images/mls3_0108.png)
<br><sub><sup>Figure: Clustering - Source: Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow<sub><sup>
<br>

##### Dimensionality Reduction
- A way to simplify data without losing too much information
- Example: merge correlated features into one
- For a car, combine milage and age into wear-and-tear
- This is called feature extraction

![Dimensionality Reduction](support/images/mls3_0109.png)
<br><sub><sup>Figure: Example of a t-SNE visualization highlighting semantic clusters - Source: Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow<sub><sup>
<br>

##### Anomaly detection

Some examples:
* Find unusual credit card transactions
* Find manufacturing defects

![Anomaly detection](support/images/mls3_0110.png)
<br><sub><sup>Figure: Anomaly detection - Source: Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow<sub><sup>
<br>

##### Other approaches

* Semi-supervised learning: Combines a small amount of labeled data with a large amount of unlabeled data to improve learning accuracy.

* Self-supervised learning: Uses data's inherent structure to create labels, allowing the model to learn without external labels (e.g. BERT, RoBERTa).

* Reinforcement learning: Trains an agent to make decisions by rewarding desired actions and penalizing undesired ones in a dynamic environment.

## Main challenges

* Insufficient quantity of training data
* Nonrepresentative training data
* Poor-quality data
* Irrelevant features
* Overfitting the training data
* Underfitting the training data