<a href="https://colab.research.google.com/github/Adel/FrameworkBenchmarks/blob/master/books/hands-on-machine-learning/homl_ch1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Chapter 1

## What is machine learning

> Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed. ( Arthur Samuel, 1959)

> A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E. (Tom Mitchell, 1997)

* Training set: Examples that the system uses to learn. Each sample is called training instance
* The part that learns and makes predictions is called the model.
* _Accuracy_ is the ratio of correctly classified instances to the total number of instances in classification tasks.

## Machine learning use cases
### Spam filter

#### Traditional approach
<img src="https://miro.medium.com/v2/resize:fit:1400/format:webp/1*S0XFiqMKT9sMFHqzfNfXZg.png"/>

#### ML approach
<img src="https://miro.medium.com/v2/resize:fit:1348/format:webp/1*Zw8YfFSQS3LKurbkciHSdw.png"/>


### Machine learning can help humans learn
<img src="https://hkalabs.com/wp-content/uploads/2022/11/06_machineLearningCanHelpHumansLearn.png"/>

### Example applications
* Analyzing images of products in a production line to automatically classify them
* Detecting tumors in brain scans
* Automatically classifying news articles
* Automatically flagging offensive comments on discussion forums
* Summarizing long documents automaticallly
* Creating a chatbot or a personall assistant
* etc.

## Types of machine learning systems

### Training supervision

#### Supervised learning
* The training set contains the desired solutions, called labels.
##### Examples
* Spam classification
* Predicting a target numeric value such as the price of a house

#### Unsupervised learning
* The training data is unlabeled. The system tries to learn without a teacher.

##### Examples
* A clustering algorithm
<img src="https://media.geeksforgeeks.org/wp-content/uploads/merge3cluster.jpg"/>
* Visualization algorithms
  * Output a 2D or 3D representation of the data
  * Help understand how the data is organized
  * Related task: Dimensionality reduction which simplifies the data without loosing too much information. For example, by merging features into one, as a car's mileage and age, which are usually corrolated. This is called feature extraction.
  * Anomaly detection, for example, detecting unsual credit card transactions to prevent fraud, catching manufacturing defects, or automatically removing outliers from a dataset before feeding it to another learning algorithm.
  * A very similar task is novelty detection
* Association rule learning
  * Example: Running an association rule on sales logs for a supermarket might reveal that people who purchase barbecure sauce and potato chips also tend to buy steak. Thus, you may want to place those items close to each other.

#### Semi-supervised learning
For example, photo-hosting services, such as Google photos. It automatically recognizes thhat the same person shows in multiple photos (clustering). Adding the person's name to a photo will allow it to identify it in all photos, which is useful for searching photos.

* Most semi-supervised learning algos. are a combination of unsupervised and supervised algorithms.

#### Self supervised learning

Self-supervised learning (SSL) is a machine learning approach where a model learns representations from unlabeled data by generating its own supervision signals. This is typically done through **pretext tasks** that help the model extract meaningful features.

##### Key Approaches:
- **Contrastive Learning**: Distinguishing similar and dissimilar data points (e.g., SimCLR, MoCo).
- **Predictive Learning**: Predicting missing or transformed parts of the data (e.g., BERT, wav2vec2).

##### Benefits:
- Eliminates the need for labeled data.
- Improves generalization for downstream tasks.
- Leverages large-scale unlabeled datasets efficiently.

SSL is widely used in **NLP, computer vision, and speech processing**, making it a powerful tool for representation learning.

##### Example
With a large dataset of unlabeled images, we can randomly mask a small part of each iage and then train a model to recover the original image. The mask images are used as inputs and the original images are used as the labels.

* More often than not, such as model is not the final goal, but is fine-tuned for a slighly different task.

Note: Transferring knowledge from one task to another is called _transfer learning_ and it's one of the most important techniques in machine learning today.

#### Reinforcement learning
The learning system, called an _agent_ in this context, can observe the environment, select and perform actions, and get _rewards_ in return (or _penalties_ in the form of negative rewards).

##### Examples
* Many robots implement reinforcement learning algorithms to learn how to walk.
* AlphaGo program is a good example. It learned its winning policies by analyzing millions of games, and playing many games againt itself. it beat Ke Jie, the number one ranked player in the world at the time, at Go.

#### Batch versus Online learning

##### Batch learning
* The system is incapable of learning incrementally. It must be trained using all the available data. This is generally done offline: the system is first trained then deployed into production. This is called _offline learning_.
* The model's performance tends to decay slowly over time, simply because the world continues to evolve while the model remian unchanged. This is called, _model rot_ or _model drift_.
* Solution: regularly retrain the model.

##### Online learning (incremental learning)
* Training the model incrementally by feeding its data sequentially, either individually or in mini-batches.

<img src="https://www.oreilly.com/api/v2/epubs/9781098125967/files/assets/mls3_0114.png"/>


* Can be used to train algorithms on huge datasets that cannot fit in one machine's main memory.

<img src="https://www.oreilly.com/api/v2/epubs/9781098125967/files/assets/mls3_0115.png"/>

###### Learning rate
* How fast does the system adapt to changing data.
* Slow learning rate: the system will have more inertia, and learn more slowly.
* High learning rate: Rapidly adopt new data, but quickly forget old data.

##### Challenges
* Dealing with bad data, as it can impact an decline the system's performance.
* Need to monitor the sysstem and switch learning off if drops in performance is detected.

#### Instance-Based vs Model-Based learning
* How well does the system generalize.

##### Instance based learning
* The system learns examples by heart, then generalizes to new cases by using a similarity measure to compare them to the learned example (or a subset of them).

<img src="https://www.oreilly.com/api/v2/epubs/9781098125967/files/assets/mls3_0116.png"/>

##### Model-based learning
* Build a model and use that model to make predictions.
* Typical machine learning worfklow.

<img src="https://www.oreilly.com/api/v2/epubs/9781098125967/files/assets/mls3_0117.png"/>

## Main challenges of Machine learning

### Insufficient quantity of training data
* Even for very simple problems, we typically need thousands of examples, and for complext problems such as image or speech recognition, we may need millions of examples (unless we can reuse parts of an existing model).

* In 2001, Microsoft [researchers](https://dl.acm.org/doi/10.3115/1073012.1073017) showed that very different machine learning algorithms, including fairly simple ones, performed almost identically well on a complex problem of natural language disambiguation once they were given enough data.
<img src="https://www.oreilly.com/api/v2/epubs/9781098125967/files/assets/mls3_0121.png"/>
* [The unreasonable effectiveness of data](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/35179.pdf)
* However, small and medium datasets are still very common, and its not always cheap to get extra training data.

### Nonrepresentative training data
By using an nonrepresentative training set, a model is unlikely to make accurate predictions, as for example for very poor and rich countries below:


<img src="https://www.oreilly.com/api/v2/epubs/9781098125967/files/assets/mls3_0122.png"/>

* Too small sample will have _sampling noise_.
#### Sampling bias
* Very large samples can be nonrepresentative if the sampling method is flawed: _sampling bias_.
* For example, during the US presidential election in1936,the Literary digest conducted a very large poll, sending mail to about 10 million people. It goes 2.4 million answers, but still predicted the result incorrectly, because of sampling bias:
* Most mailing addresses obtained tended to favor wealthier people.
* less than 25% of the people where were polled answered. This is a special kind of sampling bias called _nonresponse bias_.

### Poor-quality data
* Training data full of errors, outliers and noise (e.g, due to poor-quality measurements).

For example:
* Remove outliers.
* If 5% of customers did not specify their age (a features), we can:
  * ignore the attribute
  * ignore those instances
  * fill in the missing vlaues
  * train two models, one with and one without.

### Irrelevant features
* Garbage in, garbage out.
* Feature engineering: Coming up with a good set of features to train on.
* Feature selection: Seleciton the most useful features.
* Feature extraction: Combining existing features to produce a more useful one).
* Creating new features by gathering new data

### Overfitting the training data

<img src="https://www.oreilly.com/api/v2/epubs/9781098125967/files/assets/mls3_0123.png"/>

* If the sample is too smal, which introduces sampling noise, the model for complex models such as deep neural networks is likely to detect patterns in the noise itself. These patterns will not generalize to new instances.
* For example, feeding the country name to the model for the gdp per capita prediction, might detect that all dat with `w` in their name have a life satisfaction score greater than 7.

#### Warning
overfitting happens when the model is too complex relative to the amount and noisiness of the training data.
###### Possible solutions
* Simplify the model by selecting one with fewer parameters (e.g, a linear model rather than a high-degree polynomial), by reducing the number of attributes or by constraining the model.
* Gather more training data.
* Reduce the noise in the training data (e.g, fix data errors and remove outliers).

### Underfitting the training data
* Underfitting is the opposite of overfitting.
* It occurs when the model is too simple to learn the underlying structure of data.

#### Main options to fix the problem
* Select a more powerful model, with more paramaters.
* Feed better features to the learning algorithm (feature engineering).
* Reduce the constraints of the model (for example, by reducing the regularization hyperparameter).

## Testing and validating
* Split the data into two sets.
  * _Training set_ and _test_ set.
* Train the model on the training set and test it using the test set.
* The error rate on new cases is called _generalization error_ (or _out-of-sample_ error.
* If the training error is low, but the generalization is high, the model is overfitting the data.
* It is common to use 80% of the data for training and 20% for testing. However, it depends on the size of the dataset.

## Hyperparameter tuning and model selection
TODO

## Data mismatch
* Both the validation set and the test set must be as representative as possible of the data you expect to use in production.


TODO

<img src="https://www.oreilly.com/api/v2/epubs/9781098125967/files/assets/mls3_0126.png"/>

