# Reference Book: 
### Hands-on Machine Learning with Scikit-Learn, Keras & TensorFlow (By Aurélien Géron)

# What Is Machine Learning?

- Machine Learning is the science (and art) of programming computers so they can learn from data.
- [Machine Learning is the] field of study that gives computers the ability to learn without being explicitly programmed. —Arthur Samuel, 1959

### Example

- For example, your spam filter is a Machine Learning program that can learn to flag spam given examples of spam emails (e.g., flagged by users) and examples of regular (nonspam, also called “ham”) emails. 

# Why Use Machine Learning?

1. Consider how you would write a spam filter using traditional programming techniques.
<!-- ![Alt text](Images/1.%20ML%20Traditional.png) -->
<img src="Images/1. ML Traditional.png" width="700" style="margin-left: 40px;">

2. First you would look at what spam typically looks like. You might notice that some words or phrases (such as “4U,” “credit card,” “free,” and “amazing”) tend to come up a lot in the subject. Perhaps you would also notice a few other patterns in the sender’s name, the email’s body, and so on.

3. You would write a detection algorithm for each of the patterns that you noticed, and your program would flag emails as spam if a number of these patterns are detected.

4. You would test your program, and repeat steps 1 and 2 until it is good enough.

Since the problem is not trivial, your program will likely become a long list of complex rules—pretty hard to maintain.

1. In contrast, a spam filter based on Machine Learning techniques automatically learns which words and phrases are good predictors of spam by detecting unusually frequent patterns of words in the spam examples compared to the ham examples. The program is much shorter, easier to maintain, and most likely more accurate.

<img src="Images/2. ML Modern.png" width="700" style="margin-left: 40px;">

2. Moreover, if spammers notice that all their emails containing “4U” are blocked, they might start writing “For U” instead. A spam filter using traditional programming techniques would need to be updated to flag “For U” emails. If spammers keep work‐ ing around your spam filter, you will need to keep writing new rules forever.

3. In contrast, a spam filter based on Machine Learning techniques automatically noti ces that “For U” has become unusually frequent in spam flagged by users, and it starts flagging them without your intervention.

- Finally, Machine Learning can help humans learn (Figure 1-4): ML algorithms can be inspected to see what they have learned (although for some algorithms this can be tricky). For instance, once the spam filter has been trained on enough spam, it can easily be inspected to reveal the list of words and combinations of words that it believes are the best predictors of spam. Sometimes this will reveal unsuspected correlations or new trends, and thereby lead to a better understanding of the problem.

- Applying ML techniques to dig into large amounts of data can help discover patterns
that were not immediately apparent. This is called **data mining.**

<img src="Images/3. ML Understanding.png" width="700" style="margin-left: 40px;">


# Types of Machine Learning Systems

1. **Supervised learning**

- In supervised learning, the training data you feed to the algorithm includes the desired
solutions, called labels
- Types:
    1. k-Nearest Neighbors
    2. Linear Regression
    3. Logistic Regression
    4. Support Vector Machines (SVMs)
    5. Decision Trees and Random Forests
    6. Neural networks

    <img src="Images/4. Supervised Learning.png" width="700" style="margin-left: 40px;">

2. **Unsupervised learning**

- In unsupervised learning, as you might guess, the training data is unlabeled. The system tries to learn without a teacher.
- Types:
    1. Clustering
        - K-Means
        - DBSCAN
        - Hierarchical Cluster Analysis (HCA)
    2. Anomaly detection and novelty detection
        - One-class SVM
        - Isolation Forest
    2. Visualization and dimensionality reduction
        - Principal Component Analysis (PCA)
        - Kernel PCA
        - Locally-Linear Embedding (LLE)
        - t-distributed Stochastic Neighbor Embedding (t-SNE)
    3. Association rule learning
        - Apriori
        - Eclat

- Example:
    - say you have a lot of data about your blog’s visitors. You may want to run a clustering algorithm to try to detect groups of similar visitors (Figure 1-8). At no point do you tell the algorithm which group a visitor belongs to: it finds those connections without your help. For example, it might notice that 40% of your visitors are males who love comic books and generally read your blog in the evening, while 20% are young sci-fi lovers who visit during the weekends, and so on. If you use a hierarchical clustering algorithm, it may also subdivide each group into smaller groups. This may help you target your posts for each group.

    <img src="Images/5. Unsupervised Learning.png" width="700" style="margin-left: 40px;">

- **Visualization alogorithm:**
    - Visualization algorithms are also good examples of unsupervised learning algorithms: you feed them a lot of complex and unlabeled data, and they output a 2D or 3D rep‐ resentation of your data that can easily be plotted (Figure 1-9). These algorithms try to preserve as much structure as they can (e.g., trying to keep separate clusters in the input space from overlapping in the visualization), so you can understand how the data is organized and perhaps identify unsuspected patterns.

    <img src="Images/6. Visualization Algorithm.png" width="700" style="margin-left: 40px;">

- **Dimensionality Reduction:**
    - A related task is dimensionality reduction, in which the goal is to simplify the data without losing too much information. One way to do this is to merge several correlated features into one. For example, a car’s mileage may be very correlated with its age, so the dimensionality reduction algorithm will merge them into one feature that represents the car’s wear and tear. This is called feature extraction.

    - It is often a good idea to try to reduce the dimension of your training data using a dimensionality reduction algorithm before you feed it to another Machine Learning algorithm (such as a supervised learning algorithm). It will run much faster, the data will take up less disk and memory space, and in some cases it may also perform better.

- **Anomaly Detection:**
    - Yet another important unsupervised task is anomaly detection—for example, detecting unusual credit card transactions to prevent fraud, catching manufacturing defects, or automatically removing outliers from a dataset before feeding it to another learning algorithm. The system is shown mostly normal instances during training, so it learns to recognize them and when it sees a new instance it can tell whether it looks like a normal one or whether it is likely an anomaly. A very similar task is novelty detection: the difference is that novelty detection algorithms expect to see only normal data during training, while anomaly detection algorithms are usually more tolerant, they can often perform well even with a small percentage of outliers in the training set.

    <img src="Images/7. Anomaly Detection.png" width="700" style="margin-left: 40px;">

- **Association Rule:**
    - Finally, another common unsupervised task is association rule learning, in which the goal is to dig into large amounts of data and discover interesting relations between attributes. For example, suppose you own a supermarket. Running an association rule on your sales logs may reveal that people who purchase barbecue sauce and potato chips also tend to buy steak. Thus, you may want to place these items close to each other.

3. **Semisupervised learning**

    - Some algorithms can deal with partially labeled training data, usually a lot of unlabeled data and a little bit of labeled data. This is called semisupervised learning.
    - Most semisupervised learning algorithms are combinations of unsupervised and supervised algorithms. For example, deep belief networks (DBNs) are based on unsupervised components called restricted Boltzmann machines (RBMs) stacked on top of one another. RBMs are trained sequentially in an unsupervised manner, and then the whole system is fine-tuned using supervised learning techniques.

    <img src="Images/8. Semi Supervised.png" width="700" style="margin-left: 40px;">

4. **Reinforcement Learning**

    - Reinforcement Learning is a very different beast. The learning system, called an agent in this context, can observe the environment, select and perform actions, and get rewards in return (or penalties in the form of negative rewards, as in Figure below). It must then learn by itself what is the best strategy, called a policy, to get the most reward over time. A policy defines what action the agent should choose when it is in a given situation.

    <img src="Images/9. Reinforcement Learning.png" width="700" style="margin-left: 40px;">

    - For example, many robots implement Reinforcement Learning algorithms to learn how to walk. DeepMind’s AlphaGo program is also a good example of Reinforcement Learning: it made the headlines in May 2017 when it beat the world champion Ke Jie at the game of Go. It learned its winning policy by analyzing millions of games, and then playing many games against itself. Note that learning was turned off during the games against the champion; AlphaGo was just applying the policy it had learned.


# Understand how Utility or Cost Function Works

- For example, suppose you want to know if money makes people happy. You download the **Better Life Index** data from the OECD’s website as well as statistics about **GDP per capita** from the IMF’s website. Then you join the tables and sort by GDP per capita. The table below shows an excerpt of what you get.

**Table 1-1. Does money make people happier?**

| Country        | GDP per capita (USD) | Life satisfaction |
|----------------|----------------------|-------------------|
| Hungary         | 12,240               | 4.9               |
| Korea           | 27,195               | 5.8               |
| France          | 37,675               | 6.5               |
| Australia       | 50,962               | 7.3               |
| United States   | 55,805               | 7.2               |

---

Let’s plot the data for a few random countries:

<img src="Images/10. Simple_Model_1.png" width="700" style="margin-left: 40px;">

There does seem to be a trend here! Although the data is noisy (i.e., partly random), it looks like life satisfaction goes up more or less linearly as the country’s GDP per capita increases.

So you decide to model life satisfaction as a **linear function of GDP per capita**. This step is called **model selection**: you selected a linear model of life satisfaction with just one attribute, GDP per capita.

---

### Equation 1-1. A Simple Linear Model
```
life_satisfaction = θ₀ + θ₁ × GDP_per_capita
```

This model has two parameters, **θ₀** and **θ₁**. By tweaking these parameters, you can make your model represent any linear function, as shown below:


<img src="Images/10. Simple_Model_2.png" width="700" style="margin-left: 40px;">

---

Before you can use your model, you need to define the parameter values **θ₀** and **θ₁**.

How can you know which values will make your model perform best?

To answer this question, you need to specify a **performance measure**:

- A **utility (or fitness) function** measures how good your model is.
- A **cost function** measures how bad your model is.

For linear regression problems, people typically use a **cost function** that measures the distance between the model’s predictions and the training examples. The objective is to **minimize this distance**.

---

This is where the **Linear Regression algorithm** comes in. You feed it your training examples, and it finds the parameters that make the linear model fit the data best. This process is called **training the model**.

In this case, the algorithm finds the optimal parameter values:

- **θ₀ = 4.85**
- **θ₁ = 4.91 × 10⁻⁵**

Now the model fits the training data as closely as possible (for a linear model), as shown below:


<img src="Images/10. Simple_Model_3.png" width="700" style="margin-left: 40px;">

---

You are now ready to use the model to make predictions.

For example, suppose you want to know how happy people in **Cyprus** are, but the OECD data does not provide this information. You can use your model to make a prediction:

- GDP per capita of Cyprus = **22,587**
- Predicted life satisfaction:
```
4.85 + 22,587 × 4.91 × 10⁻⁵ ≈ 5.96
```


So, life satisfaction in Cyprus is likely to be **around 5.96**.

---

## The Unreasonable Effectiveness of Data

In a famous paper published in **2001**, Microsoft researchers **Michele Banko** and **Eric Brill** showed that very different **Machine Learning algorithms**, including fairly simple ones, performed almost identically well on a complex problem of **natural language disambiguation** once they were given enough data (as illustrated in Figure 1-20).

**Figure. The importance of data versus algorithms**

<img src="Images/11. Data Imp.png" width="700" style="margin-left: 40px;">

As the authors put it:

> “These results suggest that we may want to reconsider the tradeoff between spending time and money on algorithm development versus spending it on corpus development.”

The idea that **data matters more than algorithms** for complex problems was further popularized by **Peter Norvig et al.** in a paper titled **“The Unreasonable Effectiveness of Data”**, published in **2009**.

However, it should be noted that **small- and medium-sized datasets** are still very common, and it is not always easy or cheap to obtain extra training data. Therefore, **do not abandon algorithms just yet**—both data and algorithms play an important role in building effective machine learning systems.


## Overfitting the Training Data

Overfitting occurs when a machine learning model performs well on the **training data** but fails to **generalize to new, unseen data**. This is similar to overgeneralizing in real life—for example, assuming all taxi drivers in a country are dishonest after a single bad experience.

A highly complex model (such as a high-degree polynomial) may fit the training data extremely well, yet produce **unreliable predictions**. This often happens when:
- The training data is **noisy**
- The dataset is **too small**
- The model is **too complex** relative to the data

Complex models can mistakenly learn **patterns from noise**. For instance, a model might infer that countries with the letter *“w”* in their name have higher life satisfaction—an accidental pattern that does not generalize (e.g., to Rwanda or Zimbabwe).

### Why Overfitting Happens
- Model complexity is too high for the amount or quality of data

### Common Solutions
- **Simplify the model** (fewer parameters, fewer features, or simpler algorithms)
- **Collect more training data**
- **Reduce noise** (fix errors, remove outliers)

### Regularization
Regularization is the process of **constraining a model** to make it simpler and reduce overfitting.  
For example, in a linear model with parameters **θ₀** (intercept) and **θ₁** (slope):
- Forcing **θ₁ = 0** produces a very simple model (only shifting up/down)
- Allowing **θ₁** to vary but keeping it small balances simplicity and flexibility

The goal is to find the **right balance** between fitting the training data well and keeping the model simple enough to generalize effectively.

<img src="Images/12. Regularization.png" width="700" style="margin-left: 40px;">

- The amount of regularization to apply during learning can be controlled by a hyperparameter. A hyperparameter is a parameter of a learning algorithm (not of the model). As such, it is not affected by the learning algorithm itself; it must be set prior to training and remains constant during training. If you set the regularization hyperparameter to a very large value, you will get an almost flat model (a slope close to zero); the learning algorithm will almost certainly not overfit the training data, but it will be less likely to find a good solution. Tuning hyperparameters is an important part of building a Machine Learning system (you will see a detailed example in the next chapter).

## Underfitting the Training Data

Underfitting is the **opposite of overfitting**. It happens when a model is **too simple** to capture the underlying structure of the data. As a result, the model performs poorly even on the **training data**.

For example, a simple linear model for life satisfaction may underfit because real-world relationships are often more complex than a straight line, leading to inaccurate predictions.

### How to Fix Underfitting
- **Use a more powerful model** with more parameters
- **Provide better features** through feature engineering
- **Reduce model constraints**, such as lowering the regularization strength


---

## Pipelines

A **data pipeline** is a sequence of data processing components. Pipelines are very common in machine learning systems because large amounts of data must be processed through multiple transformations.

Pipeline components usually run **asynchronously**. Each component:
- Pulls data from a data store  
- Processes it  
- Writes the output to another data store  

The next component later consumes this output and continues the process.

Each component is **self-contained**, and components communicate only through data stores. This design:
- Makes the system easier to understand (often visualized using a data flow graph)
- Allows different teams to work independently on different components
- Improves robustness, since downstream components can continue using the last available output if one component fails

However, without proper **monitoring**, failures may go unnoticed. This can cause data to become **stale**, leading to a gradual drop in overall system performance.


---

## RMSE, MAE, and Distance Measures (Explained Simply)

When we build a machine learning model, we want to know **how far the predictions are from the actual values**.  
Both **RMSE** and **MAE** measure this distance between:

- A vector of **predictions**
- A vector of **true (target) values**

Think of them as different ways to answer the question:
> *“On average, how wrong is my model?”*

---

### Simple Example

Suppose the true values and predictions are:

- **Actual values:** `[10, 20, 30]`
- **Predicted values:** `[12, 18, 40]`

The **errors** are:
[2, -2, 10]


---

## MAE (Mean Absolute Error) — ℓ1 Norm

**What it does:**  
- Takes the **absolute value** of each error
- Adds them up and averages

|2| + |−2| + |10| = 14
MAE = 14 / 3 ≈ 4.67


### Intuition
- Treats all errors **equally**
- Easy to understand
- Less affected by very large errors (outliers)

### City Analogy (Manhattan Distance)
Imagine moving in a city where you can only go **left/right or up/down** (like city blocks).  
MAE measures distance the same way—step by step, no shortcuts.

---

## RMSE (Root Mean Squared Error) — ℓ2 Norm

**What it does:**
- Squares each error
- Averages them
- Takes the square root

2² + (−2)² + 10² = 108
RMSE = √(108 / 3) = √36 = 6


### Intuition
- **Big errors hurt more** because of squaring
- More sensitive to outliers
- Commonly used in practice

### Straight-Line Analogy (Euclidean Distance)
This is the usual **straight-line distance** you’re familiar with in geometry.

---

## General Norms (ℓk Norms)

For a vector `v = [v₀, v₁, ..., vₙ]`, the ℓk norm is:

∥v∥k = (|v₀|ᵏ + |v₁|ᵏ + ... + |vₙ|ᵏ)^(1/k)


### Special Cases
- **ℓ1 norm** → MAE (sum of absolute values)
- **ℓ2 norm** → RMSE (square root of sum of squares)
- **ℓ0 norm** → counts how many values are non-zero
- **ℓ∞ norm** → takes the **largest absolute error**

---

## Key Insight: RMSE vs MAE

- **Higher norm index ⇒ more focus on large errors**
- RMSE penalizes large mistakes much more than MAE
- MAE is more robust when outliers exist

### When to Use Which 이해
- Use **MAE** when:
  - Outliers exist
  - You want a simple, stable error measure
- Use **RMSE** when:
  - Large errors are rare
  - Data follows a bell-shaped (normal) distribution
  - Big mistakes should be penalized more

---

### In Short
- **MAE**: “How wrong am I on average?”
- **RMSE**: “How bad are my worst mistakes?”
