# Chapter 1: The Machine Learning Landscape

## 1. Chapter Overview
**Goal:** This chapter provides a high-level overview of Machine Learning (ML), defining what it is, why it is useful, and categorizing the various types of ML systems. It also covers the typical workflow of an ML project and the primary challenges faced by practitioners.

**Key Concepts:**
* Definition of Machine Learning.
* Supervised vs. Unsupervised Learning.
* Batch vs. Online Learning.
* Instance-based vs. Model-based Learning.
* Overfitting and Underfitting.

**Practical Skills:**
* Loading and preparing data using Pandas.
* Training a simple Linear Regression model using Scikit-Learn.
* Comparing Linear Regression with k-Nearest Neighbors.

## 2. Theoretical Explanation

### What is Machine Learning?
Machine Learning is the science of programming computers to learn from data. Instead of explicitly hard-coding rules (e.g., "if email contains 'free', mark as spam"), an ML system learns patterns from examples (training data) to make predictions on new data.

### Types of Machine Learning Systems
ML systems are generally classified by three criteria:

1.  **Human Supervision**:
    * **Supervised Learning:** The training data includes labels (the desired solutions). Examples: Linear Regression, Spam Classification.
    * **Unsupervised Learning:** The training data is unlabeled. The system tries to learn without a teacher. Examples: Clustering, Dimensionality Reduction.
    * **Semi-supervised Learning:** A mix of a small amount of labeled data and a lot of unlabeled data.
    * **Reinforcement Learning:** An agent observes an environment, selects actions, and gets rewards or penalties in return. It learns the best strategy (policy) to maximize rewards.

2.  **Incremental Learning**:
    * **Batch Learning:** The system is incapable of learning incrementally. It must be trained using all available data. This takes time and resources, so it is typically done offline.
    * **Online Learning:** The system learns incrementally by feeding it data instances sequentially, either individually or in small groups (mini-batches). Good for systems that receive data as a continuous flow.

3.  **Generalization Approach**:
    * **Instance-based Learning:** The system learns the examples by heart, then generalizes to new cases by comparing them to the learned examples using a similarity measure.
    * **Model-based Learning:** The system builds a model of the examples (like a formula) and uses that model to make predictions.

### Main Challenges
* **Insufficient Quantity of Training Data:** ML algorithms generally need a lot of data to work well.
* **Nonrepresentative Training Data:** The training data must be representative of the new cases you want to generalize to (avoiding sampling bias).
* **Poor-Quality Data:** Errors, outliers, and noise make it hard for the system to detect patterns.
* **Irrelevant Features:** Garbage in, garbage out. Success depends on *Feature Engineering* (selecting good features).
* **Overfitting:** The model performs well on the training data but generalizes poorly. It means the model is too complex relative to the amount and noisiness of the training data.
* **Underfitting:** The model is too simple to learn the underlying structure of the data.


## 3. Code Reproduction

We will reproduce the "Money and Happiness" example. We start with the standard setup found in the book's notebooks.

In [None]:
# Setup
# Python ≥3.5 is required
import sys
assert sys.version_info >= (3, 5)

# Scikit-Learn ≥0.20 is required
import sklearn
assert sklearn.__version__ >= "0.20"

# Common imports
import numpy as np
import os

# To plot pretty figures directly within Jupyter
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

### Data Loading and Preparation
We define the function `prepare_country_stats` to merge OECD life satisfaction data and IMF GDP data.

In [None]:
import pandas as pd

def prepare_country_stats(oecd_bli, gdp_per_capita):
    oecd_bli = oecd_bli[oecd_bli["INEQUALITY"]=="TOT"]
    oecd_bli = oecd_bli.pivot(index="Country", columns="Indicator", values="Value")
    gdp_per_capita.rename(columns={"2015": "GDP per capita"}, inplace=True)
    gdp_per_capita.set_index("Country", inplace=True)
    full_country_stats = pd.merge(left=oecd_bli, right=gdp_per_capita,
                                  left_index=True, right_index=True)
    full_country_stats.sort_values(by="GDP per capita", inplace=True)
    remove_indices = [0, 1, 6, 8, 33, 34, 35]
    keep_indices = list(set(range(36)) - set(remove_indices))
    return full_country_stats[["GDP per capita", "Life satisfaction"]].iloc[keep_indices]

In [None]:
# Load the data
import urllib.request

DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml2/master/"
os.makedirs("datasets/lifesat", exist_ok=True)

for filename in ("oecd_bli_2015.csv", "gdp_per_capita.csv"):
    print("Downloading", filename)
    url = DOWNLOAD_ROOT + "datasets/lifesat/" + filename
    urllib.request.urlretrieve(url, "datasets/lifesat/" + filename)

oecd_bli = pd.read_csv("datasets/lifesat/oecd_bli_2015.csv", thousands=',')
gdp_per_capita = pd.read_csv("datasets/lifesat/gdp_per_capita.csv", thousands=',', delimiter='\t',
                             encoding='latin1', na_values="n/a")

# Prepare the data
country_stats = prepare_country_stats(oecd_bli, gdp_per_capita)
X = np.c_[country_stats["GDP per capita"]]
y = np.c_[country_stats["Life satisfaction"]]

# Visualize the data
country_stats.plot(kind='scatter', x="GDP per capita", y='Life satisfaction')
plt.show()

### Training and Prediction
We will now train a **Linear Regression** model and make a prediction for Cyprus.

In [None]:
import sklearn.linear_model

# Select a linear model
model = sklearn.linear_model.LinearRegression()

# Train the model
model.fit(X, y)

# Make a prediction for Cyprus
X_new = [[22587]]  # Cyprus's GDP per capita
print(f"Prediction for Cyprus (Linear Regression): {model.predict(X_new)[0][0]}")

### Alternative Model: k-Nearest Neighbors
Using an instance-based learning algorithm.

In [None]:
import sklearn.neighbors

# Select a k-Nearest Neighbors regression model
model_knn = sklearn.neighbors.KNeighborsRegressor(n_neighbors=3)

# Train the model
model_knn.fit(X, y)

# Make a prediction for Cyprus
print(f"Prediction for Cyprus (k-NN): {model_knn.predict(X_new)[0][0]}")

## 4. Step-by-Step Explanation

### 1. Data Pipeline
**Input:** Two CSV files (OECD Life Satisfaction & IMF GDP).
**Process:** 
1.  **Ingestion:** We download the raw CSVs from the book's repository.
2.  **Cleaning:** The `prepare_country_stats` function filters the data, handles missing values, and merges the two tables based on 'Country'.
3.  **Visualization:** We plot the data to visually confirm a linear trend.
**Output:** `X` (GDP) and `y` (Life Satisfaction).

### 2. Model Training (`fit`)
* **Linear Regression:** The algorithm finds the parameters $\theta_0$ and $\theta_1$ that minimize the cost function (MSE). It effectively draws the "best fit" line through the data points.
* **k-NN:** The algorithm memorizes the training instances. It does not create a formula.

### 3. Prediction (`predict`)
When we ask for a prediction for Cyprus ($22,587):
* **Linear Model:** Inputs the value into the formula: $LifeSatisfaction = \theta_0 + \theta_1 \times 22587$.
* **k-NN Model:** Finds the 3 countries with GDP closest to $22,587 and averages their life satisfaction scores.

## 5. Chapter Summary

* **Machine Learning** is about building systems that learn from data rather than explicit rules.
* **Workflow:** A typical project involves fetching data, cleaning it, selecting a model, training it, and using it for prediction.
* **Model Selection:** You can choose between model-based approaches (like Linear Regression) which find a mathematical trend, or instance-based approaches (like k-NN) which rely on similarity to existing data.
* **Data Matters:** The quality and quantity of data are often more important than the sophistication of the algorithm. Bad data (outliers, missing values) or bad features lead to bad models.
* **Evaluation:** Always set aside a **Test Set** to evaluate how your model performs on unseen data to avoid overfitting.