# Chapter 2

## Notations
- $m$ is the number of instances in the dataset 
- $\textbf{x}^{(i)}$ is a vector of all the feature values (excluding the label) of the $i^{th}$ instance in the dataset, and $y^{(i)}$ is its label (the desired output value for that instance).
- $\textbf{X}$ is a matrix containing all the feature values (excluding labels) of all instances in the dataset. There is one row per instance and the $i^{th}$ row is equal to the transpose of $\textbf{x}^{(i)}$, noted $(\textbf{x}^{(i)})^T$
- $h$ is prediction function, also called a *hypothesis*.
- $\hat{y}$ is the predicted value. $\hat{y}^{(i)}=h(\textbf{x}^{(i)})$

We use lowercase italic font for scalar values (such as $m$ or $y^{(i)}$) and function names (such as $h$), lowercase bold font for vectors (such as $\textbf{x}^{(i)}$), and uppercase bold font for matrices (such as $\textbf{X}$)

## Measurements
### RMSE (Root Mean Square Error)  
It measures the standard deviation of the errors the system makes in its predictions.
$$\text{RMSE(}\textbf{X}, h)=\sqrt{\frac{1}{m}\sum_{i=1}^m(h(\textbf{x}^{(i)})-y^{(i)})^2}$$

### MAE (Mean Absolute Error)
$$\text{MAE(}\textbf{X},h)=\frac{1}{m}\sum_{i=1}^m|h(\textbf{x}^{(i)}-y^{(i)}|$$

### Distance measures
Various distance measures, or *norms*, between two vectors: the vector of predictions and the vector of target values.  
$\ell_2\ norm$, noted $||\cdot||_2$ (or just $||\cdot||$), known as the *Euclidian norm* as well.  
$\ell_1\ norm$, noted $||\cdot||_1$, sometimes called the *Manhattan norm*.  
More generally, the $\ell_k\ norm$ of a vector $\textbf{v}$ containing $n$ elements is defined as $||\textbf{v}||_k=(|v_0|^k+|v_1|^k+...+|v_n|^k)^{\frac{1}{k}}$. $\ell_0$ just gives the cardinality of the vector (i.e., the number of elements), and $\ell_\infty$ just gives the maximum absolute value in the vector.

## Modules and dependencies
  

```
$ pip3 install --upgrade jupyter matplotlib numpy pandas scipy scikit-learn
```

Documentation:
[Jupyter](https://jupyter.readthedocs.io/en/latest/) 
[Matplotlib](https://matplotlib.org) 
[NumPy](https://docs.scipy.org/doc/numpy/user/index.html#user) 
[Pandas](http://pandas.pydata.org/pandas-docs/stable/) 
[SciPy](http://docs.scipy.org/doc/scipy/reference/) 
[Scikit learn](http://scikit-learn.org/stable/documentation.html) 

## Creating training and test sets

### Simply shuffle

In [4]:
def split_train_test(data, test_ratio):
    shuffled_indices = np.random.permutation(len(data))
    test_set_size = int(len(data) * test_ratio)
    test_indices = shuffled_indices[:test_set_size]
    train_indices = shuffled_indices[test_set_size:]
    return data.iloc[train_indices], data.iloc[test_indices]

# train_set, test_set = split_train_test(housing, 0.2)

This method just pick some instances randomly and set them aside. Well, this works, but it is not perfect. If you run the program again, it will generate a different test set. Over time, you will get to see the whole dataset, which is what you want to avoid.

### Fix random number generator's seed

In [7]:
def split_train_test_with_fixed_seed(data, test_ratio, seed=42):
    np.random.seed(seed)
    shuffled_indices = np.random.permutation(len(data))
    test_set_size = int(len(data) * test_ratio)
    test_indices = shuffled_indices[:test_set_size]
    train_indices = shuffled_indices[test_set_size:]
    return data.iloc[train_indices], data.iloc[test_indices]

Another option is to save the test set on the first run and then load it in subsequent runs. But both these solutions will break next time you fetch an updated dataset.

### With identifier

In [6]:
def test_set_check(identifier, test_ratio, hash):
    return hash(np.int64(identifier)).digest()[-1] < 256 * test_ratio

def split_train_test_by_id(data, test_ratio, id_column, hash=hashlib.md5):
    ids = data[id_column]
    in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio, hash))
    return data.loc[~in_test_set], data.loc[in_test_set]

# housing_with_id = housing.reset_index()      # adds an `index` column
# train_set, test_set = split_train_test_by_id(housing_with_id, 0.2, "index")

# housing_with_id = housing["longitude"] * 1000 + housing["latitude"]
# train_set, test_set = split_train_test_by_id(housing_with_id, 0.2, "id")

Use each instance's identifier to decide whether or not it should go in the test set (assuming instances have a unique and immutable identifier).  
However, the housing dataset does not have an identifier column. The simplest solution is to use the row index as the ID. In this way, you need to make sure that new data gets appended to the end of the dataset, and no row ever gets deleted.  
You can try to use the most stable features to build a unique identifier. E.g, a district's latitude and longitude.

### Scikit-Learn built-in functions

In [None]:
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)

### Stratified sampling

In [None]:
housing["income_cat"] = np.ceil(housing["median_income"] / 1.5)
housing["income_cat"].where(housing["income_cat"] < 5, 5.0, inplace=True)

from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing["income_cat"]):
    strat_train_set = housing.loc[train_index]
    strat_test_set = housing.loc[test_index]
    
for set in (strat_train_set, strat_test_set):
    set.drop(["income_cat"], axis=1, inplace=True)