Logs   
- [2023/03/08]   
  Restart this notebook if you change the scratch library

In [4]:
import numpy as np
import matplotlib.pyplot as plt

from typing import TypeVar, List, Tuple

Matplotlib is building the font cache; this may take a moment.


In [5]:
plt.rcParams.update(plt.rcParamsDefault)
plt.rcParams.update({
  'font.size': 16,
  'grid.alpha': 0.25})

## Modeling

- Business model (from the estimation)
- Cookbook recipes (recipe is a model and constructer by trial and error)
- Win probability of poker game (probabilistic model how the players draw
  cards)

## What is Machine Learning

(Grus, 2019) Machine learning is a creation and use of models that are 
learned from data.

Some examples:
- Whether an email message is spam or not
- Whether a credit card transaction is fraudulent
- Which advertisment a shopper is most likely to click on
- Which footbal team is going to win the Super Bowl

## Overfitting and Underfitting

Overfitting is a situation of a machine learning model that performs
well on the data we train on it, but generalizes poorly to any new data.

Underfitting is a situation of a machine learning model that does not 
perform well even on the training data, although typically when
this happen we decide our model isn't good enough and keep looking for
a better one.

In [6]:
rng = np.random.default_rng(2023)

a = [1, 2, 3, 4, 5]
rng.shuffle(a)
a

[3, 2, 1, 5, 4]

In [7]:
X = TypeVar('X')   # generic type to represent a data point

def split_data(data: List[X], prob: float) -> Tuple[List[X], List[X]]:
  """Split data into fractions [prob, 1 - prob]""" 
  data = data[:]          # Make a shallow copy
  
  seed = 2023_04_13
  rng = np.random.default_rng(seed)
  rng.shuffle(data)             # because shuffle modifies the list
  cut = int(len(data) * prob)   # Use prob to find a cutoff
  return data[:cut], data[cut:]


data = [n for n in range(1_000)]
train, test = split_data(data, 0.75)

## The proportions should be correct
print(f"len(train): {len(train)}")
print(f"len(test): {len(test)}")

# And the original data should be preserved (in some order)
print(sorted(train + test))

len(train): 750
len(test): 250
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215

In [8]:
Y = TypeVar('Y')      # generic type to represent output variables

def train_test_split(xs: List[X], ys: List[Y], test_pct: float) -> Tuple[List[X], List[X], List[Y], List[Y]]:

  # Generate the indices and split them
  idxs = [i for i in range(len(xs))]
  train_idxs, test_idxs = split_data(idxs, 1 - test_pct)

  return ([xs[i] for i in train_idxs],    # x_train
          [xs[i] for i in test_idxs],     # x_test
          [ys[i] for i in train_idxs],    # y_train
          [ys[i] for i in test_idxs])     # y_test

In [12]:
xs = [x for x in range(1000)]       # xs are 1 ... 1000
ys = [2 * x for x in xs]            # each y_i is twice x_i
x_train, x_test, y_train, y_test = train_test_split(xs, ys, 0.25)

print(f"len(x_train): {len(x_train)}")
print(f"len(y_train): {len(y_train)}")
print(f"len(x_test): {len(x_test)}")
print(f"len(y_test): {len(y_test)}")

print(all(y == 2 * x for x, y in zip(x_train, y_train)))
print(all(y == 2 * x for x, y in zip(x_test, y_test)))

len(x_train): 750
len(y_train): 750
len(x_test): 250
len(y_test): 250
True
True


A bigger problem is if you use test/train split not just to judge a model but
also to *choose* from among many models. In that case, although each individual
model may not be overfit, "choosing a model that performs best on the test set"
is a meta-training that makes the test set function as a second training set.
(Of course the model that performed best on the test set is going to perform
well on the test set.).  

In such situation, you should split the data into three parts: a training set for
building models, a *validation* set for choosing among trained models, and a test
set for judging the final model.

## Correctness

## The Bias-Variance Tradeoff

## Feature Extraction and Selection