Bias: LOOCV is an almost unbiased estimate of the true generalization error in many theoretical contexts. K-fold can have a slight bias, but typically it is not large in practice—especially for 5-fold or 10-fold.
Variance: LOOCV often has higher variance, which can overshadow any advantage of lower bias.
Practical Result: Because of the higher variance, LOOCV estimates can be less stable and might mislead you about performance on truly unseen data. Therefore, simply having “almost no bias” does not guarantee better real-world estimates if the variance is large.


<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/4/4b/KfoldCV.gif/1597px-KfoldCV.gif" width="750" align="center">


# LOOCV Can Have Higher Variance

**LOOCV** uses nearly the entire dataset for training (n−1 samples out of n) for each fold, so every trained model is extremely similar to the others—only one sample differs.

Small test sets (size = 1 per fold): The estimate on each fold is based on a single data point’s error. Hence each “fold error” can fluctuate quite a bit, leading to higher variance when you average them all.
High correlation among folds: Since the training sets differ by only one sample, the learned models are highly correlated, which doesn’t necessarily reduce the variability of the final error estimate.


- **Bias**: LOOCV is an almost unbiased estimate of the true generalization error in many theoretical contexts. K-fold can have a slight bias, but typically it is not large in practice—especially for 5-fold or 10-fold.
  
- **Variance**: LOOCV often has higher variance, which can overshadow any advantage of lower bias.


# K-fold (e.g., 5-fold or 10-fold)

Larger test sets (n/k): Each fold’s test error is an average over more data points (instead of a single one), which helps stabilize the estimate.

Slightly smaller training sets (n − n/k): Each fold’s model is trained on fewer samples than LOOCV, but this often does not significantly harm the error estimate in practice.

Lower variance: Because each fold’s test error is averaged over multiple points, the final mean error tends to have a lower variance.
Commonly used: 5-fold or 10-fold CV are popular due to the good balance between a reliable estimate of out-of-sample error and reduced computational cost.


- **Bias**: k-fold is an almost has higher bias
- **Variance**:  k-fold often has lower variance, which can overshadow any advantage of high bias.

In [None]:

import numpy as np
np.random.seed(10)
X = np.random.uniform(-100, 100, (100, 1))
y = np.random.rand(100, 1)

indices = np.arange(len(X))
np.random.shuffle(indices)

In [None]:
batch_size = 5
for i in range(0, len(X) - 1, batch_size):
    fold_idx = indices[i:i+batch_size]
    x_test, y_test = X[fold_idx], y[fold_idx]

    mask = np.ones(len(X), dtype=bool)
    mask[fold_idx] = 0

    x_train, y_train = X[mask], y[mask]
    print(x_train.shape, y_train.shape)
    print(x_test.shape, y_test.shape)

    break



(95, 1) (95, 1)
(5, 1) (5, 1)
(95, 1) (95, 1)
(5, 1) (5, 1)
(95, 1) (95, 1)
(5, 1) (5, 1)
(95, 1) (95, 1)
(5, 1) (5, 1)
(95, 1) (95, 1)
(5, 1) (5, 1)
(95, 1) (95, 1)
(5, 1) (5, 1)
(95, 1) (95, 1)
(5, 1) (5, 1)
(95, 1) (95, 1)
(5, 1) (5, 1)
(95, 1) (95, 1)
(5, 1) (5, 1)
(95, 1) (95, 1)
(5, 1) (5, 1)
(95, 1) (95, 1)
(5, 1) (5, 1)
(95, 1) (95, 1)
(5, 1) (5, 1)
(95, 1) (95, 1)
(5, 1) (5, 1)
(95, 1) (95, 1)
(5, 1) (5, 1)
(95, 1) (95, 1)
(5, 1) (5, 1)
(95, 1) (95, 1)
(5, 1) (5, 1)
(95, 1) (95, 1)
(5, 1) (5, 1)
(95, 1) (95, 1)
(5, 1) (5, 1)
(95, 1) (95, 1)
(5, 1) (5, 1)
(95, 1) (95, 1)
(5, 1) (5, 1)


半个小时是在问k-folder validation 的各种问题，比如怎么做啊，为啥要用cross validation, bias/variance trade off, k 大了小了的影响，问的非常非常细，感觉很多是master上课时候的东西。有些真的是忘了。

In [None]:
def k_fold_cv(data, k):
    
    n = data.shape[0]
    data = np.random.shuffle(data)
    batch_size = n // k
    
    for i in range(0, n, batch_size):
        mask = np.ones(n, dtype=bool)
        mask[i:i+batch_size] = False
        
        train_data = data[mask]
        test_data = data[~mask]
        ...
        
        
        
    