# Lecture 24: Machine learning related packages

In this lecture, we will some packages and functions. 
Main topic: `KFold`.

Optional: `autograd` and `pytorch`, through some simple examples. `scitime` that can be helpful in training neural network with multiple hidden layers (>= 3), and/or large datasets.

In [1]:
import numpy as np
from sklearn.model_selection import KFold

In [22]:
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
y = np.array([17, 23, 37, 41])
kf = KFold(n_splits=4) # n_splits is the number of splits
kf.get_n_splits(X)

print(kf)  

KFold(n_splits=4, random_state=None, shuffle=False)


In [23]:
type(kf)

sklearn.model_selection._split.KFold

In [24]:
kf.n_splits

4

In [25]:
kf.split(X) # generator object is iterable

<generator object _BaseKFold.split at 0x00000297B5406DE0>

In [26]:
for train_index, test_index in kf.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    print()
    print(X_train)
    print(y_train,'\n')
    

TRAIN: [1 2 3] TEST: [0]

[[3 4]
 [5 6]
 [7 8]]
[17 23 37] 

TRAIN: [0 2 3] TEST: [1]

[[1 2]
 [5 6]
 [7 8]]
[17 23 37] 

TRAIN: [0 1 3] TEST: [2]

[[1 2]
 [3 4]
 [7 8]]
[17 23 37] 

TRAIN: [0 1 2] TEST: [3]

[[1 2]
 [3 4]
 [5 6]]
[17 23 37] 



## Enumerate iterable

In [9]:
list = ['x1', 'x2', 'x3', 24]

In [11]:
for x in list:
    print(x)

x1
x2
x3
24


In [12]:
for i, x in enumerate(list):
    print(i, x) # the i is the index of x in list

0 x1
1 x2
2 x3
3 24


In [18]:
from sklearn.linear_model import LinearRegression

In [27]:
y_oof_pred = np.zeros_like(y) # same shape with y

for fold_no, (index_tr, index_cv) in enumerate(kf.split(X)):
    print("\n\nFold number:" , fold_no)
    X_tr, y_tr = X[index_tr], y[index_tr]
    # do whatever training using X_tr, y_tr
    reg = LinearRegression()
    reg.fit(X_tr,y_tr)
    print("Training data {0}, labels{1}".format(X_tr, y_tr))
    
    X_cv, y_cv = X[index_cv], y[index_cv]
    print("CV data {0}, labels{1}".format(X_cv, y_cv))
    # predict y_cv_pred using X_cv by the model just trained
    # this is the OOF prediction
    y_oof_pred[index_cv] = reg.predict(X_cv)
    



Fold number: 0
Training data [[3 4]
 [5 6]
 [7 8]], labels[23 37 41]
CV data [[1 2]], labels[17]


Fold number: 1
Training data [[1 2]
 [5 6]
 [7 8]], labels[17 37 41]
CV data [[3 4]], labels[23]


Fold number: 2
Training data [[1 2]
 [3 4]
 [7 8]], labels[17 23 41]
CV data [[5 6]], labels[37]


Fold number: 3
Training data [[1 2]
 [3 4]
 [5 6]], labels[17 23 37]
CV data [[7 8]], labels[41]


In [28]:
print(y_oof_pred, y)

[15 26 32 45] [17 23 37 41]



# Optional packages to learn

## Installation

Neither of `autograd` and `pytorch` is installed in a typical Anaconda installation. Here is an instruction on how to install it on MacOS and Windows (the desktop in the MSTB/ALP labs).

### Autograd
For both MacOS and Windows system, please run the following command at Anaconda prompt (Windows) or command line (Linux/BSD-based system like MacOS)
```
conda install -c conda-forge autograd
```

### PyTorch
For MacOS (not supporting CUDA), please run the following command:
```
conda install pytorch torchvision -c pytorch
```
For Windows, and to install the cpu-only version
```
conda install pytorch-cpu torchvision-cpu -c pytorch
```
If you have an NVIDIA GPU on Windows or Linux, you can install the CUDA-enabled version as well:
```
conda install pytorch torchvision cudatoolkit=x.0 -c pytorch
```
Change `x` to `10` or `9` depending on your CUDA toolkit version. You can check your CUDA toolkit's version directly on Windows by checking [your driver's version](https://docs.nvidia.com/deploy/cuda-compatibility/index.html), while on Linux, you can check by `nvcc --version` if CUDA is installed with PATH properly configed.

## Autograd

The `autograd` package provides automatic differentiation.

Instruction of installation: 

Reference: 
* [Autograd: Effortless Gradients in Numpy](https://indico.lal.in2p3.fr/event/2914/contributions/6483/subcontributions/180/attachments/6060/7185/automl-short.pdf)
* [Autograd Tutorial](http://www.cs.toronto.edu/~rgrosse/courses/csc321_2017/tutorials/tut4.pdf)

If you are interested in how `autograd` is implemented, you can refer to ["Autodidact: a pedagogical implementation of Autograd"](https://github.com/mattjj/autodidact).

In [None]:
import autograd.numpy as np
from autograd import grad

In [None]:
def taylor_sine(x):  # Taylor expansion to sine function
    ans = x
    currenterm = x
    i = 0
    while np.abs(currenterm) > 1e-6: # cut off when residual term is < 1e-6
        currenterm = -currenterm * x**2 / ((2*i + 3) * (2*i + 2))
        ans += currenterm
        i += 1
    return ans

grad_sine = grad(taylor_sine)

In [None]:
type(grad_sine)

In [None]:
from math import pi
grad_sine(pi)

## Exercise for yourself to try
Recall that we have tried to compute the gradient of Beale function by hand (refer to Lab 6 practice).

In [None]:
import autograd.numpy as np
from autograd import value_and_grad
from scipy.optimize import minimize

def beale(x):
    term1 = (1.5 - x[0] + x[0]*x[1])**2 
    term2 = (2.25 - x[0] + x[0]*x[1]**2)**2 
    term3 = (2.625 - x[0] + x[0]*x[1]**3)**2
    return term1+term2+term3

# Build a function that also returns gradients using autograd.
beale_with_grad = value_and_grad(beale)

# Optimize using conjugate gradients.
result = minimize(beale_with_grad, x0=np.array([3.0, 2.0]), jac=True, method='CG')
print("The minimum is achieved at", result.x, "with value", result.fun)

In [None]:
beale_with_grad(np.array([0, 2.0]))

## PyTorch
In the following few cells, we are gonna replicate what we have done the last two lectures in an open-source platform PyTorch.

### First let us see a plain and simple numpy neural net

In [None]:
# a minimal example of least-square loss (no activation on the last layer) on a neural net
import numpy as np

# N is the sample size (or current mini-batch size); 
# D_in is input dimension;
# N_H is hidden dimension; 
# D_out is output dimension.
N, D_in, N_H, D_out = 64, 1000, 100, 3

# Create random input and output data
# np.random.seed(666)
X = np.random.randn(N, D_in)
y = np.random.randn(N, D_out)

# Randomly initialize weights, no bias since X, y are centered at 0
# w1: weights for input layer -> hidden layer
# w2: weights for hidden layer -> output layer
w1 = np.random.randn(D_in, N_H)
w2 = np.random.randn(N_H, D_out)

# step size/learning rate
eta = 1e-6
for m in range(1000):
    # Forward pass: compute predicted y
    z = np.matmul(X,w1)
    a = z*(z>0) # relu
    y_pred = np.matmul(a,w2)

    # Backprop to compute gradients of w1 and w2 with respect to loss
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = np.matmul(a.T, grad_y_pred)
    grad_a = np.matmul(grad_y_pred, w2.T)
    grad_w1 = np.matmul(X.T, grad_a*(z>0))

    # Update weights
    w1 -= eta * grad_w1
    w2 -= eta * grad_w2
    
    # Compute and print loss
    loss = np.sum((y_pred - y)**2)
    if m % 100 == 0:
        print("LS loss after", m+1, 
              "iterations is",loss, 
              "with a training R squared", 1 - loss/(np.sum((y- np.mean(y))**2)))

### Let us look at the PyTorch version

The following example is slightly adapted from the PyTorch tutorial.

In [None]:
import torch

In [None]:
dtype = torch.float
device = torch.device("cpu")
# device = torch.device("cuda:0") # Uncomment this to run on GPU

# N is the sample size (or current mini-batch size); 
# D_in is input dimension;
# N_H is hidden dimension; 
# D_out is output dimension.
N, D_in, N_H, D_out = 64, 1000, 100, 3

# Create random Tensors to hold input and outputs.
# Setting requires_grad=False indicates that we do not need to compute gradients
# with respect to these Tensors during the backward pass.
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# Create random Tensors for weights.
# Setting requires_grad=True indicates that we want to compute gradients with
# respect to these Tensors during the backward pass.
w1 = torch.randn(D_in, N_H, device=device, dtype=dtype, requires_grad=True)
w2 = torch.randn(N_H, D_out, device=device, dtype=dtype, requires_grad=True)

learning_rate = 1e-6
for m in range(1000):
    # Forward pass: compute predicted y using operations on Tensors; these
    # are exactly the same operations we used to compute the forward pass using
    # Tensors, but we do not need to keep references to intermediate values since
    # we are not implementing the backward pass by hand.
    # torch.mm() is the matrix multiplication in PyTorch
    # https://pytorch.org/docs/0.4.0/torch.html#torch.mm
    y_pred = x.mm(w1).clamp(min=0).mm(w2)

    # Compute and print loss using operations on Tensors.
    # Now loss is a Tensor of shape (1,)
    # loss.item() gets the a scalar value held in the loss.
    loss = (y_pred - y).pow(2).sum()
    if m % 100 == 0:
        print("LS loss after", m, 
              "iterations is",loss.item(), 
              "with a training R squared", 1 - (loss.item())/((y- y.mean()).pow(2).sum().item()))

    # Use autograd to compute the backward pass. This call will compute the
    # gradient of loss with respect to all Tensors with requires_grad=True.
    # After this call w1.grad and w2.grad will be Tensors holding the gradient
    # of the loss with respect to w1 and w2 respectively.
    loss.backward()

    # Manually update weights using gradient descent. Wrap in torch.no_grad()
    # because weights have requires_grad=True, but we don't need to track this
    # in autograd.
    # An alternative way is to operate on weight.data and weight.grad.data.
    # Recall that tensor.data gives a tensor that shares the storage with
    # tensor, but doesn't track history.
    # You can also use torch.optim.SGD to achieve this.
    with torch.no_grad():
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad

        # Manually zero the gradients after updating weights
        w1.grad.zero_()
        w2.grad.zero_()

## Scitime

Training time estimation for `scikit-learn` algorithms. Currently supporting: 
* [SVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)
* [KMeans](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html)
* [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)

You can try support vector machine (SVM) based classification and/or random forest based classifier for the final project as well.

To install `scitime`, you can do
```
conda install -c conda-forge scitime
```
An example of `KMeans` running time is as follows:

In [None]:
from sklearn.cluster import KMeans
import numpy as np
import time

from scitime import Estimator

# example for kmeans clustering
estimator = Estimator(meta_algo='RF', verbose=3)
km = KMeans()

X = np.random.rand(100000,10)
# run the estimation
estimation, lower_bound, upper_bound = estimator.time(km, X)

# compare to the actual training time
start_time = time.time()
km.fit(X)
elapsed_time = time.time() - start_time
print("elapsed time:", elapsed_time)