In [1]:
from fastai.vision.all import *
from fastbook import *
from fastai.vision.widgets import *
import fastbook

# Get Data

In [2]:
data = untar_data(URLs.MNIST_SAMPLE)

Yes! It seems like we get a path object. But how do we get real image data for training and testing?

In [5]:
Path.BASE_PATH = data
data.ls()

(#3) [Path('labels.csv'),Path('train'),Path('valid')]

Wow, this way we can make the paths more cleaner. So two folders...

In [6]:
(data/'train').ls()

(#2) [Path('train/3'),Path('train/7')]

Only threes and sevens. Of course, this is only a *sample* of the MNIST dataset. So guess I know where to find the images now...

In [18]:
threes = [tensor(Image.open(i)) for i in (data/'train'/'3').ls()]
threes[1][10:20, 10:20]

tensor([[  0,   0,   0,   0, 249, 253, 245, 126,   0,   0],
        [  0,  14, 101, 223, 253, 248, 124,   0,   0,   0],
        [166, 239, 253, 253, 253, 187,  30,   0,   0,   0],
        [248, 250, 253, 253, 253, 253, 232, 213, 111,   2],
        [  0,  43,  98,  98, 208, 253, 253, 253, 253, 187],
        [  0,   0,   0,   0,   9,  51, 119, 253, 253, 253],
        [  0,   0,   0,   0,   0,   0,   1, 183, 253, 253],
        [  0,   0,   0,   0,   0,   0,   0, 182, 253, 253],
        [  0,   0,   0,   0,   0,   0,  85, 249, 253, 253],
        [  0,   0,   0,   0,   0,  60, 214, 253, 253, 173]], dtype=torch.uint8)

Hang there! It's not as easy as I think to write this line of code, because there are some gruesome parts. 
- First, you cannot just open the images and left them raw in the list, like `[Image.open(i) for i in (data/'train'/'3').ls()]` as I first did so. Writing code like this have many disadvantages. You can't access their data just by opening them. No matter what you do, you must turn them into tensors eventually. 
- Turning them into tensors is not an easy task. Because a `Tensor` and a `tensor` is very different! Calling the first one gives errors.

And now you get all of threes in the training dataset turned into tensors. Great!

In [19]:
sevens = [tensor(Image.open(i)) for i in (data/'train'/'7').ls()]
valid_threes = [tensor(Image.open(i)) for i in (data/'valid'/'3').ls()]
valid_sevens = [tensor(Image.open(i)) for i in (data/'valid'/'7').ls()]

Now we do this to the other images. But it is hard to miss that the data structure storing the tensors is not a tensor. If we train the model on pure python lists, it will be, well, very slow and discouraging. So what do we do next?

In [22]:
stacked_threes = torch.stack(threes)
stacked_threes.shape

torch.Size([6131, 28, 28])

Ok! We stack the items in the list together to make a huge 3-dimension tensor! We can use the same approach to stack the other lists.

In [23]:
stacked_sevens = torch.stack(sevens)
stacked_vsevens = torch.stack(valid_sevens)
stacked_vthrees = torch.stack(valid_threes)

We haven't done yet! We need to squeeze the picture into a single dimension array of pixels, not a 28x28 sized 2d image.

In [26]:
train3 = stacked_threes.view(-1, 28*28)
train3.shape

torch.Size([6131, 784])

Hopefully that is correct. Now do this to other tensors.

In [27]:
train7 = stacked_sevens.view(-1, 28*28)
valid3 = stacked_vthrees.view(-1, 28*28)
valid7 = stacked_vsevens.view(-1, 28*28)

This is still not over! After squeezing the dimensions, we now need to squeeze the value of the images, because all of the values (representing the pixels) are between 0 and 255, and we need all the values between 0 and 1. So we take the actions below:

In [39]:
train3 = stacked_threes.view(-1, 28*28)/255
train7 = stacked_sevens.view(-1, 28*28)/255
valid3 = stacked_vthrees.view(-1, 28*28)/255
valid7 = stacked_vsevens.view(-1, 28*28)/255

Now we are all done preparing our data for training! We now put the previous code together:

In [40]:
from fastai.vision.all import *
from fastbook import *
from fastai.vision.widgets import *
import fastbook

data = untar_data(URLs.MNIST_SAMPLE)
threes = [tensor(Image.open(i)) for i in (data/'train'/'3').ls()]
sevens = [tensor(Image.open(i)) for i in (data/'train'/'7').ls()]
valid_threes = [tensor(Image.open(i)) for i in (data/'valid'/'3').ls()]
valid_sevens = [tensor(Image.open(i)) for i in (data/'valid'/'7').ls()]
stacked_threes = torch.stack(threes)
stacked_sevens = torch.stack(sevens)
stacked_vsevens = torch.stack(valid_sevens)
stacked_vthrees = torch.stack(valid_threes)
train3 = stacked_threes.view(-1, 28*28)/255
train7 = stacked_sevens.view(-1, 28*28)/255
valid3 = stacked_vthrees.view(-1, 28*28)/255
valid7 = stacked_vsevens.view(-1, 28*28)/255

It's time for us to start building a model.
# Parameters, Predictions and Accuracy
The first step we need to take is to generate some random parameters. In this case we use a linear model in w\*x+b style.

In [33]:
weights = torch.randn(28*28).unsqueeze(1)
bias = torch.randn(1)
weights.shape, bias

(torch.Size([784, 1]), tensor([0.3143]))

We use matrix multiplication to classify the pictures. So we need the weights to change shape and have 784 rows instead of 784 columns.

Now we do a single prediction to a picture and have a look at its output.

In [41]:
p = train3[1]
p@weights+bias

tensor([11.8016])

So what does the result mean? We should define what it means, and now we define that if the value if above zero, it means that the machine predicts the image is a 3. Else, it is a 7. And we calculate the accuracy and the loss based on that. Let's write a function to do the predictions:

In [104]:
def predict(inp, params: tuple):
    w, b = params
    return inp@w+b
params = (weights, bias)
predict(p, params)

tensor([-5.7097])

Perfect! After making the first prediction, we need to calculate the loss. But if we want to make the loss more "countable", the value of the loss need to be from 0 and 1, not positive infinity or negative infinity. That's when the sigmoid function comes in handy. We modify the `predict` function using sigmoid:
> Notice that the function predicts whether a image is a three.

In [88]:
def predict(inp, params: tuple):
    w, b = params
    raw_result = inp@w+b
    return torch.sigmoid(raw_result)
predict(p, (weights, bias))
predict(merged, params).shape

torch.Size([12396, 1])

Great! And now to actually train the model, we need to merge the dataset together and create a set of labels.

In [74]:
labels = tensor([[1]*len(train3)+[0]*len(train7)]).T
merged = torch.concat((train3, train7))
merged.shape, labels.shape

(torch.Size([12396, 784]), torch.Size([12396, 1]))

We discoverd a new `concat` function to merge tensors. Next we create a function that evaluates the accuracy of the model.

In [96]:
def get_accuracy(x, y, params):
    result = predict(x, params)
    raw_acc = (result>0.5).float() == y
    return raw_acc.float().mean()
get_accuracy(merged, labels, params)

tensor(0.6089)

The `get_accuracy` function is the first hard part. We predict the results, and compare the results with 0.5 - if the value is higher than that, it means the machines determines it as a "3", then we get the array of 0s and 1s. After that, we use the notation of `result==y` to get if the machine has predicted right. At last we take the average value of the `raw_acc` variable to get the accuracy.

Put this part of the code together:

In [137]:
labels = tensor([[1]*len(train3)+[0]*len(train7)]).T
merged = torch.concat((train3, train7))

weights = torch.randn(28*28).unsqueeze(1)
bias = torch.randn(1)
params = (weights, bias)
def predict(inp, params: tuple):
    w, b = params
    raw_result = inp@w+b
    return torch.sigmoid(raw_result)
def get_accuracy(x, y, params):
    result = predict(x, params)
    raw_acc = (result>0.5).float() == y
    return raw_acc.float().mean()

# Loss Function and Gradient Descent
How do we let the machine improve itself? It all depends on the loss function we are about to define.

In [98]:
def get_loss(x, y, params):
    res = predict(x, params)
    return torch.where(y==1, 1-res, res).mean()
get_loss(merged, labels, params)

tensor(0.3927)

It seems that we have been doing all right. So now we need to do the gradient descent.

In [117]:
def calc_grad(x, y, params):
    for i in params:
        i.requires_grad_()
    loss = get_loss(x, y, params)
    loss.backward()
calc_grad(merged, labels, params)
print(weights.grad[100:110], bias.grad)
weights.grad = None
bias.grad = None

tensor([[-0.0103],
        [-0.0085],
        [-0.0058],
        [-0.0035],
        [-0.0020],
        [-0.0007],
        [-0.0002],
        [ 0.0000],
        [ 0.0000],
        [ 0.0000]]) tensor([0.0108])


Now we have successfully implemented the `calc_grad` method! Notice that we have to clear the gradients after we retrieved them. 
What's next? We train our model! We feed the model batches of our data and calculate the average gradient, then modify our weights.

In [124]:
dl = DataLoader(list(zip(merged, labels)), batch_size=256, shuffle=True)
vdl = DataLoader(list(zip(torch.concat((valid3, valid7)))), tensor([[1]*len(valid3)+[0]*len(valid7)]).T, batch_size=256, shuffle=True)

After creating our dataloaders, we're able to build a basic training function.

In [150]:
def train_once(x, y, params, lr=1):
    calc_grad(x, y, params)
    weights, bias = params
    weights.data -= weights.grad * lr
    bias.data -= bias.grad * lr
    weights.grad = None
    bias.grad = None
def train_epoch(dl, params, lr=1):
    for xi, yi in dl:
        train_once(xi, yi, params, lr)
        print(round(get_loss(merged, labels, params).item(), 4), end=' ')
train_epoch(dl, params)

0.0383 0.0382 0.0381 0.0381 0.038 0.0379 0.0378 0.0377 0.0377 0.0376 0.0376 0.0375 0.0374 0.0373 0.0372 0.0372 0.0372 0.0371 0.037 0.0369 0.0369 0.0368 0.0368 0.0367 0.0367 0.0366 0.0365 0.0364 0.0364 0.0363 0.0362 0.0362 0.0361 0.0361 0.036 0.036 0.0359 0.0358 0.0357 0.0356 0.0356 0.0355 0.0355 0.0355 0.0354 0.0354 0.0353 0.0353 0.0352 

It is important to use `weights.data` instead of `weights`. Putting it all together:

In [158]:
labels = tensor([[1]*len(train3)+[0]*len(train7)]).T
merged = torch.concat((train3, train7))
dl = DataLoader(list(zip(merged, labels)), batch_size=256, shuffle=True)
vdl = DataLoader(list(zip(torch.concat((valid3, valid7)))), tensor([[1]*len(valid3)+[0]*len(valid7)]).T, batch_size=256, shuffle=True)
weights = torch.randn(28*28).unsqueeze(1)
bias = torch.randn(1)
params = (weights, bias)

def predict(inp, params: tuple):
    w, b = params
    raw_result = inp@w+b
    return torch.sigmoid(raw_result)
def get_accuracy(x, y, params):
    result = predict(x, params)
    raw_acc = (result>0.5).float() == y
    return raw_acc.float().mean()
def get_loss(x, y, params):
    res = predict(x, params)
    return torch.where(y==1, 1-res, res).mean()
def calc_grad(x, y, params):
    for i in params:
        i.requires_grad_()
    loss = get_loss(x, y, params)
    loss.backward()
def train_once(x, y, params, lr=1):
    calc_grad(x, y, params)
    weights, bias = params
    weights.data -= weights.grad * lr
    bias.data -= bias.grad * lr
    weights.grad = None
    bias.grad = None
def train_epoch(dl, params, lr=1):
    for xi, yi in dl:
        train_once(xi, yi, params, lr)

for i in range(10):
    train_epoch(dl, params)
    print(get_accuracy(merged, labels, params).item())

0.9036785960197449
0.9462729692459106
0.9561148881912231
0.9614391922950745
0.9636979699134827
0.9664407968521118
0.9692642688751221
0.9709583520889282
0.9731364846229553
0.9737818837165833


Nearly 98% accuracy!
# Refactoring and Reducing the Code
All of our functions take (x, y, params) as the input parameters. Seeing from the outside, this is very fustrating because it makes all the functions look the same. We should change it. 

In [167]:
def get_accuracy(raw_prediction, y):
    raw_acc = (raw_prediction>0.5).float() == y
    return raw_acc.float().mean()
def get_loss(raw_prediction, y):
    return torch.where(y==1, 1-raw_prediction, raw_prediction).mean()
def calc_grad(x, y, params):
    for i in params:
        i.requires_grad_()
    pred = predict(x, params)
    loss = get_loss(pred, y)
    loss.backward()
for i in range(10):
    train_epoch(dl, params)
    print(get_accuracy(predict(merged, params), labels).item())

0.9748305678367615
0.9758793115615845
0.9770894050598145
0.9784607887268066
0.9795095324516296
0.9798322319984436
0.9803969264030457
0.9810422658920288
0.9816069602966309
0.9823330044746399


Step 2 is to make the model as a parameter, that is we should take the `predict` function as a parameter. Also the weights are related to the model, so the `params` no longer have to be the function parameter. The dataloader and learning rate should also be a global variable.

In [202]:
lr=1.
weights = torch.randn(28*28).unsqueeze(1).requires_grad_()
bias = torch.randn(1).requires_grad_()
params = (weights, bias)

def predict(inp):
    w, b = params
    raw_result = inp@w+b
    return raw_result
def calc_grad(x, y, model):
    pred = torch.sigmoid(model(x))
    loss = get_loss(pred, y)
    loss.backward()
def train_epoch(model):
    for xi, yi in dl:
        calc_grad(xi, yi, model)
        for i in params:
            i.data -= i.grad*lr
            i.grad = None
def validate_model(model):
    pred = torch.sigmoid(model(torch.concat((valid3, valid7))))
    return round(get_accuracy(pred, tensor([[1]*len(valid3)+[0]*len(valid7)]).T).item(), 4)
def train_model(model, epochs):
    for i in range(epochs):
        train_epoch(model)
        print(validate_model(model), end=' ')
#train_model(predict, 10)

We also moved the sigmoid from the model function `predict` to the other functions that also use the model. We added the `validate_model` method to know the accuracy of the model on the validation set.

So why do we make these changes? That's because it is a standardization process. We can greatly reduce the code by replacing most of the code above with those below:

In [201]:
def train_epoch(model):
    for xb,yb in dl:
        calc_grad(xb, yb, model)
        opt.step()
        opt.zero_grad()
linear_model = nn.Linear(28*28,1)
opt = SGD(linear_model.parameters(), 1)
train_model(linear_model, 20)

0.9701 0.9755 0.9764 0.9769 0.9779 0.9789 0.9789 0.9789 0.9799 0.9799 0.9814 0.9814 0.9814 0.9809 0.9809 0.9823 0.9818 0.9818 0.9828 0.9818 

What is this piece of code doing? 
- Initialize our model using `nn.Linear` instead of defining our own function and weights.
- Using a SGD class for modifying the weights and clearing the gradients, replacing the `calc_grad` function.
- `train_model` stays the same, but the function `train_epoch` can be simplified by calling the optimizer's methods of `step()` and `zero_grad()`.
- The others all stay the same.

So the total code will look like:

In [221]:
labels = tensor([[1]*len(train3)+[0]*len(train7)]).T
merged = torch.concat((train3, train7))
dl = DataLoader(list(zip(merged, labels)), batch_size=256, shuffle=True)
vdl = DataLoader(list(zip(torch.concat((valid3, valid7)), tensor([[1]*len(valid3)+[0]*len(valid7)]).T)), batch_size=256, shuffle=True)

def get_accuracy(raw_prediction, y):
    raw_acc = (raw_prediction.sigmoid()>0.5).float() == y
    return raw_acc.float().mean()
def get_loss(raw_prediction, y):
    raw_prediction = raw_prediction.sigmoid()
    return torch.where(y==1, 1-raw_prediction, raw_prediction).mean()
def calc_grad(x, y, model):
    pred = torch.sigmoid(model(x))
    loss = get_loss(pred, y)
    loss.backward()
def validate_model(model):
    pred = model(torch.concat((valid3, valid7)))
    return round(get_accuracy(pred, tensor([[1]*len(valid3)+[0]*len(valid7)]).T).item(), 4)
def train_epoch(model):
    for xb,yb in dl:
        calc_grad(xb, yb, model)
        opt.step()
        opt.zero_grad()
def train_model(model, epochs):
    for i in range(epochs):
        train_epoch(model)
        print(validate_model(model), end=' ')
linear_model = nn.Linear(28*28,1)
opt = SGD(linear_model.parameters(), 1)
train_model(linear_model, 20)

0.9681 0.9701 0.9711 0.973 0.9735 0.9735 0.9735 0.975 0.9755 0.9755 0.9755 0.9755 0.9769 0.9755 0.9769 0.9774 0.9774 0.9779 0.9779 0.9779 

A lot less code than the last part, but we can still do better.
# Using a Leaner

In [223]:
labels = tensor([[1]*len(train3)+[0]*len(train7)]).T
merged = torch.concat((train3, train7))
dl = DataLoader(list(zip(merged, labels)), batch_size=256, shuffle=True)
vdl = DataLoader(list(zip(torch.concat((valid3, valid7)), tensor([[1]*len(valid3)+[0]*len(valid7)]).T)), batch_size=256, shuffle=True)

def get_accuracy(raw_prediction, y):
    raw_acc = (raw_prediction.sigmoid()>0.5).float() == y
    return raw_acc.float().mean()
def get_loss(raw_prediction, y):
    raw_prediction = raw_prediction.sigmoid()
    return torch.where(y==1, 1-raw_prediction, raw_prediction).mean()

dls = DataLoaders(dl, vdl)
learn = Learner(dls, nn.Linear(28*28,1), opt_func=SGD,
                loss_func=get_loss, metrics=get_accuracy)
learn.fit(10, lr=lr)

epoch,train_loss,valid_loss,get_accuracy,time
0,0.060302,0.041964,0.97105,00:00
1,0.040459,0.034997,0.975957,00:00
2,0.032664,0.031845,0.976448,00:00
3,0.028626,0.03032,0.976448,00:00
4,0.025594,0.028421,0.977429,00:00
5,0.023563,0.027011,0.979882,00:00
6,0.02252,0.026397,0.979392,00:00
7,0.021388,0.025955,0.978901,00:00
8,0.020645,0.024926,0.979882,00:00
9,0.019871,0.024914,0.980373,00:00


- `loss_func` and `metrics` get the raw prediction, that is before sigmoid is applied.
# Non-linear and Resnets

In [227]:
model = nn.Sequential(
    nn.Linear(28*28, 30),
    nn.ReLU(),
    nn.Linear(30, 1))
learn = Learner(dls, model, opt_func=SGD,
                loss_func=get_loss, metrics=get_accuracy)
learn.fit(10, lr=lr)

epoch,train_loss,valid_loss,get_accuracy,time
0,0.055009,0.030219,0.975957,00:00
1,0.031645,0.02505,0.978901,00:00
2,0.022799,0.024563,0.979882,00:00
3,0.019182,0.022011,0.980373,00:00
4,0.016903,0.019853,0.982336,00:00
5,0.015788,0.020317,0.981845,00:00
6,0.01455,0.020612,0.981845,00:00
7,0.013672,0.018114,0.983808,00:00
8,0.013649,0.017714,0.983317,00:00
9,0.01293,0.016864,0.98528,00:00


In [231]:
dls = ImageDataLoaders.from_folder(data)
learn = vision_learner(dls, resnet18, pretrained=False,
                    loss_func=F.cross_entropy, metrics=accuracy)
learn.fit_one_cycle(1, 0.1)

epoch,train_loss,valid_loss,accuracy,time
0,0.0768,0.012163,0.996075,01:26
