# File I/O

So far we have discussed how to process data and how
to build, train, and test deep learning models.
However, at some point we will hopefully be happy enough
with the learned models that we will want
to save the results for later use in various contexts
(perhaps even to make predictions in deployment).
Additionally, when running a long training process,
the best practice is to periodically save intermediate results (checkpointing)
to ensure that we do not lose several days' worth of computation
if we trip over the power cord of our server.
Thus it is time to learn how to load and store
both individual weight vectors and entire models.
This section addresses both issues.


In [1]:
import torch
from torch import nn
from torch.nn import functional as F
import numpy as np

import warnings
warnings.filterwarnings('ignore')

## **Loading and Saving Tensors**

For individual tensors, we can directly
invoke the `load` and `save` functions
to read and write them respectively.
Both functions require that we supply a name,
and `save` requires as input the variable to be saved.


In [2]:
x = torch.arange(4)
torch.save(x, './saves/x-file')

We can now read the data from the stored file back into memory.


In [3]:
x2 = torch.load('./saves/x-file')
x2

tensor([0, 1, 2, 3])

We can [**store a list of tensors and read them back into memory.**]


In [4]:
y = torch.zeros(4)
torch.save([x, y],'./saves/x-files')
x2, y2 = torch.load('./saves/x-files')
(x2, y2)

(tensor([0, 1, 2, 3]), tensor([0., 0., 0., 0.]))

We can even [**write and read a dictionary that maps
from strings to tensors.**]
This is convenient when we want
to read or write all the weights in a model.


In [5]:
mydict = {'x': x, 'y': y}
torch.save(mydict, './saves/mydict')
mydict2 = torch.load('./saves/mydict')
mydict2

{'x': tensor([0, 1, 2, 3]), 'y': tensor([0., 0., 0., 0.])}

## **Loading and Saving Model Parameters**

Saving individual weight vectors (or other tensors) is useful,
but it gets very tedious if we want to save
(and later load) an entire model.
After all, we might have hundreds of
parameter groups sprinkled throughout.
For this reason the deep learning framework provides built-in functionalities
to load and save entire networks.

> An important detail to note is that this
saves model *parameters* and not the entire model.
For example, if we have a 3-layer MLP,
we need to specify the architecture separately.


The reason for this is that the models themselves can contain arbitrary code,
hence they cannot be serialized as naturally.
Thus, in order to reinstate a model, we need
to generate the architecture in code
and then load the parameters from disk.



In [6]:
# See, that this is the simple MODEL. We will do some crazy stuff later.
class MLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.hidden = nn.LazyLinear(256)
        self.output = nn.LazyLinear(10)

    def forward(self, x):
        return self.output(F.relu(self.hidden(x)))

net = MLP()
X = torch.randn(size=(2, 20))
Y = net(X)

In [7]:
net

MLP(
  (hidden): Linear(in_features=20, out_features=256, bias=True)
  (output): Linear(in_features=256, out_features=10, bias=True)
)

In [8]:
[(name, param.shape) for name, param in net.named_parameters()]

[('hidden.weight', torch.Size([256, 20])),
 ('hidden.bias', torch.Size([256])),
 ('output.weight', torch.Size([10, 256])),
 ('output.bias', torch.Size([10]))]

In [9]:
### 
#
# THIS IS THE MAIN CODE
#
###
torch.save(net.state_dict(), './saves/mlp.params')

> ### ðŸ’­
> See that the `torch.save` can save any python object as a file, just like the pickle. The reason we are storing the `.state_dict()` is because later we will use the `load_state_dict()` to load the weights... there we will need the weights stored in **some specific format**.

To recover the model, we instantiate a clone
of the original MLP model.
Instead of randomly initializing the model parameters,
we [**read the parameters stored in the file directly**].


In [10]:
clone = MLP()
clone.load_state_dict(torch.load('./saves/mlp.params'))
clone.eval()

MLP(
  (hidden): LazyLinear(in_features=0, out_features=256, bias=True)
  (output): LazyLinear(in_features=0, out_features=10, bias=True)
)

Since both instances have the same model parameters,
the computational result of the same input `X` should be the same.
Let's verify this.


In [11]:
Y_clone = clone(X)
Y_clone == Y

tensor([[True, True, True, True, True, True, True, True, True, True],
        [True, True, True, True, True, True, True, True, True, True]])

## Summary

The `save` and `load` functions can be used to perform file I/O for tensor objects.
We can save and load the entire sets of parameters for a network via a parameter dictionary.
Saving the architecture has to be done in code rather than in parameters.

In [12]:
%xmode Minimal

Exception reporting mode: Minimal



## Exercises

1. Even if there is no need to deploy trained models to a different device, what are the practical benefits of storing model parameters?
    - Storing the model parameters allows to further train the model in the future from that *checkpoint* rather than training from scratch.
    - Even if you are not deploying, it is always a best practice to keep the model parameters somewhere stored.

2. Assume that we want to reuse only parts of a network to be incorporated into a network having a different architecture. How would you go about using, say the first two layers from a previous network in a new network?
    - Well, this will be an interesting exercise. I would like to tackle this into some combinations.
  
**<u>We will try the following:</u>**

ðŸ”¥ We will try to use the parameters of the model - 1 in the model - 2 which has **entirely different** architecture. 

ðŸ”¥ We will use partial model - 1's parameters in the model - 2 in which the model - 2 has some similar parameters to the model - 1.

ðŸ”¥ All number of weights are the same, just model-2 doesn't have the bias term.

Let's see how that goes.

<img src="../images/different-models.png">

### Test - 1

In [13]:
# The different architecture
class MODEL_1(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(5, 3),
            nn.ReLU(),
            nn.Linear(3, 4),
            nn.ReLU(),
            nn.Linear(4, 1)
        )

    def forward(self, X):
        return self.net(X)

Model_1_net = MODEL_1()
Model_1_X = torch.randn(size=(2, 5))
Model_1_Y = Model_1_net(Model_1_X)

In [14]:
# This is the model - 1
class MODEL_2(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(3, 4),
            nn.ReLU(),
            nn.Linear(4, 1)
        )

    def forward(self, x):
        return self.net(X)

Model_2_net = MODEL_2()
Model_2_X = torch.randn(size=(2, 5))
Model_2_Y = Model_1_net(Model_2_X)

In [15]:
# accessing directly
Model_1_net.state_dict()

OrderedDict([('net.0.weight',
              tensor([[ 0.1612,  0.0042, -0.4217, -0.2391,  0.1241],
                      [-0.3142,  0.2413, -0.0544, -0.3145,  0.2827],
                      [ 0.0352,  0.1620, -0.0823, -0.1796, -0.1256]])),
             ('net.0.bias', tensor([ 0.2893,  0.4079, -0.4366])),
             ('net.2.weight',
              tensor([[ 0.1933,  0.3769, -0.1508],
                      [ 0.2927, -0.3656,  0.5275],
                      [-0.0374, -0.1061,  0.1082],
                      [-0.0833,  0.3725, -0.0442]])),
             ('net.2.bias', tensor([ 0.3388, -0.5125,  0.1416,  0.0006])),
             ('net.4.weight', tensor([[ 0.0488, -0.0071, -0.1935,  0.4220]])),
             ('net.4.bias', tensor([0.2392]))])

In [16]:
# accessing the network's dict
Model_1_net.net.state_dict()

OrderedDict([('0.weight',
              tensor([[ 0.1612,  0.0042, -0.4217, -0.2391,  0.1241],
                      [-0.3142,  0.2413, -0.0544, -0.3145,  0.2827],
                      [ 0.0352,  0.1620, -0.0823, -0.1796, -0.1256]])),
             ('0.bias', tensor([ 0.2893,  0.4079, -0.4366])),
             ('2.weight',
              tensor([[ 0.1933,  0.3769, -0.1508],
                      [ 0.2927, -0.3656,  0.5275],
                      [-0.0374, -0.1061,  0.1082],
                      [-0.0833,  0.3725, -0.0442]])),
             ('2.bias', tensor([ 0.3388, -0.5125,  0.1416,  0.0006])),
             ('4.weight', tensor([[ 0.0488, -0.0071, -0.1935,  0.4220]])),
             ('4.bias', tensor([0.2392]))])

In [17]:
# saving model - 1 (in 2 ways)
torch.save(Model_1_net.state_dict(), "./saves/test_1_model_1_direct")
torch.save(Model_1_net.net.state_dict(), "./saves/test_1_model_1_specific")

#### The direct way

In [18]:
model_1_loading = MODEL_1()
model_1_loading.load_state_dict(torch.load('./saves/test_1_model_1_direct'))

<All keys matched successfully>

In [19]:
# AFTER LOADING, ALL WEIGHTS ARE THE SAME
torch.allclose(Model_1_net.net[0].weight, model_1_loading.net[0].weight)

True

#### The "specific" way

In [20]:
model_1_loading = MODEL_1()
model_1_loading.load_state_dict(torch.load('./saves/test_1_model_1_specific'))

RuntimeError: Error(s) in loading state_dict for MODEL_1:
	Missing key(s) in state_dict: "net.0.weight", "net.0.bias", "net.2.weight", "net.2.bias", "net.4.weight", "net.4.bias". 
	Unexpected key(s) in state_dict: "0.weight", "0.bias", "2.weight", "2.bias", "4.weight", "4.bias". 

In [21]:
# AFTER LOADING, ALL WEIGHTS ARE THE SAME
torch.allclose(Model_1_net.net[0].weight, model_1_loading.net[0].weight)

False

> The model above was **just initialized** and the weights were not loaded. Thus, it is giving the `False`.

In [22]:
model_1_loading = MODEL_1()
model_1_loading.net.load_state_dict(torch.load('./saves/test_1_model_1_specific'))

<All keys matched successfully>

In [23]:
# AFTER LOADING, ALL WEIGHTS ARE THE SAME
torch.allclose(Model_1_net.net[0].weight, model_1_loading.net[0].weight)

True

> Viola!!
> See? How I used `.net.load_state_dict`? It just works.
>
> The summary is: **You have to load in the same way you have saved**.

### Loading in the Model - 2!

In [24]:
model_2_loading = MODEL_2()
model_2_loading.load_state_dict(torch.load('./saves/test_1_model_1_direct'))

RuntimeError: Error(s) in loading state_dict for MODEL_2:
	Unexpected key(s) in state_dict: "net.4.weight", "net.4.bias". 
	size mismatch for net.0.weight: copying a param with shape torch.Size([3, 5]) from checkpoint, the shape in current model is torch.Size([4, 3]).
	size mismatch for net.0.bias: copying a param with shape torch.Size([3]) from checkpoint, the shape in current model is torch.Size([4]).
	size mismatch for net.2.weight: copying a param with shape torch.Size([4, 3]) from checkpoint, the shape in current model is torch.Size([1, 4]).
	size mismatch for net.2.bias: copying a param with shape torch.Size([4]) from checkpoint, the shape in current model is torch.Size([1]).

## Test - 2: Same architecture but singe layer is different

In [25]:
# The different architecture
class MODEL_2(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(5, 3),
            nn.ReLU(),
            nn.Linear(3, 4),
            nn.ReLU(),
            nn.Linear(4, 2) # from output 1 to 2
        )

    def forward(self, X):
        return self.net(X)

In [26]:
model_2_loading = MODEL_2()
model_2_loading.load_state_dict(torch.load('./saves/test_1_model_1_direct'))

RuntimeError: Error(s) in loading state_dict for MODEL_2:
	size mismatch for net.4.weight: copying a param with shape torch.Size([1, 4]) from checkpoint, the shape in current model is torch.Size([2, 4]).
	size mismatch for net.4.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([2]).

## Test - 3: No bias term in the second model -- other things same

In [27]:
# The different architecture
class MODEL_2(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(5, 3, bias=False),
            nn.ReLU(),
            nn.Linear(3, 4, bias=False),
            nn.ReLU(),
            nn.Linear(4, 1, bias=False)
        )

    def forward(self, X):
        return self.net(X)

In [28]:
model_2_loading = MODEL_2()
model_2_loading.load_state_dict(torch.load('./saves/test_1_model_1_direct'))

RuntimeError: Error(s) in loading state_dict for MODEL_2:
	Unexpected key(s) in state_dict: "net.0.bias", "net.2.bias", "net.4.bias". 

> Cool?

## Bonus... partially loading the model.

<img src="../images/partially-matching-models.png">

In [31]:
# The different architecture. -- but a layer has same one!
class MODEL_2(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(2, 3, bias=False),
            nn.ReLU(),
            nn.Linear(3, 4, bias=False),
            nn.ReLU(),
            nn.Linear(4, 2, bias=False)
        )

    def forward(self, X):
        return self.net(X)

In [42]:
model_2_loading = MODEL_2()
model_2_loading.load_state_dict(torch.load('./saves/test_1_model_1_direct'))

RuntimeError: Error(s) in loading state_dict for MODEL_2:
	Unexpected key(s) in state_dict: "net.0.bias", "net.2.bias", "net.4.bias". 
	size mismatch for net.0.weight: copying a param with shape torch.Size([3, 5]) from checkpoint, the shape in current model is torch.Size([3, 2]).
	size mismatch for net.4.weight: copying a param with shape torch.Size([1, 4]) from checkpoint, the shape in current model is torch.Size([2, 4]).

In [43]:
torch.allclose(Model_1_net.net[2].weight, model_2_loading.net[2].weight)

True

Look at this!? It **partially loads** what it can!

But we can make it run like this.

In [44]:
layer = torch.load('./saves/test_1_model_1_direct')

In [47]:
layer.keys()

odict_keys(['net.0.weight', 'net.0.bias', 'net.2.weight', 'net.2.bias', 'net.4.weight', 'net.4.bias'])

In [50]:
weights_to_use = layer["net.2.weight"]

In [51]:
model_2_loading.net.data = weights_to_use.clone()

ðŸ”¥ WOO!!