# 1

## 1.1

We want to train a model that predicts $d$ given $d = a \cdot b + c$.
$a$, $b$ and $c$ are non-negative and $a$ and $c$ are two-digit integers and $b$ is a one-digit integer.
This makes $d$ at most a three digit number. Specifically $d \in \{ 0, 990 \}$. The representation of $d$ then becomes
$n_0 n_1 n_2$.
Because we reverse the digits, the training set $\{x, y\}$ would become:

\begin{align}
    x &= [ a_0, a_1, b, c_0, c_1, d_2, d_1 ] \\
    y &= [ a_1, b, c_0, c_1, d_2, d_1, d_0 ]
\end{align}

A concrete example shows that padding with zeros keeps the length constant:

$$
    a = 5, b = 5, c = 33 \\
    a \cdot b + c = 58
$$

gives

\begin{align*}
    x &= [0,5,5,3,3,8,5] \\
    y &= [5,5,3,3,8,5,0].
\end{align*}

## 1.2

When the model is optimized it will predict d given a, b and c. Using the same $a = 5, b = 5, c = 33$ as before:

\begin{align}
    x^{(0)} &= [0, 5, 5, 3, 3],  &[\hat{z}_0^{(0)}, \hat{z}_1^{(0)}, \hat{z}_2^{(0)}, \hat{z}_3^{(1)}, \textcolor{red}{\hat{z}_4^{(0)}}] = f_\theta(x^{(0)})\\
    x^{(1)} &= [0, 5, 5, 3, 3, \textcolor{red}{\hat{z}_4^{(0)}}],  &[\hat{z}_0^{(1)}, \cdots, \textcolor{blue}{\hat{z}_5^{(1)}}] = f_\theta(x^{(1)})  \\
    x^{(2)} &= [0, 5, 5, 3, 3, \textcolor{red}{\hat{z}_4^{(0)}}, \textcolor{blue}{\hat{z}_5^{(1)}}],  &[\hat{z}_0^{(2)}, \cdots, \textcolor{green}{\hat{z}_6^{(2)}}] = f_\theta(x^{(2)}) \\
    x^{(3)} &= [0, 5, 5, 3, 3, \textcolor{red}{\hat{z}_4^{(0)}}, \textcolor{blue}{\hat{z}_5^{(1)}}, \textcolor{green}{\hat{z}_6^{(2)}}],  &[\hat{z}_0^{(2)}, \cdots, \textcolor{purple}{\hat{z}_7^{(3)}}] = f_\theta(x^{(3)}) \\
    x^{(4)} &= [0, 5, 5, 3, 3, \textcolor{red}{\hat{z}_4^{(0)}}, \textcolor{blue}{\hat{z}_5^{(1)}}, \textcolor{green}{\hat{z}_6^{(2)}}, \textcolor{purple}{\hat{z}_7^{(3)}}]
\end{align}

\begin{align}
\hat{y} = [\textcolor{red}{\hat{z}_4^{(0)}}, \textcolor{blue}{\hat{z}_5^{(1)}}, \textcolor{green}{\hat{z}_6^{(2)}}, \textcolor{purple}{\hat{z}_7^{(3)}}]
\end{align}



Annen mulighet

\begin{align}
    x^{(0)} &= [0, 5, 5, 3, 3],  &[\hat{z}_0^{(0)}, \hat{z}_1^{(0)}, \hat{z}_2^{(0)}, \hat{z}_3^{(1)}, \textcolor{red}{0}] = f_\theta(x^{(0)})\\
    x^{(1)} &= [0, 5, 5, 3, 3, \textcolor{red}{0}],  &[\hat{z}_0^{(1)}, \cdots, \textcolor{blue}{5}] = f_\theta(x^{(1)})  \\
    x^{(2)} &= [0, 5, 5, 3, 3,  \textcolor{red}{0}, \textcolor{blue}{5}],  &[\hat{z}_0^{(2)}, \cdots, \textcolor{green}{8}] = f_\theta(x^{(2)}) \\
    x^{(3)} &= [0, 5, 5, 3, 3, \textcolor{red}{0}, \textcolor{blue}{5}, \textcolor{green}{8}] \\
\end{align}

\begin{align}
\hat{y} = [\textcolor{red}{0}, \textcolor{blue}{5}, \textcolor{green}{8}]
\end{align}

## 1.3

Using cross entropy as the object function, with $m = 5$ and $y = [4, 3, 2, 1]$. If the object function $\mathcal{L(\theta, D)} = 0$, then $\hat{Y}$ would be the onehot representation of $y$: 
\begin{align}
\hat{Y} = onehot(y) =
\begin{bmatrix}
0 & 0 & 0 & 0 \\
0 & 0 & 0 & 1 \\
0 & 0 & 1 & 0 \\
0 & 1 & 0 & 0 \\
1 & 0 & 0 & 0 \\
\end{bmatrix}
\end{align}

In this case $ \hat{y} = y = [4, 3, 2, 1]$.
If the objectfunction $\mathcal{L(\theta, D)} = 0$, then $ \hat{y}$ will be the as the solution.


## 1.4
Given $ d, m, n_{max}, k, p, L$. To find the amount of paramters that must be determined one msut look at the dimensions of all the matrices involved in the transformer model.

$W_{E},  W_{U} \in \mathbb{R}^{d \times m}$ and $W_{P} \in \mathbb{R}^{d \times n_{max}}$ is only made once per transformer model

$W_{O},  W_{V}, W_{Q},  W_{K} \in \mathbb{R}^{k \times d}$ and $W_{1},  W_{2} \in \mathbb{R}^{p \times d}$ are made for all $L$ layers in the transformer model.

In total that gives $ 2 \cdot m n + dn_{max} + L(kd +pd)$ individual paramters that must be determined
 

# Sorting problem

In [None]:
from train_test_params import *
sort_params = SortParams1()

Prepare training and testing data for the sorting problem

In [None]:
from data_generators import get_train_test_sorting

training_data = get_train_test_sorting(
    length=sort_params.r,
    num_ints=sort_params.m,
    samples_per_batch=sort_params.D,
    n_batches_train=sort_params.b_train,
    n_batches_test=sort_params.b_test,
)

x_train = training_data["x_train"]
y_train = training_data["y_train"][:, :, sort_params.r - 1:]
x_test = training_data["x_test"]
y_test = training_data["y_test"]

Let's initialize the network

In [None]:
from train_network import init_neural_network

network = init_neural_network(sort_params)

Train the network using `CrossEntropy` as the loss function (object function).

In [None]:
from train_network import train_network
from layers_numba import CrossEntropy

loss = CrossEntropy()

train_network(
    network=network,
    x_train=x_train,
    y_train=y_train,
    loss_func=loss,
    alpha=sort_params.alpha,
    n_iter=sort_params.n_iter,
    num_ints=sort_params.m,
    dump_to_pickle_file=False,
)

Or load a pre-trained network from a pickle dump

In [None]:
import dill as pickle

with open("nn_dump_exer3.pkl", "rb") as f:
    network = pickle.load(f)

In [None]:
from test_network import test_trained_network

test_trained_network(
    network=network, x_test=x_test, y_test=y_test, num_ints=sort_params.m
)

# Addition problem

In [None]:
from train_test_params import AddParams

add_params = AddParams()

In [None]:
from train_network import init_neural_network

network = init_neural_network(add_params)

In [None]:
from data_generators import get_train_test_addition

# prepare training and test data for addition problem
training_data = get_train_test_addition(
    n_digit = add_params.r,
    samples_per_batch = add_params.D,
    n_batches_train = add_params.b_train,
    n_batches_test=add_params.b_test
)

x_train = training_data["x_train"]
y_train = training_data["y_train"][:, :, add_params.r*2 - 1:]
x_test = training_data["x_test"]
y_test = training_data["y_test"][:, :, ::-1]    # remember that (c0, c1, c2) is reversed in the training data.

In [None]:
from train_network import train_network
from layers_numba import CrossEntropy

loss = CrossEntropy()

train_network(
    network=network,
    x_train=x_train,
    y_train=y_train,
    loss_func=loss,
    alpha=add_params.alpha,
    n_iter=add_params.n_iter,
    num_ints=add_params.m,
    is_numba_dump=True,
    dump_to_pickle_file=True,
)

In [None]:
import dill as pickle

with open("nn_dump_add.pkl", "rb") as f:
    network = pickle.load(f)

In [None]:
from test_network import test_trained_network

test_trained_network(
    network=network, x_test=x_test, y_test=y_test, num_ints=add_params.m
)