<a href="https://colab.research.google.com/github/JDS289/DNNs/blob/main/renameEventuallyC.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Deep Neural Networks - First Assignment 2025
## Ferenc Huszár and Nic Lane
### Due date: 21 February, 2025


#Info


Welcome to the first assignment for the Deep Learning and Neural Networks (DeepNN) module. This first half is marked out of 30, and makes up 30% of the total marks you can earn in this module. The second assignment is going to be a bit more substantial and more open-ended.


There are 3 parts to this assignment, A, B and C, each worth 10% of the total module marks. The first two exercises contain specific subtasks, while the final one is more open-ended, requires you to do a bit of independent reading and experimentation.

How to submit:
* Please run this notebook in [google colab](https://colab.research.google.com/), adding your own code and text cells as requested
* Leave relevant output (plots, etc) in the colab - assume we won't be running your code, but we want to check how you solved problems. Please name and submit the `ipynb` file.
* For each of the three sections (A, B and C), please submit a 1 page pdf with:
   * Up to 350 words of text summarising and interpreting of your findings.
   * Up to 2 figures (or tables)
   * For mathematical derivations, for which an extra appendix page is allowed. It's fine to include a screenshot of compiled latex or photo of hand written notes - if that saves you time.
* For Sections A and B we included a writeup checklist to help you remember to include important components.
* You do not have to max out the word count of figure numbers, in fact we prefer short, to the point, descriptions.
* A correct solution of what we asked for with a solid writeup will earn full marks. Going beyond the brief - great, do it if you're curious - won't earn extra marks.

Part of the goal of this assignment is for you to familiarise yourself with `pytorch`, if you don't know it already. To this end we tried to include quite a bit of explanation, skeleton code, and some hints.

#Imports

In [None]:
from matplotlib import pyplot as plt

Mounted at /content/drive


#C: Inductive Bias of ReLU networks



In this exercise we ask you to run your own experiment that demonstrates/reproduces some known inductive biases of neural networks.

Examples are:
* **bias towards low-complexity boolean functions** As first demonstrated by [Vallé-Perez et al, 2018](https://arxiv.org/abs/1805.08522) randomly initialised neural networks (whose inputs and outputs are binary) have a tendency to implement lower complexity Boolean functions. **Hint:** Instead of using Lempel-Ziv complexity like in the paper, you may find it easier and/or more interesting to choose other measures of complexity, such as those mentioned in [this video](https://www.youtube.com/watch?v=XAhvVn1seUY).
* **bias towards low-rank representations:** Look at, for example, Figures 1 or 2 of ([Huh et al, 2023](https://openreview.net/pdf?id=bCiNWDmlY2)). **Hint:** Training results from Figure 1 may be difficult to produce as you want to make sure you tune the learning rates for all depth/rank combinations. The rank analysis at initialization found in Figure 2. would take significantly less time to run.
* **spline behaviour:** It has been shown ([Williams et al. 2020](https://proceedings.neurips.cc/paper_files/paper/2019/file/1f6419b1cbe79c71410cb320fc094775-Paper.pdf)) that - subject to some assumtions - gradient descent training dynamics in shallow ReLU networks leads to either cubic splines (in the so called kernel regime) or linear splines (in the so called adaptive regime). The difference between the two regimes is the scale of the random initialization. for details. **Hint:** It is not expected that you follow any details of the theoretical arguments in the paper linked above. It is fine for your investigation to be purely empirical: can you find an example where a trained ReLU network closely matches a cubic spline? Can you change the behaviour by changing initialization scale or learning rates?

I'd like to see if a probabalistic classifier will tend to learn generalisable probabbility reasoning, or whether this struggles to generalise beyond its training data. So for example, will a neural network end up implementing Bayes' Rule, or will it have a tendency towards "if I see the x from training then output y, but if this, then that... and if it wasn't in training, then give a random output"?

In [5]:
import numpy as np
import torch

In [3]:
from torch.distributions.categorical import Categorical

keyboard = np.array([list("qwertyuiop"), list("asdfghjkl;"), list("zxcvbnm,./")])
flat_keyboard = keyboard.flatten()

def distance(char1, char2):
    return np.linalg.norm(np.concatenate(np.where(keyboard==char1)) - np.concatenate(np.where(keyboard==char2)))

def prob_typo(char1, char2):  # unscaled logits
    return 7 / (1+distance(char1, char2))

In [6]:
# variable names ending in "_i" mean that letters are represented by their position in flat_keyboard, rather than as a char

word_list = np.array(["past", "last", "part", "pest", "lest", "kart"])
word_probs = torch.tensor([0.25, 0.25, 0.25, 0.15, 0.05, 0.05])
char_to_index = {letter: idx for idx, letter in enumerate(flat_keyboard)}
word_list_i = torch.tensor([[char_to_index[char] for char in word] for word in word_list])
intended_words_choices = Categorical(probs=word_probs).sample((10**6,))
intended_words, intended_words_i = word_list[intended_words_choices], word_list_i[intended_words_choices]
char_typo_logits = torch.stack([torch.tensor([prob_typo(intended, typo) for typo in flat_keyboard]) for intended in flat_keyboard])

#sample_i = Categorical(logits=char_typo_logits[intended_words_i]).sample()
#typod_words = ["".join(letters) for letters in flat_keyboard[sample_i]]
#sample = (typod_words, intended_words)

In [8]:
# with open("drive/MyDrive/DNNs1/sample.pkl", "wb") as f: pickle.dump(sample, f)

In [7]:
def encode_words(words):
    def encode(word):
        if word not in encode_words.word_cache:
            encode_words.word_cache[word] = np.concatenate([(flat_keyboard==letter).astype(float) for letter in word])
        return encode_words.word_cache[word]
    return np.array([encode(word) for word in words])
encode_words.word_cache = {}

def sample_to_inputs(typod_words):
    return torch.tensor(encode_words(typod_words), device=DEVICE)

def sample_to_indices(intended_words):  # "indices" are the class labels
    mapping = {word: idx for idx, word in enumerate(word_list)}
    return torch.tensor([mapping[word] for word in intended_words], device=DEVICE)

In [9]:
from torch.nn import Sequential, Linear, ReLU, LogSoftmax, NLLLoss

def middle_layer():
    layer = Linear(120, 120)
    return layer

def output_layer():
    layer = Linear(120, 6)
    return layer

def get_typo_network(num_hidden_layers=10):
    if num_hidden_layers <= 0:
        raise ValueError('Number of hidden layers must be positive')
    blocks = []
    for l in range(num_hidden_layers):
        blocks.append(middle_layer())
        blocks.append(ReLU())
    blocks.append(output_layer())
    blocks.append(LogSoftmax(dim=1))
    return Sequential(*blocks)

In [10]:
DEVICE = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(DEVICE)

def gradient_steps_with_split(model, xs_train, ys_train, xs_test, ys_test, lr, iterations, optimizer=torch.optim.Adam):
    """Performs the specified amount of Adam optimization steps on a Typo model using the training set,
       but printing loss for a separate test set."""
    optimizer = optimizer(model.parameters(), lr=lr)
    loss_fn = NLLLoss()

    for i in range(iterations):
        optimizer.zero_grad()
        loss = loss_fn(model(xs_train.float()), ys_train)
        loss.backward()
        optimizer.step()
        if i%100 == 0:
            print(loss_fn(model(xs_test.float()), ys_test))

cpu


In [None]:
"""
model = get_typo_network()
model.to(DEVICE)
split = len(sample[0])*4 // 5
inputs_train = sample_to_inputs(sample[0][:split])
indices_train = sample_to_indices(sample[1][:split])
inputs_test = sample_to_inputs(sample[0][split:])
indices_test = sample_to_indices(sample[1][split:])
gradient_steps_with_split(model, inputs_train, indices_train, inputs_test, indices_test, 0.0001, 3000)
"""

In [11]:
import pickle
from google.colab import drive
drive.mount("/content/drive")

model = get_typo_network()
model.load_state_dict(torch.load("drive/MyDrive/DNNs1/parameters.pth", weights_only=True, map_location=DEVICE))
model.to(DEVICE);

with open("drive/MyDrive/DNNs1/sample.pkl", "rb") as f:  sample = pickle.load(f)

Mounted at /content/drive


In [14]:
def prediction(model, input_word):
    return "\n".join([f"{word}: {round(prob.item()*100, 2)}%" for word, prob in zip(word_list, torch.exp(model(sample_to_inputs([input_word]).float())[0]))])

print(prediction(model, "ldst"))

past: 1.02%
last: 61.17%
part: 0.07%
pest: 2.02%
lest: 35.67%
kart: 0.05%


In [15]:
normalized_logits = Categorical(logits=torch.tensor([prob_typo("e", typo) for typo in flat_keyboard])).logits
print("\n".join(map(lambda t: f"{t[0]}: {round(torch.sigmoid(t[1]).item(), 3)}", sorted(zip(flat_keyboard, normalized_logits), key=lambda t:-t[1])[:10])))

e: 0.446
w: 0.024
r: 0.024
d: 0.024
s: 0.013
f: 0.013
q: 0.008
t: 0.008
c: 0.008
a: 0.006


In [16]:
normalized_logits = Categorical(logits=torch.tensor([prob_typo("a", typo) for typo in flat_keyboard])).logits
print("\n".join(map(lambda t: f"{t[0]}: {round(torch.sigmoid(t[1]).item(), 3)}", sorted(zip(flat_keyboard, normalized_logits), key=lambda t:-t[1])[:13])))

a: 0.453
q: 0.024
s: 0.024
z: 0.024
w: 0.014
x: 0.014
d: 0.008
e: 0.007
c: 0.007
f: 0.004
r: 0.004
v: 0.004
g: 0.003


In [17]:
print(prediction(model, "les["))

past: 0.21%
last: 7.99%
part: 0.02%
pest: 5.62%
lest: 86.16%
kart: 0.0%


Single-quotes means what was intended; double-quotes means what was typed:



$$
\text{Bayesianism using the actual distributions: }~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ \\ \frac{P(`\text{lest'} ~ | ~ ``\text{les["})}{P(`\text{last'} ~|~ ``\text{les["})} = \frac{P(``\text{les["} ~ | ~ `\text{lest'}) ~ P(`\text{lest'}) / P(``\text{les["})} {P(``\text{les["} ~ | ~ `\text{last'}) ~ P(`\text{last'}) / P(``\text{les["})} \\[0.3in] =
0.2 ~ \frac{{P(``\text{les["} ~ | ~ `\text{lest'}})} {{P(``\text{les["} ~ | ~ `\text{last'}})} = 0.2 ~ \frac{P(``\text{e"} ~|~ `\text{e'}) }{P(``\text{e"} ~|~ `\text{a'}) }~\approx~ 0.2 ~  \frac{0.446}{0.007} ~~\approx~ 12.7
\\[0.5in]
\text{Model's result, having never seen this example: }~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\\
\frac{86.16}{7.99} ~ \approx ~ 10.8
\\[0.6in]
$$

This is pretty close, and particularly impressive for an example it had never seen during training. In fact, something non-trivial that the model handled well: a typed "r" in second position is enough to overcome the prior of P('last') = 5 P('lest'), but a "d" in second position is not (that is, an "r" makes the posterior probability of 'lest' overtake 'last', whereas a "d" does not) -- this is because although 'e'→"d" is a 4× more likely typo than 'a'→"d", this is still less than the prior of 5×, whereas 'e'→"r" is **6**× more likely than 'a'→"r".


So it seems like it has learned some generalisable notions about probability, rather than just memorising the training answers.

Another property we want -- a notion of "invariance over irrelevant changes", i.e. we don't want a change to the *final* letter to randomly change its preferences for "past" vs "last" (given that they only differ in the first letter, and my sample-generation assumed independence).

In [18]:
print(prediction(model, "pas."))
print("\n")
print(prediction(model, "pas;"))

past: 96.59%
last: 1.71%
part: 1.01%
pest: 0.68%
lest: 0.0%
kart: 0.0%


past: 97.18%
last: 1.8%
part: 0.75%
pest: 0.28%
lest: 0.0%
kart: 0.0%


This is quite consistent -- the percentages are very similar, and the "preference ranking" stays identical. It's not perfect though, so the underlying function probably isn't as simple as we'd like.

## Section C Writeup Checklist

This is an exploratory project, but here is an example of what components you want to include:
* a hypothesis: what is the motivation for your work, what do you expect to see or demonstrate.
* brief description of methodology, network/optimization parameters, etc
* a figure or two summarising your findings
* discussion of whether findings support your hypothesis or not, perhaps speculation or explanation about what is going on.

Note: if you run your experiments and they don't produce the expected effect, that is perfectly fine, so long as your experiment made sense. Our existing understanding of inductive biases is far from complete, and the observations above may not universal for all architectures and training regimes.