<a href="https://colab.research.google.com/github/JDS289/DNNs/blob/main/finalC.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#C: Inductive Bias of ReLU networks


I'd like to see if a probabalistic classifier will tend to learn generalisable probabbility reasoning, or whether this struggles to generalise beyond its training data. So for example, will a neural network end up implementing Bayes' Rule, or will it have a tendency towards just memorising its training distributions?

I decided to make a simple "Typo" dataset: over a small list of words, suppose words are drawn from a distribution over this list, and then as each word is typed, each letter has a probability of being transmuted to another letter, based on the keyboard-distance of the letters. The important thing is not whether my toy dataset is realistic, but whether a neural network will learn these distributions and potentially some "simplicity" in its implementation, from inductive biases.   

What I'm testing for, specifically:
* *Generalised probabilistic reasoning:* Can it learn to reason "even though this typo is closer to X, it's more likely to have been Y because Y is a much more common word" (and vice versa) on an unseen typo?
* *Smoothness / Irrelevance-Invariance:* Will its learned behaviour represent an elegant function i.e. will its output stay the same when modifying an irrelevant part of the input?

#Implementation

The inputs (all four-letter strings) were encoded as follows:
we have a keyboard of 30 characters; flatten this to 1d and then use letter positions as a letter -> number encoding; now the input-layer of 120 neurons is such that neuron i is set to 1 if *letter i//30 of the input string is in keyboard-position i%30*, otherwise 0. The network outputs are put through a Softmax, and it's trained using log loss, with Adam.

I experimented with the number of hidden layers, by testing its loss on a hold-out set. I chose 10, as it seemed to work fairly well and didn't overfit much (higher numbers of layers ended up being too slow for me to train).

#Results

In [8]:
keyboard = np.array([list("qwertyuiop"), list("asdfghjkl;"), list("zxcvbnm,./")])
flat_keyboard = keyboard.flatten()

def prob_typo(char1, char2):  # unscaled logits
    return 7 / (1+distance(char1, char2))

def get_typo_distribution(intended_letter):
    return sorted(zip(flat_keyboard, torch.sigmoid(normalize_logits(np.array([prob_typo(intended_letter, char) for char in flat_keyboard])))), key=lambda t: -t[1])

def format_distribution(intended_letter, length=10):
    return f"Typo distribution from {intended_letter}:\n" + "\n".join(map(lambda t: f"{t[0]}: {round(t[1].item(), 3)}", get_typo_distribution(intended_letter)[:length]))

print(format_distribution("e"))
print("\n")
print(format_distribution("a"))

Typo distribution from e:
e: 0.446
w: 0.024
r: 0.024
d: 0.024
s: 0.013
f: 0.013
q: 0.008
t: 0.008
c: 0.008
a: 0.006


Typo distribution from a:
a: 0.453
q: 0.024
s: 0.024
z: 0.024
w: 0.014
x: 0.014
d: 0.008
e: 0.007
c: 0.007
f: 0.004


In [None]:
print(prediction(model, "les["))

past: 0.21%
last: 7.99%
part: 0.02%
pest: 5.62%
lest: 86.16%
kart: 0.0%




Single-quotes means what was intended; double-quotes means what was typed:

$$
\text{Bayesianism using the actual distributions: }~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ \\ \frac{P(`\text{lest'} ~ | ~ ``\text{les["})}{P(`\text{last'} ~|~ ``\text{les["})} = \frac{P(``\text{les["} ~ | ~ `\text{lest'}) ~ P(`\text{lest'}) / P(``\text{les["})} {P(``\text{les["} ~ | ~ `\text{last'}) ~ P(`\text{last'}) / P(``\text{les["})} \\[0.3in] =
0.2 ~ \frac{{P(``\text{les["} ~ | ~ `\text{lest'}})} {{P(``\text{les["} ~ | ~ `\text{last'}})} = 0.2 ~ \frac{P(``\text{e"} ~|~ `\text{e'}) }{P(``\text{e"} ~|~ `\text{a'}) }~\approx~ 0.2 ~  \frac{0.446}{0.007} ~~\approx~ 12.7
\\[0.5in]
\text{Model's result, having never seen this example: }~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\\
\frac{86.16}{7.99} ~ \approx ~ 10.8
\\[0.6in]
$$

This is pretty close, and particularly impressive for an example it had never seen during training. In fact, something non-trivial that the model handled well: a typed "r" in second position is enough to overcome the prior of P('last') = 5 P('lest'), but a "d" in second position is not (that is, an "r" makes the posterior probability of 'lest' overtake 'last', whereas a "d" does not) -- this is because although 'e'→"d" is a 4× more likely typo than 'a'→"d", this is still less than the prior of 5×, whereas 'e'→"r" is **6**× more likely than 'a'→"r".


So it seems like it has learned some generalisable notions about probability, rather than just memorising the training answers.

Another property we want -- a notion of "invariance over irrelevant changes", i.e. we don't want a change to the *final* letter to randomly change its preferences for "past" vs "last" (given that they only differ in the first letter, and my sample-generation assumed independence).

In [None]:
print(prediction(model, "pas."))
print("\n")
print(prediction(model, "pas;"))

past: 96.59%
last: 1.71%
part: 1.01%
pest: 0.68%
lest: 0.0%
kart: 0.0%


past: 97.18%
last: 1.8%
part: 0.75%
pest: 0.28%
lest: 0.0%
kart: 0.0%


This is quite consistent -- the percentages are very similar, and the "preference ranking" stays identical. It's not perfect though, so the underlying function probably isn't as simple as we'd like -- but of course it could've been trained for much longer.