# Probabilistic Recurrent Neural Networks

Consider a set of words in a text. We have a system whose internal state is of dimension L, 
meaning it can store the information of L words. The system goes through the words 1 by one, 
and we want to store the information of the words in the system.

The objective is to have the information of any word in the text decays with a power law of the distance 
between the word and the current word. This would address the problem of exponential decay of informationin RNNs.

The objective will be achieved by first applying a contextual embedding of the words, which
will store the information of the given word and the D words around it. 

Then, the system will store each word it encounters in the memory, that can contain up to L words.
At each step, the system will update the memory by adding the information of the current word,
and removing the information of the oldest words in the memory with a probability depending on
how many steps ago a word was added to the memory:

$$ p = 1 - \left(\frac{n-1}{n}\right)^\alpha $$

where n is the number of words that passed since the word was added to the memory, and alpha 
is the exponent of the power law.

The probabilistic routine we just discussed can in principle be substituted by a neural network, that
would be able to learn the optimal way to update the memory as a function of the new input and the
current state of the memory. This will be the next step of the project.

The expect value of the quantity of information relating to a single word is obviously a fraction of the original information that can be expressed as:

$$ p = \frac{c}{n^\alpha}$$

by construction, where $ c = \sum_{i=1}^{\infty}\left(1/n\right)^\alpha $.

On the other hand, the $\sigma$ of the distribution is dependent on the number of neighbours used in the contextual embedding. The determination of such relation is important to ensure that information can in fact be reliably stored in the internal state, and we will proceed in deriving it in the first part of the project

We can start to understand the permanence of information in the memory by using as text the one-hot vectors, embedd them with the contextual embedding, and see how the sum of the scalar products of the memory vectors score with a given word vector:

$$ I_i = \sum_{j=0}^{L} v_j \cdot w_i $$

where w_i is the one hot encoded vector with 1 in position i and 0 elsewhere, and $v_j$ are the vectors of the memory at a given time

The behaviour of the $I_i$ will be power law with respeect to the index i.

### Positional Encoding

The information relating to a single word, is not only given by its contextual embedding, that in our case is simply a weighted average of the neighbouring words. 

Thus, a positional encoding is added to make sure the information of the position is not only on the probabilities of forgetting vectors, but in the vectors themselves.


In [11]:
import random
from collections import defaultdict
import numpy as np


from memory import Memory
from text import Text
from model import Model
import config

