# Long-term memories
## Intro
Word embeddings is an example of a differentiable dataset. But it's specific to words. This type of representation is better for a nn than a fixed database (such as wikipedia). They could be improved by a community (eg using a commit system.) So internet isn't yet differentiable, but it could be cool to have a version of it that nn can easily read from.

What's the difference between this and a standard dataset for neural nets? In general, standard datasets are <input, output> pairs. But this differentiable datasets are <input, transform(input)>. For example, if we are using word embeddings, we have <word, vector> with the vector being the word embedding for the word. 

## Word embeddings
Why do word embeddings work? They make a representation better suited for the nn. Two qualities: a) we train them with huge datasets, b) we use distributed representations.

The dataset used to train word embeddings such as glove is massive (they used 1000B tokens)

Why doesn't it just read the text? It's not optimal. Every neural net can learn to represent word embeddings, but it doesn't make sense to start from scratch every time. That's an advantage of electronical systems over humans: they can transfer what they learned. 

## Internet
Internet is massive. And it doesn't make sense for me that neural networks aren't taking advantage of it. A human connected to the internet is much more intelligent than if she were disconnected. 

Internet seems very smart on its own. For instance, Google can predict with high accuracy the next word in a sentence. If neural nets are good with large datasets, then the internet is a good place for them. How the neural net could learn from it isn't clear. One option could be to take the internet as a big matrix (as if it were a memory) and we allow the nn to read from it (as in memory networks.) 

The nn could backprop through the internet, and that would mean having a differentiable datasets. The nn could write to the internet (similar to people asking in forums or posting what they found useful.)

Consider that it isn't necessary to go through all the internet for every iteration. Instead, we can have a hierarchical structure where we access everything in log(n) with n being the size of the internet. Say storing the Internet takes around 10^18 bytes. So, if we use log(n) we have $log_2(10^18) \approx 60.$ What does that mean? It means that if we take 60 binary decisions, then we can decide what byte in the internet we are gonna read (assuming we store the internet in a balanced binary tree.)

## Task
We want a task that requires to store a lot of facts and correctly retrieve a few of them at each time step. 

Standard tasks aren't useful because they contain only so much data. What we need is a task where the neural net receives a huge amount of different datasets. Remember that the nn doesn't need to remember every word in every  article, they will be stored. What the nn needs is "to go to a place that makes it remember"* the memory. (Notice you can repeat the word in quotation marks the number of times you want. Eg, with times=2, the sentence is "what the nn needs is to go to a place that makes it remember to go to a place that makes it remember the memory.")

## Individual vs shared knowledge bases
### Humans
It makes sense to have a particular long-term memory for every human. Other possibility could be a huge, central knowledge base. For primite humans, this could work only for non-fast decisions.

Also, communicating information is so difficult! Talking seems to be one of the most efficient ways of communication information, and it seems much slower than neurons communicating with each other. Even with this huge latency, we evolved to have this large knowledge bases. It seems primitive humans didn't live alone but lived in communities of 150 people. That community size could have been enough for the amount of information they needed to store (mostly information about their environment.) In the modern era, libraries and the internet are examples of huge knowledge bases.

Almost every field related to knowledge works as a knowledge-base that increases over time. Society doesn't expect the young people to rederive every concept. Instead, we learn what was previously discovered and then (ideally) add something on top of it or change something that was incorrect. Language, mathematics, and societies are shared knowledge bases.

### Neural nets
It seems much better to access a shared knowledge base than to learn everything from scratch. Indeed, if we want to build long-term memories for neural networks, then it doesn't seem sensible to have a long-term memory for each individual (as humans have.) 

## More on word embeddings
What we are doing is encoding relational information. It's just encoding how things interact. A vector doesn't have meaning. It's only when things like $vector(woman) - vector(man) + vector(king) \approx vector(queen)$ happen, that we can say something useful. {is there a way of learning absolute meaning?} {what else other than +/- operations are there (ie other transformations)?}

It's similar to knowing "A => B", but without knowing whether A or B are true. Also, it's as if we had y = 2x, or c = a + b without never needing to feed values for a, b, or x.

For instance, say we have "Mary is twice as fast as Charles." Formally, we have mary_speed = 2 * charles_speed. If then we ask "How fast is Mary?" we can answer "Twice as fast as charles."

But what if I ask you "How fast is Mary (don't give an answer in terms of Charles)." And say you know that charles runs at 8KM/H. Then you would say that Mary runs at 16KM/h. But it continues to be relative.

What is one KM? One KM is ten blocks. And what's a block? A hundred steps. Now we are operating in a very low level. When I think of this concepts, everything I have is related to images. I have an image to represent the distance of a block. I have the same to represent the distance of a step. 

## Primitives
It's interesting that we don't have one hardcoded explanation for a concept and everything builds from there (that is, we don't have an image for the distance of a step and everything builds from there.) Instead, we have an image for the distance of a step, for a block, and even for a KM. And that complements with the abstract meaning of 100 step = 1 blocks and 10 blocks = 1 KM.

The low-level memories seem to be best represented as images. It seems as if humans have good primitives for dealing with images. What are these primitives about?

It seems there are at least two primitives
* storage: memorization techniques rely on converting symbols (eg sentences, cards) to images.
* understanding: techniques for deep understanding rely on visualizations: plots, interactive simulations, diagrams.

Thus, storage and understanding benefit from using images. It doesn't seem that we need primitives _about images_ to have intelligence. But to get intelligence, we need good manipulation of _some_ data type. We then can map other data types to the one we have good tools to manipulate. 

Storage doesn't seem to be the limiting factor with intelligence. Or is it? It's interesting that AI research doesn't focus on storage. It's not clear that humans can store less information than a computer. It seems a human can't store as much information as the internet has.

We shouldn't mix the information that could be consciously accessed with the information that a brain stores as a whole. Let's consider the conscious memories, which give us a lower-bound for all the memories a human can store. I can easily retrieve memories of rides. Most of the things I can retrieve are images. I can retrieve some sequences of numbers. For a given person, I can retrieve their house, face, name, likings. This could be completely wrong, but it seems we are able to consciously retrieve between 100k and 10000K images in total (assume we use a sensible equivalence between other data types and images. So those 100k-10000k images also account for other types of memories.) We can store 2000K images of 500kB in a TB. So it seems the memory of a pc is in the same order of magnitude as a human. [1] Now the unconscious could be doing whatever it pleases and it could have much more information stored. Who knows?

## Direction of research
Taking what we said above, it seems as if the problems we focus on with neural nets are a little bit off from the problems we should focus on for AGI. With actual nns we are training them from scratch to learn about one task. And at the same time we are saying that they aren't able to generalize. What if the generalization comes from training it in several tasks and using differentiable datasets?

I think we shouldn't use synthethic datasets, because they are misleading. The fact that they are synthethic means they are explained by a few, simple rules. We can't do that with human knowledge. [2] 

## Take aways
A good direction seems as follows. We train a neural network that uses information from the internet and from previously trained neural networks. The task should be general but easy. That is, we use the largest knowledge-base we can use where we obtain some performance that is better than chance. 

## Notes
[1] I think humans are much more efficient at storing data than computers.

[2] Evidence for this is that symbolic systems didn't quite work and people also failed in trying to make huge databases for human knowledge. 

[3] Recursive hopfield net is having a set of hopfield networks, which are increasingly complex. We start with a blurred memory using the simpler hopfield net and we refine the memory using the increasingly complex next hopfield nets.