On 2023/01/26, we discussed it would be more interesting to analyze **conditional probabilities** (as opposed to unconditional probabilities), since language is largely contextual). To this end, we agreed on:

- Run the analysis and comparisons in terms of tokens (as opposed to text).
- Running **bi-gram** analysis of the words that co-occur more often within the beginning of the documents.
- Use the first token of the bigram to condition the LM and observe the conditional probability of the second token, given the first one: $P(N_{w_2}(K) > 0|w_1)$

Comparing the text and the set of tokens in terms of indices, mitigates issues with tokenization that arise in models like GPT2 or GPT-Neo, where **"data"** or **" data"** are represented differently. Note however, that we do not consider the probability mass (in our estimates) assigned to the sequence of tokens **"d" "a" "t" "a"**. We set to revisit this issue later if we find a big gap between model and data distributions.


One pseudo-algorithm for computing the terms frequencies is:

```
Pseudocode: Compute the token frequencies of the first 100 tokens across all documents in the provided dataset.
input: Tokenizer (T), docs (D)
output: frequencies<int, int>

frequencies<int, int> = dict()

for d in D:
  tokenized_doc = T.tokenize(d)
  tokenized_doc = tokenized_doc.slice(100)  // get first 100 

  frequencies.update_counts(tokenized_doc)
end

return frequencies
```

~~Implementation-wise, if we obtain a list of all possible document names, we can parallelize the terms counts.~~ However, chances are we cannot use the "_id" field, since it is a private field and we have no permissions to change the indexing mapping. Let us test the method on a couple of documents, if it's proven to be considerably fast, we will not parallelize it.

In [1]:
K_TOKENS = 10
INDEX = "re_pile"

# Default keyword arguments for elastic search
default_kwargs = {
    "index": INDEX,
    "track_total_hits": True,
}

# Load elastic search 
from notebook_utils import load_elastic_search
es = load_elastic_search("./configs/elastic-search.yml")

In [3]:
# Load tokenizer
from transformers import GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("EleutherAI/gpt-neo-125M")
tokenizer.pad_token = tokenizer.eos_token  # GPT2 does not have a pad token (EOS is used often)

from functools import partial
tokenize = partial(
    tokenizer.batch_encode_plus,
    max_length=K_TOKENS,
    truncation=True,
    padding="max_length"
)

In [6]:
from notebook_utils import scroll

query = {"match_all": {}}
data = iter(scroll(es, query, size=10, **default_kwargs))
docs = next(data)

Total documents found {'value': 211036967, 'relation': 'eq'}
Processing 10 documents


In [7]:
from notebook_utils import filter_tokens
from frequencies import PositionalFrequencies

In [18]:
counts = PositionalFrequencies()
j = 0
# TODO: ? Create threads to process these results
while docs and j<10:
    results = tokenize(list(map(get_text, docs)))

    input_ids = results["input_ids"]
    att_mask = results["attention_mask"]

    for inpt, attn in zip(input_ids, att_mask):
        tokens = filter_tokens(inpt, attn)
        for i, token in enumerate(tokens):
            counts.add(token, i)
    
    docs = next(data)
    print(j)
    j += 1

Processing 10 documents
0
Processing 10 documents
1
Processing 10 documents
2
Processing 10 documents
3
Processing 10 documents
4
Processing 10 documents
5
Processing 10 documents
6
Processing 10 documents
7
Processing 10 documents
8
Processing 10 documents
9


In [27]:
counts.most_common(5, 8)

[(198, 5), (11, 3), (287, 3), (286, 2), (1944, 2)]

In [21]:
tokenizer.decode([198])

'\n'

In [22]:
tokenizer.decode([286])

' of'

In [31]:
import joblib


In [32]:
joblib.load("unigram_counts.pkl")

AttributeError: Can't get attribute 'default_init' on <module 'frequencies' from '/home/kat/Projects/PhD/constrained-decoding/frequencies.py'>