# prompt normalisation

In this notebook we develop the functions process a chunk of the semcore dataset into prompts for the model, with variable batch size and maximum sequence length.

The goal here is to have the ambiguous word surrounded by a contextual window, for now I'll leave it centred if the window has to be clipped.

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('Data/Processed/SemCoreChunks/chunk_0.csv')

In [3]:
df.head()

Unnamed: 0,word,sentence,word_loc,wordnet,definition,definitions
0,long,How|long|has|it|been|since|you|reviewed|the|ob...,1,long%3:00:02::,primarily temporal sense; being or indicating ...,desire strongly or persistently|primarily temp...
1,been,How|long|has|it|been|since|you|reviewed|the|ob...,4,be%2:42:03::,"have the quality of being; (copula, used with ...",a light strong brittle grey toxic bivalent met...
2,reviewed,How|long|has|it|been|since|you|reviewed|the|ob...,7,review%2:31:00::,look at again; examine again,a new appraisal or evaluation|an essay or arti...
3,objectives,How|long|has|it|been|since|you|reviewed|the|ob...,9,objective%1:09:00::,the goal intended to be attained (and which is...,the goal intended to be attained (and which is...
4,benefit,How|long|has|it|been|since|you|reviewed|the|ob...,12,benefit%1:21:00::,financial assistance in time of need,financial assistance in time of need|something...


In [4]:
row = df.iloc[0]

In [5]:
type series = pd.core.series.Series

In [None]:
from typing import List
import re

def fix_whitespace(text):
    # Remove spaces before punctuation
    text = re.sub(r'\s+([?.!,;:])', r'\1', text)
    # Ensure space after punctuation if followed by a word (except for some cases like commas within numbers)
    text = re.sub(r'([?.!;:])(?=[^\s])', r'\1 ', text)
    return text

def all_defs(row: series) -> List[str]:
    return str.split(row.definitions, '|')

def sentence(row: series) -> str:
    return ' ' + fix_whitespace(' '.join(str.split(row.sentence, '|'))) # see below for why we add preceding whitespace

In [7]:
all_defs(row)

['desire strongly or persistently',
 'primarily temporal sense; being or indicating a relatively great or greater than average duration or passage of time or a duration as specified',
 'primarily spatial sense; of relatively great or greater than average spatial extension or extension as specified',
 'of relatively great height; - Sherwood Anderson',
 'good at remembering',
 'holding securities or commodities in expectation of a rise in prices',
 '(of speech sounds or syllables) of relatively long duration',
 'involving substantial risk',
 'planning prudently for the future',
 'having or being more than normal or necessary:',
 'for an extended time or at a distant time',
 'for an extended distance']

In [8]:
sentence(row)

' How long has it been since you reviewed the objectives of your benefit and service program?'

In [32]:
import sys
import os

# we want to import some llama source later
os.getcwd()
project_path = os.path.abspath("LLM")

if project_path not in sys.path:
    sys.path.append(project_path)

from llama.tokenizer import Tokenizer

tok_path = "/home/matt/.llama/checkpoints/Llama3.2-1B-hf-tok/tokenizer.model"
tok = Tokenizer(tok_path)
tok

<llama.tokenizer.Tokenizer at 0x7f5f3be13dd0>

In [10]:
print(f"no pad:      {tok.encode("Paris", bos=False, eos = False)}")
print(f"pre-padded:  {tok.encode(" Paris", bos=False, eos = False)}")
print(f"post-padded: {tok.encode("Paris ", bos=False, eos = False)}")
print(f"with colon:  {tok.encode(" Paris:", bos=False, eos = False)}")



no pad:      [60704]
pre-padded:  [12366]
post-padded: [60704, 220]
with colon:  [12366, 25]


## Whitespace

We see that tokens differ when whitespace is considered - so the first word in a setence may be a different ID to the same word inside the sentence. Usually sentences **are** preceded by a space, so we might as well do that.

However we can add colons by the looks of things.

In [11]:
sntc = sentence(row)
defs = all_defs(row)
word = row.word

print(f"word: {word}, sentence: {sntc}, definitions: {defs}")

word: long, sentence:  How long has it been since you reviewed the objectives of your benefit and service program?, definitions: ['desire strongly or persistently', 'primarily temporal sense; being or indicating a relatively great or greater than average duration or passage of time or a duration as specified', 'primarily spatial sense; of relatively great or greater than average spatial extension or extension as specified', 'of relatively great height; - Sherwood Anderson', 'good at remembering', 'holding securities or commodities in expectation of a rise in prices', '(of speech sounds or syllables) of relatively long duration', 'involving substantial risk', 'planning prudently for the future', 'having or being more than normal or necessary:', 'for an extended time or at a distant time', 'for an extended distance']


In [12]:
sntc_toks = tok.encode(sntc, bos=False, eos = False)

print("sentence tokens: " + " ".join([str(i) for i in sntc_toks]))

sentence tokens: 2650 1317 706 433 1027 2533 499 22690 279 26470 315 701 8935 323 2532 2068 30


In [13]:
word_tok = tok.encode(word, bos=False, eos = False)
word_tok_pad = tok.encode( " " + word, bos=False, eos = False)

In [14]:
word_tok[0] in sntc_toks

False

In [15]:
word_tok_pad[0] in sntc_toks

True

## Identifying sub-token indices

We will need to know which indices in the list of sentence tokens correspond to the word we want to disambiguate. There are some edge cases to bear in mind here, which we investigate below.

In [16]:
def word_tok_pad(row: series) -> List[int]:
    word = row.word
    word_tok_pad = tok.encode( " " + word, bos=False, eos = False)
    return word_tok_pad

def sentence_tok(row: series) -> List[int]:
    sntc = sentence(row)
    sntc_toks = tok.encode(sntc, bos=False, eos=False)
    return sntc_toks

In [17]:
df["word_tok_pad"] = df.apply(word_tok_pad, axis = 1)
df["sentence_tok"] = df.apply(sentence_tok, axis = 1)

In [18]:
df["word_tok_len"] = df.apply(lambda s : len(s.word_tok_pad), axis = 1)
df[df['word_tok_len'] != 1].head(3)

Unnamed: 0,word,sentence,word_loc,wordnet,definition,definitions,word_tok_pad,sentence_tok,word_tok_len
27,absenteeism,Do|you|measure|its|relation|to|reduced|absente...,7,absenteeism%1:04:00::,habitual absence from work,habitual absence from work,"[94190, 2191]","[3234, 499, 6767, 1202, 12976, 311, 11293, 941...",2
43,fancier,Is|it|larger|or|fancier|than|you|really|need|?,4,fancy%3:00:00::,not plain; decorative or ornamented,something many people believe that is false|a ...,"[81697, 1291]","[2209, 433, 8294, 477, 81697, 1291, 1109, 499,...",2
67,purchasing agent,Is|your|purchasing agent|offering|too much|fre...,2,purchasing_agent%1:18:00::,an agent who purchases goods or services for a...,an agent who purchases goods or services for a...,"[23395, 8479]","[2209, 701, 23395, 8479, 10209, 2288, 1790, 19...",2


We see that not all the words in out sample are just a single token, so we will need to be more general when checking that we can located the tokens in the sentence tokens: we want to check that they are present a a contiguous sub-list

In [19]:
def contains_contiguous_sublist(sublst, lst):
    return str(sublst)[1:-1] in str(lst)[1:-1]

In [20]:
df["word_locate_necc_condition"] = df.apply(lambda s : contains_contiguous_sublist(s.word_tok_pad, s.sentence_tok), axis = 1)
df[df.word_locate_necc_condition == False]

Unnamed: 0,word,sentence,word_loc,wordnet,definition,definitions,word_tok_pad,sentence_tok,word_tok_len,word_locate_necc_condition


ok - all our rows satisfy the necessary condition, however to truly locate the word we also need to identify which position it is in if there are two options; there is no reason this can't happen

In [21]:
def count_contiguous_sublists(sublst, lst):
    count = 0
    i = 0
    sub_len = len(sublst)

    while i <= len(lst) - sub_len:
        if lst[i:i + sub_len] == sublst:
            count += 1
            i += sub_len
        else:
            i += 1

    return count

In [22]:
df["word_locate_possibilities"] = df.apply(lambda s : count_contiguous_sublists(s.word_tok_pad, s.sentence_tok), axis = 1)
df[df["word_locate_possibilities"] != 1]

Unnamed: 0,word,sentence,word_loc,wordnet,definition,definitions,word_tok_pad,sentence_tok,word_tok_len,word_locate_necc_condition,word_locate_possibilities


Unfortunately this data set doesn't contain an example, so we will have to come up with some test cases.

In [23]:
bad_word_simple = " that"
bad_sentence_simple = " that thing you found, this is not that, that was another thing?"

bad_word_simple_toks     = tok.encode(bad_word_simple, bos = False, eos = False)
bad_sentence_simple_toks = tok.encode(bad_sentence_simple, bos = False, eos = False)

In [24]:
bad_word_simple_toks

[430]

In [25]:
bad_sentence_simple_toks

[430, 3245, 499, 1766, 11, 420, 374, 539, 430, 11, 430, 574, 2500, 3245, 30]

In [26]:
count_contiguous_sublists(bad_word_simple_toks, bad_sentence_simple_toks)

3

https://stackoverflow.com/questions/63413414/is-there-a-way-to-get-the-location-of-the-substring-from-which-a-certain-token-h

In [None]:
i = 0

for t in bad_sentence_simple_toks:
    tok_word = tok.decode([t]) 
    
    pos = bad_sentence_simple[i:].find(tok_word)

    pos += i  

    print(f"({pos}, {pos + len(tok_word)}) {tok_word}")

    i = pos + len(tok_word)



(0, 5)  that
(5, 11)  thing
(11, 15)  you
(15, 21)  found
(21, 22) ,
(22, 27)  this
(27, 30)  is
(30, 34)  not
(34, 39)  that
(39, 40) ,
(40, 45)  that
(45, 49)  was
(49, 57)  another
(57, 63)  thing
(63, 64) ?


In [59]:
def word_start_index(s: series) -> int:
    words = s.sentence.split('|')
    ret = 1
    for i in range(s.word_loc):
        ret += len(words[i]) + 1 # the space
    return ret


def word_start_tok_index(s: series) -> int:

    toks  = sentence_tok(s)

    ret = 0
    i = 0

    while(True):

        tok_word = tok.decode([toks[ret]]) 
        
        i += len(tok_word)
        
        if i >= s.word_start_index:
            return ret
        
        ret += 1

In [60]:
df["word_start_index"] = df.apply(word_start_index, axis = 1)
df["word_start_tok_index"] = df.apply(word_start_tok_index, axis = 1)

In [62]:
df.head(3)

Unnamed: 0,word,sentence,word_loc,wordnet,definition,definitions,word_tok_pad,sentence_tok,word_tok_len,word_locate_necc_condition,word_locate_possibilities,word_start_index,word_start_tok_index
0,long,How|long|has|it|been|since|you|reviewed|the|ob...,1,long%3:00:02::,primarily temporal sense; being or indicating ...,desire strongly or persistently|primarily temp...,[1317],"[2650, 1317, 706, 433, 1027, 2533, 499, 22690,...",1,True,1,5,1
1,been,How|long|has|it|been|since|you|reviewed|the|ob...,4,be%2:42:03::,"have the quality of being; (copula, used with ...",a light strong brittle grey toxic bivalent met...,[1027],"[2650, 1317, 706, 433, 1027, 2533, 499, 22690,...",1,True,1,17,4
2,reviewed,How|long|has|it|been|since|you|reviewed|the|ob...,7,review%2:31:00::,look at again; examine again,a new appraisal or evaluation|an essay or arti...,[22690],"[2650, 1317, 706, 433, 1027, 2533, 499, 22690,...",1,True,1,32,7


In [64]:
df[df.word_loc != df.word_start_tok_index].sample(3)

Unnamed: 0,word,sentence,word_loc,wordnet,definition,definitions,word_tok_pad,sentence_tok,word_tok_len,word_locate_necc_condition,word_locate_possibilities,word_start_index,word_start_tok_index
68,offering,Is|your|purchasing agent|offering|too much|fre...,3,offer%2:40:02::,"make available or accessible, provide or furnish",the verbal act of offering|something offered (...,[10209],"[2209, 701, 23395, 8479, 10209, 2288, 1790, 19...",1,True,1,26,4
80,eating,When|improvements|are|recommended|in|working|c...,14,eating%1:04:00::,the act of consuming food,the act of consuming food|take in solid food|e...,[12459],"[3277, 18637, 527, 11349, 304, 3318, 4787, 482...",1,True,1,91,15
85,productivity,When|improvements|are|recommended|in|working|c...,30,productivity%1:07:00::,the quality of being productive or having the ...,the quality of being productive or having the ...,[26206],"[3277, 18637, 527, 11349, 304, 3318, 4787, 482...",1,True,1,184,33


we see in that in cases when the setence contains words that require multiple tokens, this index based method is necessary - it also is robust to repeat instances of the same word!

## Forming definition prompts

We need to find a way to include the word itself in the prompt containing the definition, for now we'll do the most basic, just appending it

In [76]:
def prepare_definition_prompts(s: series) -> List[str]:

    encode = lambda  text : tok.encode(text, bos = False, eos = False)

    if len(encode(" " + s.word)) == len(encode(" " + s.word + ":")):
        return Exception("colon was absorbed - changing token embedding!")


    defs = s.definitions.split('|')

    return [" " + s.word + ": " + d for d in defs]


In [79]:
df["definition_prompts"] = df.apply(prepare_definition_prompts, axis = 1)

In [80]:
df.sample(4)


Unnamed: 0,word,sentence,word_loc,wordnet,definition,definitions,word_tok_pad,sentence_tok,word_tok_len,word_locate_necc_condition,word_locate_possibilities,word_start_index,word_start_tok_index,definition_prompts
21,make,What|effort|do|you|make|to|assess|results|of|y...,4,make%2:41:00::,engage in,a recognizable kind|the act of mixing cards ha...,[1304],"[3639, 5149, 656, 499, 1304, 311, 8720, 3135, ...",1,True,1,20,4,"[ make: a recognizable kind, make: the act of..."
14,improved,Have|you|permitted|it|to|become|a|giveaway|pro...,17,improved%3:00:00::,made more desirable or valuable or profitable;...,to make better|get better|made more desirable ...,[13241],"[12522, 499, 15480, 433, 311, 3719, 264, 61064...",1,True,1,89,17,"[ improved: to make better, improved: get bet..."
23,results,What|effort|do|you|make|to|assess|results|of|y...,7,result%1:11:00::,something that results,a phenomenon that follows and is caused by som...,[3135],"[3639, 5149, 656, 499, 1304, 311, 8720, 3135, ...",1,True,1,35,7,[ results: a phenomenon that follows and is ca...
94,benefits,"When|negotiating|with|your|union|,|do|you|make...",16,benefit%1:21:00::,financial assistance in time of need,financial assistance in time of need|something...,[7720],"[3277, 44725, 449, 701, 11552, 11, 656, 499, 1...",1,True,1,89,16,[ benefits: financial assistance in time of ne...


In [81]:
def check_tokens(s: series) -> bool:
    word_tok_pad = s.word_tok_pad[0]
    word_tok_sentence = s.sentence_tok[s.word_start_tok_index]

    return word_tok_pad == word_tok_sentence

In [82]:
df["tokens_match"] = df.apply(check_tokens, axis = 1)

In [83]:
df[df["tokens_match"] == False]

Unnamed: 0,word,sentence,word_loc,wordnet,definition,definitions,word_tok_pad,sentence_tok,word_tok_len,word_locate_necc_condition,word_locate_possibilities,word_start_index,word_start_tok_index,definition_prompts,tokens_match
