<a href="https://colab.research.google.com/github/Seemab97/Practical-Neural-Networks-and-Deep-Learning-in-Python/blob/main/Tokenization_T5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Doing this to get the data
# Mount Google Drive to this Notebook instance.
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [1]:
!pip install torch transformers

Collecting transformers
  Downloading transformers-4.31.0-py3-none-any.whl (7.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m14.5 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m22.8 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m36.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m42.1 MB/s[0m eta [36m0:00:0

In [2]:
!pip install sentencepiece

Collecting sentencepiece
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sentencepiece
Successfully installed sentencepiece-0.1.99


In [3]:
import json
import torch

# Reading Data from file

In [None]:
# 1/4th of data
new_file_path = '/content/drive/MyDrive/Internship/Meetup - Text Data/quarter_context_train.json'

In [None]:
# Read data from the JSON file
with open(new_file_path, 'r') as json_file:
    data = json.load(json_file)

In [None]:
len(data)

2075

In [None]:
data[2074]

[['[00:10] B: e',
  '[00:33] B: could be a waiting room',
  '[00:38] B: 2 chairs',
  '[00:39] B: brown',
  '[00:50] A: hey',
  '[00:50] B: round table in the center with a plant on it',
  "[00:55] A: '/n",
  '[01:02] B: 3 lights on the wall',
  '[01:08] B: the walls are beige',
  '[01:10] A: I have flowers on table',
  '[01:23] A: blue chairs',
  '[01:26] B: you wish to find me or me find you'],
 '[01:31] A: I find you',
 'Use',
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]]

**Undestanding Data:**

Data looks like this:

```
[
[[], "[00:15] B: i'm in the playroom", "Initiate", []],
[["[00:15] B: i'm in the playroom"], "[00:17] B: you have to go west", "Initiate", [0]],
[["[00:15] B: i'm in the playroom", "[00:17] B: you have to go west"], "[00:19] A: What does it look like", "Use", [1, 0]],
[["[00:15] B: i'm in the playroom", "[00:17] B: you have to go west"], "[00:19] A: What does it look like", "Initiate", [0, 0]],
[["[00:15] B: i'm in the playroom", "[00:17] B: you have to go west", "[00:19] A: What does it look like"], "[00:38] A: describe it please", "Move", [0, 1, 0]]
]
```

As you can see there's an outer list A which contains a number of lists (B) of each data point which further contains a List C and D along with other elements. Let's understand B.

```
[[], "[00:15] B: i'm in the playroom", "Initiate", []]
```

In this list B, following are the elements:
- `[]`: Context -> Utterances spoken so far in List C
- `"[00:15] B: i'm in the playroom"`: Input -> Curent utterance
-  `"Initiate"`: Grounding Act (at current stage)
- `[]`: Reference to which element in context got grounded

In context, we keep adding all the previous utterance which serve as context to current utterance in Input.

Let's look at list B at 3rd index of A:
```
[["[00:15] B: i'm in the playroom", "[00:17] B: you have to go west"], "[00:19] A: What does it look like", "Use", [1, 0]]
```
Here's what element means:
- `["[00:15] B: i'm in the playroom", "[00:17] B: you have to go west"]`: Context -> Utterances spoken so far
- `"[00:19] A: What does it look like"`: Input -> Curent utterance
-  `"Use"`: Grounding Act
- `[1,0]`: 1st element in context ("[00:15] B: i'm in the playroom") got grounded with a Use

# Tokenize then Merge

In [4]:
# or as per HuggingFace tutorial
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("t5-small")

Downloading (…)okenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

In [5]:
data = [[[], "[00:15] B: i'm in the playroom", "Initiate", []],
        [["[00:15] B: i'm in the playroom"], "[00:17] B: you have to go west", "Initiate", [0]],
        [["[00:15] B: i'm in the playroom", "[00:17] B: you have to go west"], "[00:19] A: What does it look like", "Use", [1, 0]],
        [["[00:15] B: i'm in the playroom", "[00:17] B: you have to go west"], "[00:19] A: What does it look like", "Initiate", [0, 0]],
        [["[00:15] B: i'm in the playroom", "[00:17] B: you have to go west", "[00:19] A: What does it look like"], "[00:38] A: describe it please", "Move", [0, 1, 0]]]

**Input for the Model:**

We need to prepare input for the model in this format:

      [All_Context [SEP] Input]

But as know computers don't understand text but numbers so we will have to tokenize the data and get it ready.

**Tokenizing context:**

Now, for tokenization:

For each data list B, we individually tokenize each element in context and create its ```token_type_ids```. They will be set to 1 for the contexts which are grounded and for the rest they will be zero. And finally, we merge all individually tokenized elements into 1 list which represent B as whole.

We use T5-Tokenizer to tokenize context. BERT and other models generate ```token_type_ids``` automatically when tokenizing but T5 doesn't. So, we have to manually create them.

Here's how we create ```token_type_ids```:
- One we have a list of each individually tokenized context i.e., each element of tokens_list represent tokens of each context in list C of List B. For example for the given context: ```["[00:15] B: i'm in the playroom", "[00:17] B: you have to go west"]```
  -
```"[00:15] B: i'm in the playroom"``` will be tokenized separately and ```"[00:17] B: you have to go west"]``` separately.
  - ```"[00:15] B: i'm in the playroom"```:
          tensor([784, 1206, 10, 1808, 908, 272, 10, 3, 23, 31, 51, 16, 8, 577, 3082, 1])
  - ```"[00:17] B: you have to go west"]```:
          tensor([784, 1206, 10, 2517, 908, 272, 10, 25, 43, 12, 281, 4653, 1])
  - ```tokens_list``` = containing both the contexts as
          tokens_list:
          [{'input_ids': tensor([[ 784, 1206,   10, 1808,  908,  272,   10,    3,   23,   31,   51,   16,
            8,  577, 3082,    1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])},
            
            {'input_ids': tensor([[ 784, 1206,   10, 2517,  908,  272,   10,   25,   43,   12,  281, 4653,
            1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}]

**Token_type_ids:**
This tokenization is currently missing ```token_type_ids```. We manually create them.

- We generate a list ```token_type_id``` of all zeros initially. Inside, it has the same number of lists as ```token_type_ids``` and each list should be of the same corresponding size as well. Basically, we each context tokenized in ```token_type_ids```, we want to have another list either containing all 1s or all 0s to represent if this context was grounded or not.  For the same List B and its contexts as considered before:
          token_type_id:  [tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])]

- Now, we need only need to set that element 1 which corresponds to the grounded context in ```token_type_ids```. This is given by the last element of list B i.e., ```[1, 0]```. Hence, we store last index in ```grounded_context_list```.
         grounded_context_list: [1, 0]]
    
- It is supposed to be the same size as number of contexts in list C of List B, which generates a list of same size called ```token_type_ids``` corresponding to which we generate ```token_type_id```. Since all of them are in sync with their indices and contents at the indices, whichever index is 1 in ```grounded_context_list```, we set the elements of the corresponding index in ```token_type_id``` as 1.
          token_type_id:  [tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])]

**Final Result of Tokenized Context:**

Now that we have tokenized context
          
          tokens_list:
          [{'input_ids': tensor([[ 784, 1206,   10, 1808,  908,  272,   10,    3,   23,   31,   51,   16,
            8,  577, 3082,    1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])},
            
            {'input_ids': tensor([[ 784, 1206,   10, 2517,  908,  272,   10,   25,   43,   12,  281, 4653,
            1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}]
and  their corresponding token_type_ids:

          token_type_id:  [tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])]

We need to add each element of ```token_typy_id``` to its corresponding tokenized context in ```tokens_list```.

As we can see, ```tokens_list``` is a list of dictionaries. Each element of dictionary representing information of a context having key and value pairs for ```input_ids``` and ```attention_mask```. To this, we need to add one more key ```token_type_ids``` and get the value from  ```token_type_id```.

<br>

Now that we have finished making a complete list for each context in list C of list B having all the contents needed for tokenized context i.e., ```input_ids```, ```attention_masks```, ```token_type_ids```, we keep adding this final product to ```tokenized_contexts```.

          [
            {'input_ids': tensor([[ 784, 1206,   10, 1808,  908,  272,   10,    3,   23,   31,   51,   16,
            8,  577, 3082,    1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 'token_type_ids': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])},
            
            {'input_ids': tensor([[ 784, 1206,   10, 2517,  908,  272,   10,   25,   43,   12,  281, 4653,
            1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])}
            ]  

In [6]:
# input id, attention mask, token type id for each context
# data = [[[], "[00:15] B: i'm in the playroom", "Initiate", []],
#         [["[00:15] B: i'm in the playroom"], "[00:17] B: you have to go west", "Initiate", [0]],
#         [["[00:15] B: i'm in the playroom", "[00:17] B: you have to go west"], "[00:19] A: What does it look like", "Use", [1, 0]]
# ]

def preprocess_function(examples):
    tokenized_contexts = []

    for idx, item in enumerate(examples):
        token_type_ids = {}
        tokens_list = [tokenizer(context, padding=True, truncation=True, return_tensors="pt") for context in item[0]]
        #print('\n', '\n token list: ', tokens_list)

        # Create a list `token_type_id` with the same number of lists and sizes as `input_ids`
        token_type_id = [torch.zeros_like(tokens['input_ids']) for tokens in tokens_list]
        #print('token_type_id: ', token_type_id)

        grounded_context_list = item[-1]

        # Go over list of grounded_context_list and whichever index is 1, set the corresponding element in token_type_id = 1
        for i, grounded_context in enumerate(grounded_context_list):
            current_token_type_id = token_type_id[i]
            if grounded_context == 1:
                current_token_type_id[current_token_type_id == 0] = 1

        #print('token_type_id: ', token_type_id)

        # Store the token_type_ids in the token_type_ids dictionary with the key as the index
        token_type_ids['token_type_ids'] = token_type_id

        # For each token_type_id
        for j,id in enumerate(token_type_id):
            current_content = tokens_list[j] # get the corresponding element having context tokens -> it will be dictionary {'input_ids': tensor([[ 784, 1206,...]]), 'attention_mask': tensor([[1, 1, ...]])}
            current_content['token_type_ids'] = id # to this dictionary add another key 'token_type_ids' and add value as the current token_type_id stored in 'id'

        # Finally keep adding the prepared tokens_list elements to final tokenized_contexts as we go
        tokenized_contexts.append(tokens_list)
        #print('tokenized context: ', tokenized_contexts)

    return tokenized_contexts, token_type_ids

# Tokenize each element in item[0] separately i.e., each context
tokenized_data, token_type_ids = preprocess_function(data)

# Print the tokenized data and token_type_ids
# for i, tokens_list in enumerate(tokenized_data):
#     print("----------------------------------------------------------------------------------------------------------------")
#     print(f"Example {i + 1}:")
#     for j, tokens in enumerate(tokens_list):
#         print(f"  Tokens {j + 1}:", tokenizer.convert_ids_to_tokens(tokens['input_ids'][0]))
#         print(f"  Input IDs {j + 1}:", tokens['input_ids'][0])
#     #print(f"  Token Type IDs:", token_type_ids[i])

#     print()

for i in tokenized_data:
    print(i,'\n')

#print(tokenized_data)


[] 

[{'input_ids': tensor([[ 784, 1206,   10, 1808,  908,  272,   10,    3,   23,   31,   51,   16,
            8,  577, 3082,    1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])}] 

[{'input_ids': tensor([[ 784, 1206,   10, 1808,  908,  272,   10,    3,   23,   31,   51,   16,
            8,  577, 3082,    1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 'token_type_ids': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}, {'input_ids': tensor([[ 784, 1206,   10, 2517,  908,  272,   10,   25,   43,   12,  281, 4653,
            1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])}] 

[{'input_ids': tensor([[ 784, 1206,   10, 1808,  908,  272,   10,    3,   23,   31,   51,   16,
            8,  577, 3082,    1]]), 'attention_mask': tensor([[1,

Now ```tokenized_data``` contains each context tokenized separately. For example in:

        [["[00:15] B: i'm in the playroom", "[00:17] B: you have to go west"], "[00:19] A: What does it look like", "Use", [1, 0]]

```"[00:15] B: i'm in the playroom"``` is tokenized separately

        {'input_ids': tensor([[ 784, 1206,   10, 1808,  908,  272,   10,    3,   23,   31,   51,   16,
            8,  577, 3082,    1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 'token_type_ids': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

and ```"[00:17] B: you have to go west"]``` separately as

           {'input_ids': tensor([[ 784, 1206,   10, 2517,  908,  272,   10,   25,   43,   12,  281, 4653,
            1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])}

They are stored together in ```tokenized_list``` but separated by commas.

          [{'input_ids': tensor([[ 784, 1206,   10, 1808,  908,  272,   10,    3,   23,   31,   51,   16,
            8,  577, 3082,    1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 'token_type_ids': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])},
            
            {'input_ids': tensor([[ 784, 1206,   10, 2517,  908,  272,   10,   25,   43,   12,  281, 4653,
            1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])}]


Now, we want to have all the contexts together. For one data point, i.e.,List C of List B, instead of
          
          [context, context]
we want

          [context]

So, we merge all the information for one data point i.e., List B into 1. We dissolve List C. and now instead of 1,2 or whatever number of contexts separately, we will have one huge context based on only one ```input_ids```, ```attention_masks``` and ```token_type_ids```.

        {'input_ids': tensor([[ 784, 1206,   10, 1808,  908,  272,   10,    3,   23,   31,   51,   16,
            8,  577, 3082,    1,  784, 1206,   10, 2517,  908,  272,   10,   25,
           43,   12,  281, 4653,    1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1]]), 'token_type_ids:': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0]])}


In [66]:
# merging context

# Initialize a list to store the merged tokens (input IDs and attention masks)
merged_tokens = []

# Iterate through each element in tokenized_data
for example in tokenized_data:

    # Initialize lists to store input IDs and attention masks for each element in the example
    input_ids_list = []
    attention_mask_list = []
    token_type_ids_list = []

    # Iterate through each part in the example
    for part in example:

        # Extract the input IDs and attention masks from the part
        input_ids = part['input_ids']
        attention_mask = part['attention_mask']
        token_type = part['token_type_ids']

        # Append the input IDs and attention masks to their respective lists
        input_ids_list.append(input_ids)
        attention_mask_list.append(attention_mask)
        token_type_ids_list.append(token_type)

    # Check if the input IDs list is empty. If empty, create a single tensor with the shape (1, 1) containing a special token ID.
    if not input_ids_list:
        input_ids_list.append(torch.tensor([[tokenizer.pad_token_id]]))
        attention_mask_list.append(torch.tensor([[0]]))
        token_type_ids_list.append(torch.tensor([[0]]))

    # Concatenate the input IDs and attention masks along the last dimension (dimension 1) to create merged tensors
    merged_input_ids = torch.cat(input_ids_list, dim=1)
    merged_attention_mask = torch.cat(attention_mask_list, dim=1)
    merged_token_type_ids = torch.cat(token_type_ids_list, dim=1)

    # Create a dictionary with the merged input IDs and attention masks
    merged_example = {'input_ids': merged_input_ids, 'attention_mask': merged_attention_mask, 'token_type_ids:': merged_token_type_ids}

    # Append the merged_example dictionary to the merged_tokens list
    merged_tokens.append(merged_example)

# Print the merged tokens and attention masks together
for i, example in enumerate(merged_tokens):
    print(f"Example {i + 1} - Merged Tokens and Attention Mask: {example}")
    print()

#print(merged_tokens, type(merged_tokens))


Example 1 - Merged Tokens and Attention Mask: {'input_ids': tensor([[0]]), 'attention_mask': tensor([[0]]), 'token_type_ids:': tensor([[0]])}

Example 2 - Merged Tokens and Attention Mask: {'input_ids': tensor([[ 784, 1206,   10, 1808,  908,  272,   10,    3,   23,   31,   51,   16,
            8,  577, 3082,    1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 'token_type_ids:': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])}

Example 3 - Merged Tokens and Attention Mask: {'input_ids': tensor([[ 784, 1206,   10, 1808,  908,  272,   10,    3,   23,   31,   51,   16,
            8,  577, 3082,    1,  784, 1206,   10, 2517,  908,  272,   10,   25,
           43,   12,  281, 4653,    1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1]]), 'token_type_ids:': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0]])}

Exam

In [9]:
print(type(merged_tokens))

<class 'list'>


## Merging input and its token_type_ids to merged_tokens list

In [13]:
# input id, attention mask, token type id for each context
# data = [[[], "[00:15] B: i'm in the playroom", "Initiate", []],
#         [["[00:15] B: i'm in the playroom"], "[00:17] B: you have to go west", "Initiate", [0]],
#         [["[00:15] B: i'm in the playroom", "[00:17] B: you have to go west"], "[00:19] A: What does it look like", "Use", [1, 0]]
# ]

def preprocess_function(examples):
    tokenized_contexts = []

    for idx, item in enumerate(examples):
        token_type_ids = {}
        tokens_list = [tokenizer(context, padding=True, truncation=True, return_tensors="pt") for context in item[0]]
        #print('\n', '\n token list: ', tokens_list)

        # Create a list `token_type_id` with the same number of lists and sizes as `input_ids`
        token_type_id = [torch.zeros_like(tokens['input_ids']) for tokens in tokens_list]
        #print('token_type_id: ', token_type_id)

        grounded_context_list = item[-1]

        # Go over list of grounded_context_list and whichever index is 1, set the corresponding element in token_type_id = 1
        for i, grounded_context in enumerate(grounded_context_list):
            current_token_type_id = token_type_id[i]
            if grounded_context == 1:
                current_token_type_id[current_token_type_id == 0] = 1

        #print('token_type_id: ', token_type_id)

        # Store the token_type_ids in the token_type_ids dictionary with the key as the index
        token_type_ids['token_type_ids'] = token_type_id

        # For each token_type_id
        for j,id in enumerate(token_type_id):
            current_content = tokens_list[j] # get the corresponding element having context tokens -> it will be dictionary {'input_ids': tensor([[ 784, 1206,...]]), 'attention_mask': tensor([[1, 1, ...]])}
            current_content['token_type_ids'] = id # to this dictionary add another key 'token_type_ids' and add value as the current token_type_id stored in 'id'

        # Finally keep adding the prepared tokens_list elements to final tokenized_contexts as we go
        tokenized_contexts.append(tokens_list)
        #print('tokenized context: ', tokenized_contexts)

    return tokenized_contexts, token_type_ids

# Tokenize each element in item[0] separately i.e., each context
tokenized_data, token_type_ids = preprocess_function(data)

# Print the tokenized data and token_type_ids
# for i, tokens_list in enumerate(tokenized_data):
#     print("----------------------------------------------------------------------------------------------------------------")
#     print(f"Example {i + 1}:")
#     for j, tokens in enumerate(tokens_list):
#         print(f"  Tokens {j + 1}:", tokenizer.convert_ids_to_tokens(tokens['input_ids'][0]))
#         print(f"  Input IDs {j + 1}:", tokens['input_ids'][0])
#     #print(f"  Token Type IDs:", token_type_ids[i])

#     print()

# for i in tokenized_data:
#     print(i,'\n')

print(tokenized_data)


[[], [{'input_ids': tensor([[ 784, 1206,   10, 1808,  908,  272,   10,    3,   23,   31,   51,   16,
            8,  577, 3082,    1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])}], [{'input_ids': tensor([[ 784, 1206,   10, 1808,  908,  272,   10,    3,   23,   31,   51,   16,
            8,  577, 3082,    1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 'token_type_ids': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}, {'input_ids': tensor([[ 784, 1206,   10, 2517,  908,  272,   10,   25,   43,   12,  281, 4653,
            1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])}], [{'input_ids': tensor([[ 784, 1206,   10, 1808,  908,  272,   10,    3,   23,   31,   51,   16,
            8,  577, 3082,    1]]), 'attention_mask': tensor([[1, 1

Tokenizing input and generating token_type_ids for it

In [14]:


input_tokens_list = []
for item in data:
    tokens = tokenizer(item[1], padding=True, truncation=True, return_tensors="pt")
    input_tokens_list.append(tokens)

# Create a list of all ones with the same size as tokens_list
input_token_type_ids = [torch.ones_like(input['input_ids']) for input in input_tokens_list]

# Print the tokenized elements and token_type_ids
for i, input_tokens in enumerate(input_tokens_list):
    print(f"Example {i + 1} - Tokens:", tokenizer.convert_ids_to_tokens(input_tokens['input_ids'][0]))
    print(f"          - Token Type IDs:", input_token_type_ids[i][0],'\n')

print(input_tokens_list)
print(input_token_type_ids)


Example 1 - Tokens: ['▁[', '00', ':', '15', ']', '▁B', ':', '▁', 'i', "'", 'm', '▁in', '▁the', '▁play', 'room', '</s>']
          - Token Type IDs: tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]) 

Example 2 - Tokens: ['▁[', '00', ':', '17', ']', '▁B', ':', '▁you', '▁have', '▁to', '▁go', '▁west', '</s>']
          - Token Type IDs: tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]) 

Example 3 - Tokens: ['▁[', '00', ':', '19', ']', '▁A', ':', '▁What', '▁does', '▁it', '▁look', '▁like', '</s>']
          - Token Type IDs: tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]) 

Example 4 - Tokens: ['▁[', '00', ':', '19', ']', '▁A', ':', '▁What', '▁does', '▁it', '▁look', '▁like', '</s>']
          - Token Type IDs: tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]) 

Example 5 - Tokens: ['▁[', '00', ':', '38', ']', '▁A', ':', '▁describe', '▁it', '▁please', '</s>']
          - Token Type IDs: tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]) 

Example 6 - Tokens: ['▁[', '01', ':', '39', ']', '▁A', ':'

Merging all the individual contexts per data point into 1 per data point.

Then adding -1 for [SEP] to all input_ids, attention_mask and token_type_ids

Finally, adding all the relevant tokens for input after [SEP]

In [15]:
# merging context adding [SEP] and adding input

# Initialize a list to store the merged tokens (input IDs and attention masks)
merged_tokens = []

# Iterate through each element in tokenized_data
for index,example in enumerate(tokenized_data):

    # Initialize lists to store input IDs and attention masks for each element in the example
    input_ids_list = []
    attention_mask_list = []
    token_type_ids_list = []



    # Iterate through each part in the example
    for part in example:

        # Extract the input IDs and attention masks from the part
        input_ids = part['input_ids']
        attention_mask = part['attention_mask']
        token_type = part['token_type_ids']

        # Append the input IDs and attention masks to their respective lists
        input_ids_list.append(input_ids)
        attention_mask_list.append(attention_mask)
        token_type_ids_list.append(token_type)

    # Check if the input IDs list is empty. If empty, create a single tensor with the shape (1, 1) containing a special token ID.
    if not input_ids_list:
        input_ids_list.append(torch.tensor([[tokenizer.pad_token_id]]))
        attention_mask_list.append(torch.tensor([[0]]))
        token_type_ids_list.append(torch.tensor([[0]]))

    # Concatenate the input IDs and attention masks along the last dimension (dimension 1) to create merged tensors
    merged_input_ids = torch.cat(input_ids_list, dim=1)
    merged_attention_mask = torch.cat(attention_mask_list, dim=1)
    merged_token_type_ids = torch.cat(token_type_ids_list, dim=1)


    # Add -1 at the end of each merged tensor to represent [SEP] after which we will add tokenized input
    merged_input_ids = torch.cat([merged_input_ids, torch.tensor([[-1]])], dim=1)
    merged_attention_mask = torch.cat([merged_attention_mask, torch.tensor([[-1]])], dim=1)
    merged_token_type_ids = torch.cat([merged_token_type_ids, torch.tensor([[-1]])], dim=1)

    #Extract from input_tokens_list
    # Access the corresponding dictionary in input_tokens_list using the index 'idx'
    current_dict = input_tokens_list[index]

    # Now you can use 'current_dict' which corresponds to the current element in the external loop
    input_id = current_dict['input_ids']
    input_attention = current_dict['attention_mask']

    # Add all the required tokens for tokenized input after -1 i.e., [SEP]
    # Add all the required tokens for tokenized input after -1 i.e., [SEP]
    merged_input_ids = torch.cat((merged_input_ids, input_id), dim=1)
    merged_attention_mask = torch.cat((merged_attention_mask, input_attention), dim=1)
    merged_token_type_ids = torch.cat((merged_token_type_ids, input_token_type_ids[index]), dim=1)


    # Create a dictionary with the merged input IDs and attention masks
    merged_example = {'input_ids': merged_input_ids, 'attention_mask': merged_attention_mask, 'token_type_ids:': merged_token_type_ids}

    # Append the merged_example dictionary to the merged_tokens list
    merged_tokens.append(merged_example)

# Print the merged tokens and attention masks together
for i, example in enumerate(merged_tokens):
    print(f"Example {i + 1} - Merged Tokens and Attention Mask: {example}")
    print()

#print(merged_tokens, type(merged_tokens))


Example 1 - Merged Tokens and Attention Mask: {'input_ids': tensor([[   0,   -1,  784, 1206,   10, 1808,  908,  272,   10,    3,   23,   31,
           51,   16,    8,  577, 3082,    1]]), 'attention_mask': tensor([[ 0, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1]]), 'token_type_ids:': tensor([[ 0, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1]])}

Example 2 - Merged Tokens and Attention Mask: {'input_ids': tensor([[ 784, 1206,   10, 1808,  908,  272,   10,    3,   23,   31,   51,   16,
            8,  577, 3082,    1,   -1,  784, 1206,   10, 2517,  908,  272,   10,
           25,   43,   12,  281, 4653,    1]]), 'attention_mask': tensor([[ 1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1, -1,  1,
          1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1]]), 'token_type_ids:': tensor([[ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, -1,  1,
          1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1]])}

Example 3 - Mer

# Preparing Output

In [20]:
from sklearn.preprocessing import LabelEncoder

# List of actions
acts = ['Use', 'Move', 'Req-Ack', 'Req-Repair', 'Repair', 'Initiate', 'Ack-Req-Ack', 'Repeat', 'Explicit-Ack', 'Continue', 'Repeat-Back', 'Cancel']

# Given data
data = [[[], "[00:15] B: i'm in the playroom", "Initiate", []],
        [["[00:15] B: i'm in the playroom"], "[00:17] B: you have to go west", "Initiate", [0]],
        [["[00:15] B: i'm in the playroom", "[00:17] B: you have to go west"], "[00:19] A: What does it look like", "Use", [1, 0]],
        [["[00:15] B: i'm in the playroom", "[00:17] B: you have to go west"], "[00:19] A: What does it look like", "Initiate", [0, 0]],
        [["[00:15] B: i'm in the playroom", "[00:17] B: you have to go west", "[00:19] A: What does it look like"], "[00:38] A: describe it please", "Move", [0, 1, 0]],
        [["[00:15] B: i'm in the playroom", "[00:17] B: you have to go west", "[00:19] A: What does it look like", "[00:38] A: describe it please", "[00:55] A: I'm in one as well, I see  a small chalkboard in the corner and a small tent to the right"], "[01:39] A: Please describe what you see in your playroom", "Repeat", [0, 0, 1, 1, 0]]]

# Initialize the LabelEncoder
label_encoder = LabelEncoder()
label_encoder.fit(acts)

# Initialize the output list
output = []

# Iterate through each list B in the data
for idx, item in enumerate(data):
    # Get the actions list from the 3rd element of list B
    actions = item[2]
    #print('\nactions: ', actions)
    # Convert the actions to a 1-dimensional array and then encode using LabelEncoder
    encoded_actions = label_encoder.transform([actions])[0]
    #print("encoded: ", encoded_actions)
    output.append(encoded_actions)

# Print the encoded actions for each list B
for i, encoded_actions in enumerate(output):
    print(f"\nExample {i + 1} - Encoded Actions: {encoded_actions}")

print(output)


Example 1 - Encoded Actions: 4

Example 2 - Encoded Actions: 4

Example 3 - Encoded Actions: 11

Example 4 - Encoded Actions: 4

Example 5 - Encoded Actions: 5

Example 6 - Encoded Actions: 7
[4, 4, 11, 4, 5, 7]
