## Batched Feature Maps:

For more flexible manipulation of data in various batch sizes, and for troubleshooting of feature extraction errors that may be inconsistent across batches (i.e. with next-token-predicting LLMs.) Below are two examples of the use of batched feature maps:
- The canonical AlexNet example.
- A more complicated GPT example.

First, some setup...

In [1]:
import sys; sys.path += ['..']
# add deepjuice to your path

In [2]:
%load_ext autoreload
%autoreload complete

In [3]:
from deepjuice import * # the juice

In [4]:
from juicyfruits import NSDBenchmark
benchmark = NSDBenchmark() #load brain data benchmark

model_uid = 'torchvision_alexnet_imagenet1k_v1'
model, preprocess = get_deepjuice_model(model_uid)

# here, we'll subset our images to simulate
# a slightly uneven split across batches:
image_subset = benchmark.image_paths[:160]

dataloader = get_data_loader(image_subset, preprocess, batch_size=64)

Initializing DeepJuice Benchmarks (JuicyFruits)
Loading DeepJuice NSDBenchmark: 
  Image Set: shared1000
  Voxel Set: ['EVC', 'OTC']


**The Modified Extraction Procedure**

...works almost exactly the same as before, except there's a new key argument:

**batch_strategy**: 3 main variants
- *join* constructs empty tensors in full dimension, and fills them iteratively with each batch.
- *list* quite literally adds each batch of features to a list, and returns that list.
- *stack* wraps this list in a class that allows for easier manipulation of the underlying nested list.


*join* is the canonical version you're used to and is most efficient, but also can lead to a lot of degeneracies it turns out if batch_sizes are irregular, transfer between tensor devices is slow, or many other things...*list* is as it's written on the tin; each new batch of feature is dropped into a list, and that list is returned at the end of the function. *stack* is the version that directly uses the BatchedFeatureMaps class, but keep in mind that you can simply use the output of the *list* version after it's been built to initialize this class.

Let's have a look at the list version first:

In [5]:
# MPS is a mac GPU device that I'm using as stopgap:
devices = {'device': 'cuda:0', 'output_device': 'cpu'}

# notice that I'm invoking the get_feature_maps function directly
# this is the function internally called by the FeatureExtractor:
feature_map_list = get_feature_maps(model, dataloader, **devices,
                                   batch_strategy='list') # <- the new argument!

Extracting sample feature_maps with torchinfo (CUDA:0 to CPU)
Keeping 18 / 24 total maps (6 duplicates removed).


Feature Extraction (DataLoader):   0%|          | 0/3 [00:00<?, ?it/s]

So what does feature_map_list look like?

In [6]:
feature_map_list # notice the last of these feature_maps has only 32 inputs

[FeatureMaps Handle
  18 maps; 64 inputs; 262.60 MB 
  0 maps on GPU (0 duplicates),
 FeatureMaps Handle
  18 maps; 64 inputs; 262.60 MB 
  0 maps on GPU (0 duplicates),
 FeatureMaps Handle
  18 maps; 32 inputs; 131.30 MB 
  0 maps on GPU (0 duplicates)]

...in this example, feature_map_list is not a list of dictionaries as you might expect, but a list of what I call "FeatureMap" handles. These basically behave almost EXACTLY like a dictionary, but don't pollute ipython with numerical printouts. they also give us access to a bunch of quick stats.

In [7]:
batch_one_maps = feature_map_list[0]

# note how these behave exactly like dictionaries:
for uid, feature_map in batch_one_maps.items():
    if 'Linear' in uid: # print linear layers:
        print(uid, [x for x in feature_map.shape])
    
# but also give you cool, quick stats:
print('\n Number of inputs:', batch_one_maps.get_input_size())

Linear-2-15 [64, 4096]
Linear-2-18 [64, 4096]
Linear-2-20 [64, 1000]

 Number of inputs: 64


With this in mind now, you can think of BatchedFeatureMaps as simply a wrapper around the FeatureMap wrappers (don't worry -- this is the most recursive this will get, I think...)

In [8]:
from deepjuice.extraction import BatchedFeatureMaps

batched_maps = BatchedFeatureMaps(feature_map_list)

In [9]:
batched_maps # the initial report

Batch Tensor Maps Handler
  Total Batch Count: 3
  Total Input Count: 160
  # of Unique Feature Maps: 18
  No irregularities found.

In [10]:
# this returns None in this case:
batched_maps.get_irregular_shapes()

{}

Note, this modification to the "batch_strategy" procedure can be wrapped directly into a FeatureExtractor:

In [11]:
devices = {'device': 'cuda:0', 'output_device': 'cpu'}

extractor = FeatureExtractor(model, dataloader, **devices,
                             batch_strategy='stack',
                             max_memory_limit='16GB')

Extracting sample feature_maps with torchinfo (CUDA:0 to CPU)
FeatureExtractor Handle for AlexNet
  24 feature maps (+6 duplicates); 160 inputs
  Memory required for full extraction: 677.15 MB
  Memory usage limiting device set to: cpu
  Memory usage limit currently set to: 371.294 GB
  1 batch(es) required for current memory limit 
   Batch-001: 24 feature maps; 677.15 MB


In [12]:
for batched_feature_maps in tqdm(extractor, 'Global Progress'):
    feature_maps = batched_feature_maps.join_batches()
    for uid, feature_map in feature_maps.items():
        if 'Linear' in uid: # print linear layers:
            print(uid, [x for x in feature_map.shape])

Global Progress:   0%|          | 0/1 [00:00<?, ?it/s]

Irregularly shaped feature_maps excluded by default.


Joining Batched Feature Maps:   0%|          | 0/3 [00:00<?, ?it/s]

Linear-2-15 [160, 4096]
Linear-2-18 [160, 4096]
Linear-2-20 [160, 1000]


In [13]:
# this will also work if we add a flattening modification

extractor.modify_settings(flatten=True)

for batched_feature_maps in tqdm(extractor, 'Global Progress'):
    feature_maps = batched_feature_maps.join_batches()
    for uid, feature_map in feature_maps.items():
        if 'Conv2d' in uid: # print linear layers:
            print(uid, [x for x in feature_map.shape])

Global Progress:   0%|          | 0/1 [00:00<?, ?it/s]

Irregularly shaped feature_maps excluded by default.


Joining Batched Feature Maps:   0%|          | 0/3 [00:00<?, ?it/s]

Conv2d-2-1 [160, 193600]
Conv2d-2-4 [160, 139968]
Conv2d-2-7 [160, 64896]
Conv2d-2-9 [160, 43264]
Conv2d-2-11 [160, 43264]


## Huggingface GPT2 Example

Now, a more complicated example, using real text (captions) data in a generative LLM (Huggingface's GPT2).

First, we loads the captions data and our target model.

In [14]:
from example_assist import parse_caption_data

caption_path = 'example_data/social_captions'
caption_data = parse_caption_data(f'{caption_path}.csv')

In [15]:
from transformers import AutoTokenizer, AutoModel
model_uid = 'gpt2' # huggingface implemented

# standard loader without additional configs:
model = AutoModel.from_pretrained(model_uid)
tokenizer = AutoTokenizer.from_pretrained(model_uid)

tokenizer.add_special_tokens({'pad_token': '[PAD]'})
model.resize_token_embeddings(len(tokenizer));

The batched_feature_maps functionality can really come in handy in two particular scenarios.

The 1st scenario is where certain feature_map shapes may change across batches. <br>The 2nd scenario is where we might want different numbers of batch sizes.

This example will cover both of these scenarios. Here, we're working with a generative LLM, whose last layer (by default) produces a whole bunch of tensors (including *past_key_values*) that it sandwiches into a tuple.

To get variable batch sizes into our dataloader, we use a Torch class on the backend called BatchSampler. On the frontend, all we need to do is provide a dataframe, and a grouping variable that will assist our BatchSampler in choosing batches that get all inputs within a certain group into the same batch.

In [16]:
# here is our new dataloader functionality; notice data_key and group_keys:
dataloader = get_data_loader(caption_data, tokenizer, input_modality='text',
                             batch_size=16, data_key='caption', group_keys='video_name')

In [17]:
dataloader # by default, dataloader now comes packaged with additional info

DataLoader with 101 batches.
  Batch sizes range from: 6 to 16
  with an average size of: ~13

In [18]:
# note this loader can now yield samples!
dataloader.get_sample(show_original=True)

Young boys are playing on a pair of drum sets.


{'input_ids': tensor([20917,  6510,   389,  ..., 50257, 50257, 50257]), 'attention_mask': tensor([1, 1, 1,  ..., 0, 0, 0])}

The data used to produce the variable batch sizes is directly available in the dataloader:

In [19]:
dataloader.batch_data.sort_values(by=['group_index', 'caption_index'])

Unnamed: 0,video_name,caption_index,caption,group_index,batch_iter,batch_index,batch_size
0,-YwZOeyAQC8_15.mp4,1,A man playing on the wii which is making the b...,0,0,0,10
2,-YwZOeyAQC8_15.mp4,2,A man in shorts sits in a chair next to a stan...,0,0,2,10
4,-YwZOeyAQC8_15.mp4,3,A man with a baby on his lap playing Wii,0,0,4,10
6,-YwZOeyAQC8_15.mp4,4,father & child enjoying a show,0,0,6,10
8,-YwZOeyAQC8_15.mp4,5,A man sits playing video games on his TV while...,0,0,8,10
...,...,...,...,...,...,...,...
1382,yt_R-8XFRghgHAwk_54.mp4,1,An adult helping a small baby to open a wrappe...,249,100,1,10
1384,yt_R-8XFRghgHAwk_54.mp4,2,a a baby sitting on his fathers knee while his...,249,100,3,10
1386,yt_R-8XFRghgHAwk_54.mp4,3,A man helping a baby unwrap a present.,249,100,5,10
1388,yt_R-8XFRghgHAwk_54.mp4,4,A dad helping a baby to unwrap a present,249,100,7,10


(In this example, since GPT2 is REAL big, let's just look at the first 3 batches):

In [20]:
caption_subset = (dataloader.batch_data.sort_values(by=['group_index', 'caption_index'])
                  .query('batch_iter < 3'))[['video_name', 'caption_index', 'caption']]

# define a new dataloader with only these captions:
dataloader = get_data_loader(caption_subset, tokenizer, input_modality='text',
                             batch_size=16, data_key='caption', group_keys='video_name')

In [21]:
clean_and_sweep() # clear the CUDA cache

In [22]:
extractor = FeatureExtractor(model, dataloader, batch_strategy='stack', **devices)

Extracting sample feature_maps with torchinfo (CUDA:0 to CPU)


Extraction Error Report
(These Layers Skipped)
  Embedding-1-2
   --(add_features) IndexError: No dimension equal to input_size.


FeatureExtractor Handle for GPT2Model
  185 feature maps (+61 duplicates); 33 inputs
  Memory required for full extraction: 49.695 GB
  Memory usage limiting device set to: cpu
  Memory usage limit currently set to: 365.151 GB
  1 batch(es) required for current memory limit 
   Batch-001: 185 feature maps; 49.695 GB


In [23]:
for batched_feature_maps in tqdm(extractor, 'Overall Progress'):
    print(batched_feature_maps)

Overall Progress:   0%|          | 0/1 [00:00<?, ?it/s]

Batch Tensor Maps Handler
  Total Batch Count: 3
  Total Input Count: 33
  # of Unique Feature Maps: 185
  # Irregular Shapes: 25


Notice here that when loading GPT2 without additional configurations or keyword arguments, that we can get some irregularly shaped feature_maps:

In [24]:
batched_feature_maps.get_irregular_shapes()

{'GPT2Model-S2': [[2, 10, 12, 1024, 64],
  [2, 11, 12, 1024, 64],
  [2, 12, 12, 1024, 64]],
 'GPT2Block-2-1-S2': [[10, 12, 1024, 64],
  [11, 12, 1024, 64],
  [12, 12, 1024, 64]],
 'GPT2Attention-3-2-S2': [[10, 12, 1024, 64],
  [11, 12, 1024, 64],
  [12, 12, 1024, 64]],
 'GPT2Block-2-2-S2': [[10, 12, 1024, 64],
  [11, 12, 1024, 64],
  [12, 12, 1024, 64]],
 'GPT2Attention-3-6-S2': [[10, 12, 1024, 64],
  [11, 12, 1024, 64],
  [12, 12, 1024, 64]],
 'GPT2Block-2-3-S2': [[10, 12, 1024, 64],
  [11, 12, 1024, 64],
  [12, 12, 1024, 64]],
 'GPT2Attention-3-10-S2': [[10, 12, 1024, 64],
  [11, 12, 1024, 64],
  [12, 12, 1024, 64]],
 'GPT2Block-2-4-S2': [[10, 12, 1024, 64],
  [11, 12, 1024, 64],
  [12, 12, 1024, 64]],
 'GPT2Attention-3-14-S2': [[10, 12, 1024, 64],
  [11, 12, 1024, 64],
  [12, 12, 1024, 64]],
 'GPT2Block-2-5-S2': [[10, 12, 1024, 64],
  [11, 12, 1024, 64],
  [12, 12, 1024, 64]],
 'GPT2Attention-3-18-S2': [[10, 12, 1024, 64],
  [11, 12, 1024, 64],
  [12, 12, 1024, 64]],
 'GPT2Block-2-6

The reason this is happening is due to a recent update in how Huggingface "evaluates" CLM-style LLM models by default. Unless otherwise configured, these models will store a cache of tensor values that facilitate next word prediction, but are confusingly shaped. 

See for example [this section](https://huggingface.co/transformers/v3.5.1/model_doc/gpt2.html#gpt2lmheadmodel) in the Huggingface documentation for more information on this caching procedure and its downstream dimensionality.

For now, and for most intents and purposes, we actually don't want need or even want this cached information, and can safely exclude it:


In [29]:
from transformers import AutoTokenizer, AutoModel

model_uid = 'gpt2' # huggingface implemented

#! note now this config we'll add:
model_config = {'use_cache': False}

model = AutoModel.from_pretrained(model_uid, **model_config)
tokenizer = AutoTokenizer.from_pretrained(model_uid)

tokenizer.add_special_tokens({'pad_token': '[PAD]'})
model.resize_token_embeddings(len(tokenizer));

Now, when we run our same procedure, you'll notice we get no irregular shapes:

In [44]:
caption_subset = (dataloader.batch_data.sort_values(by=['group_index', 'caption_index'])
                  .query('batch_iter < 3'))[['video_name', 'caption_index', 'caption']]

# define a new dataloader with only these captions:
dataloader = get_data_loader(caption_subset, tokenizer, input_modality='text',
                             batch_size=16, data_key='caption', group_keys='video_name')

In [45]:
clean_and_sweep() # clear the CUDA cache

extractor = FeatureExtractor(model, dataloader, **devices,
                             memory_limit='16GB',
                             batch_strategy='stack')

Extracting sample feature_maps with torchinfo (CUDA:0 to CPU)


Extraction Error Report
(These Layers Skipped)
  Embedding-1-2
   --(add_features) IndexError: No dimension equal to input_size.


FeatureExtractor Handle for GPT2Model
  160 feature maps (+49 duplicates); 33 inputs
  Memory required for full extraction: 42.153 GB
  Memory usage limiting device set to: cpu
  Memory usage limit currently set to: 16.000 GB
  3 batch(es) required for current memory limit 
   Batch-001: 59 feature maps; 14.792 GB 
   Batch-002: 57 feature maps; 15.856 GB 
   Batch-003: 44 feature maps; 11.505 GB


In [46]:
for batched_feature_maps in tqdm(extractor, 'Overall Progress'):
    feature_maps = batched_feature_maps.join_batches()
    for uid, feature_map in feature_maps.items():
        if 'Attention' in uid:
            print(uid, feature_map.shape)

Overall Progress:   0%|          | 0/3 [00:00<?, ?it/s]

Irregularly shaped feature_maps excluded by default.


Joining Batched Feature Maps:   0%|          | 0/3 [00:00<?, ?it/s]

GPT2Attention-3-2 torch.Size([33, 1024, 768])
GPT2Attention-3-6 torch.Size([33, 1024, 768])
GPT2Attention-3-10 torch.Size([33, 1024, 768])
GPT2Attention-3-14 torch.Size([33, 1024, 768])
GPT2Attention-3-18 torch.Size([33, 1024, 768])
Irregularly shaped feature_maps excluded by default.


Joining Batched Feature Maps:   0%|          | 0/3 [00:00<?, ?it/s]

GPT2Attention-3-22 torch.Size([33, 1024, 768])
GPT2Attention-3-26 torch.Size([33, 1024, 768])
GPT2Attention-3-30 torch.Size([33, 1024, 768])
GPT2Attention-3-34 torch.Size([33, 1024, 768])
Irregularly shaped feature_maps excluded by default.


Joining Batched Feature Maps:   0%|          | 0/3 [00:00<?, ?it/s]

GPT2Attention-3-38 torch.Size([33, 1024, 768])
GPT2Attention-3-42 torch.Size([33, 1024, 768])
GPT2Attention-3-46 torch.Size([33, 1024, 768])


As a last step in this walkthrough, we'll show how we can combine batched_feature_maps with our variable batch sampler to save disk space by averaging feature_maps over the multiple samples of caption we have per image.

We do this simply by constructing a function that leverages the batch data in our sampler to compute averages *per group*...

In [42]:
# store batch_data to directly access
# the groups in our batching function:
batch_data = dataloader.batch_data.copy()

def grouped_average(tensor, batch_iter=None, **kwargs):
    if batch_iter is None: return tensor # as is
        
    sub_data = batch_data.query('batch_iter==@batch_iter')

    tensor_means = [] # fill with group tensor means
    for group in sub_data.group_index.unique():
        group_data = sub_data.query('group_index==@group')
        group_idx = group_data.batch_index.to_list()

        # convert index to tensor on device
        group_idx = (torch.LongTensor(group_idx)
                     .to(tensor.device))

        tensor_mean = tensor[group_idx].mean(dim=0)
        tensor_means += [tensor_mean.unsqueeze(0)]

    return torch.concat(tensor_means, dim=0) # as is for testing

... then adding it to our extractor!

In [47]:
clean_and_sweep() # clear the CUDA cache

extractor = FeatureExtractor(model, dataloader, **devices,
                             # note the function here:
                             tensor_fn=grouped_average,
                             memory_limit='16GB',
                             batch_strategy='stack')

Extracting sample feature_maps with torchinfo (CUDA:0 to CPU)


Extraction Error Report
(These Layers Skipped)
  Embedding-1-2
   --(add_features) IndexError: No dimension equal to input_size.


FeatureExtractor Handle for GPT2Model
  160 feature maps (+49 duplicates); 33 inputs
  Memory required for full extraction: 42.153 GB
  Memory usage limiting device set to: cpu
  Memory usage limit currently set to: 16.000 GB
  3 batch(es) required for current memory limit 
   Batch-001: 59 feature maps; 14.792 GB 
   Batch-002: 57 feature maps; 15.856 GB 
   Batch-003: 44 feature maps; 11.505 GB


In [48]:
for batched_feature_maps in tqdm(extractor, 'Overall Progress'):
    feature_maps = batched_feature_maps.join_batches()
    for uid, feature_map in feature_maps.items():
        if 'Attention' in uid:
            print(uid, feature_map.shape)

Overall Progress:   0%|          | 0/3 [00:00<?, ?it/s]

Irregularly shaped feature_maps excluded by default.


Joining Batched Feature Maps:   0%|          | 0/3 [00:00<?, ?it/s]

GPT2Attention-3-2 torch.Size([5, 1024, 768])
GPT2Attention-3-6 torch.Size([5, 1024, 768])
GPT2Attention-3-10 torch.Size([5, 1024, 768])
GPT2Attention-3-14 torch.Size([5, 1024, 768])
GPT2Attention-3-18 torch.Size([5, 1024, 768])
Irregularly shaped feature_maps excluded by default.


Joining Batched Feature Maps:   0%|          | 0/3 [00:00<?, ?it/s]

GPT2Attention-3-22 torch.Size([5, 1024, 768])
GPT2Attention-3-26 torch.Size([5, 1024, 768])
GPT2Attention-3-30 torch.Size([5, 1024, 768])
GPT2Attention-3-34 torch.Size([5, 1024, 768])
Irregularly shaped feature_maps excluded by default.


Joining Batched Feature Maps:   0%|          | 0/3 [00:00<?, ?it/s]

GPT2Attention-3-38 torch.Size([5, 1024, 768])
GPT2Attention-3-42 torch.Size([5, 1024, 768])
GPT2Attention-3-46 torch.Size([5, 1024, 768])


And there you have it; the use of batched_feature_maps and averaging to extract features from GPT2!