# CMPUT 624 - Machine Learning and the Brain (2023)

*Notebook written by Alex Murphy (September 2023)*

This notebook is for the class workshop on **Thursday 28 September 2023**. Now that you have all handed in your project proposals, this workshop will give you a chance to see some examples of loading / visualising neural data. I will focus on EEG & fMRI because these are the modalities your teams have selected for your projects. We will also end with some examples of loading (large) language models (LLMs) via the HuggingFace library to extract text embeddings.

* Section 1: Working with EEG
* Section 2: Working with fMRI
* Section 3: Working with LLMs

# Section 3: Working with LLMs




Make sure you have the `transformers` library installed:
* `pip install transformers` (prefixed with exclamation mark if running in a notebook cell)

In [None]:

!pip install transformers

Collecting transformers
  Downloading transformers-4.33.3-py3-none-any.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m46.4 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.15.1 (from transformers)
  Downloading huggingface_hub-0.17.3-py3-none-any.whl (295 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m295.0/295.0 kB[0m [31m25.4 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m93.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.3.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m64.2 MB/s[0m eta [36m0:00:0

### Pipelines

HuggingFace provides a very simple entry point into working with language models via its `pipeline` interface. You can start working with language models very quickly using this functionality, but it can be quite restrictive when you want to do anything more advanced or custom. However, it's a great place to begin. The first time you call these pipelines, relevant models will be downloaded so it might take a little bit of time on the first call of a specific pipeline task.

In [None]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier("I've been waiting for a HuggingFace course my whole life.")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

[{'label': 'POSITIVE', 'score': 0.9598048329353333}]

`semantic-analysis` is just one possibility. The website outlines so other possible tasks:

* `feature-extraction` (get the vector representation of a text)
* `fill-mask`
* `ner` (named entity recognition)
* `question-answering`
* `sentiment-analysis`
* `summarization`
* `text-generation`
* `translation`
* `zero-shot-classification`

Let's try `summarization`

In [None]:
classifier = pipeline("summarization")

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Let's try taking an abstract of a paper and seeing how we can **further** condense it by seeing what a summarisation module can do. Here is the text:

*Face cells are neurons that respond more to faces than to non-face objects. They are found in clusters in the inferotemporal cortex, thought to process faces specifically, and, hence, studied using faces almost exclusively. Analyzing neural responses in and around macaque face patches to hundreds of objects, we found graded response profiles for non-face objects that predicted the degree of face selectivity and provided information on face-cell tuning beyond that from actual faces. This relationship between non-face and face responses was not predicted by color and simple shape properties but by information encoded in deep neural networks trained on general objects rather than face classification. These findings contradict the long-standing assumption that face versus non-face selectivity emerges from face-specific features and challenge the practice of focusing on only the most effective stimulus. They provide evidence instead that category-selective neurons are best understood by their tuning directions in a domain-general object space.*

In [None]:
summary = classifier("Face cells are neurons that respond more to faces than to non-face objects. They are found in clusters in the inferotemporal cortex, thought to process faces specifically, and, hence, studied using faces almost exclusively. Analyzing neural responses in and around macaque face patches to hundreds of objects, we found graded response profiles for non-face objects that predicted the degree of face selectivity and provided information on face-cell tuning beyond that from actual faces. This relationship between non-face and face responses was not predicted by color and simple shape properties but by information encoded in deep neural networks trained on general objects rather than face classification. These findings contradict the long-standing assumption that face versus non-face selectivity emerges from face-specific features and challenge the practice of focusing on only the most effective stimulus. They provide evidence instead that category-selective neurons are best understood by their tuning directions in a domain-general object space.")

In [None]:
summary[0]['summary_text'].split('. ')

[' Face cells are neurons that respond more to faces than to non-face objects ',
 'They are found in clusters in the inferotemporal cortex, thought to process faces specifically ',
 'Analyzing neural responses in and around macaque face patches to hundreds of objects, we found graded response profiles for non-faces predicted the degree of face selectivity .']

That is a good summary! It might be slightly different the next time I run this.

You can specify a particular model by passing in `model="model_name"` in order to get the exact response from a pre-defined model choice. However, this is not really directly of interest to those of you doing projects with text. We're interested in the embeddings. Let's look at those next. But first, it can be important to read up about the different models available as some have language-specific embeddings, while other models are trained to be multi-lingual. Those working with multiple language text stimuli might find a potential comparison with their brain data by examining the difference here.

## Tokenisers

Text comes as a string of letters that we recognise as words. In order for a language model to process text, it needs to be split up into chunks so that individual units can be mapped to numerical embeddings. There are a few ways this can happen:

* word-based
* character-based
* sub-word based (WordPiece / SentencePiece / Byte-Pair Encoding)

With word-based tokenisation, we need a specific embedding for every word in our vocabulary and we cannot meaningfully process any new word outside the vocab used in training. We can get around that by making an embedding for every character (so a nice small look-up table of the 26 letters of the English alphabet plus punctuation). The problem here is that individual characters don't encode meaning by themselves. They have been used and do surprisingly well at many tasks, but there is a happy medium between the two extremes. The details aren't important for this class today, but if you will use a LLM in your projects, it is worth diving a bit deeper into the world of tokenisation.

Different models are trained with different tokenisation schemes, meaning that if you want to represent units of text in a specific way (word-level, character-level, byte-level) you are typically constrained to use a specific model. In the same vein, if you want to use a specific model, or run a comparison with another study that used a specific model, then you are constrained to use the same tokenisation scheme. This has an effect on how text embeddings are derived and could have some implications on your project analyses. It's more something to be aware of as you will certainly come across this concept if dealing with language models.

## Getting text embeddings
* Get input text
* Put it through a tokeniser
* Take one set of information from the tokeniser and pass to the model
* Model outputs -> text embeddings

Let's get some text embeddings from the `BERT` model, which some class as the first "LLM" (though by later standards it doesn't seem that large), more of a L(?)LM.

In [None]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [None]:
sentence = "I think neuroscience and machine learning is great!"
tokenized_sentence = tokenizer(sentence)

In [None]:
print(tokenized_sentence.keys())
print(tokenized_sentence['input_ids'])

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])
[101, 146, 1341, 24928, 11955, 22274, 1105, 3395, 3776, 1110, 1632, 106, 102]


These indices represent a lookup index into the predefined vocabulary for BERT. BERT uses a subword-tokenizer, so it's in between word-level and character-level and captures many common letter sequences, that individually do not often have meaning by themselves. Let's see what happens if we map these back to words based on the tokenization.

In [None]:
for index in tokenized_sentence['input_ids']:
    print(tokenizer.decode([index]))

[CLS]
I
think
ne
##uro
##science
and
machine
learning
is
great
!
[SEP]


Two things to notice here:
* The addition of [CLS] and [SEP] tokens to the input
* Subword-splitting

BERT was originally trained on a task of masked-word prediction and sentence continuation (the model had to predict, during training, if the sentence following another made sense). This was done by randomly replacing the real continuation of a sentence in a long text with a random (but coherent and grammatical) sentence from another part of the text. This [SEP] token is a carry-over from the model training process and can be ignored (but beware, it will be returned by the model). The [CLS] token is a class token that is used at the final layer to have a sentence embedding, not just of the individual words (which you can also access) and linear prediction (after softmax) helps to classify sentences. You will notice these extra tokens have the indices [101] and [102] and you will always see them in Bert, unless you specify request for them not to be returned (by specifying `add_special_tokens=False`).

In [None]:
sentence = "I think neuroscience and machine learning is great!"
tokenized_sentence = tokenizer(sentence, add_special_tokens=False)
tokenized_sentence['input_ids']

[146, 1341, 24928, 11955, 22274, 1105, 3395, 3776, 1110, 1632, 106]

The second thing is the demonstration that the word `neuroscience` has been split up into three separate parts:
* ne
* ##uro
* ##science

The double-# are there to signal that they were originally from a longer word but have been split up because for the words that are not very common in the entire corpus the model was trained on, this is how we enable the ability to process any arbitrary new text. When you go from indices to words, the tokeniser knows to concatenate the words so you get out your normal sentence.

In [None]:
tokenizer.decode([101, 146, 1341, 24928, 11955, 22274, 1105, 3395, 3776, 1110, 1632, 106, 102])

'[CLS] I think neuroscience and machine learning is great! [SEP]'

The actual language model (in this cae, BERT) has an initial word embedding for each of the subword tokens in its vocabulary and the `input_ids` selects these from the look-up table and then the inputs are processed to get final-layer embeddings. Let's load the BERT model and get some text embeddings.

In [None]:
from transformers import BertModel

model_name = 'bert-base-uncased'
model = BertModel.from_pretrained(model_name)

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Beware here that `tokenized_sentence` is a dictionary containing multiple pieces of information. We want the `input_ids`. However, we need to put them in a batch if only using a single sentence (neural networks often accept batches and the requirement is there even if you are only giving the model a single sample). That's why we put `tokenized_sentence['input_ids']` in a `list` before converting to a PyTorch `tensor`. The default model type in HuggingFace is PyTorch, which is why we need to convert the `input_ids` into this format.

In [None]:
import torch

input_ids = torch.tensor([tokenized_sentence['input_ids']])
output = model(input_ids)

print(output.keys())

odict_keys(['last_hidden_state', 'pooler_output'])


What is returned to us is a dictionary with two keys:
* `last_hidden_state`
* `pooler_output`

`pooler_output` is the final layer's representation of the [CLS] token we mentioned earlier, after being put through another linear layer and activation function. This is for classification purposes and likely isn't relevant to you (but it might be, so it's worth explaining). Let's look more at `last_hidden_state`.

In [None]:
output['last_hidden_state'].shape

torch.Size([1, 11, 768])

The first dimension represents the batch size we put in (remember, we only had one sentence). The last dimension (768) is the embedding dimension of BERT. All outputs will have a 768-dimensional vector associated with them. The middle output relates to the number of segments in our (tokenized input).

In [None]:
len(tokenized_sentence['input_ids']) == output['last_hidden_state'].shape[1]

True

The first element of `output['last_hidden_state]` is the 'raw' value of the [CLS] token, in case you were wondering. The `pooler_output` value is the same embedding with the additional preprocessing through a Dense layer which has been pretrained to be good for classification.

What if you wanted more than just the top layer embeddings?

In that case, when you load the model, you need to add in a `config` data structure and specify this with the `output_hidden_states=True` setting, like so:

In [None]:
from transformers import BertConfig

config = BertConfig.from_pretrained(model_name, output_hidden_states=True)
model = BertModel.from_pretrained(model_name, config=config)
output = model(input_ids)

print(output.keys())

odict_keys(['last_hidden_state', 'pooler_output', 'hidden_states'])


Notice now we have a **third** key in the output, namely, `hidden_states`.

In [None]:
len(output['hidden_states'])

13

BERT has 12 layers and when we consider the initial embeddings of the `input_ids` (which are also tacked on) then this gives 13 layers' worth of information.

In [None]:
torch.all(output['last_hidden_state'] ==output['hidden_states'][12])

tensor(True)

As you can see here, the top layer of `hidden_states` is the same thing that we get ordinarily in the `last_hidden_state` key of the model outputs. Let's look at the 6th layer and embeddings for the 10th token. We will print the first 25 values of the 768-dimensional embedding.

In [None]:
output['hidden_states'][5][0][10][:25]

tensor([-0.2270, -0.0707,  0.0318, -0.3179,  0.9179,  0.2117,  0.4920, -0.1177,
        -0.0364, -0.5063,  0.3525,  1.3593, -0.0255,  0.3841, -0.4087,  0.0650,
        -0.2806, -0.8488,  0.0529,  1.3787,  0.5879, -0.3346,  1.0472, -0.2409,
         0.9801], grad_fn=<SliceBackward0>)

That is how you get text embeddings for a sentence. You can average them together, concatenate them, pick the first/middle/last. This is on the decision of the researcher depending on the task at hand.

## Don't have to stop at BERT

And if you want the GPT-2 embeddings for the same sentence, it's as easy as this:

In [None]:
from transformers import GPT2Tokenizer, GPT2Model, GPT2Config

config = GPT2Config(output_hidden_states=True)
model = GPT2Model.from_pretrained('gpt2', config=config)
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
gpt2_tokenized_sentence = tokenizer(sentence)
outputs = model(torch.tensor([gpt2_tokenized_sentence['input_ids']]))

Downloading model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

In [None]:
outputs.keys()

odict_keys(['last_hidden_state', 'past_key_values', 'hidden_states'])

In [None]:
len(outputs['hidden_states'])

13

In [None]:
gpt2_tokenized_sentence

{'input_ids': [40, 892, 39738, 290, 4572, 4673, 318, 1049, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [None]:
outputs['hidden_states'][4].shape

torch.Size([1, 9, 768])

GPT-2 still keeps a 768-dimensional numerical representation of each `token`, but there are only 9 return values here. Check out the tokenised sentence representation above. Since GPT-2 uses a different tokeniser to BERT, the original sentence is split up differently.

In [None]:
for token in gpt2_tokenized_sentence['input_ids']:
    print(tokenizer.decode(token))

I
 think
 neuroscience
 and
 machine
 learning
 is
 great
!


In this case, it didn't split up any words at all. BERT has a vocabulary size of 30,000, GPT-2 has 50,257. That extra capacity to model common words means we can actually keep word-based embeddings in this case. It will default to splitting tokens for more unusual terms not likely found abundantly in its training data:

In [None]:
sentence2 = "CMPUT-624 is a great class and best office hours are in the Athabasca building!"
sentence2_tkn = tokenizer(sentence2)
for token in sentence2_tkn['input_ids']:
    print(tokenizer.decode(token))

C
MP
UT
-
6
24
 is
 a
 great
 class
 and
 best
 office
 hours
 are
 in
 the
 Ath
ab
asca
 building
!


* CMPUT-624 -> `[C, MP, UT, -, 6, 24]`
* Athabasca -> `[Ath, ab, asca]`

That's enough to get started with!

## Good luck with your projects!

## What I've Left Out

* Padding
* Dealing with batches
* How to deal with multiple sub-word embeddings for a single input word

Check out the HuggingFace beginner course below for more info on these. Ask me if you get stuck on any particular issue you're facing.


## Resources

* [HuggingFace website](https://huggingface.co/)
* [HuggingFace course](https://huggingface.co/learn/nlp-course/chapter1/1)
* [Natural Language Processing with Transformers (O'Reilly Book)](https://transformersbook.com/)

I have a copy of the NLP with Transformers book, if you would like to refer to it for your course projects. Speak to me after class if you would like to borrow it.

# End of Part 3