 # Introduction to BERT and GPT

In [1]:
!pip install transformers torch -q

# Using BERT

[Wikipedia on BERT](https://en.wikipedia.org/wiki/BERT_(language_model))

BERT is a transformer-based model that exclusively uses the encoder part of the transformer architecture, making it specialized for understanding and representing text rather than generating it. It is widely regarded as a foundational or foundation model because it is pretrained on large corpora and can be fine-tuned for a wide variety of downstream tasks such as context understanding, document classification, and generating embeddings for semantic representations.

* BERT (Bidirectional Encoder Representations from Transformers) relies only on the transformer encoders, not the decoder.

* The focus is on deep bidirectional processing, capturing context from both directions within text, enabling sophisticated contextual understanding.

### **Applicability: Context, Classification, Embeddings**
* BERT excels at context understanding due to its bidirectional training and attention mechanisms, vastly improving tasks like coreference resolution and polysemy disambiguation.

* It is widely used for document and sentence classification by applying simple classifiers to its context-aware embeddings (often using the [CLS] token representation).

* BERT generates powerful embeddings that capture rich semantic information, making these representations valuable for clustering, retrieval, and downstream language tasks


## Masked Language Modeling (MLM)

Masked Language Models (MLMs) are a type of machine learning model designed to predict missing or "masked" words in a sentence. These models are trained on large datasets of text where certain words are intentionally hidden during training.

In [2]:
from transformers import pipeline

# --- BERT for Masked Language Modeling ---
print("--- BERT Fill-Mask Demo ---")
unmasker = pipeline('fill-mask', model='bert-base-uncased')

--- BERT Fill-Mask Demo ---


Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Device set to use cuda:0


The model predicts words that could fill the [MASK] token.


In [3]:
result = unmasker("The capital of France is [MASK].")
result

[{'score': 0.41678905487060547,
  'token': 3000,
  'token_str': 'paris',
  'sequence': 'the capital of france is paris.'},
 {'score': 0.0714166909456253,
  'token': 22479,
  'token_str': 'lille',
  'sequence': 'the capital of france is lille.'},
 {'score': 0.06339260190725327,
  'token': 10241,
  'token_str': 'lyon',
  'sequence': 'the capital of france is lyon.'},
 {'score': 0.04444749280810356,
  'token': 16766,
  'token_str': 'marseille',
  'sequence': 'the capital of france is marseille.'},
 {'score': 0.03029714524745941,
  'token': 7562,
  'token_str': 'tours',
  'sequence': 'the capital of france is tours.'}]

Let's understand what BERT is doing:

* __Context is Everything__: BERT doesn't just guess a common word. It analyzes the entire context of the sentence: "The capital of France is...". This context strongly suggests the answer should be a city, specifically the capital city of France. This is the core innovation of transformer models like BERT—they deeply understand the relationships between all words in a sequence.

* __It's More Than Memorization__: A key point to emphasize is why the other predictions (lille, lyon, marseille) are so significant. The fact that BERT's next-best guesses are other major French cities shows that it has learned a semantic relationship. It understands the category of word that should be there (a city in France), not just a single memorized fact.

* __Probabilistic Predictions__: The scores demonstrate that BERT operates on probabilities. It's not giving a single "correct" answer but rather a ranked list of what it considers most likely based on the patterns in its vast training data (like Wikipedia and books). The high score for 'paris' indicates that the association between "capital of France" and "paris" is extremely common in the data it was trained on.

* '`score`': This is the confidence score or probability the model assigns to that specific token being the correct word for the [MASK] position. Notice that * 'paris' has a score of ~0.42, which is substantially higher than the next best guess.
* '`token`': This is the unique ID for the predicted word in BERT's vocabulary. For example, the word 'paris' is represented by the integer 3000.
* '`token_str`': This is the human-readable string corresponding to the token ID.

* '`sequence`': This shows the complete sentence with the [MASK] token filled in by the predicted word.

The model will likely predict words like 'book', 'paper', or 'article', demonstrating it understands the grammatical role of a noun in that position.

In [4]:
result = unmasker("I am reading a [MASK] about machine learning.")
result

[{'score': 0.9153188467025757,
  'token': 2338,
  'token_str': 'book',
  'sequence': 'i am reading a book about machine learning.'},
 {'score': 0.02375485748052597,
  'token': 3259,
  'token_str': 'paper',
  'sequence': 'i am reading a paper about machine learning.'},
 {'score': 0.01992184668779373,
  'token': 2843,
  'token_str': 'lot',
  'sequence': 'i am reading a lot about machine learning.'},
 {'score': 0.0052698408253490925,
  'token': 16432,
  'token_str': 'textbook',
  'sequence': 'i am reading a textbook about machine learning.'},
 {'score': 0.004395367577672005,
  'token': 2466,
  'token_str': 'story',
  'sequence': 'i am reading a story about machine learning.'}]

Predictions will likely be 'medication', 'drug', 'prescription', etc.

In [5]:
result = unmasker("The doctor prescribed the [MASK] to the patient.")
result


[{'score': 0.16983750462532043,
  'token': 4319,
  'token_str': 'drug',
  'sequence': 'the doctor prescribed the drug to the patient.'},
 {'score': 0.15705649554729462,
  'token': 7709,
  'token_str': 'procedure',
  'sequence': 'the doctor prescribed the procedure to the patient.'},
 {'score': 0.05311083048582077,
  'token': 3949,
  'token_str': 'treatment',
  'sequence': 'the doctor prescribed the treatment to the patient.'},
 {'score': 0.048807937651872635,
  'token': 14667,
  'token_str': 'medication',
  'sequence': 'the doctor prescribed the medication to the patient.'},
 {'score': 0.030730772763490677,
  'token': 5970,
  'token_str': 'surgery',
  'sequence': 'the doctor prescribed the surgery to the patient.'}]

In [6]:
result = unmasker("The judge delivered the [MASK] to the defendant.")
result

[{'score': 0.21729233860969543,
  'token': 14392,
  'token_str': 'verdict',
  'sequence': 'the judge delivered the verdict to the defendant.'},
 {'score': 0.10090971738100052,
  'token': 2553,
  'token_str': 'case',
  'sequence': 'the judge delivered the case to the defendant.'},
 {'score': 0.06320745497941971,
  'token': 8689,
  'token_str': 'judgment',
  'sequence': 'the judge delivered the judgment to the defendant.'},
 {'score': 0.032562173902988434,
  'token': 6251,
  'token_str': 'sentence',
  'sequence': 'the judge delivered the sentence to the defendant.'},
 {'score': 0.03079116903245449,
  'token': 3350,
  'token_str': 'evidence',
  'sequence': 'the judge delivered the evidence to the defendant.'}]

##  BERT for a Downstream Task (Sentiment Analysis)
Creating a `text-classification` pipeline using BERT

In [7]:
# --- BERT for a Downstream Task: Sentiment Analysis ---
print("--- BERT Sentiment Analysis Demo ---")
from transformers import pipeline

# This pipeline uses a model fine-tuned for sentiment analysis
classifier = pipeline('sentiment-analysis')

# Example 1: Positive Review
result_pos = classifier("I've been waiting for a course like this my whole life. It's amazing!")
print(result_pos)

# Example 2: Negative Review
result_neg = classifier("This movie was so boring. I fell asleep halfway through.")
print(result_neg)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


--- BERT Sentiment Analysis Demo ---


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

Device set to use cuda:0


[{'label': 'POSITIVE', 'score': 0.9998832941055298}]
[{'label': 'NEGATIVE', 'score': 0.9997774958610535}]


Notice it downloaded `distilbert-base-uncased-finetuned-sst-2-english`.

* __Fine-Tuning__: This isn't the base __BERT__ model. It's a model like __DistilBERT__ that was pre-trained for general language understanding and then fine-tuned on a specific dataset of labeled positive/negative reviews.

* __Output__: The output is straightforward: a label ('`POSITIVE`' or '`NEGATIVE`') and a score indicating the model's confidence in that label.

# Using GPT

GPT models are based on the decoder-only architecture of the transformer, making them fundamentally different from BERT, which uses only the encoder stack. GPT (Generative Pre-trained Transformer) is specifically designed for text generation tasks: it predicts the next word in a sequence based on all the previous words, which is known as an autoregressive or causal language modeling approach

* GPT exclusively leverages the transformer decoder blocks, processing input in a unidirectional (left-to-right) manner so that each predicted token is conditioned only on previous tokens.

* This masking ensures that future context is hidden from the model, and the model generates output by sequentially predicting the next token

### **Causal Language Modeling** (or autoregressive text generation)
There are two types of language modeling, __causal__ and __masked__

* __Causal language modeling__ predicts the next token in a sequence of tokens, and the model can only attend to tokens on the left. This means the model cannot see future tokens. GPT-2 is an example of a causal language model.

* __BERT__ is an "__Encoder__" (a text understander): Its main goal is to build a rich, deep understanding of a given piece of text. It's bidirectional, meaning it looks at words both to the left and right of a `[MASK]` token to make the best guess. This makes it ideal for tasks like classification, sentiment analysis, and fill-in-the-blank.
* __GPT__ is a "__Decoder__" (a text generator): Its main goal is to generate new text that plausibly follows from a prompt. It is autoregressive, meaning it works one word at a time, from left to right. To predict the next word, it can only look at the words that came before it.



In [8]:
# ---GPT-2 for Text Generation ---

print("--- GPT-2 Text Generation Demo ---")

generator = pipeline('text-generation', model='gpt2')

--- GPT-2 Text Generation Demo ---


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cuda:0


In [9]:
prompt = "In a world where AI could write stories, the first tale it told was about"
generated_text = generator(prompt, max_length=50, num_return_sequences=1)
print(generated_text[0]['generated_text'])

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


In a world where AI could write stories, the first tale it told was about a girl in a wheelchair who accidentally got stuck in a car. In reality, the girl had been using her own story to help her find her first job. The story begins with her getting stuck in the car, and it ends with her becoming a school bus driver.

But the problem is, it all looks a lot more like an elaborate story for a different kind of story. That's because it's not. In fact, the story is quite simple.

Advertisement

As the story goes on, it's clear that this girl from a wheelchair is struggling to find a job, to find a job that will pay her. This girl has to learn from her past mistakes, learn how to make a good life for herself, and work hard to get a better job.

The problem is, the story is very simple for the story itself. It's just a story about a girl in a wheelchair who accidentally got stuck in a car.

In fact, the story is rather simple for the story itself. It's just a story about a girl in a wheelcha

## Probabilistic Nature

The model is making a series of probabilistic choices, leading to different creative paths.

In [10]:
# Generate 3 different story beginnings from the same prompt
generated_texts = generator(prompt, max_length=50, num_return_sequences=3)

for i, text in enumerate(generated_texts):
    print(f"--- Option {i+1} ---")
    print(text['generated_text'])
    print("\n")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


--- Option 1 ---
In a world where AI could write stories, the first tale it told was about a man being chased by a bird, and it was the story of a man living under the shade of an ice-melting tree.

I watched the story unfold on TV, and I knew it must be a great story, but I couldn't believe it was actually a story. I didn't know what they thought about the story, and I was always amazed to see how poorly they thought about it. I don't know what they thought about the story, but I knew that they were going to do something amazing with it.

A few years ago, I was sitting in my living room when an editor of the magazine asked me to come in on the show to talk about the next story we were thinking about. I thought, "What about the story you write about when I'm in a car accident?" And he asked me how they were going to get it done. So I told him that's what I wanted to talk about and he said "Well, I'll just tell you a story about a story I wrote as a kid."

And then I realized that I was

## Control the "Creativity" with Temperature

* __Low temperature__ (e.g., `0.7`): The model becomes more confident and conservative, often choosing the most likely words. The text will be more predictable and less creative.

* __High temperature__ (e.g., `1.5`): The model takes more risks, increasing the probability of less common words. This can lead to more "creative" or surprising text, but also increases the chance of it being nonsensical.

In [11]:
# More predictable, "boring" output
print("--- Low Temperature (More Predictable) ---")
low_temp_text = generator(prompt, max_length=50, temperature=0.5)
print(low_temp_text[0]['generated_text'])

print("\n" + "="*30 + "\n")

# More creative, "weirder" output
print("--- High Temperature (More Creative) ---")
high_temp_text = generator(prompt, max_length=50, temperature=1.5)
print(high_temp_text[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


--- Low Temperature (More Predictable) ---


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


In a world where AI could write stories, the first tale it told was about a young man who was sent to the hospital when he was attacked by a swarm of ants. The story was told in a world where the human brain can't read. The story was told in a world where humans can't read.

That's a pretty strong message. But it also means that we may have an even more powerful AI than we thought.

It's a good thing that we have an AI, because it's the only thing we need to keep an eye on.

Advertisement

Advertisement

Advertisement

Advertisement

Advertisement

Advertisement

Advertisement

Advertisement

Advertisement

Advertisement

Advertisement

Advertisement


--- High Temperature (More Creative) ---
In a world where AI could write stories, the first tale it told was about Ayn Rand. When she had her own novel written, it seemed inevitable she'd write one. Today, though, most people think about Ayn and are surprised not to learn why she went on doing so to the same degree that she was writing. 

 ## GPT-Style Models for a Structured Task (Summarization)
 The default models for the `summarization` pipeline are technically __encoder-decoder__ models. These are like a hybrid that uses an encoder (like BERT) to "read" and understand the source text and a decoder (like GPT) to "write" the summary.

In [12]:
# --- A Generative Task: Summarization ---
print("--- Text Summarization Demo ---")
from transformers import pipeline

summarizer = pipeline("summarization")

long_text = """
Large language models (LLMs) are a type of artificial intelligence (AI) model
that can generate human-like text. They are trained on massive datasets of text and
code, and they can be used for a variety of tasks, including text generation,
translation, and summarization. BERT, developed by Google, is a powerful encoder
model that learns deep bidirectional representations. In contrast, the GPT series
from OpenAI are decoder-only models, known for their impressive text generation
capabilities. These models are the foundation of many modern NLP applications.
"""

summary = summarizer(long_text, max_length=45, min_length=20, do_sample=False)
summary

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


--- Text Summarization Demo ---


config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

Device set to use cuda:0


[{'summary_text': ' Large language models (LLMs) are a type of artificial intelligence (AI) model that can generate human-like text . They can be used for a variety of tasks, including text generation, translation, and'}]

* __Generative but Constrained__: Unlike the free-form text generation earlier, this task is about generating text that is a condensed representation of the input.

* __Encoder-Decoder__ Architecture: The model first `encodes` the full meaning of the long article and then `decodes` that meaning into a shorter summary.

# Model Bias
From the expirement below think about:
* What are the real-world consequences if a model with these biases is used to build a resume-sorting tool for a hiring system?
* As AI developers and users, what is our responsibility here? How could we potentially fix this?
  * Data Curation
  * Bias Detection
  * Mode Fine-Tuning

In [29]:
# ---A Critical Look at Model Bias ---
print("--- Uncovering Model Bias Demo ---")
from transformers import pipeline

unmasker = pipeline('fill-mask', model='bert-base-uncased')

# These prompts can reveal gender and professional biases
prompt_nurse = "My manager he is a [MASK]."
prompt_president = "My manager she is a [MASK]."



print(f"Predictions for: '{prompt_nurse}'")
print([f"{res['token_str']}" for res in unmasker(prompt_nurse)])

print("\n" + "="*30 + "\n")

print(f"Predictions for: '{prompt_president}'")
print([f"{res['token_str']}" for res in unmasker(prompt_president)])



Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


--- Uncovering Model Bias Demo ---


Device set to use cuda:0


Predictions for: 'My manager he is a [MASK].'
['genius', 'fool', 'man', 'friend', 'professional']


Predictions for: 'My manager she is a [MASK].'
['genius', 'model', 'woman', 'beauty', 'virgin']
