# Welcome to Machine Learning - Week 11
Instructor - Daniel Wiesenfeld

# Foundational Models, LLMs, and APIs

## Contents


## Foundational Models
A foundational model is a large-scale, pre-trained model that serves as a base for a wide range of downstream tasks. These models are trained on vast amounts of data and are designed to understand and generate human language, as well as other types of data. Foundational models are characterized by their general-purpose capabilities and are fine-tuned or adapted for specific tasks or domains.

### Key Characteristics of Foundational Models:
1) **Pre-trained on Large Datasets:** They are trained on extensive and diverse datasets, which allows them to learn a wide range of patterns and knowledge.
2) **General-purpose:** They can be used as a base for various tasks without needing to be trained from scratch for each new task. This enables zero-shot inference, few-shot infernce, and in-context learning.
3) **Scalable:** They typically have a large number of parameters, making them capable of handling complex and nuanced data.
4) **Adaptable:** They can be fine-tuned or adapted to perform specific tasks, often with much less data and computational resources than required for training from scratch.

### Examples of Foundational Models:
* **GPT-3:** A generative language model that can perform tasks such as text completion, translation, and question answering.
* **BERT:** A model designed for understanding the context of words in a sentence, useful for tasks like sentiment analysis and named entity recognition.
* **T5:** A model that converts all NLP tasks into a text-to-text format, making it highly versatile.
* **CLIP:** A model that understands both images and text, enabling tasks like image classification and text-to-image generation.

### Use Cases:
* **Language Understanding:** Sentiment analysis, named entity recognition, and text classification.
* **Language Generation:** Text completion, summarization, and translation.
* **Multi-modal Tasks:** Combining text and images for tasks like captioning and visual question answering.

### Training Process:
* **Pre-training:** The model is trained on a large corpus of data using typically self-supervised learning techniques to capture general patterns and knowledge.
* **Fine-tuning:** The pre-trained model may be further trained on task-specific data using supervised learning techniques to adapt it for specific applications.

### Key Concepts in Self-Supervised Learning for Foundational Models:
#### Self-Supervised Learning:
* **Definition:** A form of unsupervised learning where the model is trained to predict parts of the input data from other parts. This creates a form of supervision without the need for manually labeled data.
* **Examples:** Masked Language Modeling (MLM), Autoregressive Modeling, and Next Sentence Prediction (NSP).
Pre-training Tasks:

#### Masked Language Modeling (MLM):
* Used in models like BERT.
* Randomly masks some tokens in the input and trains the model to predict the masked tokens.
* **Example:** For the input "The capital of [MASK] is Paris", the model learns to predict "France".

#### Autoregressive Modeling:
* Used in models like GPT.
* Trains the model to predict the next token in a sequence given the previous tokens.
* **Example:** For the input "The capital of France is", the model learns to predict "Paris".

#### Sequence-to-Sequence Modeling:
* Used in models like T5.
* Converts input sequences into output sequences, enabling tasks like translation and summarization.
* **Example:** For the input "Translate English to French: Hello", the model learns to output "Bonjour".

### Benefits of Self-Supervised Learning:

* **Utilizes Large Unlabeled Datasets:** Can leverage vast amounts of text data available on the internet without the need for manual labeling.
* **Learns Rich Representations:** Captures complex patterns and structures in the data, which can be fine-tuned for specific tasks.
* **Scalability:** Can be scaled to very large models and datasets, enhancing the model's generalization capabilities.

## How Foundational Models are Used

### Zero-Shot Inference
**Definition:** Zero-shot inference refers to the ability of a model to perform tasks without any explicit training on the specific task or dataset.

**Mechanism:**
* Leverages the model's pre-trained knowledge.
* Utilizes natural language prompts (and/or other modalities) to specify tasks.

**Examples:**
* GPT-3 answering trivia questions without being explicitly trained on trivia datasets.
* CLIP identifying objects in images based on descriptive text.
* Instructing a model to act as a classifier

### Few-Shot Inference
**Definition:** Few-shot inference allows a model to perform tasks by being provided with a small number of examples.

**Mechanism:**
* Utilizes a few examples to understand the task.
* Examples serve as context for the model to generate responses or predictions.

**Examples:**
* GPT-3 writing a story when given a few example sentences.
* BERT performing text classification with a few labeled examples.
* An image model classifying a new type of object after being shown a few images of the object

### In-Context Learning
**Definition:** In-context learning allows models to learn tasks from context provided in the input, rather than through explicit parameter updates. Both zero shot and few shot inference rely on in-context learning.

**Mechanism:**
* The model uses a few examples provided in the input prompt to infer the task and generate responses accordingly.
* Can also instruct the model on how to behave and what to or not to output.
* The field of "prompt engineering" is focused on optimizing prompts to yield the right outputs

### Fine-Tuning
**Definition:** Fine-tuning is the process of taking a pre-trained model and continuing to train it for a specific task. This results in making small modifications to the model weights that make the model better at that particular task.

**Purpose:**
* Leverages the general knowledge already captured by the model during pre-training.
* Requires significantly less data and computational resources compared to training a model from scratch.

**Applications:**
Sentiment analysis, text classification, named entity recognition, translation, image classification, and more.

**In-Context Learning:**
* Does not update model parameters.
* Relies on examples provided at inference time.
* Flexible and quick for new tasks.

**Fine-Tuning:**
* Involves updating model parameters using task-specific training data.
* Typically requires more data and computational resources.
* Provides more tailored and potentially higher performance for specific tasks.

###  Fine-Tuning Techniques
**Full Fine-Tuning:** Adjusts all the parameters of the model.
* Pros: High adaptability and potentially better performance.
* Cons: Computationally expensive and risk of overfitting.

**Partial Fine-Tuning:** Adjusts only a subset of the model parameters (e.g., last layer or a few layers).
* Pros: Less computational resources and faster training.
* Cons: Limited adaptability.

**Parameter Efficient Fine-Tuning (PEFT):**
* Techniques like Adapter layers, BitFit, and Quantization + Low-Rank Adaptation (Q-LoRA).
* Adapter Layers
    - Adding small trainable layers between pre-trained layers.
* BitFit
    - Fine-tuning only the bias terms in the model.

### Retrieval-Augmented Generation (RAG)
**Definition:** Retrieval-Augmented Generation (RAG) enhances model performance by incorporating external information retrieval during inference.

**Mechanism:** 
* Combines a retrieval model (e.g., Dense Passage Retrieval, DPR) with a generation model (e.g., BERT, GPT).
* Retrieves relevant documents from a large corpus to provide additional context for the generation model.

**Applications:**
* Open-domain question answering.
* Fact-checking and information retrieval.

**Advantages:**
* Improves accuracy by leveraging external knowledge.
* Reduces the reliance on model's parameterized knowledge alone.
* Provides a solution to limited context windows

### Reinforcement Learning from Human Feedback (RLHF)

**Definition:** A method for aligning LLMs with human values and preferences by incorporating human feedback into the training process. 

**USE:** Used to fine-tune models for specific behaviors, ethical considerations, and instruction-following capabilities.

**RLHF Process:**
* Pre-training: Train a base model on a large dataset using self-supervised learning.
* Fine-tuning: Adjust the model using human feedback on specific outputs.
* Reward Modeling: Train a reward model to predict human preferences based on feedback.
* Reinforcement Learning: Optimize the model to maximize the reward model's output.

**Applications:**
* Instruction following
* Content moderation
* Ethical AI.

### Techniques for Model Adaptation

**Distillation**
* Compressing a large model into a smaller one while retaining performance
* Transfers knowledge from a large "teacher" model to a smaller "student" model.
* Example: DistilBERT

**Quantization**
* Reducing the precision of model weights (e.g., from 32-bit floats to 8-bit floats).
* Benefits: Smaller model size, faster inference.
* Types:
    - Static Quantization: Quantizes weights and activations based on a calibration dataset.
    - Dynamic Quantization: Quantizes weights only, and activation values are quantized on-the-fly.

**Low-Rank Adaptation (LoRA)**
* Decomposing weight matrices into lower-rank matrices.
* Reduces the number of parameters and computation required.

In [1]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel
import torch

# Load pre-trained model and tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

# Encode input text
input_text = "Once upon a time, in a galaxy far, far"
inputs = tokenizer(input_text, return_tensors="pt")

# Generate text
outputs = model.generate(inputs.input_ids, max_length=10, num_return_sequences=1)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(f"Generated Text: {generated_text}")



tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated Text: Once upon a time, in a galaxy far, far away


['audio-classification', 'automatic-speech-recognition', 'conversational', 'feature-extraction', 'fill-mask', 'image-classification', 'image-segmentation', 'ner', 'object-detection', 'question-answering', 'sentiment-analysis', 'summarization', 'table-question-answering', 'text-classification', 'text-generation', 'text2text-generation', 'token-classification', 'translation', 'visual-question-answering', 'vqa', 'zero-shot-classification', 'zero-shot-image-classification', 'translation_XX_to_YY']

In [82]:
from transformers import pipeline

generator = pipeline(task = "text-generation", model="openai-community/gpt2")
generator("I can't believe", do_sample=False)[0]['generated_text']

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


"I can't believe I'm doing this. I'm so sorry. I'm so sorry. I'm so sorry. I'm so sorry. I'm so sorry. I'm so sorry. I'm so sorry. I'm so sorry. I"

In [79]:
generator("Once upon a time in a galaxy")[0]['generated_text']

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


'Once upon a time in a galaxy filled with aliens, a woman was found by an android called the Doctor. This woman had a child while he was with the Doctor, and had been living on a farm with friends. She was called her sister so'

In [83]:
from transformers import pipeline

classifier = pipeline('zero-shot-classification', model="roberta-large-mnli")
classifier("I have a problem with my iphone that needs to be resolved asap!",
    candidate_labels=["urgent", "not urgent", "phone", "tablet", "computer"],
)

Some weights of the model checkpoint at roberta-large-mnli were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


{'sequence': 'I have a problem with my iphone that needs to be resolved asap!',
 'labels': ['urgent', 'phone', 'tablet', 'computer', 'not urgent'],
 'scores': [0.5578488111495972,
  0.3911593556404114,
  0.033464979380369186,
  0.014779362827539444,
  0.002747497521340847]}

In [84]:
classifier(
    "I have a problem with my iphone that needs to be resolved asap!!",
    candidate_labels=["english", "german"],
)

{'sequence': 'I have a problem with my iphone that needs to be resolved asap!!',
 'labels': ['english', 'german'],
 'scores': [0.7661928534507751, 0.23380716145038605]}

In [85]:
from transformers import pipeline

# Load sentiment analysis pipeline
sentiment = pipeline("sentiment-analysis", )

# Analyze sentiment of a text
sentiment("I love teaching machine learning!")


No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'POSITIVE', 'score': 0.9995545744895935}]

In [88]:
en_fr_translator = pipeline("translation_en_to_fr")
en_fr_translator("life is like  a box of chocolates")

No model was supplied, defaulted to t5-base and revision 686f1db (https://huggingface.co/t5-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'translation_text': 'La vie est comme une boîte de chocolats'}]

In [97]:
from transformers import pipeline, Conversation
convo = pipeline("conversational")
convo([Conversation("life is like  a box of chocolates")])

No model was supplied, defaulted to microsoft/DialoGPT-medium and revision 8bada3b (https://huggingface.co/microsoft/DialoGPT-medium).
Using a pipeline without specifying a model name and revision in production is not recommended.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Conversation id: d06e87ec-0419-4539-91f7-103381ccf76e 
user >> life is like  a box of chocolates 
bot >> I'm not sure if that's a good or a bad thing. 

In [109]:
convo([Conversation("Could you tell me how to cook rice?")])

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Conversation id: db04c6c7-0c18-49f7-9205-ce910bc37834 
user >> Could you tell me how to cook rice? 
bot >> I'm not sure, but I think you can use a rice cooker. 

In [112]:
convo([Conversation("Could I just use a plain old pot?", 
                   past_user_inputs = ["Could you tell me how to cook rice?"], 
                   generated_responses = ["I'm not sure, but I think you can use a rice cooker."])])

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Conversation id: b112e600-eac9-426f-b636-c16ca3fd7208 
user >> Could you tell me how to cook rice? 
bot >> I'm not sure, but I think you can use a rice cooker. 
user >> Could I just use a plain old pot? 
bot >> I don't think so. I think you'd have to use a rice cooker. 

In [103]:
convo([Conversation("What should I have for breakfast?")])

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Conversation id: f805a4e5-d815-44cd-a076-9ee1febb556d 
user >> What should I have for breakfast? 
bot >> A sandwich 

In [77]:
# https://huggingface.co/alirezamsh/small100

from transformers import M2M100ForConditionalGeneration
from tokenization_small100 import SMALL100Tokenizer

hi_text = "जीवन एक चॉकलेट बॉक्स की तरह है।"
chinese_text = "生活就像一盒巧克力。"
hebrew_text = "החיים כמו קופסה של שוקולדים"
english_text = "life is like a box of chocolates"

model = M2M100ForConditionalGeneration.from_pretrained("alirezamsh/small100")
tokenizer = SMALL100Tokenizer.from_pretrained("alirezamsh/small100")

# translate Hindi to French
tokenizer.tgt_lang = "fr"
encoded_hi = tokenizer(hi_text, return_tensors="pt")
generated_tokens = model.generate(**encoded_hi)
print(tokenizer.batch_decode(generated_tokens, skip_special_tokens=True))
# => "La vie est comme une boîte de chocolat."

# translate Chinese to English
tokenizer.tgt_lang = "en"
encoded_zh = tokenizer(chinese_text, return_tensors="pt")
generated_tokens = model.generate(**encoded_zh)
print(tokenizer.batch_decode(generated_tokens, skip_special_tokens=True))
# => "Life is like a box of chocolate."

# translate Hebrew to English
tokenizer.tgt_lang = "en"
encoded_he = tokenizer(hebrew_text, return_tensors="pt")
generated_tokens = model.generate(**encoded_he)
print(tokenizer.batch_decode(generated_tokens, skip_special_tokens=True))
# => "Life is like a box of chocolate."

# translate Hebrew to Yiddish
tokenizer.tgt_lang = "yi"
encoded_en = tokenizer(english_text, return_tensors="pt")
generated_tokens = model.generate(**encoded_en)
print(tokenizer.batch_decode(generated_tokens, skip_special_tokens=True))
# => "Life is like a box of chocolate."

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'M2M100Tokenizer'. 
The class this function is called from is 'SMALL100Tokenizer'.


['La vie est comme une boîte de chocolat.']
['Life is like a box of chocolate.']
['Life is like a box of chocolate.']
['איך בין געגאנגען צו צוקולד.']


*Source: https://huggingface.co/tasks*

## Current Major Large Language Models:
https://www.techtarget.com/whatis/feature/12-of-the-best-large-language-models