<a href="https://colab.research.google.com/github/JennEYoon/llm-ml/blob/main/hugging/Hugging_Face_NLP_Course.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Setup

<b>Colab ( https://colab.research.google.com/ ) – simplest way to get started</b>

Colab: File→New Notes in Drive to create a new  notebook

Example : In a code block type !ls to list the content of the file which could be sample_data.

<ul>Install the Libraries
<li>!pip install transformers to install the light version</li>
<li>!pip install transformers[sentencepiece] to install the full version ( Contains all libraries need for the course)</li>
</ul>


Test:
``` python
import transformers
print(transformers.__version__)

# Result should see a version >= 4.42.4
```

In [None]:
!pip install transformers[sentencepiece]



In [None]:
import transformers

from IPython import display
from base64 import b64decode

print(transformers.__version__)

4.42.4


# Introduction

## What is NLP

NLP Goal : To understand single words and the context of those words

Transformers can also be used in Computer Vision, Images etc..

Examples:
<ul>
  <li>Classifying Whole Sentences</li>
  <ul>
    <li>Sentiment of a review</li>
    <li>Detecting if an email is spam</li>
    <li>Is sentence grammatically correct</li>
    <li>Are two sentences logically related or not</li>
  </ul>
  <li>Classifying each word in the sentence</li>
  <ul>
    <li>Identifying the grammatical components of a sentence (noun, verb, adjective)</li>
    <li>Identifying named entities ( person, location, organization )</li>
  </ul>
  <li>Generating Text Context</li>
  <ul>
    <li>Comleting a prompt with auto-generated text</li>
    <li>Filling in the blanks with masked words</li>
  </ul>
  <li>Extracting an answer from a text</li>
  <ul>
    <li>Given a question and a context extracting the answer to the question based on the information provided in the context</li>
  </ul>
  <li>Generating a new sentence from an input text</li>
  <ul>
    <li>Translating text into another language</li>
    <li>Summarizing text</li>
  </ul>
</ul>

## Locations

| URL | Description |
|-----|-------------|
|[Transformers Github Page](https://github.com/huggingface/transformers) | provides functionality to use and create shared models|
|[Model Hub](https://huggingface.co/models)| contains thousands of pretrained models anyone can downloaded and used|

## Tranformers

### Pipeline Function

1. The most high-level API in the Transformers Library which groups all steps to go from rawtext to usable predictions

Major Steps:
<ol>
  <li>Preporcessing to convert all data into numbers</li>
  <li>Model</li>
  <li>Post Processing makes the output human readable</li>
</ol>

``` python
# An example pipeline
# Based on the first parameter select a pregtrained model that has been fine-tuned for sentiment analysis in English
# Create the classifier causes the model to be downloaded and cached
from transformers import pipeline
classifier = pipeline("sentiment-analysis")
classifier("I've been waiting for a HuggingFace course my whole life.")
```

### Infrerence API

&emsp; Models can be tested directly through your browser using the inference API which is avaiable on the Hugging Face Website.

&emsp; Play with the model directl on this page by inputting custom text and watching the model process the input data.

### Example of Available Pipelines below
These pipelines were programmed for specific tasks, but pipelines can have its behaviors customized

<b>I did not run the code yet</b>



In [None]:
# An example pipeline
from transformers import pipeline
classifier = pipeline("sentiment-analysis")
classifier("I've been waiting for a HuggingFace course my whole life.")

In [None]:
# An example pipeline with a classifier having two senteneces (batch) -- Returns two labels back
from transformers import pipeline
classifier = pipeline("sentiment-analysis")
classifier(["I've been waiting for a HuggingFace course my whole life.", "I hate this so much!"])

In [None]:
# Zero-Shot clasification -- Allows you to clasify text that has not been classified and the developer provides the labels
# Don't need to fine tune the model on your data to use it ( useful for real time tasks )
from transformers import pipeline
classifier = pipeline("zero-shot-classification")
classifier(
    "This is a course about the Transformers library",
    candidate_labels=["education", "politics", "business"],
)

In [None]:
# Text Generation
from  transformers import pipeline
generator = pipeline("text-generation")
generator("In this course, we will teach you how to")

In [None]:
# Use with a different model.  Ex use distilgpt2 ( lighter version of gpt2 )
# max_length -- length of the generated text
# num_return_sequence -- numnber of sentences returned
# Text generation involves randomness so its normal if you don't get the same results on other machines.
from transformers import pipeline
generator = pipeline("text-generation", model="distilgpt2")
generator("In this course, we will teach you how to", max_length=30, num_return_sequence=2)

In [None]:
# Generating text by guessing the next word in a sentence
# top_k=2 : Asking for the two most likely values
# <mask> -- called the mask token.  Other mdoels might have a different mask token
from transformers import pipeline
unmasker = pipeline("fill-mask")
unmasker("This course will teach you all about <mask> models.", top_k=2)

In [None]:
# Classify Each Word in a sentence as a Person, Organization or Location and more...
from transformers import pipeline
classifier = pipeline("ner")
classifier("My name is Sylvain and I work at Hugging Face in Brooklyn.")

In [None]:
# Eamples of group_entieties = true -- Make the pipeline group together the different words linked to the same entity (ex. Hugging Face)
from transformers import pipeline
classifier = pipeline("ner", grouped_entities=True)
classifier("My name is Sylvain and I work at Hugging Face in Brooklyn.")

In [None]:
# Extract Questioning and Answers -- Proivde the location where the question is answered
from transformers import pipeline
question_answerer = pipeline("question-answering")
question_answerer(
    question="Where do I work?",
    context="My name is Sylvain and I work at Hugging Face in Brooklyn.",
)

In [None]:
# Example of transforming one language to another
from transformers import pipeline
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")
translator("Ce cours est produit par Hugging Face.")

In [None]:
# Getting short summaries of articles
from transformers import pipeline
summarizer = pipeline("summarization")
summarizer("I am learning natural language processing which is a very interesting field")

## How Tranformers Work

1. Transformer modesl have been trained as language models on large amounts of raw text using self supervised learning and transfer learning<br>

Self Supervised Learning
<ul>
<li>
Based on an neural network and functions by processing unlabeled data and automatically generating the associated labels, without human intervention.  This method works by masking part of the training data and training the model to identify this hidden data. This is done by analyzing the structure and characteristics of the unmasked data. This labeled data is then used for the supervised learning stage.
</li>
<li>
Develops a statistical understanding of the language it has been trained on<br>
</li>
<li>
Usually done with a large amount of data.  Good for when you don't have the computing or monetary resources to build an accruate model.
</li>
</ul>
<img src="https://drive.google.com/uc?id=188bbQN1Ntg3nh6zhHBoLrSwo6M2g8vaO"/>

<hr>

Transfer Learning
<ul>
<li>
Uses the model from the self supervised learning above or downlaod a model from the ModuleHub ( Hugging Face ).  Also, using a pretrained language model will require less computing and monetary resources
</li>
<li>
Act of initializing a model with another model's weights.  The model's weights from the self sueprvised learning neural network copied to the neural network for the application
</li>
<li>
The training focuses on the specific task that the neural network should be built for.
</li>
</ul>
<img src="https://drive.google.com/uc?id=1HYcnRzt2nEmA2b5dlkAeysQ2pk8-2MSK" />

<br>
<br>
To achieve better performance the model's size should be increased and/or the amount of data to be pretrained on should be increased which can be costly financially and compute resources.

## Tranformer Architecture

Model is composed of two blocks
<ol>
  <li>Encoder -- model is optimized to acquire understanding from its input so it receives the input and build a representation of it features</li>
  <li>Decoder -- Optmized for generating outputs so the decoder uses the encoders representation of it features along with a single string from the ( Output probabilites ) to generate a target sequence.
</ol>

<img src="https://drive.google.com/uc?id=126iEZtv0QADavAQVmMLWVRLpAgXEKw0M" /><br>
The "Output (Shifted Right") is the current word that was selected from the output probabilities.  With thte current word a new word is selected form the "output probabilities"

Each of these parts can be used independently.
<ol>
<li> Encoder Only Models -- Sentence Classification and Named Entitiy Recognition</li>
<li>Decoder Only Models -- For generative tasks such as text generation</li>
<li>Enocoder Decoder Models -- For task that require input such as translation and summariztion</li>
</ol>

<img src="https://drive.google.com/uc?id=1qjXrkixyo2WjkCe1Mk9TI81w5iUYpBkp"></img>

### Attention Layers Introduction

A layer that tells the model to pay specific attention to certain words in the sentence when dealing with the representation of each word.

A word itself has meaning, but meaning is affected by the context ( other words in the sentence )

### Transformer Architecture

In the encoder the attention layers can use all the words in a sentence to understand the context.  

#### Attention Layers
The first type of attention layer resides in the decoder pays attention to all past inputs to the decoder.

The second type of attention layer resides in encoder so that languages can bi-directional ( left to right or right to left )

Attention Mask -- Prevents the model from paying attention to some special words. Ex. Special Padding to make all the inputs the same length.

<img src="https://drive.google.com/uc?id=1zLZcFbO8JkMD8P2l4YJLfhV9A3Hh8cNQ" />



#### Terms

|Terms|Definition|
|-----|----------|
|Archtitecture | Skeleton of the model ( definiton of each layer and each operation that happens within the model) |
|CheckPoints| Weights to be loaded into a given archtitecture |
| Model | Umbrella term for Architecture or checkpoint |

Bert
<ol>
  <li>BERT is an archtitecture</li>
  <li>bert-base-cased -- a set of weight trained by Google</li>
  <li>BERT modle bert-based-cased-model</li>
</ol>





## Encoder Models ( Auto Encoding Models )

Use only the encoder of the Transformer Model.

At each stage the attention layer can access all the words in the inital sentence and have bi-directional attention

Pretraining -- Revolves around corrupting a given sentece ( could be random words) and tasking the model with finding or reconstrcuting the initial sentence

Examples:
<ol>
  <li>Albert</li>
  <li>Bert</li>
  <li>DistilBERT</li>
  <li>Electra</li>
  <li>RoBERTa</li>
</ol>

### How does an Encoder work

1. Feature Vector or Feature Tensors one sequence of numbers per input word which contains the word and contextulization information ( holds the meaning of the words within the text).  

  The dimension is defined by the architecture of the languagemodel.  

  Example to -- Would a vector with the numerial represent of "to" and contextual information about the words around it.

2. Self Attention Mechanmism -- Contexulization information ( releates to different poisitons  or different words in a single sequence)

### When should one use an encoder
<ul>
  <li>Bi-directional: context from the left and the right</li>
  <li>Good at extracting information</li>
  <li>Sentiment Analysis, Question Answering, Masked Language Modelling</li>
  <li>NLU: Natural Language Understanding -- extracts meaning from text and speech</li>
</ul>



## Decoders ( Auto Regressive Models )

Use Mask Self Attention -- For a given words the attention layers can only access the words poistioned before or after the word.  ( Bi-directional text ( Is it read from left to right or right to left ).

Pretraining usually revolves around predicting the next word in the sentence.

Auto-Regressive Models re-use their past outputs as inputs in the following steps.

<ol>
<li>CTRL</li>
<li>GPT</li>
<li>GPT-2</li>
<li>Transformer XL</li>
</ol>

### When should a Decoder be used for?
<ol>
<li>Unidirectional: access to their left or right context</li>
<li>Casual Language Modeling -- Generating sequences</li>
<li>NLG ( Natural Language Generation )</li>
</ol>





## Sequence to Sequence Models

Uses both the encoder and decoder

Encoder -- Get a sentence and decodes into a feature vector.

Decoder -- Takes the feature array and the current sequnece word ( previously generated by the decoder or perhaps a null string)  and the output is the next word generated byt the decoder.

Both the encoder and decoder do not share weights ( separation of components).
The encoder can focus on the sequence (Parsing).  The Decoder can be specialized for a different language or even images.

What should a Encoder/Decoder be used for
<ol>
<li>Translation Language Model (transduction) -- Ex. From English to French</li>
<li>Many ot Many Translation</li>
<li>Summarization</li>
</ol>

Examples:
<ol>
  <li>BART</li>
  <li>mBART</li>
  <li>MarianMT</li>
  <li>T5</li>
</ol>

## Bias and Limitations

Pretrained models ( Models built by someone else ) usually is trained on large amounts of data ( both accurate and not accurate). Most models are trained on data from the internet.  Evern Bert sufferes from bias and it was trained <b>only on</b> English Wikipedia and BookCorpus.

Be aware of getting a sexist, racist or homphobic content and Fine-Tuning the model on your data will not make the intrinsic bias disappear

Human Feedback (Reinforcement) is really helpful with bias

GAN -- Look it up.

PPO - Proximate Polciy Optimization
DPO - Sueprvisor Network -- Training a model is good or bad based on human feedback

Rank on inference

# Scriblings of a madman done while listening to the discussion.

RAG -- Need a human to see how well the LLM is doing.  Different than fine tuning (Reinforcement learning with Human Feedback ).  Last piece of the workflow

RAG -- Running an ML Different ( ex. pdf in vector database  ) and it will actually look through all the documents and find the paragraph that relivient.  Not training source

Quantizaton to reduce the amount of carbon ( research it )

GAN ( resarch it )

A tensor has 3 or more dimensions

Neural Net -- LSTM ( Long Term Short Memory ) -- limited for large documents with alot of context

Pass in the sentence for the encoder only.  The encoder is used during training

Textual to Contextual for LLM --

NoContext -- Fill in a mask, but in some prelimary material.  Combine multiple sentence through a mask.

