<a href="https://colab.research.google.com/github/ResByte/llm-notebooks/blob/main/notebooks/00-LLM-101.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LLM 101

## Abstract (generated from Perplexity)

Large Language Models (LLMs) are deep learning models designed to process and understand vast amounts of natural language data. They are built on neural network architectures, particularly the transformer architecture, which allows them to capture complex language patterns and relationships between words or phrases in large-scale text datasets.


Key blogs on LLMs:
- To get basic understanding of GPT or LLM in general : https://jalammar.github.io/illustrated-gpt2/
- Google colab for training LLMs with BnB: https://colab.research.google.com/drive/1VoYNfYDKcKRQRor98Zbf2-9VQTtGJ24k?usp=sharing#scrollTo=jq0nX33BmfaC



Common NLP tasks:
- **classify whole sentences**: determine if the sentence is grammatically correct or not, if the two sentences are logically related
- **classify each word in a sentence**: identify grammatical components of a sentence
- **generate text content**: completing a prompt with auto-generated text
- **extract answer from a text** : given a qn and a context, extract answer to the question
- **generate new sentence from an input text**: translate text to another language or summarize a text   



## Embeddings

These are data that has been transformed into n-dimensional matrices. They are used
- transform multimodal inputs to representation that can be used by deep learning models.
- compress information to store for search or machine learning task
- create embedding space for that data for transfer learning or generalize on other domains

In [1]:
!pip install -q transformers

In [3]:
import torch
from transformers import BertTokenizer, BertModel

In [4]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [5]:
text = """Hold fast to dreams, for if dreams die, life is a broken-winged bird that cannot fly"""

In [6]:
tokenized_text = tokenizer.tokenize(text)
tokenized_text

['hold',
 'fast',
 'to',
 'dreams',
 ',',
 'for',
 'if',
 'dreams',
 'die',
 ',',
 'life',
 'is',
 'a',
 'broken',
 '-',
 'winged',
 'bird',
 'that',
 'cannot',
 'fly']

## Working with Transformers

from : https://huggingface.co/learn/nlp-course/chapter1/3?fw=pt

In [None]:
from tqdm.auto import tqdm
from transformers import pipeline

In [None]:
classifier = pipeline('sentiment-analysis')

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


In [None]:
classifier('India is winning the world cup 2023')

[{'label': 'POSITIVE', 'score': 0.9997451901435852}]

There are 3 steps for the model :
1. text is first pre-processed into a format the model can understand
2. pre-processed inputs are passed to the model
3. predictions of the model are post-processed in natural english

### Zero shot Text classification

In this a set of labels are passed to pipeline alongwith input sentence and the model will classify the sentence into one of the labels without having to tune the model.

In [None]:
classifier = pipeline('zero-shot-classification')
classifier(
    "India is winning the world cup 2023",
    candidate_labels=['education', 'sports', 'politics']
)

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.
Downloading (…)lve/main/config.json: 100%|████████████████████████████████████████████████████████████████████████████████████| 1.15k/1.15k [00:00<00:00, 158kB/s]
Downloading model.safetensors: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 1.63G/1.63G [00:39<00:00, 41.7MB/s]
Downloading (…)okenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████| 26.0/26.0 [00:00<00:00, 6.77kB/s]
Downloading (…)olve/main/vocab.json: 100%|█████████████████████████████████████████████████████████████████████████████████████| 899k/899k [00:00<00:00, 1.15MB/s]
Downloading (…)olve/main/merges.txt: 100%|█████████████████████████████████████████████████████████████████████████████

{'sequence': 'India is winning the world cup 2023',
 'labels': ['sports', 'politics', 'education'],
 'scores': [0.986595630645752, 0.008057733997702599, 0.00534663675352931]}

In [None]:
# what happens when sentiment is also added
classifier(
    "India is winning the world cup 2023",
    candidate_labels=['education', 'sports', 'politics', 'positive', 'negative']
)

{'sequence': 'India is winning the world cup 2023',
 'labels': ['sports', 'positive', 'negative', 'politics', 'education'],
 'scores': [0.62184077501297,
  0.36308449506759644,
  0.006626029033213854,
  0.005078704562038183,
  0.0033699285704642534]}