# Overview of Large Language Models (LLMs)

### What are LLMs?

Large Language Models (LLMs) are a class of machine learning models designed to understand and generate human-like text. These models, such as GPT-4, Llama-3, BERT, and T5, are built using deep learning techniques, primarily using the Transformer architecture. LLMs have been trained on vast amounts of text data, making them capable of various natural language processing (NLP) tasks like text completion, translation, summarization, and more.

### Real-World Applications

- **Text Generation**: Creating coherent and contextually relevant text for chatbots, virtual assistants, and content creation.
- **Machine Translation**: Translating text from one language to another with high accuracy.
- **Text Classification**: Categorizing text into predefined labels, useful in sentiment analysis, spam detection, etc.
- **Question Answering**: Providing answers to user queries based on context.

### Evolution of LLMs

- **Early Models (Pre-2017)**: RNNs and LSTMs dominated the NLP landscape but faced limitations in handling long dependencies.
- **Transformers (2017 Onwards)**: Introduction of the Transformer architecture by Vaswani et al. with the now famous paper "Attention Is All Aou Need", which solved many limitations of earlier models.
- **Recent Advances**: Large-scale pre-training, fine-tuning, and specialized architectures for specific tasks.

### Challenges and Opportunities

- **Challenges**: High computational cost, energy consumption, biases in training data, interpretability.
- **Opportunities**: Interaction with applications through human language, speed-up of labour-intense tasks.


### Overview Fine-Tuning a Pretrained Model using Hugging Face

Lets walk through the usual steps of fine-tuning a model

In [1]:
# imports
import torch
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments

2024-09-25 10:41:25.224248: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-09-25 10:41:25.224361: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-09-25 10:41:25.225503: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-09-25 10:41:25.231162: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [2]:
# Load dataset
# We'll use the IMDb dataset from Hugging Face
dataset = load_dataset("imdb")

Downloading readme:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

Downloading data: 100%|██████████| 21.0M/21.0M [00:00<00:00, 35.7MB/s]
Downloading data: 100%|██████████| 20.5M/20.5M [00:00<00:00, 24.0MB/s]
Downloading data: 100%|██████████| 42.0M/42.0M [00:00<00:00, 65.6MB/s]


Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [3]:
# Split into train and test sets
train_dataset = dataset['train']
test_dataset = dataset['test']

In [4]:
# Load pretrained tokenizer
# Using a BERT-based model for sequence classification
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [5]:
# Tokenize the dataset
def tokenize_function(examples):
    return tokenizer(examples['text'], padding="max_length", truncation=True)

train_dataset = train_dataset.map(tokenize_function, batched=True)
test_dataset = test_dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

In [6]:
# Remove unnecessary columns
train_dataset = train_dataset.remove_columns(["text"])
test_dataset = test_dataset.remove_columns(["text"])

In [7]:
# Set format for PyTorch
train_dataset.set_format("torch")
test_dataset.set_format("torch")

In [8]:
# Load pretrained model
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

# Define training arguments and trainer
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    num_train_epochs=2,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    logging_dir="./logs",
    report_to="none"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


In [9]:
# Fine-tunine the model
trainer.train()

Epoch,Training Loss,Validation Loss
1,0.2898,0.355888
2,0.1529,0.272766


TrainOutput(global_step=6250, training_loss=0.25062741455078125, metrics={'train_runtime': 1754.025, 'train_samples_per_second': 28.506, 'train_steps_per_second': 3.563, 'total_flos': 1.3155552768e+16, 'train_loss': 0.25062741455078125, 'epoch': 2.0})

In [10]:
# Evaluating the model
trainer.evaluate()

{'eval_loss': 0.27276554703712463,
 'eval_runtime': 212.0401,
 'eval_samples_per_second': 117.902,
 'eval_steps_per_second': 14.738,
 'epoch': 2.0}

## Conclusion

In this notebook, we briefly introduced the concept of Large Language Models, their applications, and their evolution. We also explored various LLMs available on the Huggingface Model Hub, learning to filter and analyze models based on different criteria.

In [11]:
# Shut down the kernel
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(restart=False)

{'status': 'ok', 'restart': False}