# Overview of Large Language Models (LLMs)

### What are LLMs?

Large Language Models (LLMs) are a class of machine learning models designed to understand and generate human-like text. These models, such as GPT-4, Llama-3, BERT, and T5, are built using deep learning techniques, primarily using the Transformer architecture. LLMs have been trained on vast amounts of text data, making them capable of various natural language processing (NLP) tasks like text completion, translation, summarization, and more.

### Real-World Applications

- **Text Generation**: Creating coherent and contextually relevant text for chatbots, virtual assistants, and content creation.
- **Machine Translation**: Translating text from one language to another with high accuracy.
- **Text Classification**: Categorizing text into predefined labels, useful in sentiment analysis, spam detection, etc.
- **Question Answering**: Providing answers to user queries based on context.

### Evolution of LLMs

- **Early Models (Pre-2017)**: RNNs and LSTMs dominated the NLP landscape but faced limitations in handling long dependencies.
- **Transformers (2017 Onwards)**: Introduction of the Transformer architecture by Vaswani et al. with the now famous paper "Attention Is All You Need", which solved many limitations of earlier models.
- **Recent Advances**: Large-scale pre-training, fine-tuning, and specialized architectures for specific tasks.

### Challenges and Opportunities

- **Challenges**: High computational cost, energy consumption, biases in training data, interpretability.
- **Opportunities**: Interaction with applications through human language, speed-up of labour-intense tasks.


### Overview Fine-Tuning a Pretrained Model using Hugging Face

Lets walk through the usual steps of fine-tuning a model

In [1]:
# imports
import torch
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments

In [2]:
# Load dataset
# We'll use the IMDb dataset, but won't be downloading it from Hugging Face, but from a shared directory
dataset = load_dataset("stanfordnlp/imdb", cache_dir='/nvme/scratch/edu28/data')

In [5]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [4]:
# Split into train and test sets
train_dataset = dataset['train']
test_dataset = dataset['test']

In [6]:
print(f"Review example: {train_dataset['text'][0]}")
print(f"Label of review: {train_dataset['label'][0]}")

Review example: I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and 

In [5]:
# Load pretrained tokenizer
# Using a BERT-based model for sequence classification
tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased", cache_dir='/nvme/scratch/edu28/models')

In [8]:
# Tokenize the dataset
def tokenize_function(examples):
    return tokenizer(examples['text'], padding="max_length", truncation=True)

train_dataset = train_dataset.map(tokenize_function, batched=True)
test_dataset = test_dataset.map(tokenize_function, batched=True)

In [9]:
train_dataset

Dataset({
    features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 25000
})

In [10]:
# Remove unnecessary columns
train_dataset = train_dataset.remove_columns(["text"])
test_dataset = test_dataset.remove_columns(["text"])

In [11]:
# Set format for PyTorch
train_dataset.set_format("torch")
test_dataset.set_format("torch")

In [6]:
# Load pretrained model
model = AutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-uncased", cache_dir='/nvme/scratch/edu28/models', num_labels=2)

# Define training arguments and trainer
training_args = TrainingArguments(
    output_dir="~/results",
    eval_strategy="epoch",
    num_train_epochs=5,
    per_device_train_batch_size=5,
    per_device_eval_batch_size=5,
    logging_dir="~/logs",
    report_to="none"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


NameError: name 'train_dataset' is not defined

In [14]:
# Fine-tunine the model
trainer.train()

Epoch,Training Loss,Validation Loss


KeyboardInterrupt: 

In [23]:
# Evaluating the model
trainer.evaluate()

{'eval_loss': 0.5775209665298462,
 'eval_runtime': 83.3585,
 'eval_samples_per_second': 299.909,
 'eval_steps_per_second': 59.982,
 'epoch': 5.0}

In [None]:
# Shut down the kernel to release memory
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(restart=False)

{'status': 'ok', 'restart': False}

: 