
# Introduction to Language Model Learning Mechanisms (LLMs)

## 1. What are Language Model Learning Mechanisms?

Language models are a subset of machine learning models that are trained to predict the next word in a sequence. These models are designed to capture patterns and structures in human language, making them capable of tasks like text generation, sentiment analysis, translation, and more. The "learning mechanism" refers to the algorithms and architectures that power these models, allowing them to learn from vast amounts of textual data.

In essence, an LLM is like a virtual brain that's been trained to understand and generate human language. When given a sequence of words, an LLM tries to predict the most likely next word based on the patterns it has seen during its training.

## 2. Evolution of LLMs: A brief history

- **Early Days (1950s - 1990s)**: The origins of language modeling can be traced back to early statistical methods. These methods were based on simple word frequencies and n-gram models which used the probability of a word following a sequence of other words.
- **Neural Networks Era (2000s)**: With the advent of neural networks, Recurrent Neural Networks (RNNs) and their advanced version, Long Short-Term Memory networks (LSTMs), became popular for language modeling. They could remember longer sequences compared to traditional methods and led to significant improvements in tasks like speech recognition and machine translation.
- **Transformers and Attention (2017 - Present)**: The introduction of the Transformer architecture in the "Attention Is All You Need" paper revolutionized language modeling. This architecture introduced the concept of attention mechanisms, allowing models to focus on specific parts of the input text. Models like GPT (Generative Pre-trained Transformer) and BERT (Bidirectional Encoder Representations from Transformers) are based on this architecture and have set numerous benchmarks in various natural language processing tasks.

## 3. Importance and Applications of LLMs

### Importance:
- **Human-like Text Processing**: LLMs can process and generate human-like text, making them invaluable for various applications.
- **Transfer Learning**: Modern LLMs, especially large ones, can be fine-tuned on specific tasks with a small amount of data, leveraging the knowledge they've gained from vast amounts of training data.
- **Multilingual Capabilities**: Many advanced LLMs are trained on data from multiple languages, allowing for cross-lingual tasks and understanding.

### Applications:
- **Text Generation**: From writing essays to generating code, LLMs can produce coherent and contextually relevant text.
- **Sentiment Analysis**: LLMs can determine the sentiment of a piece of text, be it positive, negative, or neutral.
- **Machine Translation**: Translate text from one language to another with high accuracy.
- **Question Answering**: Given a passage and a question, LLMs can extract or generate a relevant answer.
- **Chatbots and Virtual Assistants**: LLMs power many modern chatbots and virtual assistants, making them more conversational and context-aware.
- **Content Recommendation**: By understanding user preferences and content semantics, LLMs can recommend relevant articles, videos, and other content.

In conclusion, Language Model Learning Mechanisms are at the heart of many modern AI applications. Their ability to understand and generate human-like text has opened up numerous possibilities in the world of technology and beyond.


## Simple N-gram Example

In the provided code, we are demonstrating a basic example of generating n-grams, specifically bigrams, from a text.

**Process:**
1. We start with a sample text: "Language models are a subset of machine learning models that are trained to predict the next word in a sequence."
2. The text is tokenized by splitting it into individual words.
3. Bigrams are then generated by pairing each word with its subsequent word.
4. The code finally displays the first 5 bigrams created from the text.

**Bigrams** are pairs of consecutive words in a text, and they are a fundamental concept in natural language processing, particularly for methods like n-gram language models. In such models, the probability of each word only depends on the last few words. This is a simplifying assumption which makes the computations more tractable compared to considering the entire context.

The code snippet provided showcases how to create bigrams from a given piece of text.



In [1]:

# Simple N-gram Example

text = "Language models are a subset of machine learning models that are trained to predict the next word in a sequence."

# Tokenize the text
tokens = text.split()

# Create bigrams
bigrams = [(tokens[i], tokens[i+1]) for i in range(len(tokens)-1)]

bigrams[:5]  # Display the first 5 bigrams


[('Language', 'models'),
 ('models', 'are'),
 ('are', 'a'),
 ('a', 'subset'),
 ('subset', 'of')]

## Neural Network-based Language Model using Keras

The provided code demonstrates how to build and train a simple language model using Keras, leveraging the LSTM architecture.

**Process:**

1. **Sample Data**: A small corpus of sentences is defined. This corpus will be used to train the language model.

2. **Tokenization**: The `Tokenizer` from Keras is used to tokenize the sentences into individual words and create a vocabulary. The total number of unique words in the vocabulary is calculated.

3. **Sequence Creation**: For each sentence in the corpus, a sequence of tokens is generated. For instance, for the sentence "Language models are", token sequences like "Language", "Language models", and "Language models are" are created.

4. **Padding Sequences**: The sequences are padded to have a uniform length, which is required for training the neural network. This is achieved using Keras's `pad_sequences` function.

5. **Data Preparation**: The sequences are split into input data (`X`) and labels (`y`). The idea is to predict the next word in a sequence given the previous words.

6. **Model Definition**: A simple LSTM-based neural network is defined. The model consists of:
   - An `Embedding` layer to convert word tokens into embeddings.
   - An `LSTM` layer with 50 units.
   - A `Dense` layer with a softmax activation to predict the next word out of the vocabulary.

7. **Model Compilation**: The model is compiled using the Adam optimizer and the sparse categorical crossentropy loss function, suitable for multi-class classification tasks.

8. **Training**: The model is trained on the prepared data for a few epochs.

9. **Prediction**: Finally, the trained model is used to predict the next word for a given input sequence.

This example gives an overview of building a foundational language model using Keras. In real-world scenarios, one would typically use a larger dataset, more complex architectures, and additional preprocessing steps.


In [3]:
# Let's write a simple example using Keras to demonstrate a Neural Network-based Language Model.
# We'll use a simple LSTM model for this purpose.
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense

# Sample data
corpus = [
    "Language models are a subset of machine learning models",
    "Language models are trained to predict the next word",
    "Machine learning is a fascinating field",
    "Predict the next word using a language model"
]

# Tokenization
tokenizer = Tokenizer()
tokenizer.fit_on_texts(corpus)
total_words = len(tokenizer.word_index) + 1

# Convert corpus to sequences and prepare input and output data
input_sequences = []
for line in corpus:
    token_list = tokenizer.texts_to_sequences([line])[0]
    for i in range(1, len(token_list)):
        n_gram_sequence = token_list[:i+1]
        input_sequences.append(n_gram_sequence)

# Pad sequences and create predictors and label
max_sequence_len = max([len(x) for x in input_sequences])
input_sequences = pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre')
X, y = input_sequences[:,:-1],input_sequences[:,-1]

# Define the LSTM model
model = Sequential()
model.add(Embedding(total_words, 10, input_length=max_sequence_len-1))
model.add(LSTM(50))
model.add(Dense(total_words, activation='softmax'))
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Train the model (for demonstration, we'll just do a few epochs)
model.fit(X, y, epochs=10, verbose=0)

# Predict the next word
input_sequence = "Language models are"
input_sequence = tokenizer.texts_to_sequences([input_sequence])[0]
input_sequence = pad_sequences([input_sequence], maxlen=max_sequence_len-1, padding='pre')
predicted_word_index = np.argmax(model.predict(input_sequence), axis=-1)
predicted_word = tokenizer.index_word[predicted_word_index[0]]
predicted_word



'models'

# Quick Training on a Subset of AG News Dataset using DistilBERT

In this section, we aim to demonstrate the process of training a transformer-based model on a text classification task, but with a twist. Instead of using the entire dataset, which can be time-consuming, we'll only use a subset of the 'ag_news' dataset. This will allow us to achieve quicker training times, making it particularly useful for demonstration and prototyping purposes.

## Steps:

1. **Load a Subset of the Dataset**: We'll fetch only 5% of the 'ag_news' dataset from Hugging Face's datasets library.
2. **Initialize Tokenizer and Model**: We'll leverage the DistilBERT tokenizer and model. DistilBERT is a distilled version of BERT, which is faster and smaller but retains most of BERT's capabilities.
3. **Tokenization**: Convert the text data into a format suitable for training.
4. **Dataset Preparation**: Organize the tokenized data into a structure that the trainer can understand.
5. **Model Training**: Define training arguments and train the DistilBERT model on our subset data.

Let's dive into the code!


In [1]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.34.1-py3-none-any.whl (7.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m44.0 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.16.4 (from transformers)
  Downloading huggingface_hub-0.18.0-py3-none-any.whl (301 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.0/302.0 kB[0m [31m30.6 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.15,>=0.14 (from transformers)
  Downloading tokenizers-0.14.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m85.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m68.7 MB/s[0m eta [36m0:00:00[0m
Col

In [2]:
!pip install datasets

Collecting datasets
  Downloading datasets-2.14.6-py3-none-any.whl (493 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m493.7/493.7 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m13.7 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: dill, multiprocess, datasets
Successfully installed datasets-2.14.6 dill-0.3.7 multiprocess-0.70.15


In [3]:
!pip install accelerate -U

Collecting accelerate
  Downloading accelerate-0.24.0-py3-none-any.whl (260 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/261.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.4/261.0 kB[0m [31m1.6 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m261.0/261.0 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: accelerate
Successfully installed accelerate-0.24.0


In [4]:
# Fetching a Sample Dataset from Hugging Face's datasets library
from datasets import load_dataset
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification, Trainer, TrainingArguments

# Load a subset of the 'ag_news' dataset
dataset = load_dataset('ag_news', split='train[:5%]')  # Using only 5% of the training data

# Load the tokenizer and model
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=4)

# Tokenize the dataset with a smaller max_length
train_encodings = tokenizer(dataset['text'], truncation=True, padding=True, max_length=64)

# Prepare the data for training
class AGNewsDataset:
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: val[idx] for key, val in self.encodings.items()}
        item['labels'] = self.labels[idx]
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = AGNewsDataset(train_encodings, dataset['label'])

# Define training arguments and train the model
training_args = TrainingArguments(
    output_dir='./results',
    per_device_train_batch_size=32,  # Increased batch size
    per_device_eval_batch_size=64,
    num_train_epochs=1,
    evaluation_strategy="steps",
    save_steps=500,
    eval_steps=500,
    logging_dir='./logs',
    logging_steps=50,  # Log more frequently for quick runs
    learning_rate=2e-5,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
)

trainer.train()


Downloading builder script:   0%|          | 0.00/4.06k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.65k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.95k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/11.0M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/751k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/120000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/7600 [00:00<?, ? examples/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.weight', 'pre_classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss,Validation Loss


Step,Training Loss,Validation Loss


TrainOutput(global_step=188, training_loss=0.5803269832692248, metrics={'train_runtime': 2171.4105, 'train_samples_per_second': 2.763, 'train_steps_per_second': 0.087, 'total_flos': 99354092544000.0, 'train_loss': 0.5803269832692248, 'epoch': 1.0})

In [6]:
# Load a subset of the test data from 'ag_news'
test_dataset = load_dataset('ag_news', split='test[:1%]')  # Using only 1% of the test data for quick evaluation

# Tokenize the test data
test_encodings = tokenizer(test_dataset['text'], truncation=True, padding=True, max_length=64)

# Prepare the test data
test_dataset = AGNewsDataset(test_encodings, test_dataset['label'])

# Evaluate the model on the test data
results = trainer.evaluate(test_dataset)

# Print the evaluation results
print("Evaluation Results:")
for key, value in results.items():
    print(f"{key}: {value:.4f}")

Evaluation Results:
eval_loss: 0.2437
eval_runtime: 20.2098
eval_samples_per_second: 3.7610
eval_steps_per_second: 0.0990
epoch: 1.0000


In [9]:
import torch
# Get predictions for 5 samples from the test dataset
sample_test_data = load_dataset('ag_news', split='test[:5]')  # Load 5 samples from the test data

# Tokenize the sample test data
sample_test_encodings = tokenizer(sample_test_data['text'], truncation=True, padding=True, max_length=64, return_tensors="pt")

# Get model's predictions
with torch.no_grad():
    outputs = model(**sample_test_encodings)
    predictions = outputs.logits
    predicted_labels = torch.argmax(predictions, axis=1).numpy()

# Print the results
print("Predictions on 5 test samples:\n")
for i, (true_label, pred_label, text) in enumerate(zip(sample_test_data['label'], predicted_labels, sample_test_data['text'])):
    print(f"Sample {i+1}:")
    print(f"Text: {text}")
    print(f"True Label: {true_label}, Predicted Label: {pred_label}\n")


Predictions on 5 test samples:

Sample 1:
Text: Fears for T N pension after talks Unions representing workers at Turner   Newall say they are 'disappointed' after talks with stricken parent firm Federal Mogul.
True Label: 2, Predicted Label: 2

Sample 2:
Text: The Race is On: Second Private Team Sets Launch Date for Human Spaceflight (SPACE.com) SPACE.com - TORONTO, Canada -- A second\team of rocketeers competing for the  #36;10 million Ansari X Prize, a contest for\privately funded suborbital space flight, has officially announced the first\launch date for its manned rocket.
True Label: 3, Predicted Label: 3

Sample 3:
Text: Ky. Company Wins Grant to Study Peptides (AP) AP - A company founded by a chemistry researcher at the University of Louisville won a grant to develop a method of producing better peptides, which are short chains of amino acids, the building blocks of proteins.
True Label: 3, Predicted Label: 3

Sample 4:
Text: Prediction Unit Helps Forecast Wildfires (AP) AP - It'