**Introduction to LLMs & Transformers**

(i) **LLMs:** Large Language Models (LLMs) are deep learning models trained on vast text corpora to understand and generate human-like language. They excel at tasks like text completion, translation, summarization, and question answering.

(ii) **Transformers:** At the core of most LLMs is the Transformer architecture, introduced in 2017, which uses self-attention mechanisms to process input sequences in parallel rather than sequentially. This allows Transformers to capture complex dependencies and long-range context more efficiently than earlier RNNs or LSTMs. Modern LLMs like Google DeepMind’s Gemini, OpenAI's GPT (Generative Pre-trained Transformer), Google Research's BERT (Bidirectional Encoder Representations from Transformers), and Google Research and Brain Team's T5 (Text-To-Text Transfer Transformer) are scaled-up versions of the Transformer, leveraging massive data and computation. Understanding Transformers is key to grasping how LLMs work and how they can be customized for specific language tasks.

**Relationships between LLMs, Transformers, and Generative AI**

Large Language Models (LLMs), Transformers, and Generative AI are deeply interconnected components of modern artificial intelligence.

LLMs are a subset of Generative AI systems specifically designed to understand and produce human language. These models are typically built upon the Transformer architecture, which revolutionized NLP by introducing self-attention mechanisms that enable efficient processing of long text sequences. Transformers form the computational backbone of most state-of-the-art LLMs like Google DeepMind’s Gemini, OpenAI's GPT, Google Research's BERT, and Google Research and Brain Team's T5.

Generative AI, a broader category, includes models that create new content—text, images, code, music—based on learned patterns. LLMs are a key pillar of Generative AI in the textual domain. By using Transformers, LLMs can generate coherent, contextually relevant language outputs, making them capable of creative tasks like storytelling, summarization, and dialogue. In essence, Transformers enable LLMs, and LLMs serve as the language engine of Generative AI applications.

**What are the problems associated with Keras's high-level abstractions, such as the Sequential model or model.fit() pipeline?**

Keras's high-level abstractions like Sequential and model.fit() offer simplicity and speed for prototyping but come with several limitations, especially when building complex or customized models. Here are the main problems:

**1. Limited Architectural Flexibility (Sequential)**

**(i)** Sequential assumes a linear stack of layers.

**(ii)** Difficult or impossible to build models with:

(a) Multiple inputs or outputs.

(b) Skip connections (e.g., in ResNets).

(c) Shared layers across different paths.

(d) Dynamic branching or conditional logic.


**2. Less Control Over Training Logic (model.fit)**

**(i)** model.fit() is a black-box training loop.

**(ii)** Hard to:

(a) Implement custom loss functions that depend on intermediate layers.

(b) Add per-batch logic (e.g., curriculum learning, adaptive loss scaling).

(c) Handle non-standard data flows (e.g., RL or self-supervised learning).

(d) Manage variable-length sequences without padding (e.g., NLP tasks).


**3. Difficulty with Debugging and Advanced Logging**

**(i)** Less transparent than a custom training loop.

**(ii)** Debugging issues like gradient explosions, NaNs, or unexpected outputs can be tricky.

**(iii)** Custom logging or interaction with intermediate tensors (e.g., attention maps) is limited.


**4. Inflexible for Custom Metrics or Loss Aggregation**

**(i)** model.fit() expects metrics/losses in a specific format.

**(ii)** Aggregating complex multi-output or multi-task losses can be cumbersome.

**(iii)** Limited support for online evaluation or sample-level feedback.


**5. Performance Trade-offs**

**(i)** Some optimizations (e.g., mixed precision, gradient accumulation, custom update rules) are harder to integrate into model.fit().

**(ii)** Custom training loops using tf.GradientTape allow more fine-grained performance tuning.


**When to Avoid High-Level Abstractions:**

**(i)** Multi-modal models (e.g., images + text).

**(ii)** Graph neural networks.

**(iii)** Custom training workflows (e.g., contrastive learning, adversarial training).

**(iv)** Reinforcement learning or decision trees mixed with DL.

**Problem Statement:** Design and implement a character-level language model using a simplified Transformer architecture built with low-level TensorFlow (not Keras high-level API), to learn and generate English-like text from a small custom dataset.

**Objective:** This hands-on exercise focuses on building a custom character-level language model using the Transformer architecture, implemented directly with low-level TensorFlow primitives. Without relying on Keras's high-level abstractions like Sequential or Model.fit pipelines, participants construct embedding layers, multi-head attention, normalization, and dense layers manually to better understand the core components of a Transformer.

The model is trained on a small sample corpus and generates text character-by-character. The goal is to help learners grasp the essential concepts behind Transformer models in natural language processing and how these models perform autoregressive generation.

In [None]:
# 1. Setup
!pip install tensorflow                                                                                         #Imports TensorFlow.
import tensorflow as tf                                                                                         #Imports TensorFlow.
import numpy as np                                                                                              #Imports numpy.
import re                                                                                                       #Imports Regular Expression and is used for pattern matching and searching within strings. (Regular Expression)
import string                                                                                                   #Imports string to provide useful string constants and tools.

# 2. Sample Dataset
text = """The sun sets in the west. The stars shine in the night. The moon rises with grace."""                  #A small text corpus is defined and converted to lowercase for uniformity (reduces character variation).
text = text.lower()                                                                                              ## Convert all characters to lowercase for consistency.

# 3. Tokenization
tokenizer = tf.keras.preprocessing.text.Tokenizer(char_level=True)                                               # Create a character-level tokenizer.
tokenizer.fit_on_texts([text])                                                                                   #Learn the character vocabulary from the text.
total_chars = len(tokenizer.word_index) + 1                                                                      #Gets the total number of unique characters + 1 (for padding).

sequences = []                                                                                                   #List to hold input sequences
for i in range(1, len(text)):                                                                                    # Create all prefix sequences from text, e.g., "t", "th", "the", ...
    input_seq = text[:i]                                                                                         # For every character position i, creates an input sequence from the start of the text up to i.
    sequences.append(tokenizer.texts_to_sequences([input_seq])[0])                                               # Convert to integer sequence.[0] → gets the sequence of the first (and only) string and as a consequence a list of sequences, we are getting one per input string.

max_len = max(len(x) for x in sequences)                                                                          # Find max sequence length
sequences = tf.keras.preprocessing.sequence.pad_sequences(sequences, maxlen=max_len, padding='pre')               # Pad all sequences to the same length
sequences = np.array(sequences)                                                                                   # Convert list to NumPy array
X, y = sequences[:, :-1], sequences[:, -1]                                                                        # Split into input and target (last character)
y = tf.keras.utils.to_categorical(y, num_classes=total_chars)                                                     # One-hot encode the target

# 4. Build a Mini Transformer Block
class MiniTransformer(tf.keras.Model):                                                                            # Define a custom Transformer-based model
    def __init__(self, vocab_size, max_len):
        super().__init__()
        self.embedding = tf.keras.layers.Embedding(vocab_size, 64)                                                # Learn character embeddings
        self.pos_encoding = tf.keras.layers.Embedding(input_dim=max_len, output_dim=64)                           # Positional encoding
        self.att = tf.keras.layers.MultiHeadAttention(num_heads=2, key_dim=64)                                    # Multi-head self-attention
        self.norm = tf.keras.layers.LayerNormalization()                                                          # Layer normalization
        self.ff = tf.keras.Sequential([                                                                           # Feed-forward network
            tf.keras.layers.Dense(128, activation='relu'),
            tf.keras.layers.Dense(64)
        ])
        self.out = tf.keras.layers.Dense(vocab_size, activation='softmax')                                        # Output layer for next character prediction

    def call(self, x):
     positions = tf.range(start=0, limit=tf.shape(x)[-1], delta=1)                                                # Create position indices
     x_embed = self.embedding(x) + tf.cast(self.pos_encoding(positions), dtype=tf.float32)                        # Add embeddings + positions

     # Self-attention block
     attn_output = self.att(x_embed, x_embed)                           # Apply self-attention
     x1 = self.norm(x_embed + attn_output)                              # Residual connection + norm

     # Feed-forward block
     ff_output = self.ff(x1)                                            # Apply feed-forward network
     x2 = self.norm(x1 + ff_output)                                     # Another residual connection + norm

     return self.out(x2[:, -1, :])                                      # Only return last token prediction


model = MiniTransformer(vocab_size=total_chars, max_len=max_len-1)                                                  # Instantiate the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])                              # Compile with optimizer & loss
model.summary()                                                                                                     # Print model architecture

# 5. Train Model (quick demo with few epochs)
model.fit(X, y, epochs=10, batch_size=2)                                                                            # Train model for 10 epochs using small batch size

# 6. Text Generation
def generate_text(model, start_str, length=100):                                                                    # Generate text of desired length
    for _ in range(length):
        token_list = tokenizer.texts_to_sequences([start_str])[0]                                                   # Convert input text to sequence
        token_list = tf.keras.preprocessing.sequence.pad_sequences([token_list], maxlen=max_len-1, padding='pre')   #Pad to match input length
        prediction = model.predict(token_list, verbose=0)                  # Predict next character
        predicted_id = np.argmax(prediction[0][-1])                        # Get most likely next character
        next_char = tokenizer.index_word.get(predicted_id, '')             # Convert ID back to character
        start_str += next_char                                             # Append character to the string
    return start_str                                                       # Return generated text

# Example
print(generate_text(model, "the ", length=50))     #Generate 50 characters starting with "the"



Epoch 1/10
[1m41/41[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 9ms/step - accuracy: 0.1535 - loss: 3.1255
Epoch 2/10
[1m41/41[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step - accuracy: 0.4079 - loss: 1.9976
Epoch 3/10
[1m41/41[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 9ms/step - accuracy: 0.4430 - loss: 1.6803
Epoch 4/10
[1m41/41[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 9ms/step - accuracy: 0.3736 - loss: 1.6314
Epoch 5/10
[1m41/41[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 8ms/step - accuracy: 0.4303 - loss: 1.5085
Epoch 6/10
[1m41/41[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step - accuracy: 0.5237 - loss: 1.3373
Epoch 7/10
[1m41/41[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 8ms/step - accuracy: 0.4708 - loss: 1.4518
Epoch 8/10
[1m41/41[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 8ms/step - accuracy: 0.4970 - loss: 1.4278
Epoch 9/10
[1m41/41[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[

**Novelty of the execution:** This implementation teaches foundational LLM concepts by building a fully functional Transformer from scratch using low-level TensorFlow APIs in a character-level setting, optimized for transparency and fast experimentation.

**Applications:**

**(i) Understanding Core LLM Concepts**

Helps learners grasp how large language models (like Gemini) are built from the ground up using attention, embeddings, and residual connections.

**(ii) Text Generation Systems**

Forms the basis for applications like chatbots, story generators, autocomplete engines, or poetry generators—even if in a simplified form.

**(iii) Language Modeling in Low-Resource Settings**

Character-level modeling is useful for languages or domains without robust tokenizers or large datasets.

**(iv) Educational & Research Prototypes**

Excellent for academic courses, workshops, or sandbox experimentation to prototype new ideas before scaling up.

**(v) Error Correction & Typo Prediction**

Since it models sequences at the character level, it can be fine-tuned for spelling correction, auto-correction, or typo-aware input systems.

**(vi) Custom Code Generation or Domain-Specific DSLs (Domain-Specific Languages)**

Character-level models can be adapted for specialized domains like DNA sequences, source code, or markup languages.