<a href="https://colab.research.google.com/github/Tanu-N-Prabhu/Python/blob/master/Machine%20Learning%20Advanced%20Topics/Large%20Language%20Models%20(LLMs)%20%26%20Foundation%20Models/llm_demo_explained.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Large Language Models (LLMs)
_Simple Educational Demo_

This notebook gives a **lightweight, quick-to-run** introduction to:

1. **Transformer architectures** (GPT-like text generation)
2. **Parameter-efficient fine-tuning** (LoRA / QLoRA): *concept demo only*
3. **Prompt engineering**
4. **Retrieval-Augmented Generation (RAG)**: mini in-memory demo

All code runs in **Google Colab** or locally with CPU/GPU.

---


In [1]:
!pip install transformers datasets faiss-cpu --quiet

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.6/23.6 MB[0m [31m34.5 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import faiss
import numpy as np


### Explanation
Here, we install the minimal dependencies:
- **transformers**: For loading pre-trained language models like GPT.
- **datasets**: (Not used heavily here, but common for text data loading.)
- **faiss-cpu**: For building our mini vector search index in the RAG example.


## 1. GPT-like Text Generation


### Explanation
We import:
- **AutoTokenizer** and **AutoModelForCausalLM**: To load our GPT-like model.
- **pipeline**: High-level API for text generation.
- **faiss** and **numpy**: For retrieval in the RAG example.


In [3]:
# Load a tiny GPT-2 model for fast inference
model_name = "sshleifer/tiny-gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Create pipeline
generator = pipeline("text-generation", model=model, tokenizer=tokenizer)

# Generate text
prompt = "Artificial Intelligence is"
output = generator(prompt, max_length=30, num_return_sequences=1)
print(output[0]['generated_text'])


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/90.0 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.51M [00:00<?, ?B/s]

Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=30) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


model.safetensors:   0%|          | 0.00/2.51M [00:00<?, ?B/s]

Artificial Intelligence is448 236 courtyardacious 236 factors bravery boilsozyg rented membership incarcerivedaciousGy Wheels skillet Tre 236 rubbing soy soy Television rented LateMost� Boone Medic clearer linedozyg rubbing equate LateMost Wheels equate bravery Redux653� omega membership representations rubbing Medic brutality skillet Pocket deflect membership workshops Pocket bravery equate soy 236 Reduxobl workshops rented incarcer representations Medic� TreProsozyg 236 clearer lined boilsGy boils workshopsshows mutual braveryOutsideozygOutside Tre Television grandchildren Booneshowsacious courtyard Boone Television Dreams Pocket omegapublic Booneozyg Pocket Singapore mutual workshops448 rubbing courtyard incarcer courtyard perhaps 236publicpublic Bend courtyardacious653 rented factors factors representations membershipacious653MiniPros soyMost brutality lined perhaps clearer rubbing Late448Sexual 236 skillet courtyard equate predators workshops predators� brutality predatorsacious c


### Explanation
We load a **very tiny GPT-2 model** (`sshleifer/tiny-gpt2`) so it runs quickly.
- This is NOT a factual or coherent model — it's only for understanding *how* text generation works.
- We pass a short prompt and let the model generate a continuation.


## 2. LoRA / QLoRA Concept (no training)


### Explanation
LoRA / QLoRA allow fine-tuning large models by training only small adapter layers instead of the full model.
Here, we just **simulate** that idea by calculating what 1% of the parameters would be.


In [4]:
# LoRA / QLoRA Concept
total_params = sum(p.numel() for p in model.parameters())
lora_params = int(total_params * 0.01)  # simulate only 1% trainable params

print(f"Full model parameters: {total_params:,}")
print(f"LoRA trainable parameters (simulated): {lora_params:,}")
print("LoRA idea: train only small adapter matrices instead of the whole model.")

Full model parameters: 102,714
LoRA trainable parameters (simulated): 1,027
LoRA idea: train only small adapter matrices instead of the whole model.



### Explanation
Prompt engineering is about phrasing your input to guide the model's output.
We demonstrate with a "translate" example, even though our toy GPT model is not trained for translation.


## 3. Prompt Engineering Example


### Explanation
In the mini RAG example:
1. We store a few small documents.
2. We turn them into simple numeric vectors (bag-of-words).
3. We use **FAISS** to find the most relevant document for a query.
This simulates how retrieval-augmented generation works on a small scale.


In [5]:
# Prompt Engineering Demo
prompt = """Translate the following English sentence to French:

'Where is the nearest train station?'"""

output = generator(prompt, max_length=50, num_return_sequences=1)
print(output[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Translate the following English sentence to French:

'Where is the nearest train station?' Singapore equate rubbing Wheels Bend deflect predatorsshows Bend linedMost bravery prayingozyg Late omega 236448 mutualacious equate perhaps mutual lined Televisionozyg653 workshops Late representations Reduxpublic Late courtyard membership deflect Medic653ived rentedshowsOutside membershipMost 236shows perhaps PocketGy rubbing courtyardacious representations Medic brutality skilletacious Bend skillet 236 deflectMost perhapspublic deflect Wheels lined perhapsSexualMostOutside soy praying praying Pocketozyg Medic Dreams rented Late workshopsGy grandchildren representations 236 membership Redux Late� 236 representations DreamsSexual membership Late brutality perhaps factors Treoblozyg perhapspublicGy 236 Lateacious Tre Dreams factors boils� factors Pocket workshops Tre bravery Tre soy predators Bend448Gy perhapspublicMini Late soy membership deflectpublic deflect representations brutality BooneSexu

## 4. Mini Retrieval-Augmented Generation (RAG)

In [6]:
# Mini Retrieval-Augmented Generation Demo

# Our small 'document database'
docs = [
    "Python is a popular programming language created by Guido van Rossum.",
    "Transformers are neural networks that use self-attention mechanisms.",
    "LoRA reduces the number of trainable parameters in LLM fine-tuning."
]

# Embed documents (simple: bag-of-words style vector)
vocab = list(set(" ".join(docs).lower().split()))
word_to_idx = {w:i for i,w in enumerate(vocab)}

def embed(text):
    vec = np.zeros(len(vocab))
    for word in text.lower().split():
        if word in word_to_idx:
            vec[word_to_idx[word]] += 1
    return vec

doc_embeddings = np.stack([embed(d) for d in docs])

# Create FAISS index
index = faiss.IndexFlatL2(len(vocab))
index.add(doc_embeddings)

# Query
query = "Who developed Python?"
query_vec = embed(query).reshape(1, -1)

# Search
D, I = index.search(query_vec, k=1)
print("Query:", query)
print("Retrieved document:", docs[I[0][0]])

Query: Who developed Python?
Retrieved document: Transformers are neural networks that use self-attention mechanisms.



---
## Educational Purpose Disclaimer
This is a **basic educational version** of the concepts behind LLMs, LoRA/QLoRA, Prompt Engineering, and RAG.  
- The models used here are **tiny test models** that do not produce accurate or factual answers.  
- This notebook is intended **purely for learning how these systems work**, not for real-world AI deployment.
