# 🧑‍🏫 Types of Foundation Models

---

### 🔹 **LLM (Large Language Models)**

* **What**: Trained on huge text corpora to understand, generate, and reason with natural language.
* **Open-Source Examples**:

  * **LLaMA 3** (Meta)
  * **Mistral** (Mistral AI)
  * **Falcon** (TII, UAE)
  * **RedPajama** (Together)
* **Closed/Paid Examples**:

  * **GPT-4 / GPT-4o** (OpenAI)
  * **Claude 3** (Anthropic)
  * **Gemini** (Google DeepMind)
* **Use Cases**: Chatbots, coding assistants, reasoning, customer support, translation, summarization.

---

### 🔹 **VLM (Vision-Language Models)**

* **What**: Combine images + text for multimodal reasoning.
* **Open-Source Examples**:

  * **LLaVA** (fine-tuned LLaMA + CLIP)
  * **BLIP-2** (Salesforce Research)
  * **Kosmos-2** (Microsoft)
* **Closed/Paid Examples**:

  * **GPT-4o** (OpenAI)
  * **Gemini 1.5 Pro** (Google)
  * **Claude 3.5 Sonnet Vision** (Anthropic)
* **Use Cases**: Image captioning, OCR, visual QA, accessibility apps, diagram/code explanation.

---

### 🔹 **Embedding Models**

* **What**: Convert text, image, or audio into **vectors (dense numerical representations)**.
* **Open-Source Examples**:

  * **Sentence-BERT (SBERT)**
  * **E5 Embeddings** (Microsoft)
  * **InstructorXL** (HKU + Microsoft)
* **Closed/Paid Examples**:

  * **OpenAI text-embedding-3-small/large**
  * **Cohere Embed**
* **Use Cases**: Semantic search, clustering, personalization, retrieval-augmented generation (RAG).

---

### 🔹 **Video Captioning Models**

* **What**: Extend VLMs to video by modeling **temporal sequence of frames**.
* **Open-Source Examples**:

  * **Video-LLaMA**
  * **InternVideo**
  * **MERLOT Reserve**
* **Closed/Paid Examples**:

  * **PaLI-X** (Google)
  * **Flamingo** (DeepMind)
  * **Gemini multimodal video**
* **Use Cases**: Automatic subtitles, meeting/video summarization, surveillance analytics, YouTube automation.

---

### 🔹 **Speech-to-Text (ASR)**

* **What**: Convert spoken audio → text.
* **Open-Source Examples**:

  * **Whisper** (OpenAI, also runs locally)
  * **NVIDIA NeMo ASR**
* **Closed/Paid Examples**:

  * **Deepgram API**
  * **AssemblyAI**
  * **Rev.ai**
* **Use Cases**: Meeting transcription, voice assistants, call center analytics.

---

### 🔹 **Text-to-Speech (TTS)**

* **What**: Convert text → natural sounding speech.
* **Open-Source Examples**:

  * **Coqui TTS**
  * **Tacotron 2 / FastSpeech**
* **Closed/Paid Examples**:

  * **ElevenLabs Voice AI**
  * **Azure Speech**
  * **Amazon Polly**
* **Use Cases**: Audiobooks, dubbing, chatbots with voice, accessibility tools.

---

### 🔹 **Diffusion Models (Generative Image/Video)**

* **What**: Use iterative noise → denoising to generate images/videos from text.
* **Open-Source Examples**:

  * **Stable Diffusion (SDXL)**
  * **Kandinsky**
  * **DeepFloyd IF**
* **Closed/Paid Examples**:

  * **DALL·E 3** (OpenAI)
  * **MidJourney**
* **Use Cases**: Digital art, advertising, product mockups, entertainment, synthetic dataset creation.

---

### 🔹 **Audio-Language Models (ALM)**

* **What**: Handle both audio + text (not just recognition, but reasoning).
* **Open-Source Examples**:

  * **AudioLM (Google, limited release)**
  * **Bark TTS + embeddings** (community projects)
* **Closed/Paid Examples**:

  * **GPT-4o (voice mode)**
  * **Sonantic AI** (used in movies)
* **Use Cases**: Conversational AI with voice, singing generation, podcast tools.

---

### 🔹 **RLHF + Instruction-Tuned Variants**

* **What**: Base LLMs fine-tuned with **human feedback** to align with user intent.
* **Open-Source Examples**:

  * **Alpaca, Vicuna** (fine-tuned LLaMA)
  * **OpenAssistant** (LAION)
* **Closed/Paid Examples**:

  * Already baked into **ChatGPT, Claude, Gemini**.
* **Use Cases**: Safer, instruction-following assistants.

---

### 🔹 **Code Models (Specialized LLMs for Programming)**

* **What**: LLMs fine-tuned on source code datasets.
* **Open-Source Examples**:

  * **Code LLaMA**
  * **StarCoder / StarCoder2** (Hugging Face + BigCode)
* **Closed/Paid Examples**:

  * **GPT-4 (code interpreter / advanced reasoning)**
  * **Claude 3 Sonnet for coding**
* **Use Cases**: Pair programming, code completion, debugging, data science automation.

---

✅ This gives you a **comprehensive map of model families**:

* **Text → LLMs**
* **Image+Text → VLMs**
* **Embeddings → semantic vector space**
* **Video → temporal multimodal models**
* **Speech & Audio → ASR, TTS, ALMs**
* **Generative Art → Diffusion models**
* **Specializations → Code, Instruction-tuned, etc.**

# ⚙️ Useful Parameters in LLMs (Explained Simply)

---

### 🔥 1. `temperature`

* **What it does**: Controls how “creative” or “random” the model is.
* **Range**: `0.0 → 2.0`
* **How to think**:

  * **Low (0.0–0.3)** → Very serious, factual, predictable.
  * **High (0.7–1.2)** → More creative, story-like, less predictable.
* **Example**:

  * Question: *“Tell me a synonym of happy.”*
  * `temperature=0.1` → “joyful” (always the same).
  * `temperature=1.0` → “joyful, delighted, ecstatic…” (varies each time).

---

### 🎯 2. `top_p` (a.k.a. **nucleus sampling**)

* **What it does**: Instead of picking the single best word, it looks at a *set of likely words* whose probabilities add up to `p`.
* **Range**: `0 → 1`
* **How to think**:

  * **Low (0.1–0.3)** → Only choose from top-most likely words → safe & focused.
  * **High (0.9–1.0)** → Allow more variety, riskier choices.
* **Example**:

  * `top_p=0.2` → “joyful” (safe pick).
  * `top_p=0.9` → Could be “joyful, delighted, thrilled, excited…”

---

### 📝 3. `max_tokens`

* **What it does**: Sets the **maximum length** of the answer.
* **Why important**: Prevents model from giving **very long answers** or wasting cost.
* **Example**:

  * `max_tokens=10` → “Happy means joyful or delighted.”
  * `max_tokens=100` → Longer, detailed paragraph.

---

### 🔄 4. `presence_penalty`

* **What it does**: Stops the model from **repeating same topics**.
* **Range**: `-2 → +2`
* **How to think**:

  * **High (≥1.0)** → Model talks about *new* things, explores variety.
  * **Low (≤0.0)** → Model stays on the same topic.
* **Example**:

  * Ask: “Tell me about fruits.”
  * `presence_penalty=0` → Talks mostly about apples & bananas.
  * `presence_penalty=1.5` → Talks about apples, bananas, oranges, mangoes…

---

### 🔁 5. `frequency_penalty`

* **What it does**: Stops the model from **repeating the same words again and again**.
* **Range**: `-2 → +2`
* **Example**:

  * Without penalty → “happy happy happy happy happy…”
  * With penalty=1.5 → “happy, joyful, delighted, glad…”

---

### 🛑 6. `stop` sequences

* **What it does**: Tells model **when to stop**.
* **Example**:

  ```python
  stop=["<END>", "\n\n"]
  ```

  If the model sees `<END>` or a blank line, it stops immediately.

---

### 👨‍🏫 7. `system prompt` (in chat models)

* **What it does**: Sets the **role** or **personality** of the model.
* **Example**:

  * System: “You are a polite teacher.”
  * User: “What is AI?” → Answer is formal, simple.
  * System: “You are a funny stand-up comedian.”
  * User: “What is AI?” → Answer has jokes & humor.

---

### 🔎 8. `top_k` (mostly in open-source models, like Ollama, Hugging Face)

* **What it does**: Chooses from the **top k most likely words**.
* **Range**: `1 → 100` (or more)
* **How to think**:

  * **Low (k=1–5)** → Very strict, predictable.
  * **High (k=50–100)** → More random, surprising answers.

# 🔍 Difference Between `top_p` and `top_k`

---

### 🎯 **Top-k Sampling**

* The model **looks at the k most likely next words** and then randomly picks one from them.
* **k = 1** → always pick the single most likely word (deterministic).
* **k = 50** → pick randomly from the top 50 likely words.

👉 Example (task: “The sky is …”):

* Top-5 predictions: `blue (0.6), cloudy (0.2), dark (0.1), clear (0.05), red (0.05)`
* If `top_k=2` → choose randomly between “blue” and “cloudy”.
* If `top_k=5` → can choose any of the 5 words above.

---

### 🎯 **Top-p (Nucleus Sampling)**

* Instead of a fixed number, `top_p` looks at the **smallest set of words whose probabilities add up to p**.
* Example with same predictions:

  * “blue (0.6), cloudy (0.2), dark (0.1), clear (0.05), red (0.05)”
  * If `top_p=0.7` → picks from {blue, cloudy} (because 0.6 + 0.2 ≥ 0.7).
  * If `top_p=0.9` → picks from {blue, cloudy, dark} (0.6 + 0.2 + 0.1 = 0.9).

---

A) LLM (text) — Mistral via Ollama

Task: classify sentiment and explain briefly.

In [1]:
# llm_mistral.py
import ollama

SYSTEM = "You are a concise NLP assistant. Always give a short answer, then one-line reason."
USER = 'Classify sentiment of: "The food was amazing but the service was slow" (Positive/Negative/Mixed)'

resp = ollama.chat(
    model="mistral:latest",
    messages=[
        {"role": "system", "content": SYSTEM},
        {"role": "user", "content": USER}
    ],
    options={
        "temperature": 0.3,
        "top_p": 0.9,
        "max_tokens": 150
    }
)
print(resp["message"]["content"])

 Mixed. Positive sentiment towards food, negative sentiment towards service.


VLM (Vision-Language Model)

Definition:
A Vision-Language Model (VLM) is a type of AI model that can understand both images and text. It connects visual information (images, videos) with language, allowing the model to describe, answer questions about, or reason with visual content.

Key Features:

Input: Images + Optional text prompt

Output: Text (captions, answers, descriptions)

Functionality: Can do tasks like:

Image captioning → describing what’s in an image

Visual Question Answering (VQA) → answering questions about an image

Text-to-Image reasoning → understanding a prompt in context of an image

In [None]:
# 📦 Import Libraries
from transformers import VisionEncoderDecoderModel, ViTImageProcessor, AutoTokenizer
from PIL import Image
import torch

# 🔹 Paths (update if needed)
MODEL_PATH = "/kaggle/input/input-dataset/vit-gpt2-image-captioning"  # uploaded model path
IMAGE_PATH = "/kaggle/input/apple-pie/Apple_pie.jpg"                # uploaded image

# 🔹 Device
device = "cuda" if torch.cuda.is_available() else "cpu"

# 🔹 Load model offline
model = VisionEncoderDecoderModel.from_pretrained(MODEL_PATH).to(device)
processor = ViTImageProcessor.from_pretrained(MODEL_PATH)
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)

# 🔹 Load and preprocess image
image = Image.open(IMAGE_PATH).convert("RGB")

# Example prompts
prompt = "Describe the all the items there"  # your message here

pixel_values = processor(images=image,text=prompt,return_tensors="pt").pixel_values.to(device)

# 🔹 Generate caption using sampling (avoids GPT2 beam search error)
output_ids = model.generate(
    pixel_values,
    max_length=20,       # maximum caption length
    do_sample=True,      # enable sampling
    top_k=50,            # sample from top 50 tokens
    top_p=0.95,          # nucleus sampling
    temperature=1.0      # creativity level
)

# 🔹 Decode caption
caption = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print("📝 Caption:", caption)

output:
📝 Caption: a large pastry with apples, cream, and nut topping

## **1️⃣ What is an Embedding Model?**

**Definition:**
An **Embedding Model** is a type of AI model that converts **text, images, or other data into numerical vectors** (arrays of numbers). These vectors capture the **semantic meaning** of the input, so similar inputs are close together in the vector space.

* Input → Embedding Model → Vector (array of numbers)
* These vectors can then be used for:

  * Semantic search
  * Recommendation systems
  * Clustering
  * Retrieval-augmented generation (RAG)

---

### **2️⃣ Examples of Embedding Models**

| Type        | Model                                    | Input        | Output | Notes                        |
| ----------- | ---------------------------------------- | ------------ | ------ | ---------------------------- |
| Text        | `sentence-transformers/all-MiniLM-L6-v2` | Text         | Vector | Lightweight, fast            |
| Text        | `text-embedding-3-small`                 | Text         | Vector | OpenAI API                   |
| Image       | `clip-vit-base-patch32`                  | Image        | Vector | CLIP embeddings, multi-modal |
| Multi-modal | `openai/clip-vit-large-patch14`          | Text + Image | Vector | Multi-modal similarity       |

---

### **3️⃣ How embeddings work (conceptual)**

* Input sentences:

  1. `"I love pizza"` → `[0.12, -0.45, 0.88, ...]`
  2. `"I like burgers"` → `[0.15, -0.48, 0.85, ...]`
* Cosine similarity between vectors is high → semantically similar sentences.

---

### **4️⃣ Use Cases**

* Semantic search: find documents similar in meaning to a query
* Clustering: group similar texts or images together
* Recommendation engines: suggest items based on similarity
* RAG: find relevant context for LLMs

---

### ✅ Notes:

1. **Embeddings are vectors**, not human-readable text.
2. **Similar meaning → vectors close together**, different meaning → far apart.
3. Lightweight models like `all-MiniLM-L6-v2` are perfect for **Kaggle/Colab offline**.


In [1]:
from sentence_transformers import SentenceTransformer
import numpy as np

# Load model (offline compatible if uploaded as dataset)
model_name = "sentence-transformers/all-MiniLM-L6-v2"
model = SentenceTransformer(model_name)

# Sample sentences
sentences = [
    "I love machine learning",
    "Artificial intelligence is fascinating",
    "I enjoy eating pizza"
]

# Generate embeddings
embeddings = model.encode(sentences)

# Display embeddings
for i, vec in enumerate(embeddings):
    print(f"Sentence: {sentences[i]}")
    print(f"Embedding vector (first 5 dims): {vec[:5]}\n")

# Cosine similarity between first two sentences
from numpy.linalg import norm
cos_sim = np.dot(embeddings[0], embeddings[1]) / (norm(embeddings[0]) * norm(embeddings[1]))
print("Cosine similarity between sentence 1 and 2:", cos_sim)

  from .autonotebook import tqdm as notebook_tqdm


Sentence: I love machine learning
Embedding vector (first 5 dims): [-0.04363637 -0.05905436  0.08201238 -0.01076719  0.0611959 ]

Sentence: Artificial intelligence is fascinating
Embedding vector (first 5 dims): [-0.02487095 -0.06024496  0.0552827  -0.00576778  0.0083817 ]

Sentence: I enjoy eating pizza
Embedding vector (first 5 dims): [-0.03766708  0.01072115 -0.0045544   0.11080603 -0.07481083]

Cosine similarity between sentence 1 and 2: 0.6323954


1️⃣ What is a Text-to-Speech (TTS) Model?

Definition:
A Text-to-Speech (TTS) model is an AI system that converts written text into spoken audio. Essentially, it “reads aloud” text with natural-sounding voices.

Input → Text → TTS Model → Audio output

The audio can be MP3, WAV, or any playable format.

2️⃣ Key Features:

Multi-language support (English, Hindi, Spanish, etc.)

Voice selection (male, female, robotic, expressive)

Adjustable speed, pitch, and volume

Some models support emotional tone (happy, sad, neutral)

3️⃣ Examples of TTS Models
Model	Open Source / Paid	Notes
ElevenLabs TTS	Paid API	Extremely realistic voices, expressive
Coqui TTS	Open Source	Trainable TTS, offline
gTTS (Google TTS)	Open Source / Free	Lightweight, simple usage
VITS / Bark	Open Source	Neural TTS, expressive voices, multi-language
4️⃣ Use Cases

Voice assistants (Alexa, Siri, Google Assistant)

Audiobooks & podcasts generation

Accessibility (screen readers for visually impaired)

Multi-modal AI apps (image-to-speech, chatbot reading responses)

In [None]:
# %pip install gtts ipython

from gtts import gTTS
import IPython.display as ipd

# Text to convert
text = "Hello! My name is Suraj Patra and I am from Rourkela and i like to play cricket"

# Create TTS object
tts = gTTS(text=text, lang='en')  # lang='hi' for Hindi

# Save audio file
audio_path = "tts_output.mp3"
tts.save(audio_path)

# Play audio in notebook
ipd.display(ipd.Audio(audio_path))

## **1️⃣ What is Speech-to-Text (STT)?**

**Definition:**
STT models take **spoken audio** as input and convert it into **written text**.

* Input → Audio (MP3/WAV) → STT Model → Text output

**Use Cases:**

* Transcribing meetings, lectures, podcasts
* Voice assistants (Alexa/Siri) understanding commands
* Accessibility for hearing-impaired users
* Processing voice notes for AI applications

---

## **2️⃣ Popular STT Models**

| Model / Library               | Open Source / Paid | Notes                                      |
| ----------------------------- | ------------------ | ------------------------------------------ |
| **Whisper (OpenAI)**          | Open Source        | Supports multiple languages, very accurate |
| **DeepSpeech (Mozilla)**      | Open Source        | Offline-friendly                           |
| **Coqui STT**                 | Open Source        | Fork of DeepSpeech, offline possible       |
| **Google Speech-to-Text API** | Paid/Free          | Requires internet                          |
| **Azure Speech Service**      | Paid               | High accuracy, multi-language              |



In [2]:
import wave
import json
from vosk import Model, KaldiRecognizer
import os

# ----------------------------
# Step 1: Set paths
# ----------------------------
audio_wav = "tts_output.wav"                  # Your converted WAV
model_path = "C:\\vosk-model-small-en-us-0.15"  # Vosk model folder

# ----------------------------
# Step 2: Check audio file
# ----------------------------
if not os.path.isfile(audio_wav):
    raise FileNotFoundError(f"Audio file not found: {audio_wav}")
print("✅ Audio file found:", audio_wav)

# ----------------------------
# Step 3: Load Vosk model
# ----------------------------
model = Model(model_path)
wf = wave.open(audio_wav, "rb")

# Ensure mono PCM
if wf.getnchannels() != 1 or wf.getsampwidth() != 2:
    raise ValueError("Audio must be WAV mono PCM")

rec = KaldiRecognizer(model, wf.getframerate())

# ----------------------------
# Step 4: Transcribe audio
# ----------------------------
transcribed_text = ""
while True:
    data = wf.readframes(4000)
    if len(data) == 0:
        break
    if rec.AcceptWaveform(data):
        res = json.loads(rec.Result())
        transcribed_text += res.get("text", "") + " "

# Add final partial result
res = json.loads(rec.FinalResult())
transcribed_text += res.get("text", "")

# ----------------------------
# Step 5: Print transcription
# ----------------------------
print("\n🎤 Transcribed Text:\n", transcribed_text)


✅ Audio file found: tts_output.wav

🎤 Transcribed Text:
 hello my name is sewerage bought through and i am from role gala and i like to play cricket


## 🔹 What is a Diffusion Model?

A **diffusion model** generates data (like an image, audio, video) by **learning to reverse a noising process**.

1. **Forward Process (Diffusion)**

   * You take an image and gradually add noise (like static on a TV) until it becomes pure noise.
   * This is called the *Markov forward process*.

2. **Reverse Process (Denoising / Generation)**

   * The model learns to start from random noise and step by step remove noise until it reconstructs a meaningful image.
   * Each denoising step is guided by the model’s learned patterns.

👉 That’s why when you generate an AI image, you often **see it appear blurry/noisy first and then sharpen** — it’s diffusion in action.

---

## 🔹 Types of Diffusion Models

1. **DDPM (Denoising Diffusion Probabilistic Models)** → original baseline.
2. **DDIM (Denoising Diffusion Implicit Models)** → faster, fewer steps.
3. **Latent Diffusion (Stable Diffusion)** → runs in compressed latent space → much faster & smaller memory use.
4. **Text-to-Image Diffusion** (Stable Diffusion, Imagen, DALL·E) → conditioned on text embeddings.
5. **Video Diffusion Models** (Gen-2, Pika Labs, Runway) → diffusion extended over time.
6. **Audio Diffusion** → noise-to-speech, noise-to-music.

---

## 🔹 Key Parameters in Diffusion Models

* **num\_inference\_steps** → number of denoising steps (more = better quality, slower).
* **guidance\_scale** → how strongly the text prompt controls generation (low = more creativity, high = more literal).
* **scheduler** → noise removal strategy (DDIM, Euler, Heun, DPM++ etc.).
* **seed** → controls randomness; same seed = reproducible result.

---

