📚 Foundational Models, Prompt/Context Engineering, RAG Systems
Author: Emmanuel Obute
Connect: Linkedin
Level: Beginner to Intermediate
Duration: 1-2 hours

🧠 1. Foundational Models
✅ Definition:
Foundational models are large-scale AI models trained on massive and diverse datasets, designed to be general-purpose and adaptable to many downstream tasks (e.g., chat, summarization, coding).

Large pretrained models (e.g., GPT, LLaMA, Mistral) that can handle diverse language tasks without task-specific training.

🔍 Characteristics:
Trained using self-supervised learning

Can be fine-tuned or used as-is (zero-shot/few-shot)

Examples: GPT-4, LLaMA, Mistral, Gemma, Claude, PaLM, Phi

📌 Why Important:
Serve as the backbone for applications like chatbots, RAG systems, and agent frameworks.

Reduce need for task-specific models.

📚 Common Open Models:
| Model     | Developer  | Size        | Notes                        |
| --------- | ---------- | ----------- | ---------------------------- |
| GPT-4     | OpenAI     | \~Trillion? | Closed, SOTA                 |
| LLaMA 2/3 | Meta       | 7B–65B      | Open-weight, strong base     |
| Mistral   | Mistral AI | 7B          | Open, fast, high quality     |
| Phi-3     | Microsoft  | 3B          | Lightweight, efficient       |
| Gemma     | Google     | 2B–7B       | Open-weight, tuned for speed |



When to Use What?
| Scenario                                             | Recommended Model    |
| ---------------------------------------------------- | -------------------- |
| Cutting-edge open model, good quality & open weights | **Mistral 7B**       |
| Research & fine-tuning base models                   | **LLaMA 13B or 33B** |
| Local fast chat with smaller footprint               | **Gemma or Phi-3**   |
| Lightweight and speedy local inference               | **Phi-3**            |


In [None]:
# Example: Using GPT-3.5 Turbo with OpenAI

import openai

openai.api_key = "your-api-key"

response = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the theory of relativity?"}
    ],
    temperature=0.3
)

print(response['choices'][0]['message']['content'])

In [None]:
# Example: Using Mistral via Ollama (locally)

from langchain_community.llms import Ollama

llm = Ollama(model="mistral")

print(llm("Explain Newton's First Law of Motion."))


💬 2. Prompt / Context Engineering
✅ Definition:
Prompt or context engineering is the process of designing inputs to guide LLMs towards desired outputs without fine-tuning the model itself.

Crafting input prompts that guide LLMs toward accurate or desirable outputs.

🧰 Techniques:
Technique	Description
Zero-shot	Ask directly without examples
Few-shot	Include 1–5 examples to guide behavior
Chain-of-thought	Include reasoning steps in prompt
Role prompting	Give model an identity or goal (e.g., “You are a helpful assistant…”)
Formatting	Use consistent structure (lists, markdown, etc.)

📦 Prompt Components:
Instructions: What to do

Context: Background info

Examples: Show expected behavior

Input: The actual query/data

⚠️ Challenges:
LLMs are sensitive to wording

Token limits constrain long inputs

Output is non-deterministic at higher temperatures

In [None]:
# Example 1: Zero-shot Prompt

prompt = "Summarize this: Artificial Intelligence is transforming industries..."


# Example 2: Few-shot Prompt

prompt = """
Q: What is the capital of France?
A: Paris
Q: What is the capital of Japan?
A: Tokyo
Q: What is the capital of Canada?
A: Ontario
"""


In [None]:
# Example with OpenAI

response = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=[{
        "role": "user",
        "content": """You are an expert summarizer.
        Summarize the following text in 2 bullet points:
        'The Great Wall of China is over 13,000 miles long and was built over centuries to protect Chinese territories from invasions.'"""
    }]
)

print(response['choices'][0]['message']['content'])

🔍 3. RAG Systems (Retrieval-Augmented Generation)
✅ Definition:
RAG combines LLMs with external knowledge (retrieved in real time) to improve the factual accuracy and relevance of answers.

🔁 Process Flow:
User Query

Retriever fetches relevant documents from a vector database (e.g. Chroma, FAISS)

Generator (LLM) uses retrieved context to answer

🔧 Key Components:
Component	Role
Vector Store	Stores embeddings for retrieval (Chroma, FAISS)
Embeddings	Turns text into vectors (e.g., all-MiniLM-L6-v2)
Retriever	Finds top-K relevant chunks
LLM	Generates answers using the context

🧠 Why Use RAG?
Mitigates hallucination

Adds real-time or proprietary knowledge

Keeps the model factual without fine-tuning

📌 Tools:
LangChain or LlamaIndex for orchestration

OpenAI, Ollama, Hugging Face models

FastAPI to serve RAG as an API

📝 FastAPI – Notes
📌 Overview:
FastAPI is a modern, fast (high-performance) web framework for building APIs with Python 3.7+ based on standard Python type hints.

Created by Sebastián Ramírez

Built on top of Starlette (for web handling) and Pydantic (for data validation)

Designed for building RESTful APIs quickly and efficiently

🚀 Key Features:
Fast: One of the fastest Python frameworks thanks to ASGI and async support.

Type Hints: Uses Python type hints for request data validation, editor autocompletion, and documentation.

Automatic Docs: Generates OpenAPI and Swagger UI automatically at:

/docs (Swagger UI)

/redoc (ReDoc)

Async Support: Supports async/await natively for non-blocking I/O operations.

Data Validation: Powered by Pydantic for strong validation and serialization.

Dependency Injection: Clean way to manage dependencies in routes.

In [None]:
from fastapi import FastAPI

app = FastAPI()

@app.get("/")
def read_root():
    return {"message": "Hello, FastAPI!"}

@app.get("/items/{item_id}")
def read_item(item_id: int, q: str = None):
    return {"item_id": item_id, "q": q}


#run: uvicorn main:app --reload