# Lesson 2 From Stratch to Scalable

>This notebook is based on the open-source project [wow-rag](https://github.com/datawhalechina/wow-rag) by Datawhale China.  
>I’ve adapted and annotated parts of it for personal learning and experimentation.


###  Integration Options for LLMs and Embeddings in LlamaIndex

LlamaIndex provides several flexible ways to integrate LLMs and embedding models into your RAG pipeline.  
While we’ll start with OpenAI for simplicity, it's helpful to understand the broader ecosystem and future possibilities.

#### Available Integration Options:

1. **Use official LlamaIndex client packages**  
   For providers like ZhipuAI or Yi-34B, LlamaIndex offers pre-built wrappers that make integration seamless.

2. **Use OpenAI-compatible APIs**  *(our choice for now)*  
   This includes OpenAI's official endpoints and any API that mimics OpenAI's interface (e.g., OpenRouter, Moonshot).  
   Since we are using OpenAI directly from abroad, the setup is straightforward and reliable.

3. **Use custom model classes**  
   LlamaIndex allows advanced users to implement their own `LLM` or `Embedding` classes, offering full control over how the models behave.

4. **Use local models via Ollama** *(exploration planned)*  
   Ollama allows running open-source models (like LLaMA, Mistral, Gemma) on your own machine with an OpenAI-compatible interface.  
   While we won’t use this in Lesson 2, we plan to explore its potential for offline, privacy-focused, or cost-efficient setups.

---

 In this notebook, we’ll proceed with **Option 2 (OpenAI)** to build a clean, minimal RAG prototype using official APIs.  
Later, we may experiment with **Option 4 (Ollama)** to compare local model performance and flexibility.


## 1. Introduction to Lamaindex

###  What is LlamaIndex (formerly GPT Index)?

**LlamaIndex** is a Python library designed to **connect your documents to language models** (like GPT) in an efficient, flexible, and scalable way.

It simplifies and automates key steps in a **RAG (Retrieval-Augmented Generation)** pipeline, such as:

- Document ingestion
- Text chunking and indexing
- Embedding and storage
- Query retrieval
- Response generation

---

###  Why Use LlamaIndex in RAG?

RAG systems require multiple components to work together:
- Chunking
- Embedding
- Indexing
- Retrieving
- Feeding context to an LLM

**LlamaIndex wraps all of this into an easy-to-use interface**, so you can build a complete RAG pipeline with minimal boilerplate code.

---

###  Core Features of LlamaIndex

| Feature               | Purpose |
|------------------------|---------|
| **DocumentLoader**     | Load text from PDFs, websites, files, etc. |
| **TextSplitter**       | Automatically chunk text by sentence or tokens |
| **VectorStoreIndex**   | Store and search document embeddings |
| **QueryEngine**        | Combine retrieved context + LLM to answer questions |
| **Storage/Callbacks**  | Persist indexes, log metrics, integrate with LangChain |

---

### 🔍 Do I *Need* LlamaIndex?

| Situation                         | Recommendation |
|----------------------------------|----------------|
| Just learning or building from scratch | ❌ Not required — write your own code (like you did) |
| Want fast prototyping / scale up      | ✅ Helpful — simplifies multi-step RAG pipelines |
| Want to integrate with external data (PDFs, SQL, etc.) | ✅ Strongly recommended |

---

###  RAG Toolkits: LlamaIndex vs LangChain vs Others

When building a RAG pipeline, you can either:
- **Write everything from scratch** (manual chunking, embeddings, FAISS, prompt assembly)
- OR use **frameworks** that abstract and manage these steps

---

###  Common RAG Frameworks

| Tool/Library    | Description |
|-----------------|-------------|
| **LlamaIndex**  | Simplifies connecting documents to LLMs (chunking, embedding, querying) |
| **LangChain**   | A modular framework for building LLM-powered applications with chains, tools, agents |
| **Haystack**    | Enterprise-grade RAG toolkit with Elasticsearch, vector DBs, and pipelines |
| **Ragas**       | Focused on **evaluating** RAG systems (not building) |
| **PrivateGPT / GPTCache** | For private local inference and response caching |

---

###  LlamaIndex vs LangChain: Key Differences

| Feature              | **LlamaIndex**                         | **LangChain**                                 |
|----------------------|----------------------------------------|-----------------------------------------------|
| Goal                 | Simple RAG from documents              | General LLM application framework             |
| Abstraction Level    | High-level (document → answer)         | Mid-level (build your own chains, tools)      |
| Focus                | RAG & retrieval                        | Agents, tools, chains, prompts, retrieval     |
| Setup                | Easier for beginners                   | More flexible, but steeper learning curve     |
| Integration          | Built-in vector store support, OpenAI  | Works with tools, APIs, DBs, vector stores    |
| Use Case             | Document Q&A, chat with files          | Complex workflows, agents, multi-step tasks   |

---

###  When to Use Which?

| You want to...                                  | Recommended |
|--------------------------------------------------|-------------|
| Build simple or academic RAG from documents      | **LlamaIndex** |
| Connect LLMs to databases, tools, APIs           | **LangChain** |
| Evaluate the quality of a RAG system             | **Ragas** |
| Build enterprise-grade, full-stack search        | **Haystack** |
| Run LLMs privately without cloud                 | **PrivateGPT** |

---

## 2. Use OpenAI-compatible APIs

### 2.1 Install package 

####  LlamaIndex Packages Overview

This table explains what each installed package does and whether it's necessary when using **OpenAI's embedding and chat models**.

| Package Name                              | Purpose / Description                                                | 
|-------------------------------------------|----------------------------------------------------------------------|
| `llama-index-core`                        | Core functionality of LlamaIndex (chunking, indexing, querying)     | 
| `llama-index-embeddings-zhipuai`          | Embedding model plugin for **ZhipuAI**                               | 
| `llama-index-llms-zhipuai`                | Chat model plugin for **ZhipuAI**                                    | 
| `llama-index-embeddings-openai`           | Embedding plugin for **OpenAI** models (e.g., `text-embedding-3`)   | 
| `llama-index-llms-openai`                 | LLM plugin for **OpenAI** chat models (e.g., `gpt-3.5`, `gpt-4`)     | 
| `llama-index-readers-file`                | Loader for reading local files (`.txt`, `.md`, `.csv`, etc.)         | 
| `llama-index-vector-stores-faiss`         | FAISS vector index integration for semantic search                   | 
| `llamaindex-py-client`                    | Client for accessing **LlamaCloud** API (hosted RAG-as-a-Service)    | 


In [58]:
# %pip install llama-index-core
# %pip install llama-index-embeddings-openai
# %pip install llama-index-llms-openai
# %pip install llama-index-readers-file
# %pip install llama-index-vector-stores-faiss
# %pip install llamaindex-py-client

### 2.2 API Configuration and Model Setup

same as the first chapter 

In [1]:
import os
from dotenv import load_dotenv

# Load env
load_dotenv()
api_key = os.getenv('API_KEY')

base_url = "hhttps://api.openai.com/v1"  # We use openai's model here
chat_model = "gpt-4.1-nano-2025-04-14"   # We will be using cheaper model as im broke AF
emb_model = "text-embedding-3-small"


### 2.3 Model Config

In [3]:
from llama_index.llms.openai import OpenAI
llm = OpenAI(
    api_key = api_key,
    model = chat_model,
)

### 2.4 Model Test 

#### 💬 Why Do We Test `stream_complete()` and `complete()` Separately?

In LlamaIndex, both `llm.complete()` and `llm.stream_complete()` are used to generate text from a prompt — but they behave differently:

---
####  `llm.complete(prompt)`
- **Returns the full response** as a single object after the whole generation is complete.

- Easier for:

  - Quick one-off generation

  - Logging, string formatting

  - Unit tests or offline batch generation



In [None]:
response = llm.complete("Who are you？")
print(response)

Hello! I am ChatGPT, an AI language model developed by OpenAI. I'm here to help answer your questions, provide information, and assist with a variety of topics. How can I assist you today?


---

####  `llm.stream_complete(prompt)`

- **Returns a generator** that yields the response **incrementally**, token by token or chunk by chunk.
- Useful for:
  - Real-time streaming display
  - Responsive chat UI
  - Reducing latency in long outputs

In [5]:
response = llm.stream_complete("Who are you?")
for chunk in response:
    print('\n')
    print(chunk, end="",flush=True)





Hello

Hello!

Hello! I

Hello! I am

Hello! I am Chat

Hello! I am ChatGPT

Hello! I am ChatGPT,

Hello! I am ChatGPT, an

Hello! I am ChatGPT, an AI

Hello! I am ChatGPT, an AI language

Hello! I am ChatGPT, an AI language model

Hello! I am ChatGPT, an AI language model developed

Hello! I am ChatGPT, an AI language model developed by

Hello! I am ChatGPT, an AI language model developed by Open

Hello! I am ChatGPT, an AI language model developed by OpenAI

Hello! I am ChatGPT, an AI language model developed by OpenAI.

Hello! I am ChatGPT, an AI language model developed by OpenAI. I'm

Hello! I am ChatGPT, an AI language model developed by OpenAI. I'm here

Hello! I am ChatGPT, an AI language model developed by OpenAI. I'm here to

Hello! I am ChatGPT, an AI language model developed by OpenAI. I'm here to help

Hello! I am ChatGPT, an AI language model developed by OpenAI. I'm here to help answer

Hello! I am ChatGPT, an AI language model developed by OpenAI. I'm here to help a

### 2.5 Embedding model config

In [49]:
from llama_index.embeddings.openai import OpenAIEmbedding
embedding = OpenAIEmbedding(
    api_key = api_key,
    model = emb_model,
)

### 2.6 Test Embedding model 

In [50]:
emb = embedding.get_text_embedding("Hellooooo~")
len(emb), type(emb)

(1536, list)

Model and also Embedding model are working, both looks great

## 3. Running locally: Ollama local


###  What is Ollama?

**Ollama** is a lightweight tool that lets you **run open-source LLMs locally** on your own machine (Mac, Windows, or Linux).  
It wraps models like `llama2`, `mistral`, `gemma`, `qwen`, etc., behind a simple API that mimics OpenAI's interface.

---

####  Key Features of Ollama

- Run **chat models locally** without internet or API keys
- Use **GPU (if available)** or CPU fallback
- Supports multiple models: `llama2`, `mistral`, `gemma`, `qwen`, and more
- Comes with a **RESTful API** (OpenAI-compatible) for easy integration
- Works with frameworks like **LlamaIndex**, **LangChain**, or even **manual RAG pipelines**



### 3.1 Package installation

In [1]:
# %pip install llama-index-embeddings-ollama
# %pip install llama-index-llms-ollama

### 3.2 Talking to Local Models via RESTful API

In this section, we will first use the `requests` library to send prompts to a locally running model (e.g., Qwen2 via Ollama) through a RESTful API.

---

###  What is a RESTful API?

A **RESTful API** is a common way for programs to communicate over HTTP using standard methods like:

- `GET` → retrieve data
- `POST` → send data (e.g., a user prompt)
- `PUT`, `DELETE`, etc.

In this case, the **local LLM** (e.g., Ollama) runs a small web server at:



In [10]:
import json
import requests
BASE_URL = "http://127.0.0.1:11434/api/chat"


payload = {
  "model": "qwen2:7b",
  "messages": [
    {
      "role": "user",
      "content": "Please write an article of about 1,000 words discussing the employment prospects of AI majors."
    }
  ]
}
response = requests.post(BASE_URL, json=payload)
print(response.text[:1000])

{"model":"qwen2:7b","created_at":"2025-07-18T15:37:18.2499381Z","message":{"role":"assistant","content":"The"},"done":false}
{"model":"qwen2:7b","created_at":"2025-07-18T15:37:18.2741113Z","message":{"role":"assistant","content":" rapid"},"done":false}
{"model":"qwen2:7b","created_at":"2025-07-18T15:37:18.298073Z","message":{"role":"assistant","content":" development"},"done":false}
{"model":"qwen2:7b","created_at":"2025-07-18T15:37:18.3235007Z","message":{"role":"assistant","content":" and"},"done":false}
{"model":"qwen2:7b","created_at":"2025-07-18T15:37:18.3491176Z","message":{"role":"assistant","content":" adoption"},"done":false}
{"model":"qwen2:7b","created_at":"2025-07-18T15:37:18.3742086Z","message":{"role":"assistant","content":" of"},"done":false}
{"model":"qwen2:7b","created_at":"2025-07-18T15:37:18.4011529Z","message":{"role":"assistant","content":" artificial"},"done":false}
{"model":"qwen2:7b","created_at":"2025-07-18T15:37:18.4287062Z","message":{"role":"assistant","cont

In [8]:
payload = {
  "model": "qwen2:7b",
  "messages": [
    {
      "role": "user",
      "content": "Please write an article of about 1,000 words discussing the employment prospects of AI majors."
    }
  ],
  "stream": True
}
response = requests.post(BASE_URL, json=payload, stream=True)  # Setting stream=True here tells requests not to download the response content immediately
# Check the response status code
if response.status_code == 200:  
    # Iterate the response body using iter_content()
    for chunk in response.iter_content(chunk_size=1024):  
        if chunk:  
            rtn = json.loads(chunk.decode('utf-8')) 
            print(rtn["message"]["content"], end="")
else:  
    print(f"Error: {response.status_code}")  

# close the response
response.close()

As technology continues to advance at an unprecedented pace, the field of artificial intelligence (AI) has rapidly grown and expanded its influence across various industries. The demand for professionals with expertise in this field is on the rise, presenting a promising outlook for graduates specializing in AI.

To understand the employment prospects for AI majors, let's first examine what skills these professionals bring to the table:

1. **Data Analysis**: AI professionals possess strong analytical abilities and are adept at processing large datasets to uncover patterns and insights that can inform decision-making processes.
2. **Machine Learning**: They have a deep understanding of machine learning algorithms, enabling them to create models capable of automating routine tasks and making predictions based on historical data.
3. **Programming Skills**: Proficiency in programming languages such as Python, R, and Java is essential for AI professionals, allowing them to develop custom s

### 3.3 Local Chat config : Using LlamaIndex's `Ollama` Wrapper

In the previous section, we used the `requests` library to interact with the local model via its RESTful API.  
That confirmed that the **Ollama server is up and running** and capable of generating text responses.

Now that we know the local setup works, let's switch to a **higher-level interface**:  
LlamaIndex's `Ollama` wrapper, which provides a cleaner and more structured way to interact with local models using the same LLM abstraction used for OpenAI or other providers.

This wrapper:
- Removes the need for manual JSON construction and parsing
- Supports both full and streaming generation
- Integrates seamlessly with the rest of the LlamaIndex ecosystem (retrievers, query engines, etc.)

Let’s see how we can use it to generate responses using the same local model.

In [5]:
from llama_index.llms.ollama import Ollama
llm = Ollama(base_url="http://127.0.0.1:11434", model="qwen2:7b")

### 3.4 Local Chat test

In [11]:
response = llm.complete("Who are you？")
print(response)

I am Qwen, an AI developed by Alibaba Cloud. I'm designed to assist users in generating various types of text, such as articles, poems, and code snippets. My primary function is to provide assistance and support for a wide range of tasks and inquiries. How can I help you today?


### 3.5 Local Embedding model test

In [45]:
from llama_index.embeddings.ollama import OllamaEmbedding
ollama_embedding = OllamaEmbedding(base_url="http://127.0.0.1:11434", model_name="qwen2:7b")

In [46]:
emb = ollama_embedding.get_text_embedding("你好呀呀")
len(emb), type(emb)


(3584, list)

###  3.6 Embedding Quality Test: Comparing Similarity Scores

According to @DataWhale that embeddings from Ollama produce **unusual similarity scores**, which may hurt retrieval accuracy in RAG.

To test this, we compare **cosine similarity** between:

- A pair of **similar sentences**
- A pair of **unrelated sentences**

Using embeddings generated from different providers:
- Ollama (local)
- OpenAI (baseline)

---

#### 3.6.1 Test Sentences

In [53]:
text1 = "Hi! What a great day"
text2 = "Helloooooo"               # semantically similar
text3 = "I'm gay"           # semantically different

#### 3.6.2 Cosine Similarity Function

In [54]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def cos_sim(a, b):
    return cosine_similarity([a], [b])[0][0]


#### 3.6.3 Get OpenAI's Embedding's

In [None]:
openai_emb1 = embedding.get_text_embedding(text1)
openai_emb2 = embedding.get_text_embedding(text2)
openai_emb3 = embedding.get_text_embedding(text3)

#### 3.6.4 Get Ollama's Embedding's

In [56]:
ollama_emb1 = ollama_embedding.get_text_embedding(text1)
ollama_emb2 = ollama_embedding.get_text_embedding(text2)
ollama_emb3 = ollama_embedding.get_text_embedding(text3)

#### 3.6.5 Result

In [None]:
print("OpenAI Similarity (text1 vs text2):", cos_sim(openai_emb1, openai_emb2))
print("OpenAI Similarity (text1 vs text3):", cos_sim(openai_emb1, openai_emb3))

print("Ollama Similarity (text1 vs text2):", cos_sim(ollama_emb1, ollama_emb2))
print("Ollama Similarity (text1 vs text3):", cos_sim(ollama_emb1, ollama_emb3))

OpenAI Similarity (text1 vs text2): 0.4999185730251407
OpenAI Similarity (text1 vs text3): 0.20050592158419314
Ollama Similarity (text1 vs text2): 0.6257088231473826
Ollama Similarity (text1 vs text3): 0.591558301991839
