## Retrieval Augmented Generation (RAG)

檢索增強生成<br/>
難易度：★★☆☆☆<br/>
[文章傳送門](https://github.com/Sakuard/bootcamp_ai/blob/main/doc/LLMxNLPxRAG.md)


### RAG 基本原理
![Basic RAG Structure](./src/rag/basic_rag_structure.png)
### 分別為
- 資料嵌入(Embedding) #藍色路徑
- 資料檢索 ##黃色路徑

#### 資料嵌入
1. 透過把資料 ***Embedding***
2. 將 ***Embedding*** 結果存到 ***向量資料庫*** 來建立**個人知識庫**
#### 資料檢索
1. 使用者提出提問 (Query)
2. 把 Query Embedding
3. 把 Query Embedding 結果給 ***向量資料庫*** 做比對，找到最相似的資料後回傳
4. 將 Query + 回傳資料, 做 Prompt 整合給 LLM 產生回應


In [1]:
!pip install ollama chromadb

Collecting ollama
  Obtaining dependency information for ollama from https://files.pythonhosted.org/packages/2f/25/c3442864bd77621809a208a483b0857f8d6444b7a67906b58b9dcddd1574/ollama-0.3.1-py3-none-any.whl.metadata
  Downloading ollama-0.3.1-py3-none-any.whl.metadata (3.8 kB)
Collecting chromadb
  Obtaining dependency information for chromadb from https://files.pythonhosted.org/packages/80/4c/ee62b19a8daeed51e3c88c84b7da6047a74b786e598be3592b67a286d419/chromadb-0.5.5-py3-none-any.whl.metadata
  Downloading chromadb-0.5.5-py3-none-any.whl.metadata (6.8 kB)
Collecting build>=1.0.3 (from chromadb)
  Obtaining dependency information for build>=1.0.3 from https://files.pythonhosted.org/packages/e2/03/f3c8ba0a6b6e30d7d18c40faab90807c9bb5e9a1e3b2fe2008af624a9c97/build-1.2.1-py3-none-any.whl.metadata
  Using cached build-1.2.1-py3-none-any.whl.metadata (4.3 kB)
Collecting pydantic>=1.9 (from chromadb)
  Obtaining dependency information for pydantic>=1.9 from https://files.pythonhosted.org/pack

documents 即為我們的模擬資料<br/>並把 embedding 的結果儲存到 chromadb<br/>這邊 embedding-model 使用 mxbai-embed-large

In [2]:
import ollama
import chromadb

documents = [
  "Llamas are members of the camelid family meaning they're pretty closely related to vicuñas and camels",
  "Llamas were first domesticated and used as pack animals 4,000 to 5,000 years ago in the Peruvian highlands",
  "Llamas can grow as much as 6 feet tall though the average llama between 5 feet 6 inches and 5 feet 9 inches tall",
  "Llamas weigh between 280 and 450 pounds and can carry 25 to 30 percent of their body weight",
  "Llamas are vegetarians and have very efficient digestive systems",
  "Llamas live to be about 20 years old, though some only live for 15 years and others live to be 30 years old",
]

client = chromadb.Client()

# collection exits ? use it || create one
try:
    collection = client.create_collection(name="docs")
except Exception as e:
    if "Collection docs already exists" in str(e):
        collection = client.get_collection(name="docs")
    else:
        raise e

# ID check
existing_docs = collection.get()
existing_ids = set(existing_docs['ids'])

# Document vectorize and store into vector database中
for i, d in enumerate(documents):
    if str(i) in existing_ids:
        print(f"ID {i} already exists, skipping.")
        continue

    response = ollama.embeddings(model="mxbai-embed-large", prompt=d)
    embedding = response["embedding"]
    collection.add(
        ids=[str(i)],
        embeddings=[embedding],
        documents=[d]
    )

使用者 Query 提問<br/>把 Query embedding，將結果給 chromadb 做比對

In [3]:
Query = "What animals are llamas related to?"

# vectorinze and embeddings
response = ollama.embeddings(
  prompt=Query,
  model="mxbai-embed-large"
)
results = collection.query(
  query_embeddings=[response["embedding"]],
  n_results=1
)
data = results['documents'][0][0]

用 chromadb 比對結果 {data} 與使用者提問 {Query} 整合成一個 prompt<br/>給 LLM-{llama2} 產生回應

In [4]:
ollama.pull(model="llama2")
# response
output = ollama.generate(
  model="llama2",
  prompt=f"Using this data: {data}. Respond to this prompt: {Query}"
)

print(output['response'])


Llamas are members of the camelid family, which means they are closely related to other animals such as:

1. Vicuñas: Vicuñas are small, wild relatives of llamas and alpacas. They are found in the Andean highlands and are known for their soft, woolly coats.
2. Camels: As mentioned earlier, llamas are part of the camelid family, which means they are closely related to camels. Camels are large, even-toed ungulates that are native to Africa and Asia.
3. Alpacas: Alpacas are domesticated mammals that are similar to llamas but have a different coat type. They are also members of the camelid family and are found in South America.

In summary, llamas are related to vicuñas, camels, and alpacas through their shared membership in the camelid family.
