Skip to content

HakimMohammed/rag-api

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

rag-api

alt text

Work Environment

While working on this project I am using python 3.13, so make sure you install this exact version or at least make sure that this code is compatible with whatever version you are using.

I am also using tinyllama, so make sure Ollama is running. and pull the model by running

ollama pull tinyllama

Static Knowledge

Before RAG

>>> What is Kubernetes ?
Kubernetes is an online marketplace for used and new kitchen appliances, electronics, and homeware. It provides users with a wide range of products that are tested by independent experts before being
listed on the site. The platform also offers a convenient checkout process with various payment options and shipping methods. In short, Kubernetes is an online marketplace that specializes in used and
new kitchen appliances, electronics, and homeware.

This is the answer provided by our local LLM.

Embedding Generation

k8s.txt file acts as our knowledge base. This can be documentation about a tool, scans, etc. In our case it is a definition of Kubernetes. embed.py will transform our document into vector embeddings and store output them in /db.

Building API

app.py will act as our FastAPI app.

We initialize our ChromaDB client and query the previous stored context, then we pass it to tinyllama

Finally we run the app by using

uvicorn app:app --reload

We can verify that the app is running by visiting http://localhost:8000

After RAG

>>> What is Kubernetes ?
Kubernetes is an open-source platform for automatically managing containerized applications on a cloud or distributed infrastructure, including load balancing, self-healing, and rolling out updates. It was originally developed by Google and has since been maintained by the Cloud Native Computing Foundation (CNCF).

This is the answer provided by our local LLM.

We can see that now it didn't mistake what Kubernetes we were talking about thanks to our context.

Dynamic Knowledge

Now we will try to make our knowledge base dynamic. So we can add new information to our context.

To do so we defined a new endpoint /add that takes the new information as parameter

Before New Information

>>> How old is Mohamed Hakim ?
The given text mentions a young and modern-looking person with dark hair, who is in his mid-twenties to early thirties. Therefore, according to the context provided, the age of Mohamed Haiman could be around 25 to 30 years old.

After New Information

Mohamed Hakim is 21 years old. This will be the text passed to /add endpoint.

>>> How old is Mohamed Hakim ?
The given question requires a straightforward and clear answer that could be answered with confidence. As per the given context, Mohamed Hakim is approximately 21 years old.

Containerization

The app is running perfectly on our machine, but to make sure anyone can run this app without running into problems. We have to use DevOps tools to dockerize the app.

alt text

after creating our Dockerfile, we now have to create our image

docker build -t rag-app .

Then run our container

docker run --name "my-rag-app" -e MODEL_NAME="tinyllama" -e OLLAMA_HOST="http://host.docker.internal:11434" -p 8000:8000 rag-app

Docker Hub

Now let's send our project image to the world

First let's tag our image

docker tag rag-app hakim/rag-app

Now let us push it

docker push hakim/rag-app

Anyone can now run this command to get our image

docker pull hakim/rag-app

Kubernetes

alt text

We will try to implement this architecture locally. And to do so we will use Minikube

Minikube is a tool that lets you run a single-node Kubernetes cluster on your local machine. It's perfect for learning Kubernetes, developing applications, and testing your deployments without needing a full-blown cloud environment.

So let's install necessary tools

  • Minikube runs a local Kubernetes cluster on your computer for learning and testing.
winget install Kubernetes.minikube
  • kubectl is the command-line tool for managing Kubernetes clusters.
winget install -e --id Kubernetes.kubectl

After running our tools, we load our images into Kubernetes environment

Kubernetes Deployment

A Deployment is a blueprint that tells Kubernetes how to run your app. Kubernetes will make sure your app always matches this blueprint - if a container crashes, Kubernetes automatically starts a new one.

our blueprint is defined in deployment.yml

Now let's apply it

kubectl apply -f kubernetes/deployment.yml

We can check if deployment is applied correctly by running

kubectl get deployment

Kubernetes Service

At this point, our RAG API is running inside a pod. We have to access it

We will use a service to provide an endpoint for our pod.

a service provide multiple functionalities:

  • Stable IP address
  • DNS name rag-app-service that always works
  • Load Balancing across pods ( if multiple instances are running)
  • Automatic Routing to Healthy pods.

Our service is defined in kubernetes/service.yml

Let's apply it

kubectl apply -f kubernetes/service.yml

Let's verify

kubectl get services

Test App

First let's get the url of our app

minikube service rag-app-service --url

Making a request to the url returns the wanted answer. So our app is setup properly.

Self Healing

One of the strongest points of Kubernetes is Self-Healing. Whenever a Pod is down another one is Up wo our architecture is always maintained.

We will try to delete a running pod and see what will happen Another one is instantly up.

alt text

CI/CD Pipeline

alt text

Tests

We will implement semantic tests in our projects, so before hosting the app we need to make sure everything is correct. And for that we will use two libraries httpx and pytest

Tests are defined inside app/test_app.py

Problem

How can we test the response of the LLM ?

We now that the after consuming the endpoint /query, the model has to answer according to our knowledge base. But the answer is not deterministic nor consistent. So how can we test that endpoint ?

We can check if certain words exist in the response open-source containerized. But that still might fail

Solution: Mock LLM

Instead of making the LLM generate a response.

We will insert an env variable MOCK_MODE when True, the endpoint will return the generated embeddings / context as a response. This will make the endpoint's response consistent.

About

RAG (Retrieval-Augmented Generation) API using FastAPI, Chroma DB and Ollama (Local LLM)

Topics

Resources

Stars

Watchers

Forks

Contributors