While working on this project I am using python 3.13, so make sure you install this exact version or at least make sure that this code is compatible with whatever version you are using.
I am also using tinyllama, so make sure Ollama is running.
and pull the model by running
ollama pull tinyllama>>> What is Kubernetes ?
Kubernetes is an online marketplace for used and new kitchen appliances, electronics, and homeware. It provides users with a wide range of products that are tested by independent experts before being
listed on the site. The platform also offers a convenient checkout process with various payment options and shipping methods. In short, Kubernetes is an online marketplace that specializes in used and
new kitchen appliances, electronics, and homeware.This is the answer provided by our local LLM.
k8s.txt file acts as our knowledge base. This can be documentation about a tool, scans, etc.
In our case it is a definition of Kubernetes.
embed.py will transform our document into vector embeddings and store output them in /db.
app.py will act as our FastAPI app.
We initialize our ChromaDB client and query the previous stored context, then we pass it to tinyllama
Finally we run the app by using
uvicorn app:app --reloadWe can verify that the app is running by visiting http://localhost:8000
>>> What is Kubernetes ?
Kubernetes is an open-source platform for automatically managing containerized applications on a cloud or distributed infrastructure, including load balancing, self-healing, and rolling out updates. It was originally developed by Google and has since been maintained by the Cloud Native Computing Foundation (CNCF).This is the answer provided by our local LLM.
We can see that now it didn't mistake what Kubernetes we were talking about thanks to our context.
Now we will try to make our knowledge base dynamic. So we can add new information to our context.
To do so we defined a new endpoint /add that takes the new information as parameter
>>> How old is Mohamed Hakim ?
The given text mentions a young and modern-looking person with dark hair, who is in his mid-twenties to early thirties. Therefore, according to the context provided, the age of Mohamed Haiman could be around 25 to 30 years old.Mohamed Hakim is 21 years old.
This will be the text passed to /add endpoint.
>>> How old is Mohamed Hakim ?
The given question requires a straightforward and clear answer that could be answered with confidence. As per the given context, Mohamed Hakim is approximately 21 years old.The app is running perfectly on our machine, but to make sure anyone can run this app without running into problems. We have to use DevOps tools to dockerize the app.
after creating our Dockerfile, we now have to create our image
docker build -t rag-app .Then run our container
docker run --name "my-rag-app" -e MODEL_NAME="tinyllama" -e OLLAMA_HOST="http://host.docker.internal:11434" -p 8000:8000 rag-appNow let's send our project image to the world
First let's tag our image
docker tag rag-app hakim/rag-appNow let us push it
docker push hakim/rag-appAnyone can now run this command to get our image
docker pull hakim/rag-appWe will try to implement this architecture locally. And to do so we will use Minikube
Minikube is a tool that lets you run a single-node Kubernetes cluster on your local machine. It's perfect for learning Kubernetes, developing applications, and testing your deployments without needing a full-blown cloud environment.
So let's install necessary tools
Minikuberuns a local Kubernetes cluster on your computer for learning and testing.
winget install Kubernetes.minikubekubectlis the command-line tool for managing Kubernetes clusters.
winget install -e --id Kubernetes.kubectlAfter running our tools, we load our images into Kubernetes environment
A Deployment is a blueprint that tells Kubernetes how to run your app. Kubernetes will make sure your app always matches this blueprint - if a container crashes, Kubernetes automatically starts a new one.
our blueprint is defined in deployment.yml
Now let's apply it
kubectl apply -f kubernetes/deployment.ymlWe can check if deployment is applied correctly by running
kubectl get deploymentAt this point, our RAG API is running inside a pod. We have to access it
We will use a service to provide an endpoint for our pod.
a service provide multiple functionalities:
- Stable IP address
- DNS name
rag-app-servicethat always works - Load Balancing across pods ( if multiple instances are running)
- Automatic Routing to Healthy pods.
Our service is defined in kubernetes/service.yml
Let's apply it
kubectl apply -f kubernetes/service.ymlLet's verify
kubectl get servicesFirst let's get the url of our app
minikube service rag-app-service --urlMaking a request to the url returns the wanted answer. So our app is setup properly.
One of the strongest points of Kubernetes is Self-Healing. Whenever a Pod is down another one is Up wo our architecture is always maintained.
We will try to delete a running pod and see what will happen Another one is instantly up.
We will implement semantic tests in our projects, so before hosting the app we need to make sure everything is correct. And for that we will use two libraries httpx and pytest
Tests are defined inside app/test_app.py
How can we test the response of the LLM ?
We now that the after consuming the endpoint /query, the model has to answer according to our knowledge base. But the answer is not deterministic nor consistent. So how can we test that endpoint ?
We can check if certain words exist in the response open-source containerized. But that still might fail
Instead of making the LLM generate a response.
We will insert an env variable MOCK_MODE when True, the endpoint will return the generated embeddings / context as a response. This will make the endpoint's response consistent.




