In [1]:
!curl -fsSL https://ollama.com/install.sh | sh
!nohup ollama serve > output.log 2>&1 &
!ollama pull phi4

>>> Installing ollama to /usr/local
>>> Downloading Linux amd64 bundle
######################################################################## 100.0%
>>> Creating ollama user...
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
>>> The Ollama API is now available at 127.0.0.1:11434.
>>> Install complete. Run "ollama" from the command line.
[?2026h[?25l[1Gpulling manifest ⠙ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠙ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠹ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠼ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠼ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠴ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠧ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠧ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠇ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest [K
pulling fd7b6731c33c:   0% ▕▏ 7.3 MB/9.1 GB                

In [2]:
!pip install ollama faiss-cpu sentence-transformers numpy

Collecting ollama
  Downloading ollama-0.4.8-py3-none-any.whl.metadata (4.7 kB)
Collecting faiss-cpu
  Downloading faiss_cpu-1.11.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (4.8 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.11.0->sentence-tran

#Optimizations Implemented in the Agent
##Document Chunking

Large documents are split into ~100-word chunks to improve embedding relevance and avoid passing long irrelevant sections.

##Top-k Retrieval

Instead of retrieving just 1 document, the agent retrieves the top 3 (k=3) most relevant chunks for better context.

##Context Summarization

The retrieved context is summarized into a few sentences, significantly reducing the prompt size and token usage.

##Prompt Refinement

Cleaned up and structured the prompt to be concise and direct, eliminating redundant instructions.

##Performance Logging

Token and timing metrics remain for evaluation but now reflect leaner input and response structure.



In [4]:
import ollama
import faiss
import numpy as np
from sentence_transformers import SentenceTransformer
import time
import logging
from typing import List, Tuple
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

class OptimizedAgent:
    def __init__(self):
        self.embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
        self.dimension = 384
        self.index = faiss.IndexFlatL2(self.dimension)
        self.documents = []
        self.document_embeddings = []
        self.token_usage = 0
        self.query_times = []
        self.k = 3  # retrieve top-k chunks only

    def embed_text(self, text: str) -> np.ndarray:
        return self.embedding_model.encode([text])[0]

    def add_document(self, document: str):
        start_time = time.time()
        chunks = self.chunk_document(document)
        for chunk in chunks:
            embedding = self.embed_text(chunk)
            self.index.add(np.array([embedding], dtype=np.float32))
            self.documents.append(chunk)
            self.document_embeddings.append(embedding)
        logger.info(f"Added document with {len(chunks)} chunks. Time: {time.time() - start_time:.4f}s")

    def chunk_document(self, document: str, max_words: int = 100) -> List[str]:
        words = document.split()
        return [" ".join(words[i:i+max_words]) for i in range(0, len(words), max_words)]

    def summarize_context(self, context: List[str], max_sentences: int = 3) -> str:
        text = " ".join(context)
        sentences = text.split('.')
        return ". ".join(sentences[:max_sentences]).strip() + "."

    def query(self, question: str) -> Tuple[str, int]:
        start_time = time.time()

        question_embedding = self.embed_text(question)
        distances, indices = self.index.search(np.array([question_embedding], dtype=np.float32), self.k)

        context = [self.documents[idx] for idx in indices[0] if idx < len(self.documents)]
        summarized_context = self.summarize_context(context)

        prompt = (
            f"You are a helpful assistant. Answer the question based on the following context.\n"
            f"Context: {summarized_context}\n"
            f"Question: {question}\n"
            f"Answer:"
        )

        response = ollama.chat(
            model='phi4',
            messages=[{'role': 'user', 'content': prompt}]
        )

        prompt_tokens = len(prompt.split())
        response_tokens = len(response['message']['content'].split())
        total_tokens = prompt_tokens + response_tokens
        self.token_usage += total_tokens

        query_time = time.time() - start_time
        self.query_times.append(query_time)

        logger.info(f"Query processed. Time: {query_time:.4f}s, Tokens: {total_tokens}")

        return response['message']['content'], total_tokens

    def get_performance_metrics(self) -> dict:
        return {
            'total_token_usage': self.token_usage,
            'average_query_time': np.mean(self.query_times) if self.query_times else 0,
            'number_of_queries': len(self.query_times)
        }

# Example usage
if __name__ == "__main__":
    agent = OptimizedAgent()

    documents = [
        "The capital of France is Paris. It is known for the Eiffel Tower and fine cuisine.",
        "Python is a high-level programming language often used in AI and data science.",
        "The sun, a massive ball of gas, is a star at the center of our solar system."
    ]

    for doc in documents:
        agent.add_document(doc)

    queries = [
        "What is the capital of France?",
        "What is Python used for?",
        "Is the sun a star or a planet?"
    ]

    for query in queries:
        answer, tokens = agent.query(query)
        print(f"Query: {query}\nAnswer: {answer}\nTokens used: {tokens}\n")

    metrics = agent.get_performance_metrics()
    print("Performance Metrics:")
    print(f"Total Token Usage: {metrics['total_token_usage']}")
    print(f"Average Query Time: {metrics['average_query_time']:.4f}s")
    print(f"Number of Queries: {metrics['number_of_queries']}")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Query: What is the capital of France?
Answer: The capital of France is Paris.
Tokens used: 57

Query: What is Python used for?
Answer: Python is a high-level programming language that is commonly used in fields such as artificial intelligence (AI) and data science. Its versatility, readability, and extensive library support make it an ideal choice for tasks like machine learning, data analysis, automation, web development, and scientific computing.

In AI and data science, Python is particularly popular due to its powerful libraries like TensorFlow, PyTorch, NumPy, Pandas, and Scikit-learn. These tools facilitate complex computations, statistical modeling, and data manipulation with relative ease, allowing developers and researchers to build sophisticated models and analyze large datasets effectively.

Overall, Python's simplicity and broad applicability make it a preferred language for various applications in technology and research.
Tokens used: 160

Query: Is the sun a star or a pla