# Technique 0: Setup & Baseline RAG

## üéØ Welcome to Advanced RAG Techniques!

This notebook establishes the **foundation** for all advanced techniques you'll learn.

### What You'll Do:
1. Set up your environment
2. Load the MSME dataset
3. Build a basic RAG system
4. Establish baseline metrics
5. Test with sample queries

### Why This Matters:
Every advanced technique will be compared against this baseline. Understanding where we start helps you appreciate the improvements!

**Difficulty:** ‚≠ê‚òÜ‚òÜ‚òÜ‚òÜ

## üìã Prerequisites

Before starting, ensure you have:
- ‚úÖ Python 3.8+
- ‚úÖ Together AI API key
- ‚úÖ Installed dependencies (see requirements.txt)
- ‚úÖ `msme.csv` in this directory
- ‚úÖ `.env` file with your API key

## Step 1: Import Libraries

We'll use our custom `utils.py` module along with LangChain components.

In [4]:
# Import utilities
from utils_openai import (
    setup_openai_api,
    load_msme_data,
    create_embeddings,
    create_llm,
    create_vectorstore,
    get_baseline_prompt,
    print_retrieval_results,
    count_tokens_approximate
)

# LangChain components
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

print("‚úÖ All imports successful!")

‚úÖ All imports successful!


## Step 2: Setup Together AI API

Load your API key from the `.env` file.

In [5]:
# Load API key
api_key = setup_openai_api()
print("‚úÖ API key loaded successfully!")

‚úÖ API key loaded successfully!


## Step 3: Load MSME Dataset

Our knowledge base contains 14 documents about MSMEs in Nigeria:
- Business registration procedures
- Financing options and policies
- Government support programs
- Industry-specific guidance

In [6]:
# Load the MSME data
documents, metadatas, ids = load_msme_data("msme.csv")

print(f"\nDataset Overview:")
print(f"- Total documents: {len(documents)}")
print(f"- Sample title: {metadatas[0]['doc_title']}")
print(f"- Average doc length: {sum(len(d) for d in documents) // len(documents)} characters")

[OK] Loaded 14 documents from msme.csv

Dataset Overview:
- Total documents: 14
- Sample title: **Introduction 1. Definition and Importance of MSMEs**    - Definitions according to Nigerian policies (e.g., SMEDAN criteria).    - Role of MSMEs in Nigeria's economy (employment, GDP contribution, innovation).
- Average doc length: 17416 characters


## Step 4: Initialize Models

We'll use:
- **Embeddings:** M2-BERT-80M-32K (32,768 token context)
- **LLM:** Llama 3.3 70B Turbo (fast and accurate)

In [7]:
# Create embeddings model
embeddings = create_embeddings(api_key)

# Create chat model
llm = create_llm(api_key, temperature=0)

print("\n‚úÖ Models initialized!")

[OK] Initialized embeddings: text-embedding-3-small
[OK] Initialized LLM: gpt-4o-mini (temp=0)

‚úÖ Models initialized!


## Step 5: Create Vector Store

We'll use ChromaDB to store document embeddings for fast similarity search.

In [8]:
# Create vector store
vectorstore = create_vectorstore(
    documents=documents,
    metadatas=metadatas,
    ids=ids,
    embeddings=embeddings,
    collection_name="msme_baseline",
    persist_directory="./chroma_db_baseline"
)

print("‚úÖ Vector store created and persisted!")

[OK] Created vector store: msme_baseline (14 docs)
‚úÖ Vector store created and persisted!


## Step 6: Create Retriever

The retriever will find the most relevant documents for a given query.

In [9]:
# Create retriever (retrieve top 5 documents)
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

# Test retrieval
test_query = "Procedure and legal requirements for setting up a business in Nigeria"
retrieved_docs = retriever.invoke(test_query)

print(f"Retrieved {len(retrieved_docs)} documents for query:")
print(f"'{test_query}'")
print_retrieval_results(retrieved_docs, max_docs=2, max_chars=150)

Retrieved 5 documents for query:
'Procedure and legal requirements for setting up a business in Nigeria'

Retrieved 5 documents:

Document 1:
Title: **Industry Specific** How to register an MSME in the following industries   1. Agriculture**    - **Crop Production**: Largest contributor within agriculture; includes food and cash crops.    - **Livestock**: Includes poultry, cattle, sheep, goats, and aquaculture.    - **Forestry**: Activities involving logging and forest product extraction.    - **Fishing**: Fish farming (aquaculture) and capture fisheries.
Content: How To Register Your Farm As A Business - Agriculture is a vital sector in Nigeria, providing employment opportunities and contributing significantly ...
--------------------------------------------------------------------------------

Document 2:
Title: **Starting an MSME in Nigeria** 2. **Business Registration and Legal Requirements** - CAC registration process.    - Taxpayer Identification Number (TIN) and VAT compliance.


In [10]:
print_retrieval_results(retrieved_docs, max_docs=5, max_chars=5000)


Retrieved 5 documents:

Document 1:
Title: **Industry Specific** How to register an MSME in the following industries   1. Agriculture**    - **Crop Production**: Largest contributor within agriculture; includes food and cash crops.    - **Livestock**: Includes poultry, cattle, sheep, goats, and aquaculture.    - **Forestry**: Activities involving logging and forest product extraction.    - **Fishing**: Fish farming (aquaculture) and capture fisheries.
Content: How To Register Your Farm As A Business - Agriculture is a vital sector in Nigeria, providing employment opportunities and contributing significantly to the nation‚Äôs economy. If you‚Äôre a farmer looking to formalize your agricultural venture, registering your farm as a business is a crucial step. This not only enhances the credibility of your enterprise but also opens doors to various benefits and opportunities. How To Register Your Farm As A Business - Step 1: Business Idea and Name - Step 2: Conduct a Name Availability Sear

## Step 7: Build Baseline RAG Chain

Using **modern LCEL** (LangChain Expression Language) pattern - NOT deprecated RetrievalQA!

### The Pipeline:
1. Query comes in
2. Retriever finds relevant docs
3. Prompt combines docs + query
4. LLM generates answer
5. Output parser extracts text

In [11]:
# Get prompt template
prompt = get_baseline_prompt()

# Build the RAG chain (Modern LCEL pattern)
baseline_rag_chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

print("‚úÖ Baseline RAG chain created!")
print("\nChain structure:")
print("  Query ‚Üí Retriever ‚Üí Prompt ‚Üí LLM ‚Üí Answer")

‚úÖ Baseline RAG chain created!

Chain structure:
  Query ‚Üí Retriever ‚Üí Prompt ‚Üí LLM ‚Üí Answer


## Step 8: Test with Sample Queries

Let's test our baseline RAG system with various queries about MSMEs.

In [12]:
# Test Query 1
question1 = "Explain the procedure and legal requirements for setting up a business in Nigeria"

print(f"Question: {question1}\n")
answer1 = baseline_rag_chain.invoke(question1)
print(f"Answer: {answer1}")
print(f"\n{'='*80}\n")

Question: Explain the procedure and legal requirements for setting up a business in Nigeria

Answer: To set up a business in Nigeria, the first step is to register with the Corporate Affairs Commission (CAC), which involves choosing a unique business name and structure (sole proprietorship, partnership, or limited liability company). Following this, you must obtain a Tax Identification Number (TIN) from the Federal Inland Revenue Service (FIRS) for tax purposes. Additionally, businesses may need to register with relevant regulatory bodies depending on their industry, secure necessary licenses and permits, and open a corporate bank account. It is advisable to consult legal professionals to ensure compliance with all regulatory requirements and to facilitate a smooth registration process (Sources: CAC registration process, Tax Registration, and Licensing requirements).




In [13]:
# Test Query 2
question2 = "What are the financing options for small businesses in Nigeria?"

print(f"Question: {question2}\n")
answer2 = baseline_rag_chain.invoke(question2)
print(f"Answer: {answer2}")
print(f"\n{'='*80}\n")

Question: What are the financing options for small businesses in Nigeria?

Answer: Small businesses in Nigeria have access to various financing options, including traditional sources such as commercial banks and microfinance institutions, which offer loans tailored to the needs of small and medium enterprises (SMEs). The Development Bank of Nigeria provides wholesale term funding and risk-sharing facilities to participating financial institutions for on-lending to MSMEs, with flexible repayment terms (up to 10 years) and competitive interest rates. Additionally, alternative funding sources like angel investors, crowdfunding platforms, and government initiatives such as the AGSMEIS (Agri-Business/Small and Medium Enterprises Investment Scheme) and NIRSAL Microfinance Bank offer specialized financial solutions to support the growth of small businesses (Sources: Development Bank of Nigeria, AGSMEIS, NIRSAL MFB).




In [14]:
# Test Query 3
question3 = "What is the Development Bank of Nigeria loan repayment plan?"

print(f"Question: {question3}\n")
answer3 = baseline_rag_chain.invoke(question3)
print(f"Answer: {answer3}")
print(f"\n{'='*80}\n")

Question: What is the Development Bank of Nigeria loan repayment plan?

Answer: The Development Bank of Nigeria (DBN) offers a flexible loan repayment plan with a tenure of up to 10 years, which includes a moratorium period of up to 18 months. This structure is designed to accommodate the financial needs of Micro, Small, and Medium Enterprises (MSMEs) involved in productive enterprises. The interest rates for DBN loans are market-conforming and financially sustainable, ensuring that they are accessible to eligible borrowers through Participating Financial Institutions (PFIs) such as commercial banks and microfinance banks. For more details, you can refer to the official DBN website or the Development Bank of Nigeria's loan information page.




## Step 9: Establish Baseline Metrics

These metrics will be our comparison point for all advanced techniques.

We'll measure:
- Number of documents retrieved
- Total tokens in context
- Approximate cost
- Answer quality (subjective)

In [15]:
# Calculate baseline metrics for question 1
retrieved_for_q1 = retriever.invoke(question1)
total_context = "\n\n".join([doc.page_content for doc in retrieved_for_q1])
token_count = count_tokens_approximate(total_context)

print("üìä BASELINE METRICS")
print("="*80)
print(f"Query: '{question1}'")
print(f"\nRetrieval:")
print(f"  - Documents retrieved: {len(retrieved_for_q1)}")
print(f"  - Total context tokens: ~{token_count}")
print(f"  - Average tokens per doc: ~{token_count // len(retrieved_for_q1)}")
print("="*80)

üìä BASELINE METRICS
Query: 'Explain the procedure and legal requirements for setting up a business in Nigeria'

Retrieval:
  - Documents retrieved: 5
  - Total context tokens: ~19607
  - Average tokens per doc: ~3921


## ‚úÖ Summary

**What you built:**
- ‚úÖ Loaded MSME dataset (14 documents)
- ‚úÖ Created embeddings with M2-BERT
- ‚úÖ Built vector store with ChromaDB
- ‚úÖ Implemented modern RAG chain with LCEL
- ‚úÖ Established baseline metrics

**Current System:**
- Retrieves top 5 documents based on semantic similarity
- Uses ~3500-6000 tokens of context per query
- Works well for straightforward questions

**Limitations (what we'll improve):**
- ‚ùå Misses exact keyword matches
- ‚ùå Can't handle vague queries well
- ‚ùå Retrieves too much irrelevant context
- ‚ùå Fixed chunk size may break meaning
- ‚ùå No way to rerank or refine results

**Next:** Technique 1 - BM25 Hybrid Search will address keyword matching!

## üí™ Exercise: Explore the Baseline

**Task:**
1. Try 3 more queries of your own about Nigerian MSMEs
2. For each query:
   - Note how many retrieved docs seem relevant
   - Rate the answer quality (1-10)
   - Calculate approximate token usage
3. Identify one query where the system struggles

**Example queries to try:**
- "What are the tax benefits for small businesses?"
- "How long does business registration take?"
- "What is SMEDAN and what do they do?"
- "Can I get a loan for my tech startup?"

**Expected Outcome:**
You should find at least one query where:
- The system retrieves irrelevant documents, OR
- Misses documents with exact keyword matches, OR
- The answer is incomplete/vague

**Time:** 10 minutes

**Document your findings in the cell below:**

In [20]:
# Your Exercise Code Here

# Query 1:
my_query_1 = "How much does it cost to register as LLC?"  # Add your query
answer_1 = baseline_rag_chain.invoke(my_query_1)

# Query 2:
# my_query_2 = "What is the process to register in Nigeria as a Limited Liability Company?"  # Add your query
# answer_2 = baseline_rag_chain.invoke(my_query_2)

# # Query 3:
# my_query_3 = "Does CAC registration cover for registration as LLC"  # Add your query
# answer_3 = baseline_rag_chain.invoke(my_query_3)


print(f"Question: {my_query_1}\n")
answer1 = baseline_rag_chain.invoke(answer1)
print(f"Answer: {answer_1}")
print(f"\n{'='*80}\n")

# Document your findings:
# print("My Findings:")
# print("- Struggling query: [describe which query struggled]")
# print("- Why it struggled: [explain the issue]")
# print("- Potential solution: [which technique might help?]")

Question: How much does it cost to register as LLC?

Answer: To register a Limited Liability Company (LLC) in Nigeria, the cost typically includes an application and processing fee of ‚Ç¶100,000, along with a full registration fee of ‚Ç¶250,000, bringing the total to ‚Ç¶350,000. This fee structure is outlined by the Corporate Affairs Commission (CAC) and is essential for formalizing your business entity. Additionally, it is important to consider other potential costs such as obtaining a Tax Identification Number (TIN) and any industry-specific licenses that may be required. For more detailed information, you can refer to the CAC guidelines and the Companies and Allied Matters Act (CAMA) 2020.




In [19]:
# Calculate baseline metrics for question 1
retrieved_for_q1 = retriever.invoke(my_query_1)
total_context = "\n\n".join([doc.page_content for doc in retrieved_for_q1])
token_count = count_tokens_approximate(total_context)

print("üìä BASELINE METRICS")
print("="*80)
print(f"Query: '{my_query_1}'")
print(f"\nRetrieval:")
print(f"  - Documents retrieved: {len(retrieved_for_q1)}")
print(f"  - Total context tokens: ~{token_count}")
print(f"  - Average tokens per doc: ~{token_count // len(retrieved_for_q1)}")
print("="*80)

üìä BASELINE METRICS
Query: 'How much does it cost to register as LLC?'

Retrieval:
  - Documents retrieved: 5
  - Total context tokens: ~20391
  - Average tokens per doc: ~4078


In [27]:
# Create retriever (retrieve top 5 documents)
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

# Test retrieval
test_query = "How much does it cost to register as LLC?"
retrieved_docs = retriever.invoke(my_query_1)

print(f"Retrieved {len(retrieved_docs)} documents for query:")
print(f"'{my_query_1}'")
print_retrieval_results(retrieved_docs, max_docs=5, max_chars=150)

Retrieved 5 documents for query:
'How much does it cost to register as LLC?'

Retrieved 5 documents:

Document 1:
Title: **Industry Specific** ### **3. Construction**    - Building and civil engineering activities, including roads, bridges, and real estate development
Content: In addition to the CAC registration. COREN and CORBON registration is needed. Coren Requirement for Engineering firm The Council is empowered to licen...
--------------------------------------------------------------------------------

Document 2:
Title: **Starting an MSME in Nigeria** 2. **Business Registration and Legal Requirements** - CAC registration process.    - Taxpayer Identification Number (TIN) and VAT compliance.
Content: BUSINESS SETUP IN NIGERIA ‚Äì PROCEDURE & LEGAL REQUIREMENTS There are various legal and regulatory requirements that must be met for a business set up ...
--------------------------------------------------------------------------------

Document 3:
Title: ** Industry Specific** **6.

In [21]:
# Query 2:
my_query_2 = "What is the process to register in Nigeria as a Limited Liability Company?"  # Add your query
answer_2 = baseline_rag_chain.invoke(my_query_2)

print(f"Question: {my_query_2}\n")
answer2 = baseline_rag_chain.invoke(answer_2)
print(f"Answer: {answer_2}")
print(f"\n{'='*80}\n")


Question: What is the process to register in Nigeria as a Limited Liability Company?

Answer: To register as a Limited Liability Company (LLC) in Nigeria, you must first choose a unique business name and conduct a name availability search through the Corporate Affairs Commission (CAC). Next, prepare the necessary documents, including the Memorandum and Articles of Association, and complete the registration forms. After submitting these documents along with the required fees, the CAC will review your application and, upon approval, issue a Certificate of Incorporation, officially recognizing your LLC. Additionally, you will need to obtain a Tax Identification Number (TIN) from the Federal Inland Revenue Service (FIRS) for tax purposes. For detailed guidance, refer to the CAC's official resources and the Companies and Allied Matters Act (CAMA) 2020. 

Sources: [CAC Registration Process](https://www.cac.gov.ng/wp-content/uploads/2020/12/CAMA-NOTE-BOOK-FULL-VERSION.pdf), [Business Registra

In [22]:
# Calculate baseline metrics for question 1
retrieved_for_q1 = retriever.invoke(my_query_2)
total_context = "\n\n".join([doc.page_content for doc in retrieved_for_q1])
token_count = count_tokens_approximate(total_context)

print("üìä BASELINE METRICS")
print("="*80)
print(f"Query: '{my_query_2}'")
print(f"\nRetrieval:")
print(f"  - Documents retrieved: {len(retrieved_for_q1)}")
print(f"  - Total context tokens: ~{token_count}")
print(f"  - Average tokens per doc: ~{token_count // len(retrieved_for_q1)}")
print("="*80)

üìä BASELINE METRICS
Query: 'What is the process to register in Nigeria as a Limited Liability Company?'

Retrieval:
  - Documents retrieved: 5
  - Total context tokens: ~17584
  - Average tokens per doc: ~3516


In [23]:
# # Query 3:
my_query_3 = "Does CAC registration cover for registration as LLC"  # Add your query
answer_3 = baseline_rag_chain.invoke(my_query_3)

print(f"Question: {my_query_3}\n")
answer2 = baseline_rag_chain.invoke(answer_3)
print(f"Answer: {answer_3}")
print(f"\n{'='*80}\n")

Question: Does CAC registration cover for registration as LLC

Answer: Yes, the Corporate Affairs Commission (CAC) registration does cover the process for registering a business as a Limited Liability Company (LLC) in Nigeria. To register as an LLC, applicants must submit completed registration forms, a Memorandum and Articles of Association, and identification documents for directors and shareholders, along with the payment of registration fees. This process is essential for formalizing the business and ensuring compliance with the Companies and Allied Matters Act (CAMA) 2020, which governs business operations in Nigeria (source: **Starting an MSME in Nigeria**).




In [24]:
# Calculate baseline metrics for question 1
retrieved_for_q1 = retriever.invoke(my_query_3)
total_context = "\n\n".join([doc.page_content for doc in retrieved_for_q1])
token_count = count_tokens_approximate(total_context)

print("üìä BASELINE METRICS")
print("="*80)
print(f"Query: '{question1}'")
print(f"\nRetrieval:")
print(f"  - Documents retrieved: {len(retrieved_for_q1)}")
print(f"  - Total context tokens: ~{token_count}")
print(f"  - Average tokens per doc: ~{token_count // len(retrieved_for_q1)}")
print("="*80)

üìä BASELINE METRICS
Query: 'Explain the procedure and legal requirements for setting up a business in Nigeria'

Retrieval:
  - Documents retrieved: 5
  - Total context tokens: ~20391
  - Average tokens per doc: ~4078


**Next Steps:**
- ‚û°Ô∏è **Technique 1:** Contextual Compression Retrieval