I have developed a working model of Retrieval Augmented Generation (RAG) for a QA bot for **Walmart (a Business)**, leveraging the  Google's Gemini API and a vector database (Pinecone DB).

The **data** used has been collected from the **Walmart's Wikipedia page** and has been saved locally in the form of a text file.

In [1]:
# install required libraries
!pip install datasets transformers
!pip install langchain
!pip install langchain_community
!pip install sentence-transformers
!pip install accelerate
!pip install pinecone-client
!pip install wikipedia-api





In [2]:
# import the required libraries
from tqdm.notebook import tqdm
import pandas as pd
from typing import Optional, List, Tuple
from datasets import Dataset
import matplotlib.pyplot as plt
from langchain.docstore.document import Document as LangchainDocument
from langchain.text_splitter import RecursiveCharacterTextSplitter
import numpy as np
import pinecone


**Get Walmart's Wikipedia Data:**

In [3]:
!pip install wikipedia




In [4]:
import wikipediaapi

# Set a valid User-Agent string
user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"

# Initialize the Wikipedia API object with the User-Agent
wiki_walmart = wikipediaapi.Wikipedia(
    language='en',
    user_agent=user_agent
)

# Fetch the Walmart page
page = wiki_walmart.page("Walmart")

# Extract the content of the page
walmart_text = page.text

# Print a snippet of the Walmart text to verify
print(walmart_text[:1000])  # Print the first 1000 characters


Walmart Inc. (  ; formerly Wal-Mart Stores, Inc.) is an American multinational retail corporation that operates a chain of hypermarkets (also called supercenters), discount department stores, and grocery stores in the United States and 23 other countries. It is headquartered in Bentonville, Arkansas. The company was founded by brothers Sam and James "Bud" Walton in nearby Rogers, Arkansas, in 1962 and incorporated under Delaware General Corporation Law on October 31, 1969. It also owns and operates Sam's Club retail warehouses.
As of October 31, 2022, Walmart has 10,586 stores and clubs in 24 countries, operating under 46 different names. The company operates under the name Walmart in the United States and Canada, as Walmart de México y Centroamérica in Mexico and Central America, and as Flipkart Wholesale in India. It has wholly owned operations in Chile and a majority stake in Massmart in South Africa. Since August 2018, Walmart held only a minority stake in Walmart Brasil, which was

In [5]:
# Define the file path where the text will be saved
file_path = "/content/walmart_wikipedia_page.txt"

# Write the Walmart text to the file
with open(file_path, "w", encoding="utf-8") as f:
    f.write(walmart_text)

print(f"Walmart Wikipedia page saved to {file_path}")


Walmart Wikipedia page saved to /content/walmart_wikipedia_page.txt


In [6]:
# Load Walmart Wikipedia page content
with open("walmart_wikipedia_page.txt", "r") as file:
    data = file.read()

print(data[:500])  # Preview the first 500 characters


Walmart Inc. (  ; formerly Wal-Mart Stores, Inc.) is an American multinational retail corporation that operates a chain of hypermarkets (also called supercenters), discount department stores, and grocery stores in the United States and 23 other countries. It is headquartered in Bentonville, Arkansas. The company was founded by brothers Sam and James "Bud" Walton in nearby Rogers, Arkansas, in 1962 and incorporated under Delaware General Corporation Law on October 31, 1969. It also owns and opera


**Create a RAW Knowledge Base:**

In [7]:
RAW_KNOWLEDGE_BASE = LangchainDocument(page_content=data)


In [8]:
MARKDOWN_SEPARATORS = [
    "\n#{1,6} ",    # Markdown headings (e.g., #, ##, ###, etc.)
    "```\n",        # Code block delimiters in Markdown
    "\n\\*\\*\\*+\n", # Triple asterisks (e.g., ***)
    "\n---+\n",     # Horizontal rules in Markdown (---)
    "\n___+\n",     # Another type of horizontal rule (___)
    "\n\n",         # Blank lines
    "\n",           # Newline characters
    " ",            # Spaces
    "",             # Default separator when all else fails
]

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,  # Maximum number of characters in a chunk
    chunk_overlap=100,  # Overlap between chunks
    add_start_index=True,  # Include chunk's start index in metadata
    strip_whitespace=True,  # Remove leading/trailing whitespace
    separators=MARKDOWN_SEPARATORS,
)

docs_processed = text_splitter.split_documents([RAW_KNOWLEDGE_BASE])
print(f"Processed {len(docs_processed)} chunks.")


Processed 160 chunks.


**Embedding the Data:**

In [9]:
!pip install sentence-transformers




In [10]:
from sentence_transformers import SentenceTransformer

# Load the embedding model
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

# Embed the chunks of text
embedded_data = []
for doc in docs_processed[:160]:
    embedding = embedding_model.encode(doc.page_content)
    embedded_data.append(embedding)

# Check the shape of the embedding
print(f"Shape of the embedding: {np.array(embedded_data).shape}")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Shape of the embedding: (160, 384)


**Store the Embeddings in Pinecone:**

In [21]:
from pinecone import Pinecone


# Initialize Pine cone
pc = Pinecone(api_key="YOUR PINECONE API KEY")
index = pc.Index("walmartqa")  # access a specific index


In [12]:
# Prepare the upsert data (the embeddings, along with metadata)
upsert_data = []
for i, embedding in enumerate(embedded_data):
    upsert_data.append({
        "id": f"vec_{i}",
        "values": embedding,
        "metadata": {"text": docs_processed[i].page_content}
    })

# Upsert the data into the Pinecone index
index.upsert(vectors=upsert_data)
print("Data has been upserted to Pinecone.")


Data has been upserted to Pinecone.


**Load the LLM:**

In [13]:
!pip install google-generativeai langchain-google-genai streamlit




I have used **Google's Gemini API** for our LLM since it is completely free for testing purposes and does not require any payment/billing details. Similarly, the Open AI API can be used to use the GPT models.

In [14]:
import os
import google.generativeai as genai
from IPython.display import Markdown

# Set your API key
os.environ['GOOGLE_API_KEY'] = "YOUR GOOGLE GEMINI API KEY"

# Configure the API with the key
genai.configure(api_key=os.environ['GOOGLE_API_KEY'])

# Initialize the Gemini model (gemini-pro)
model = genai.GenerativeModel('gemini-pro')

#Test whether the model is loaded successfully

# Make a request to generate content
response = model.generate_content("List 5 planets each with an interesting fact")

# Display the response as Markdown
Markdown(response.text)


1. **Mercury**: The smallest planet in our solar system, Mercury is also the closest to the Sun. Its surface temperature can reach up to 450 degrees Celsius during the day, but drops to -180 degrees Celsius at night.
2. **Venus**: Venus is often referred to as Earth's "twin" due to its similar size and mass. However, Venus' atmosphere is extremely thick and composed mostly of carbon dioxide, creating a runaway greenhouse effect that makes it the hottest planet in our solar system.
3. **Earth**: Our home planet, Earth is the only known planet in the universe that is capable of supporting life. It has a unique combination of atmospheric composition, temperature range, and water availability that make it habitable.
4. **Mars**: Known as the "Red Planet" due to the iron oxide on its surface, Mars has a thin atmosphere and a surface that is mostly covered with craters and volcanoes. It is also home to two polar ice caps and evidence suggests that it may once have had liquid water on its surface.
5. **Jupiter**: The largest planet in our solar system, Jupiter is a gas giant composed mostly of hydrogen and helium. It has a swirling atmosphere with colorful bands and storms, and is surrounded by a vast system of moons, including the iconic Great Red Spot.

Create a prompt template for the LLM:

In [15]:
# Define the prompt template
prompt_template = """
You are a helpful assistant that answers business-related questions based on the information provided from the Walmart knowledge base.

Context:
{}
---
Question: {}
Answer:
"""

In [16]:
# Function to query the Pinecone index for the most relevant context
def get_relevant_context(user_input):
    # Generate the embedding for the user input
    vectorized_input = embedding_model.encode(user_input)

     # Ensure the input is a list or numpy array
    if isinstance(vectorized_input, np.ndarray):
        vectorized_input = vectorized_input.tolist()  # Convert numpy array to list if needed

    # Query the Pinecone index for the best matching context
    query_result = index.query(
        vector=vectorized_input,
        top_k=5,  # Get the top 5 most relevant documents
        include_metadata=True
    )

    context = ""
    if query_result['matches']:
        for match in query_result['matches']:
            context += match['metadata']['text'] + "\n" # Build context from multiple matches

    if context:
        return context.strip()
    else:
        return "No relevant context found."



In [17]:
# Function to get the answer from the Google Gemini API
def get_answer_from_gemini(context, question):
    # Prepare the prompt by filling in the context and question
    prompt = prompt_template.format(context, question)

    # Call the Gemini API to generate a response
    response = genai.GenerativeModel('gemini-pro').generate_content(prompt)

    return response.text.strip()

**Generate responses according to the user's input:**

In [19]:
# Interactive chat with the user
print("Welcome to the Walmart Business QA Bot!")
print("Type 'Exit' to quit.")

while True:
    user_input = input("User: ")

    if user_input.lower() == "exit":
        print("Goodbye!")
        break

    # Get the relevant context from Pinecone
    context = get_relevant_context(user_input)

    # Get the answer from Google Gemini
    answer = get_answer_from_gemini(context, user_input)

    print("AI response:", answer)

Welcome to the Walmart Business QA Bot!
Type 'Exit' to quit.
User: What is Walmart?
AI response: Walmart is an American multinational retail corporation that operates a chain of hypermarkets (also called supercenters), discount department stores, and grocery stores in the United States and 23 other countries.
User: Who founded Walmart?
AI response: Sam and James "Bud" Walton
User: In which year Walmart was founded?
AI response: 1962
User: Where is Walmart headquartered?
AI response: Bentonville, Arkansas
User: What are Walmart's core business ideas?
AI response: Walmart's core business ideas include:

* Selling a wide variety of general merchandise at low prices
* Catering to different demographic groups with tailored merchandising strategies
* Providing customer service through designated greeters
User: What is the business model of Walmart?
AI response: Walmart's business model is based on selling a wide variety of general merchandise at low prices.
User: What is Sam's club and how d

**Inference:**

Hence it is evident from the above interaction that the bot is able to retrieve information from the embedded Walmart knowledge base for questions related to Walmart. The questions about Walmart are answered accurately.

For questions outside the knowledge base, such as 'What is Artificial Intelligence?', the bot correctly states that no relevant context is found.