# Setting Up Work Environment

In [1]:
# Install required libraries
!pip install --upgrade google-generativeai                 # Install Google Generative AI (Gemini) SDK
!pip install -q -U google-genai                            # Another variant of Google Generative AI SDK
!pip install langchain-community                           # Community-supported LangChain tools
!pip install docarray                                      # Used for storing and searching documents in memory
!pip install -U langchain-google-genai                     # LangChain integration for Google Generative AI (Gemini)

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.5/40.5 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m226.1/226.1 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain-community
  Downloading langchain_community-0.3.27-py3-none-any.whl.metadata (2.9 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.10.1-py3-none-any.whl.metadata (3.4 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.1-py3-none-any.whl.metadata (9.4 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading marshmallow-3.26.1-py3-none-any.whl.metadata (7.3 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->lan

In [36]:
import os
from google import genai                              # Google GenAI main module
import google.generativeai as ggenai                  # Another way to access the GenAI functionality
from google.colab import userdata                     # For securely retrieving secrets in Colab

# LangChain integrations with Gemini
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_google_genai import GoogleGenerativeAIEmbeddings

# LangChain utilities
from langchain.indexes import VectorstoreIndexCreator  # Helps create vector index from loaders
from langchain.vectorstores import DocArrayInMemorySearch
from langchain.chains import RetrievalQA                # LangChain's retrieval-based QA chain
from langchain.document_loaders import CSVLoader        # Loads documents from CSV
from IPython.display import display, Markdown           # For displaying markdown in Colab notebooks
from langchain.evaluation.qa import QAGenerateChain
import langchain
from langchain.evaluation.qa import QAEvalChain

In [2]:
# Get the Gemini API key stored securely in Colab
key = userdata.get('genai_api')

# Instantiate a GenAI client with the key
client = genai.Client(api_key=key)

# Configure the generative AI with the same key for further calls
ggenai.configure(api_key=key)

List the set of available models

In [77]:
models = ggenai.list_models()
# You can uncomment below to print model names
# for model in models:
#     print(model.name)

In [58]:
#  Initialize the Gemini LLM (Chat Model)
llm = ChatGoogleGenerativeAI(
    model="gemini-2.0-flash",     # Choose your preferred Gemini model (e.g., flash, pro, etc.)
    temperature=0.0,              # Set to 0 for deterministic output
    google_api_key=key
)

In [8]:
# Set Up Embeddings (For Similarity Search)
embedding = GoogleGenerativeAIEmbeddings(
    model="models/embedding-001",  # You can choose different embedding models
    google_api_key=key
)

# Create our QandA application

- We will begin by creating the index with a single line of code. Following that, we will construct the RetrievalQA chain by specifying the language model, selecting the appropriate chain type, configuring the retriever, and enabling verbosity to display detailed execution output.

In [55]:
# Load a CSV File and Create a Vector Index
file = 'OutdoorClothingCatalog_1000.csv'
loader = CSVLoader(file_path=file)
data = loader.load()

# Use LangChain to create an in-memory vector index using Gemini embeddings
index = VectorstoreIndexCreator(
    vectorstore_cls=DocArrayInMemorySearch,
    embedding=embedding
).from_loaders([loader])

In [59]:
qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=index.vectorstore.as_retriever(),
    verbose=True,
    chain_type_kwargs = {
        "document_separator": "<<<<>>>>>"
    }
)

- Now that the application is configured, the next step is to identify a set of data points on which the system can be evaluated. These data points will serve as the basis for assessing the performance and reliability of the application.

# Manual Testing and Debugging Procedures

We will begin with the most straightforward debugging approach: selecting representative data points that we believe serve as effective test cases. By examining a subset of the available data, we can manually create example questions along with their corresponding ground truth answers. These curated examples will later be used to evaluate the system’s accuracy and performance.

In [13]:
data[10]

Document(metadata={'source': 'OutdoorClothingCatalog_1000.csv', 'row': 10}, page_content=": 10\nname: Cozy Comfort Pullover Set, Stripe\ndescription: Perfect for lounging, this striped knit set lives up to its name. We used ultrasoft fabric and an easy design that's as comfortable at bedtime as it is when we have to make a quick run out.\n\nSize & Fit\n- Pants are Favorite Fit: Sits lower on the waist.\n- Relaxed Fit: Our most generous fit sits farthest from the body.\n\nFabric & Care\n- In the softest blend of 63% polyester, 35% rayon and 2% spandex.\n\nAdditional Features\n- Relaxed fit top with raglan sleeves and rounded hem.\n- Pull-on pants have a wide elastic waistband and drawstring, side pockets and a modern slim leg.\n\nImported.")

In [16]:
data[11]

Document(metadata={'source': 'OutdoorClothingCatalog_1000.csv', 'row': 11}, page_content=': 11\nname: Ultra-Lofty 850 Stretch Down Hooded Jacket\ndescription: This technical stretch down jacket from our DownTek collection is sure to keep you warm and comfortable with its full-stretch construction providing exceptional range of motion. With a slightly fitted style that falls at the hip and best with a midweight layer, this jacket is suitable for light activity up to 20° and moderate activity up to -30°. The soft and durable 100% polyester shell offers complete windproof protection and is insulated with warm, lofty goose down. Other features include welded baffles for a no-stitch construction and excellent stretch, an adjustable hood, an interior media port and mesh stash pocket and a hem drawcord. Machine wash and dry. Imported.')

In [17]:
examples = [
    {
        "query": "Do the Cozy Comfort Pullover Set\
        have side pockets?",
        "answer": "Yes"
    },
    {
        "query": "What collection is the Ultra-Lofty \
        850 Stretch Down Hooded Jacket from?",
        "answer": "The DownTek collection"
    }
]

# Automated Evaluation Using LLMs

- We will use the apply_and_parse method to apply an output parser to the model’s response. This is necessary because we want to retrieve the result as a structured dictionary containing both the query and the answer, rather than receiving a single unstructured string.

In [21]:
# Create the QA generation chain using Gemini
example_gen_chain = QAGenerateChain.from_llm(llm)

In [23]:
new_examples = example_gen_chain.apply_and_parse([{"doc": t} for t in data[:5]])



- At this stage, if we examine the returned output, we can observe that it includes both a query and its corresponding answer. Next, we will review the source document to which this question-and-answer pair pertains.

In [24]:
new_examples[0]

{'qa_pairs': {'query': 'According to the product description, what specific features contribute to the "broken-in feel" of the Women\'s Campside Oxfords from the moment they are first worn?',
  'answer': 'The "broken-in feel" of the Women\'s Campside Oxfords is attributed to their super-soft canvas material and thick cushioning.'}}

In [25]:
data[0]

Document(metadata={'source': 'OutdoorClothingCatalog_1000.csv', 'row': 0}, page_content=": 0\nname: Women's Campside Oxfords\ndescription: This ultracomfortable lace-to-toe Oxford boasts a super-soft canvas, thick cushioning, and quality construction for a broken-in feel from the first time you put them on. \n\nSize & Fit: Order regular shoe size. For half sizes not offered, order up to next whole size. \n\nSpecs: Approx. weight: 1 lb.1 oz. per pair. \n\nConstruction: Soft canvas material for a broken-in feel and look. Comfortable EVA innersole with Cleansport NXT® antimicrobial odor control. Vintage hunt, fish and camping motif on innersole. Moderate arch contour of innersole. EVA foam midsole for cushioning and support. Chain-tread-inspired molded rubber outsole with modified chain-tread pattern. Imported. \n\nQuestions? Please contact us for any inquiries.")

- We have just generated multiple question–answer pairs automatically, which significantly reduces the need for manual input. This not only saves time but also enables us to expand our set of test cases more efficiently. Now, we will proceed to incorporate these newly generated examples into the existing collection of examples we previously created.

In [26]:
examples += new_examples

In [28]:
qa.invoke(examples[0]["query"])



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


{'query': 'Do the Cozy Comfort Pullover Set        have side pockets?',
 'result': "I'm sorry, I don't have any information about the Cozy Comfort Pullover Set.  I only have information about the Cozy Core 3-in-1 Jacket and a Women's Soft Fleece Vest."}

# Reviewing Internal Operations

To observe the internal operations of LangChain, we can enable debugging by setting langchain.debug = True. Upon rerunning the previous example, the system will output detailed execution logs, providing greater visibility into each step of the chain's processing.

In [30]:
langchain.debug = True

In [31]:
qa.invoke(examples[0]["query"])

[32;1m[1;3m[chain/start][0m [1m[chain:RetrievalQA] Entering Chain run with input:
[0m{
  "query": "Do the Cozy Comfort Pullover Set        have side pockets?"
}
[32;1m[1;3m[chain/start][0m [1m[chain:RetrievalQA > chain:StuffDocumentsChain] Entering Chain run with input:
[0m[inputs]
[32;1m[1;3m[chain/start][0m [1m[chain:RetrievalQA > chain:StuffDocumentsChain > chain:LLMChain] Entering Chain run with input:
[0m{
  "question": "Do the Cozy Comfort Pullover Set        have side pockets?",
  "context": "Additional Features: Fully adjustable hood. Two hand pockets, one chest pocket. Interior zippered stow pocket, drop mesh pocket<<<<>>>>>: 975\nname: Cozy Core 3-in-1 Jacket<<<<>>>>>Berber fleece-lined pockets keep hands warm. One pocket has a hidden security pocket tucked inside. Imported.<<<<>>>>>: 520\nname: Women's  Soft Fleece Vest"
}
[32;1m[1;3m[llm/start][0m [1m[chain:RetrievalQA > chain:StuffDocumentsChain > chain:LLMChain > llm:ChatGoogleGenerativeAI] Entering LLM

{'query': 'Do the Cozy Comfort Pullover Set        have side pockets?',
 'result': "I'm sorry, I don't have any information about the Cozy Comfort Pullover Set.  I only have information about the Cozy Core 3-in-1 Jacket and a Women's Soft Fleece Vest."}

In [32]:
# Turn off the debug mode
langchain.debug = False

# LLM assisted evaluation

In [61]:
# Preprocess examples to ensure 'query' is a top-level key
processed_examples = []

for example in examples:
    if 'qa_pairs' in example:
        processed_examples.append({
            'query': example['qa_pairs']['query'],
            'answer': example['qa_pairs']['answer']
        })
    else:
        processed_examples.append(example)

# Now all examples have a consistent structure
for example in processed_examples:
    print(example)

# Assuming qa is an instance of QAEvalChain or a similar class
predictions = qa.apply(processed_examples)
print(predictions)


{'query': 'Do the Cozy Comfort Pullover Set        have side pockets?', 'answer': 'Yes'}
{'query': 'What collection is the Ultra-Lofty         850 Stretch Down Hooded Jacket from?', 'answer': 'The DownTek collection'}
{'query': 'According to the product description, what specific features contribute to the "broken-in feel" of the Women\'s Campside Oxfords from the moment they are first worn?', 'answer': 'The "broken-in feel" of the Women\'s Campside Oxfords is attributed to their super-soft canvas material and thick cushioning.'}
{'query': 'According to the product description, what are two key benefits of the Recycled Waterhog Dog Mat, and what specific features contribute to each benefit?', 'answer': "Two key benefits are floor protection and environmental sustainability.  Floor protection is achieved through the mat's ultra-durable construction, thick and thin fibers for scraping dirt and absorbing water, and quick-drying properties that resist mildew and rotting.  Environmental sus

- With the prepared examples, we can now proceed to the evaluation phase. We begin by importing the Question Answering (QA) evaluation chain, which will be used to assess the quality of the model's responses. Next, we instantiate this evaluation chain with a selected language model, enabling it to perform automated evaluation. Finally, we invoke the evaluate method on the evaluation chain, supplying both the reference examples and the corresponding predictions. This process yields graded outputs that provide insight into the accuracy and relevance of the model's answers.

In [71]:
eval_chain = QAEvalChain.from_llm(llm)
graded_outputs = eval_chain.evaluate(processed_examples, predictions)

To gain visibility into the evaluation process for each example, we will iterate through the dataset and print key details at each step. For every example, we will begin by displaying the question, which was generated by a language model. This will be followed by the reference answer, also produced by a language model with full access to the original source document.

Subsequently, we will present the predicted answer, which is generated by a language model operating within the QA chain. This prediction leverages embeddings and a vector database for document retrieval prior to answer generation.

Finally, we will print the evaluation grade, which is assigned by a language model that has been prompted to assess the correctness of the predicted answer. By systematically reviewing each question, answer, prediction, and grade, we gain a comprehensive understanding of the evaluation process and the quality of model performance on a per-example basis.

In [72]:
for i, eg in enumerate(processed_examples):
    print(f"Example {i}:")
    print("Question: " + predictions[i]['query'])
    print("Real Answer: " + predictions[i]['answer'])
    print("Predicted Answer: " + predictions[i]['result'])
    print("Predicted Grade: " + graded_outputs[i]['results'])
    print()

Example 0:
Question: Do the Cozy Comfort Pullover Set        have side pockets?
Real Answer: Yes
Predicted Answer: I'm sorry, but the context provided does not contain information about the "Cozy Comfort Pullover Set". Therefore, I cannot answer your question.
Predicted Grade: INCORRECT

Example 1:
Question: What collection is the Ultra-Lofty         850 Stretch Down Hooded Jacket from?
Real Answer: The DownTek collection
Predicted Answer: It is from the DownTek collection.
Predicted Grade: CORRECT

Example 2:
Question: According to the product description, what specific features contribute to the "broken-in feel" of the Women's Campside Oxfords from the moment they are first worn?
Real Answer: The "broken-in feel" of the Women's Campside Oxfords is attributed to their super-soft canvas material and thick cushioning.
Predicted Answer: According to the product description, the "broken-in feel" of the Women's Campside Oxfords from the moment they are first worn is due to the soft canvas 

In [76]:
graded_outputs[4]

{'results': 'CORRECT'}

In [74]:
graded_outputs[5]

{'results': 'INCORRECT'}