# 06 — GenAI & RAG Experiments for SunnyBest

In this notebook, I prototype a simple **question-answering assistant** for the SunnyBest
Retail Sales Forecasting System.

The goal is to move from *static analysis* (charts, tables, models) to an **interactive assistant**
that can:

- Answer questions about stockouts, promotions, and forecasts.
- Use existing analysis as **context** (Retrieval-Augmented Generation style).
- Generate human-friendly explanations of model results.

This is an offline RAG-style prototype using TF–IDF and retrieval.
Later, the core logic can be moved into:

- `src/genai/rag_index.py`
- `src/genai/rag_qa.py`
- `src/genai/explain_forecast.py`

and wired to a real LLM (e.g. OpenAI, Anthropic, etc.) for production use.


In [1]:
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

import textwrap

In [4]:
# Load merged data (same as other notebooks)
df = pd.read_csv("../data/processed/sunnybest_merged_df.csv", parse_dates=["date"], low_memory=False)
df.head()

Unnamed: 0,date,store_id,product_id,units_sold,price,regular_price,discount_pct,promo_flag,promo_type,revenue,...,is_weekend,is_holiday,is_payday,season,temperature_c,rainfall_mm,weather_condition,promo_type_promo,discount_pct_promo,promo_flag_promo
0,2021-01-01,1,1001,0,445838.0,445838,0,0,,0.0,...,False,True,False,Dry,30.6,3.7,Rainy,,,
1,2021-01-01,1,1002,2,500410.0,500410,0,0,,1000820.0,...,False,True,False,Dry,30.6,3.7,Rainy,,,
2,2021-01-01,1,1003,2,399365.0,399365,0,0,,798730.0,...,False,True,False,Dry,30.6,3.7,Rainy,,,
3,2021-01-01,1,1004,4,305796.0,305796,0,0,,1223184.0,...,False,True,False,Dry,30.6,3.7,Rainy,,,
4,2021-01-01,1,1005,5,462752.0,462752,0,0,,2313760.0,...,False,True,False,Dry,30.6,3.7,Rainy,,,


Knowlegde base that represents key insights from other notebooks

In [5]:
knowledge_docs = []

def add_doc(title, text):
    knowledge_docs.append(
        {
            "title": title,
            "text": text
        }
    )

add_doc(
    "Stockout model summary",
    """
    I built a classification model to predict stockout_occurred for SunnyBest stores.
    Features included units_sold, price, regular_price, discount_pct, promo_flag,
    store_size, store_type, category, brand, calendar features (month, day, is_weekend,
    is_holiday, is_payday, season) and weather (temperature_c, rainfall_mm).
    The model helps identify which store-product combinations are at high risk of
    stockouts so procurement can take action.
    """
)

add_doc(
    "Business impact of stockout model",
    """
    The stockout prediction model achieves around XX percent accuracy and highlights
    that promotions, price changes, and seasonality are strong drivers of stockout risk.
    Large stores and high-demand categories such as Mobile Phones and Accessories
    are more likely to stock out. This supports better inventory allocation and
    proactive restocking across SunnyBest stores in Edo State.
    """
)

add_doc(
    "Promotion uplift analysis",
    """
    Promotion uplift analysis compares revenue on promo days versus non-promo days,
    and then uses a two-model Random Forest approach to estimate incremental revenue.
    Mobile Phones and Accessories show the highest uplift, meaning promotions work
    very well in these categories. Some categories like Refrigerators and Air
    Conditioners show smaller uplift, suggesting that deep discounts may not be
    necessary to drive sales.
    """
)

add_doc(
    "Promotion uplift by city",
    """
    Uplift by city shows that some cities, such as Benin and Ekpoma, respond very
    strongly to promotions, while other locations show weaker responses. SunnyBest
    should prioritise promotion budget and marketing campaigns in high-ROI markets,
    and reduce spend where uplift is low.
    """
)

add_doc(
    "Forecasting model summary",
    """
    The revenue forecasting system uses a baseline time series model and an XGBoost
    regression model that incorporates store, product, promotion, calendar, and
    weather signals. The best-performing model is saved as xgb_revenue_forecast.pkl
    under models/, and can be used to generate forward-looking revenue forecasts for
    different store-product combinations to support quarterly planning.
    """
)

len(knowledge_docs)


5

In [6]:
# Build TF-IDF Index for Retrieval 

# Prepare corpus
corpus = [doc["text"] for doc in knowledge_docs]

vectorizer = TfidfVectorizer(stop_words="english")
doc_vectors = vectorizer.fit_transform(corpus)

doc_vectors.shape


(5, 137)

In [7]:
def retrieve_context(query, top_k=3):
    """Return top_k most similar knowledge docs for a given query."""
    q_vec = vectorizer.transform([query])
    sims = cosine_similarity(q_vec, doc_vectors).flatten()
    top_idx = sims.argsort()[::-1][:top_k]
    
    results = []
    for idx in top_idx:
        results.append(
            {
                "title": knowledge_docs[idx]["title"],
                "text": knowledge_docs[idx]["text"],
                "similarity": float(sims[idx]),
            }
        )
    return results


In [8]:
def print_context(results):
    for i, r in enumerate(results, 1):
        print(f"\n=== Context {i}: {r['title']} (score={r['similarity']:.3f}) ===\n")
        print(textwrap.fill(r["text"].strip(), width=100))


In [9]:
query = "Which categories benefit most from promotions?"
ctx = retrieve_context(query, top_k=2)
print_context(ctx)



=== Context 1: Promotion uplift analysis (score=0.246) ===

Promotion uplift analysis compares revenue on promo days versus non-promo days,     and then uses a
two-model Random Forest approach to estimate incremental revenue.     Mobile Phones and Accessories
show the highest uplift, meaning promotions work     very well in these categories. Some categories
like Refrigerators and Air     Conditioners show smaller uplift, suggesting that deep discounts may
not be     necessary to drive sales.

=== Context 2: Business impact of stockout model (score=0.179) ===

The stockout prediction model achieves around XX percent accuracy and highlights     that
promotions, price changes, and seasonality are strong drivers of stockout risk.     Large stores and
high-demand categories such as Mobile Phones and Accessories     are more likely to stock out. This
supports better inventory allocation and     proactive restocking across SunnyBest stores in Edo
State.


In [10]:
def build_prompt(query, contexts):
    context_text = "\n\n".join(
        [f"[Context {i+1}] {c['text'].strip()}" for i, c in enumerate(contexts)]
    )
    
    prompt = f"""
You are a data assistant helping analyse SunnyBest Telecommunications' retail sales,
stockouts, promotions and forecasts.

Use the context below to answer the question. If the context does not fully answer it,
make a reasonable, conservative inference but do not invent unrealistic claims.

CONTEXT:
{context_text}

QUESTION:
{query}

ANSWER in clear, concise English, focusing on business implications.
"""
    return prompt.strip()


In [11]:
query = "Which product categories should SunnyBest prioritise for promotions?"
ctx = retrieve_context(query, top_k=3)
print(build_prompt(query, ctx))


You are a data assistant helping analyse SunnyBest Telecommunications' retail sales,
stockouts, promotions and forecasts.

Use the context below to answer the question. If the context does not fully answer it,
make a reasonable, conservative inference but do not invent unrealistic claims.

CONTEXT:
[Context 1] Uplift by city shows that some cities, such as Benin and Ekpoma, respond very
    strongly to promotions, while other locations show weaker responses. SunnyBest
    should prioritise promotion budget and marketing campaigns in high-ROI markets,
    and reduce spend where uplift is low.

[Context 2] The stockout prediction model achieves around XX percent accuracy and highlights
    that promotions, price changes, and seasonality are strong drivers of stockout risk.
    Large stores and high-demand categories such as Mobile Phones and Accessories
    are more likely to stock out. This supports better inventory allocation and
    proactive restocking across SunnyBest stores in Edo 

In [13]:
def simple_rule_based_answer(query, contexts):
    """
    Very naive: just returns a stitched summary from the most similar context.
    This is a placeholder until you connect a real LLM.
    """
    top = contexts[0]
    text = top["text"].lower()
    
    if "promotion uplift analysis" in top["title"].lower():
        return (
            "Promotions work best for high-demand categories such as Mobile Phones and "
            "Accessories, where uplift is strongest. Categories like Refrigerators and "
            "Air Conditioners show smaller uplift, so SunnyBest should be more selective "
            "with discounts there."
        )
    elif "stockout model summary" in top["title"].lower():
        return (
            "Stockouts are driven mainly by high demand combined with promotions, "
            "aggressive pricing, and seasonal patterns. Large stores and popular "
            "categories such as Mobile Phones and Accessories are more likely to "
            "experience stockouts, so they require closer inventory monitoring."
        )
    else:
        # default: just return trimmed context
        return "From the analysis: " + " ".join(top["text"].split()[:60]) + "..."


In [14]:
query = "Where should SunnyBest focus its promotion budget?"
ctx = retrieve_context(query, top_k=3)
print("QUESTION:", query)
print()
print("ANSWER (rule-based placeholder):")
print(textwrap.fill(simple_rule_based_answer(query, ctx), width=100))


QUESTION: Where should SunnyBest focus its promotion budget?

ANSWER (rule-based placeholder):
From the analysis: Uplift by city shows that some cities, such as Benin and Ekpoma, respond very
strongly to promotions, while other locations show weaker responses. SunnyBest should prioritise
promotion budget and marketing campaigns in high-ROI markets, and reduce spend where uplift is
low....


## 2. Example Q&A Scenarios

Below are example business questions that the SunnyBest GenAI assistant should handle:

1. *Which categories benefit most from discounts?*
2. *Which cities are most responsive to promotions?*
3. *What are the main drivers of stockouts?*
4. *How does the forecasting model help quarterly planning?*
5. *Which stores face the highest stockout risk?*

I use the retrieval helper to pull relevant context and then generate a draft answer.


In [15]:
questions = [
    "Which categories benefit most from discounts?",
    "Which cities are most responsive to promotions?",
    "What are the main drivers of stockouts?",
    "How does the forecasting model support quarterly planning?",
]

for q in questions:
    ctx = retrieve_context(q, top_k=3)
    ans = simple_rule_based_answer(q, ctx)
    print("=" * 120)
    print("QUESTION:", q)
    print("\nANSWER (rule-based placeholder):")
    print(textwrap.fill(ans, width=100))
    print()


QUESTION: Which categories benefit most from discounts?

ANSWER (rule-based placeholder):
Promotions work best for high-demand categories such as Mobile Phones and Accessories, where uplift
is strongest. Categories like Refrigerators and Air Conditioners show smaller uplift, so SunnyBest
should be more selective with discounts there.

QUESTION: Which cities are most responsive to promotions?

ANSWER (rule-based placeholder):
From the analysis: Uplift by city shows that some cities, such as Benin and Ekpoma, respond very
strongly to promotions, while other locations show weaker responses. SunnyBest should prioritise
promotion budget and marketing campaigns in high-ROI markets, and reduce spend where uplift is
low....

QUESTION: What are the main drivers of stockouts?

ANSWER (rule-based placeholder):
From the analysis: The stockout prediction model achieves around XX percent accuracy and highlights
that promotions, price changes, and seasonality are strong drivers of stockout risk. Larg

## 3. Mapping this prototype to `src/genai/`

This notebook is a playground. In the production version of the Retail Sales Forecasting System,
the logic should be moved into the following modules:

- `src/genai/rag_index.py`
  - Functions to:
    - load project knowledge files (e.g. Markdown summaries, model reports),
    - build the TF–IDF (or embedding) index,
    - save / load the index artefacts.

- `src/genai/rag_qa.py`
  - Functions to:
    - accept a natural language question,
    - retrieve the most relevant context from the index,
    - build a prompt for an LLM,
    - call a real LLM API and return the answer.

- `src/genai/explain_forecast.py`
  - Helper functions to:
    - take a specific forecast row (e.g. for a store-product-month),
    - generate a small textual summary of drivers (price, promo, seasonality),
    - pass that into the RAG / LLM pipeline to produce a business-friendly explanation.

In a later step, the FastAPI app (`src/api/app.py`) can expose an endpoint such as
`/ask_sunnybest_assistant` that uses these GenAI utilities to provide interactive Q&A
for stakeholders.
