<a href="https://colab.research.google.com/github/peeyushsinghal/GenAI_Hands_On/blob/main/Gen_AI_in_Industry.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🧪 Generative AI in Industry — Maps + FMCG with DSPy & Agentic Pipelines


**Audience:** Beginners / Practitioners who want practical exposure to Generative AI + Agentic AI  
**Focus:** **DSPy** and Industry Use Cases of Maps, FMCG

> Tip: Run cells top-to-bottom. Sections are independent; you can skip/install only what's needed.


## 0) Environment Setup

This workshop uses:
- Python 3.10+
- `dspy` (or `dspy-ai`) for programmable, optimizable LLM pipelines
- `pandas` for data handling
- `rapidfuzz` for string similarity
- An LLM provider (Gemini or OpenAI or compatible).

> If you don't have Internet in your environment, skip installs and read through the code; it will still serve as a template.


In [None]:
# If your environment allows, uncomment to install.
!pip install --quiet dspy-ai rapidfuzz pandas python-dotenv
# For Google API:
!pip install --quiet google-generativeai

In [None]:
import google.generativeai as genai

In [None]:
import os
from pathlib import Path

DATA_DIR = Path('data')
DATA_DIR.mkdir(exist_ok=True)

print("Setup complete. Data directory:", DATA_DIR.resolve())

Setup complete. Data directory: /content/data



## 1) Configure Model / API Keys

You can use **Google Gemini** or **OpenAI** or any provider supported by DSPy
Set the env var(s) appropriately, then initialize the DSPy model.



In [None]:
import os
import dspy

# Configure API Key
GEMINI_API_KEY = "AIzaSyAh0Kp5YuOCTmc5qKNo0R5cWzWWGe8x_OQ"
GEMINI_MODEL = "gemini/gemini-2.5-flash" # gemini-2.0-flash


### Initialize DSPy with your chosen language model

Let's set up a simple LLM in DSPy. **Fill in the code cell below to:**
- Import dspy
- Set up a language model (you can use a placeholder for the API key)
- Configure dspy to use this LLM


In [None]:
# If DSPy is installed, this will work. Otherwise, treat as reference code.
try:
    import dspy
    # Initialize a Gemini-based LM for DSPy (e.g., Gemini-2.5-flash)
    llm = dspy.LM(
        model= GEMINI_MODEL,
        api_key=GEMINI_API_KEY
    )
    dspy.settings.configure(lm=llm)
    print("DSPy initialized with model:", GEMINI_MODEL)
except Exception as e:
    print("DSPy not available or failed to initialize:", e)


DSPy initialized with model: gemini/gemini-2.5-flash


## Try Calling the LLM

Write a code cell to call the LLM with a simple prompt, e.g., 'Say this is a test!'.

Note! If this does not work, most likely something is wrong with the setup of your LLM.

In [None]:
llm("Say: this is a test!", temperature=0.7)  # => ['This is a test!']

['This is a test!']

---
## 2) Maps Mini-Project: Street Name Normalization & Matching

**Problem:** Real-world street names vary (`"MG Road"`, `"M.G. Rd"`, `"Mahatma Gandhi Road"`, `"Nagar Road"`, `"Ahmednagar Road"`).  
**Goal:** Normalize variants to a canonical (full) form and match duplicates.

We'll combine:
- **LLM-based normalization** (expand abbreviations, fix casing, remove punctuation)
- **String similarity** via `rapidfuzz` for robust matching


In [53]:
import pandas as pd
from rapidfuzz import fuzz, process

# Sample data with variants
df_streets = pd.DataFrame({
    "raw_street": [
        "MG Road", "M.G. Rd", "Mahatma Gandhi Rd",
        "Indira Gandhi Road", "Indira Gandhi Marg",
        "Green Road", "Green Park",
        "Nehru Marg", "J L Nehru Marg", "JLN Marg",
        "Ring Rd", "Outer Ring Road", "Outer Rng Rd",
        "Nagar Rd", "Ahmednagar Road"
    ]
})

df_streets.to_csv("data/streets_raw.csv", index=False)
df_streets.head()

Unnamed: 0,raw_street
0,MG Road
1,M.G. Rd
2,Mahatma Gandhi Rd
3,Indira Gandhi Road
4,Indira Gandhi Marg


## 2.1) Fuzzy Logic

In [54]:
from rapidfuzz import process, fuzz

list_streets = df_streets['raw_street'].to_list()
for s in list_streets:
    matches = process.extract(s, list_streets, scorer=fuzz.token_sort_ratio, limit=3)
    print(f"\nTop matches for: {s}")
    print(matches)


Top matches for: MG Road
[('MG Road', 100.0, 0), ('M.G. Rd', 71.42857142857143, 1), ('Green Road', 70.58823529411764, 5)]

Top matches for: M.G. Rd
[('M.G. Rd', 100.0, 1), ('MG Road', 71.42857142857143, 0), ('Green Road', 47.05882352941176, 5)]

Top matches for: Mahatma Gandhi Rd
[('Mahatma Gandhi Rd', 100.0, 2), ('Indira Gandhi Road', 62.857142857142854, 3), ('Indira Gandhi Marg', 51.42857142857142, 4)]

Top matches for: Indira Gandhi Road
[('Indira Gandhi Road', 100.0, 3), ('Indira Gandhi Marg', 83.33333333333334, 4), ('Mahatma Gandhi Rd', 62.857142857142854, 2)]

Top matches for: Indira Gandhi Marg
[('Indira Gandhi Marg', 100.0, 4), ('Indira Gandhi Road', 83.33333333333334, 3), ('Mahatma Gandhi Rd', 51.42857142857142, 2)]

Top matches for: Green Road
[('Green Road', 100.0, 5), ('MG Road', 70.58823529411764, 0), ('Green Park', 70.0, 6)]

Top matches for: Green Park
[('Green Park', 100.0, 6), ('Green Road', 70.0, 5), ('Indira Gandhi Marg', 35.71428571428571, 4)]

Top matches for: Neh

In [81]:
def fuzzy_group(names, threshold=85):
    clusters = []
    visited = set()
    for name in names:
        if name in visited:
            continue
        matches = process.extract(name, names, scorer=fuzz.token_sort_ratio)
        group = [m[0] for m in matches if m[1] >= threshold]
        clusters.append(group)
        visited.update(group)
    return clusters

clusters = fuzzy_group(list_streets, threshold=60)

print("Fuzzy Clusters (pre-GenAI):")
for i, group in enumerate(clusters):
    print(f"Cluster {i}: {group}")

Fuzzy Clusters (pre-GenAI):
Cluster 0: ['MG Road', 'M.G. Rd', 'Green Road']
Cluster 1: ['Mahatma Gandhi Rd', 'Indira Gandhi Road']
Cluster 2: ['Indira Gandhi Marg', 'Indira Gandhi Road']
Cluster 3: ['Green Park', 'Green Road']
Cluster 4: ['Nehru Marg', 'J L Nehru Marg']
Cluster 5: ['JLN Marg', 'J L Nehru Marg']
Cluster 6: ['Ring Rd', 'Outer Rng Rd']
Cluster 7: ['Outer Ring Road', 'Outer Rng Rd']
Cluster 8: ['Nagar Rd', 'Ahmednagar Road']


Exercise: Change threshold and see the clusters

**Discussion:** When to trust LLM normalization vs rules; human-in-the-loop QA for map data.

## Where Fuzzy Logic Fails, GenAI Helps

Fuzzy logic is fast & cheap, but:

❌ Groups different roads (Green Park vs Green Road)

❌ Misses true matches below threshold (JL Nehru Marg vs Jawaharlal Nehru Marg)

GenAI:

✅ Canonicalizes names → “Mahatma Gandhi Road” ≠ “Indira Gandhi Road”

✅ Recognizes context → “Green Park” ≠ “Green Road”

✅ Fills gaps → expands abbreviations correctly

Visual:

Left: “Green Park ↔ Green Road (❌ Fuzzy Match)”

Right: “Green Park ≠ Green Road (✅ GenAI Canonicalization)”

## 2.2) LLM Normalizer (DSPy)
We'll create a simple Signature and Predictor that maps a raw street name → canonical normalized name.

In [57]:

normalizer_spec = """
Given an Indian street name variant, return a clean, canonical, expanded form:
- Expand common abbreviations (e.g., 'Rd' → 'Road', 'St' → 'Saint' when it's a person's name; else 'Street' if context suggests)
- Remove unnecessary punctuation
- Use Title Case
- Prefer full names (e.g., 'MG' → 'Mahatma Gandhi' when unambiguous)
Return only the normalized name, no extra text.
"""

try:
    import dspy

    class NormalizeStreet(dspy.Signature):
        raw_name = dspy.InputField()
        normalized = dspy.OutputField(desc="normalized, canonical street name")

    normalize = dspy.Predict(NormalizeStreet)

    def llm_normalize(name: str) -> str:
        r = normalize(raw_name=f"{name}, Guidelines:{normalizer_spec}")
        return r.normalized.strip()

except Exception as e:
    print("DSPy not available; falling back to a rule-based normalizer:", e)
    import re

    ABBR = {
        r"\brd\b": "Road",
        r"\brd.\b": "Road",
        r"\bst\b": "Street",
        r"\bst.\b": "Street",
        r"\bmg\b": "Mahatma Gandhi",
        r"\bjl\b": "Jawaharlal",
        r"\bmarg\b": "Marg",
        r"\brng\b": "Ring",
    }
    def rule_normalize(text: str) -> str:
        t = text.lower()
        for pat, rep in ABBR.items():
            t = re.sub(pat, rep.lower(), t)
        t = re.sub(r"[.’']", "", t)
        t = re.sub(r"\s+", " ", t).strip()
        return t.title()

    def llm_normalize(name: str) -> str:
        return rule_normalize(name)



In [72]:
df = pd.read_csv("data/streets_raw.csv")
df["normalized"] = df["raw_street"].apply(llm_normalize)
df

Unnamed: 0,raw_street,normalized
0,MG Road,Mahatma Gandhi Road
1,M.G. Rd,Mahatma Gandhi Road
2,Mahatma Gandhi Rd,Mahatma Gandhi Road
3,Indira Gandhi Road,Indira Gandhi Road
4,Indira Gandhi Marg,Indira Gandhi Road
5,Green Road,Green Road
6,Green Park,Green Park
7,Nehru Marg,Nehru Marg
8,J L Nehru Marg,Jawaharlal Nehru Road
9,JLN Marg,Jawaharlal Nehru Marg


In [73]:
df_canonical = df.copy()

## 2.3 Deep Learning Method (optional)

In [60]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import itertools

In [74]:
# Encode with embeddings
model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = model.encode(list_streets)


### PAIR wise

In [63]:

# Compute pairwise cosine similarity
sim_matrix = cosine_similarity(embeddings)

# Show top-10 most similar pairs
pairs = []
for i, j in itertools.combinations(range(len(list_streets)), 2):
    pairs.append((list_streets[i], list_streets[j], sim_matrix[i, j]))

pairs = sorted(pairs, key=lambda x: -x[2])#[:10]
for a, b, score in pairs:
    print(f"{a} ↔ {b} : {score:.2f}")

Nehru Marg ↔ J L Nehru Marg : 0.94
Indira Gandhi Marg ↔ Nehru Marg : 0.76
Indira Gandhi Marg ↔ J L Nehru Marg : 0.75
Indira Gandhi Road ↔ Indira Gandhi Marg : 0.74
Green Road ↔ Green Park : 0.73
Mahatma Gandhi Rd ↔ Indira Gandhi Road : 0.68
Mahatma Gandhi Rd ↔ Indira Gandhi Marg : 0.67
Indira Gandhi Road ↔ Ahmednagar Road : 0.65
Mahatma Gandhi Rd ↔ J L Nehru Marg : 0.63
Green Road ↔ Outer Ring Road : 0.59
Mahatma Gandhi Rd ↔ Nehru Marg : 0.59
MG Road ↔ Green Road : 0.58
Indira Gandhi Road ↔ Nehru Marg : 0.57
Indira Gandhi Road ↔ J L Nehru Marg : 0.54
Nagar Rd ↔ Ahmednagar Road : 0.52
MG Road ↔ Outer Ring Road : 0.52
Ring Rd ↔ Nagar Rd : 0.52
Ring Rd ↔ Outer Ring Road : 0.51
M.G. Rd ↔ Outer Rng Rd : 0.50
Indira Gandhi Road ↔ Green Road : 0.50
Indira Gandhi Road ↔ Outer Ring Road : 0.48
Outer Ring Road ↔ Ahmednagar Road : 0.47
M.G. Rd ↔ Ring Rd : 0.47
Ring Rd ↔ Outer Rng Rd : 0.46
Green Road ↔ Ahmednagar Road : 0.46
Indira Gandhi Road ↔ Nagar Rd : 0.46
Mahatma Gandhi Rd ↔ Ahmednagar Road

In [71]:
from sklearn.cluster import AgglomerativeClustering

AgglomerativeClustering()
clustering = AgglomerativeClustering(
    n_clusters=None,
    distance_threshold=.5,
    linkage="average",
    metric="cosine"
)
labels = clustering.fit_predict(embeddings)

clusters = {}
for street, label in zip(list_streets, labels):
    clusters.setdefault(label, []).append(street)

print("Agglomerative Clusters:")
for cid, group in clusters.items():
    print(f"Cluster {cid}: {group}")


Agglomerative Clusters:
Cluster 1: ['MG Road', 'Outer Ring Road']
Cluster 0: ['M.G. Rd', 'Outer Rng Rd']
Cluster 4: ['Mahatma Gandhi Rd', 'Indira Gandhi Road', 'Indira Gandhi Marg', 'Nehru Marg', 'J L Nehru Marg']
Cluster 2: ['Green Road', 'Green Park']
Cluster 5: ['JLN Marg']
Cluster 6: ['Ring Rd']
Cluster 3: ['Nagar Rd', 'Ahmednagar Road']


Learning point:

* Clustering is better than pairwise for larger datasets.

* Still makes semantic mistakes (Gandhi vs Indira Gandhi, Green Park vs Green Road).

| Approach                   | Pros                                | Cons                                     |
| -------------------------- | ----------------------------------- | ---------------------------------------- |
| Pairwise Cosine Similarity | Simple, intuitive                   | O(n²) → slow, threshold sensitive        |
| Agglomerative Clustering   | Scales better, fewer thresholds     | Still mixes semantically different roads |
| Generative AI (DSPy)       | Canonical full forms, context-aware | Can hallucinate, requires validation     |



---
## 3) FMCG Mini-Project: Reviews → Insights → Actions

**Goal:** Generate synthetic reviews for a new product, summarize themes, extract insights, and recommend actions.


In [82]:

product = "SunBurst Orange Juice"
aspects = ["taste", "price", "packaging", "availability", "healthiness"]

try:
    import dspy

    class ReviewSynth(dspy.Signature):
        product = dspy.InputField()
        aspects = dspy.InputField()
        reviews = dspy.OutputField(desc="10 diverse, short customer reviews")

    synth = dspy.Predict(ReviewSynth)
    reviews_text = synth(product=product, aspects=aspects).reviews
except Exception:
    # Fallback: sample static reviews
    reviews_text = """
1) Great taste but a bit pricey.
2) Love the no-sugar claim; feels healthy.
3) Packaging leaks if kept sideways.
4) Hard to find at my local store.
5) Kids enjoy it; refreshing and pulpy.
6) Price is okay during discounts.
7) Wish there was a smaller pack size.
8) Tastes natural, not too sweet.
9) Outer packaging is attractive.
10) Delivery took long; store was out of stock.
"""

print(reviews_text)


1. SunBurst Orange Juice has an incredibly fresh and vibrant taste, just like squeezing oranges yourself!
2. For the quality, the price of SunBurst is surprisingly good. It's become my daily morning drink.
3. The new packaging is a bit flimsy; the cap doesn't seal well and I've had a few spills. Disappointing.
4. I love that SunBurst is always available at my local supermarket. Never have trouble finding it.
5. It's great to know this is 100% pure orange juice with no added sugars. A healthy and delicious choice!
6. Honestly, the taste was a bit too sweet and artificial for me. I prefer a more natural, less processed flavor.
7. While the taste is decent, I find SunBurst to be a bit too expensive compared to other brands on the shelf.
8. The carton design is really attractive and easy to store in the fridge. Plus, it pours without dripping!
9. It's so frustrating when SunBurst is out of stock! It happens too often at my usual store.
10. Refreshing and genuinely tastes like real oranges.

### 3.1) Summarize & Extract Insights


In [83]:

try:
    import dspy

    class SummarizeReviews(dspy.Signature):
        reviews = dspy.InputField()
        summary = dspy.OutputField(desc="pros, cons, notable quotes")

    class ExtractInsights(dspy.Signature):
        summary = dspy.InputField()
        insights = dspy.OutputField(desc="3-5 crisp insights with evidence")

    summarize = dspy.ChainOfThought(SummarizeReviews)
    extract = dspy.Predict(ExtractInsights)

    summary = summarize(reviews=reviews_text).summary
    insights = extract(summary=summary).insights

    print("SUMMARY:\n", summary)
    print("\nINSIGHTS:\n", insights)

except Exception:
    print("DSPy not available; here is a template prompt you can run with your LLM:")
    print("""
Summarize the following reviews into pros, cons, and notable quotes. Then provide 3-5 crisp insights:
""")
    print(reviews_text)


SUMMARY:
 **Pros:**
*   **Taste & Health:** Many reviewers rave about the "incredibly fresh and vibrant taste, just like squeezing oranges yourself!" and describe it as refreshing and genuinely tasting like real oranges. It's highly valued for being "100% pure orange juice with no added sugars," making it a healthy and delicious choice.
*   **Packaging Design:** The carton design is praised for being "really attractive and easy to store in the fridge," with the added benefit that "it pours without dripping!"
*   **Value & Availability (Mixed):** Some find the price "surprisingly good" for the quality and appreciate its consistent availability at their local supermarket.

**Cons:**
*   **Packaging Functionality:** A significant concern is the "new packaging [being] a bit flimsy; the cap doesn't seal well and I've had a few spills."
*   **Taste (Subjective):** While most love the taste, a few found it "a bit too sweet and artificial," preferring a more natural, less processed flavor.
*  


---
## 4) Agentic AI with DSPy: Compose a Pipeline

We'll build a 3-stage pipeline:
1. **Summarizer** – condense reviews/sales text
2. **Insight Generator** – extract trends/causes
3. **Recommender** – propose next actions (pricing, packaging, distribution, marketing)

You'll see: how **modules** wrap LLM calls, how to **swap models**, and how to **optimize prompts**.


In [85]:

try:
    import dspy

    class Summarizer(dspy.Module):
        def __init__(self):
            super().__init__()
            class Sig(dspy.Signature):
                text = dspy.InputField()
                summary = dspy.OutputField()
            self.step = dspy.ChainOfThought(Sig)
        def forward(self, text):
            return self.step(text=text).summary

    class InsightGen(dspy.Module):
        def __init__(self):
            super().__init__()
            class Sig(dspy.Signature):
                summary = dspy.InputField()
                insights = dspy.OutputField()
            self.step = dspy.Predict(Sig)
        def forward(self, summary):
            return self.step(summary=summary).insights

    class Recommender(dspy.Module):
        def __init__(self):
            super().__init__()
            class Sig(dspy.Signature):
                insights = dspy.InputField()
                actions = dspy.OutputField()
            self.step = dspy.Predict(Sig)
        def forward(self, insights):
            return self.step(insights=insights).actions

    class FMCGPipeline(dspy.Module):
        def __init__(self):
            super().__init__()
            self.summarizer = Summarizer()
            self.insightgen = InsightGen()
            self.recommender = Recommender()

        def forward(self, text):
            summary = self.summarizer(text=text)
            insights = self.insightgen(summary=summary)
            actions = self.recommender(insights=insights)
            return dict(summary=summary, insights=insights, actions=actions)

    pipeline = FMCGPipeline()

    sample_text = reviews_text
    result = pipeline(text=sample_text)
    print("SUMMARY:\n", result["summary"])
    print("\nINSIGHTS:\n", result["insights"])
    print("\nACTIONS:\n", result["actions"])

except Exception as e:
    print("DSPy not available; here is the logical flow you can implement with any LLM:")
    print("1) Summarize -> 2) Extract Insights -> 3) Recommend Actions")


SUMMARY:
 SunBurst Orange Juice receives mixed reviews, with many praising its fresh, vibrant, and refreshing taste, noting it's like real oranges, 100% pure, and has no added sugars, making it a healthy and delicious choice. However, some find the taste too sweet and artificial. Opinions on price are divided, with some finding it surprisingly good for the quality, while others consider it too expensive. Availability is also a point of contention; some find it consistently available, while others frequently encounter out-of-stock issues. The packaging draws both positive comments for its attractive, easy-to-store design and drip-free pouring, but also negative feedback regarding flimsy caps and spills.

INSIGHTS:
 SunBurst Orange Juice exhibits a strong core appeal based on its fresh, pure, and healthy taste, which resonates with a significant portion of consumers. However, a notable segment perceives the taste as overly sweet and artificial, suggesting a potential for taste profile re


### 4.1) (Optional) DSPy Optimization

DSPy supports **teleprompter**-style optimization given labeled examples.  
Below is a minimal sketch (fill `train_data` with (input, target) pairs).


In [86]:

try:
    import dspy

    # Minimal demo dataset (toy). Replace with real (input, target) pairs.
    train_data = [
        dict(text="Pricey but delicious. Hard to find locally.", target_actions="Run local availability campaign; limited-time discount"),
        dict(text="Leaky packaging. Love the no sugar.", target_actions="Improve cap seal; emphasize health benefit in ads"),
    ]

    class ActionsTeacher(dspy.Signature):
        text = dspy.InputField()
        actions = dspy.OutputField()

    # A tiny trainer that pretends "actions" is the supervised target.
    class TinyTrainer(dspy.Module):
        def __init__(self):
            super().__init__()
            self.pipeline = FMCGPipeline()
        def forward(self, text):
            out = self.pipeline(text=text)
            return out["actions"]

    # In real usage, use dspy.teleprompt.BootstrapFewShot or similar.
    # Here we simply run the pipeline on training data as illustration.
    trainer = TinyTrainer()
    for ex in train_data:
        _ = trainer(text=ex["text"])
    print("Optimization sketch complete (replace with DSPy teleprompters in real training).")

except Exception as e:
    print("Skipping optimization sketch due to:", e)


Optimization sketch complete (replace with DSPy teleprompters in real training).


In [87]:
train_data

[{'text': 'Pricey but delicious. Hard to find locally.',
  'target_actions': 'Run local availability campaign; limited-time discount'},
 {'text': 'Leaky packaging. Love the no sugar.',
  'target_actions': 'Improve cap seal; emphasize health benefit in ads'}]

In [89]:
trainer

pipeline.summarizer.step.predict = Predict(StringSignature(text -> reasoning, summary
    instructions='Given the fields `text`, produce the fields `summary`.'
    text = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'input', 'prefix': 'Text:', 'desc': '${text}'})
    reasoning = Field(annotation=str required=True json_schema_extra={'prefix': "Reasoning: Let's think step by step in order to", 'desc': '${reasoning}', '__dspy_field_type': 'output'})
    summary = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'output', 'prefix': 'Summary:', 'desc': '${summary}'})
))
pipeline.insightgen.step = Predict(Sig(summary -> insights
    instructions='Given the fields `summary`, produce the fields `insights`.'
    summary = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'input', 'prefix': 'Summary:', 'desc': '${summary}'})
    insights = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'o

In [90]:
trainer("the juice is not nice")

'1. Conduct a detailed sensory analysis to pinpoint specific undesirable taste notes or qualities (e.g., too sour, bitter, artificial, bland, off-flavor).\n2. Review the quality and sourcing of all ingredients to ensure freshness and adherence to standards.\n3. Examine the production process for any steps that might introduce off-flavors, degrade quality, or affect consistency.\n4. Adjust the product formulation based on sensory analysis and ingredient review to improve taste balance and overall quality.\n5. Implement enhanced quality control checks throughout the production and distribution chain to ensure consistent taste and quality.\n6. Gather more specific consumer feedback through surveys or focus groups to understand the exact nature of the undesirable perception.'