<a href="https://www.kaggle.com/code/owosnow/enhancing-banana-quality-inspection-with-gemini?scriptVersionId=235099777" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Turbo‑charging Banana‑Quality Inspection with Classical ML **and** Google Gemini GenAI

**Goal:**  
Transform a traditional Decision Tree classifier into an end‑to‑end **question‑answering quality assistant** that:
* Predicts *Good* vs *Bad* bananas from physicochemical features,  
* Automatically extracts the tree’s decision rules,  
* Rewrites those rules into plain‑English “facts”,  
* Indexes them in a vector store and  
* Answers natural‑language questions with **retrieval‑augmented generation (RAG)**, returning JSON answers *with citations*.

**Dataset:**  
An open‑source **`banana_quality.csv`** dataset from Kaggle ([link](https://www.kaggle.com/datasets/l3llff/banana)) is used in this project. It contains standardized measurements such as *Size, Weight, Sweetness, Softness, HarvestTime, Ripeness,* and *Acidity*, plus a binary *Quality* label.

**High‑level Flow:**

1. **Train & evaluate** a shallow Decision Tree.  
2. **Export** its rules and let **Gemini 2.0 Flash** rewrite them into QA‑friendly statements.  
3. **Embed** the facts (Gemini *text‑embedding‑004*) and store them in **ChromaDB**.  
4. **Query** questions → embedding → vector search → top‑k facts → Gemini Flash prompt → **JSON answer + source ids**.  
5. **Fallback** if no fact is relevant as Gemini politely returns `"Not covered …"`.

The result is an explainable, self‑contained QA system that ties classical ML to modern Gen‑AI.

*Note: Rerun the notebook from the beginning again if InternalError occurs.*

## Decision Tree Classification Model

In this section, we will be building a binary classification model to predict banana quality (Good vs. Bad) based on physicochemical features. The decision tree algorithm is selected for its:

1. **Interpretability** - Rules can be easily extracted and understood by humans
2. **Feature importance insights** - Reveals which factors most strongly influence quality
3. **Minimal preprocessing requirements** - Works well with our feature set

Then, the model will be trained on banana properties including size, weight, sweetness, acidity, ripeness, softness and harvest time. Ultimately, the decision rules that determine quality classification will be visualized afterwards.

In [1]:
# Import essential libraries
import numpy as np
import pandas as pd
import os

# List available dataset files
print("Available input files:")
for dirname, _, filenames in os.walk("/kaggle/input"):
    for filename in filenames:
        print(f"- {os.path.join(dirname, filename)}")

Available input files:
- /kaggle/input/banana-quality/banana_quality.csv


In [None]:
# Import necessary libraries 
import os 
import pandas as pd 
import numpy as np 
import tensorflow as tf 
from sklearn.model_selection import train_test_split 
from sklearn.metrics import mean_absolute_error, mean_squared_error

In [3]:
class_df = pd.read_csv("/kaggle/input/banana-quality/banana_quality.csv")
print(class_df.head())

       Size    Weight  Sweetness  Softness  HarvestTime  Ripeness   Acidity  \
0 -1.924968  0.468078   3.077832 -1.472177     0.294799  2.435570  0.271290   
1 -2.409751  0.486870   0.346921 -2.495099    -0.892213  2.067549  0.307325   
2 -0.357607  1.483176   1.568452 -2.645145    -0.647267  3.090643  1.427322   
3 -0.868524  1.566201   1.889605 -1.273761    -1.006278  1.873001  0.477862   
4  0.651825  1.319199  -0.022459 -1.209709    -1.430692  1.078345  2.812442   

  Quality  
0    Good  
1    Good  
2    Good  
3    Good  
4    Good  


In [4]:
# Separate features and target variables
class_X = class_df.drop(columns='Quality')
class_y = class_df['Quality']
class_num_cols = list(class_X.select_dtypes(include=[np.number]).columns.values)
class_cat_cols = list(class_X.select_dtypes(exclude=[np.number]).columns.values)

# Create train and test Data
class_X_train, class_X_test, class_y_train, class_y_test = train_test_split(
    class_X,class_y,test_size=0.3
)

In [5]:
from sklearn.tree import DecisionTreeClassifier, plot_tree
import matplotlib.pyplot as plt
from sklearn.metrics import classification_report

# Define the decision tree format
class_dt = DecisionTreeClassifier(max_depth = 3, min_samples_leaf = 2)
class_dt.fit(class_X_train, class_y_train)

# Predict on the test data and evaluate the model
class_y_pred = class_dt.predict(class_X_test)

# Print the classification results summary
print(classification_report(class_y_pred, class_y_test))

              precision    recall  f1-score   support

         Bad       0.87      0.78      0.82      1327
        Good       0.76      0.85      0.80      1073

    accuracy                           0.81      2400
   macro avg       0.81      0.81      0.81      2400
weighted avg       0.82      0.81      0.81      2400



In [None]:
# Plot the decision tree based on the results summary
fig = plt.figure(figsize=(25,20))
_ = plot_tree(
    class_dt,
    feature_names = list(class_X_train.columns),
    class_names = ['Bad', 'Good'],
    filled=True,
    proportion = True
)

In [None]:
# Export decision tree results into rules
from sklearn.tree import export_text

# Convert the fitted tree into human‑readable rules
tree_rules = export_text(
    class_dt, # This is the trained tree
    feature_names = list(class_X.columns),
    spacing = 2,
)

# Show the first ~40 lines to evaluate the pattern
print(tree_rules.splitlines()[:40])

## Generative‑AI Augmentation

The classic Decision‑Tree above predicts **Good** vs **Bad** bananas from physicochemical features. In the upcoming cells, we'll enhance this model with Google Gemini‐powered GenAI to create an intelligent quality assistant that can answer natural language questions.

This section demonstrates these key generative AI capabilities:
* **Document understanding** – Process and interpret our decision‑tree rules  
* **Few‑shot prompting** – Guide the model with minimal examples for accurate outputs  
* **Embeddings** – Convert expert banana‑quality facts into high-dimensional vectors  
* **Vector store / search** – Store and query those vectors efficiently  
* **Retrieval‑Augmented Generation (RAG)** – Answer questions by combining retrieved facts with Gemini generation  
* **Structured JSON output** – Force the model to reply in a machine‑readable format

In [None]:
# Install Gen AI SDK
!pip -q uninstall -y jupyterlab # Remove unused packages from Kaggle's base image that conflict
!pip -q install "google-genai==1.7.0" chromadb sentence-transformers

In [9]:
# Import Gen AI and vector‑db toolchain
from google import genai
from google.genai import types
from google.api_core import retry

# Retry Gemini calls that get 429 / 503
is_retriable = lambda e: (isinstance(e, genai.errors.APIError) and e.code in {429, 503})
genai.models.Models.generate_content = retry.Retry(predicate=is_retriable)(
    genai.models.Models.generate_content)

In [10]:
# Setup API key
from kaggle_secrets import UserSecretsClient

GOOGLE_API_KEY = UserSecretsClient().get_secret("GOOGLE_API_KEY")

In [11]:
# Use Gemini to rewrite rules into plain facts
from textwrap import dedent
import json

# Create a few‑shot instruction block with two examples
few_shot = dedent("""
You are given decision‑tree rules in the format:
RULE: <conditions> => class: <Good|Bad>

Rewrite EACH rule into ONE short fact for a QA officer. Return a JSON array of strings only with no other words.

EXAMPLE
RULE: Sweetness <= 0.05 AND Ripeness <= 0.71 => class: Bad
FACT: "Very low sweetness and unripe bananas are often BAD quality."

JSON OUTPUT:
["Very low sweetness and unripe bananas are often BAD quality."]
""")

# Append the raw tree text
prompt = few_shot + "\n\nRULES:\n" + tree_rules

# Call Gemini Flash
client = genai.Client(api_key=GOOGLE_API_KEY)

facts_resp = client.models.generate_content(
    model="gemini-2.0-flash",
    contents=prompt,
    config=types.GenerateContentConfig(
        response_mime_type = "application/json",
        temperature = 0.2, # Set temperature low for consistency
        max_output_tokens = 512
    )
)

generated_facts_json = facts_resp.text
print("Raw JSON string (truncated):", generated_facts_json[:200], "...")

Raw JSON string (truncated): [
"Very low sweetness, early harvest time, and unripe bananas are often BAD quality.",
"Very low sweetness, early harvest time, and ripe bananas are often BAD quality.",
"Very low sweetness, late harv ...


In [12]:
# Load the JSON into a list
try:
    banana_kb_auto = json.loads(generated_facts_json)
except json.JSONDecodeError as e:
    print("Gemini returned invalid JSON:", e)
    banana_kb_auto = []

print("Auto‑generated facts:", len(banana_kb_auto))
print("\nSample facts:\n", "\n".join(banana_kb_auto[:100]))

Auto‑generated facts: 8

Sample facts:
 Very low sweetness, early harvest time, and unripe bananas are often BAD quality.
Very low sweetness, early harvest time, and ripe bananas are often BAD quality.
Very low sweetness, late harvest time, and very unripe bananas are often BAD quality.
Very low sweetness, late harvest time, and ripe bananas are often GOOD quality.
High sweetness and low softness with light weight bananas are often GOOD quality.
High sweetness and low softness with heavy weight bananas are often GOOD quality.
High sweetness, high softness, and small bananas are often BAD quality.
High sweetness, high softness, and large bananas are often GOOD quality.


In [13]:
# Create a Knowledge Base (KB) from the decision tree rules.
banana_kb = banana_kb_auto
print("banana_kb set with", len(banana_kb), "facts.")

banana_kb set with 8 facts.


In [14]:
# Import vector-db toolchain
from sentence_transformers import SentenceTransformer
import chromadb, json, os

In [15]:
# Embeddings capability
resp = client.models.embed_content(
    model = "models/text-embedding-004",
    contents = banana_kb, # List of strings
    config = types.EmbedContentConfig(task_type="retrieval_document")
)
kb_emb = [e.values for e in resp.embeddings] # List of float vectors

# Vector DB capability
client_vdb = chromadb.Client()
col = client_vdb.create_collection("banana_kb")

col.add(
    documents = banana_kb,
    embeddings = kb_emb,
    ids = [f"kb{i}" for i in range(len(banana_kb))]
)

print("Vector dim:", len(kb_emb[0]), "| Collection size:", col.count())

Vector dim: 768 | Collection size: 8


In [16]:
import json, textwrap

def rag_answer(question: str, k: int = 3) -> dict:
    """
    Define a RAG pipeline that returns JSON with explicit source ids as defined in the banana_kb_auto list.
    """
    # Embed the question
    q_vec = client.models.embed_content(
        model = "models/text-embedding-004",
        contents = question,
        config = types.EmbedContentConfig(task_type="retrieval_query")
    ).embeddings[0].values

    # Similarity search
    hits = col.query(query_embeddings=[q_vec], n_results=k)
    docs = hits["documents"][0] # list[str]
    ids = hits["ids"][0] # list[str] like ["kb0", "kb1", ...]

    # Craft prompt with explicit ids
    context_lines = "\n".join(f"[{_id}] {txt}" for _id, txt in zip(ids, docs))
    prompt = textwrap.dedent(f"""
        You are an expert food‑quality assistant.
        Context lines (each prefixed with its id):
        {context_lines}

        Question: {question}

        Respond in valid JSON only with keys accordingly:
        "answer": <string>,
        "sources": <list of ids you actually used, e.g. ["kb0","kb1", ...]>

        If the context DOES NOT answer the question, respond in valid JSON like this:
        "answer":"Not covered by current rules. QA officer should do further testing."
        "sources": []
    """)

    # Call Gemini Flash in JSON mode
    resp = client.models.generate_content(
        model = "gemini-2.0-flash",
        contents = [prompt],
        config = types.GenerateContentConfig(response_mime_type="application/json")
    )

    return json.loads(resp.text)

## Gen AI Demo Queries & Results

This section demonstrates the banana quality RAG system in action. Multiple questions that quality inspection personnel might ask and analyze the system's responses will be run to test the output.

Each query produces:
- A natural language **answer** synthesized from the relevant knowledge base facts
- **Source citations** showing which specific facts were used to generate the answer
- All output in structured **JSON format** for potential integration with other systems

Therefore, the system can handle questions about feature relationships, quality prediction factors, ripeness indicators and provide recommendations for quality improvement.

In [17]:
sample_qs = [
    "Why might a banana that is very soft yet still acidic be considered bad?",
    "Which feature combination most strongly predicts good quality bananas?",
    "Give a one‑sentence tip for farmers to improve ripeness index.",
    "How does the weight-to-size ratio affect banana quality assessment?",
    "What ripeness level is considered optimal for consumer acceptance?",
    "Can you explain the relationship between sweetness and banana quality?",
    "What are the warning signs of dehydration in bananas?",
    "How can farmers prevent excessive softness in bananas?",
    "What makes a banana more likely to be classified as 'Good' in quality inspection?",
    "Is there a correlation between harvest time and acidity levels in bananas?"
]

for q in sample_qs:
    print(f"\nQUESTION: {q}")
    print(json.dumps(rag_answer(q), indent=2))


QUESTION: Why might a banana that is very soft yet still acidic be considered bad?
{
  "answer": "High sweetness, high softness, and small bananas are often BAD quality.",
  "sources": [
    "kb6"
  ]
}

QUESTION: Which feature combination most strongly predicts good quality bananas?
{
  "answer": "High sweetness and low softness, combined with either light or heavy weight, are strong indicators of good quality bananas.",
  "sources": [
    "kb4",
    "kb5"
  ]
}

QUESTION: Give a one‑sentence tip for farmers to improve ripeness index.
{
  "answer": "Not covered by current rules. QA officer should do further testing.",
  "sources": []
}

QUESTION: How does the weight-to-size ratio affect banana quality assessment?
{
  "answer": "High sweetness and low softness in bananas are considered good quality, regardless of whether they are light or heavy. Both light and heavy bananas with these characteristics are often of good quality.",
  "sources": [
    "kb4",
    "kb5"
  ]
}

QUESTION: Wha

## Conclusion & Future Directions

**Accomplishments**
* Built a **fully interpretable Decision‑Tree** that classifies banana quality with ≈ 81 % test accuracy.  
* **Auto‑mined multiple decision rules** and transformed them using *document understanding + few‑shot prompting* into plain‑language quality “facts".  
* Indexed those facts in a **vector database** and wired up a **Gemini RAG pipeline** that answers inspection questions in **structured JSON**, citing the exact rules (`kb0, kb1, kb2, kb3, etc.`) used.  
* Demonstrated 6 distinct Gen‑AI capabilities.

**Strengths**
* **Grounded answers** – model refuses to hallucinate; if a topic isn’t in the rules it returns “Not covered…”.  
* **Explainability** – every answer links back to concrete rule ids.  
* **Machine‑readable output** – easy to plug into dashboards or downstream quality controls.

**Limitations**
* **Rule coverage** – captures only top‑level paths of a depth‑3 tree; deeper trees or multiple models could widen coverage.  
* **Domain breadth** – facts limited to numeric features; future work could add image cues (e.g., peel colour) or weight‑to‑size ratio sensor data.  
* **Evaluation** – answer quality is eyeballed; an LLM‑based rubric scorer would automate QA.  
* **Cold‑start questions** – if no rule matches, the system responds “Not covered”; a fallback to Google Search grounding or a live model function call would help.

**Future Enhancements**
1. **Function‑calling wrapper** – expose `predict_banana()` so Gemini can run live predictions when retrieval is weak.  
2. **Light‑weight agent** – loop: classify → retrieve rule(s) → generate explanation → auto‑grade; deploy as a Streamlit or Gradio app.  
3. **Continuous evaluation** – schedule a timely Gemini‑based test set run to detect drift.

Thank you for your time.