In [16]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session


# 🧠 LogLens: A GenAI-Powered Root Cause Analysis Assistant

**Use Case** Modern distributed systems like Apache Spark, Kafka, and Airflow often produce cryptic, hard-to-debug errors. Engineers spend hours searching logs, reading docs, or browsing forums to find root causes.
LogLens simplifies this process by acting as a GenAI-powered Root Cause Analysis (RCA) assistant—accepting raw error logs and returning simplified, trustworthy explanations and actionable fixes within seconds.

**LogLens** is an AI chatbot that helps engineers debug system errors from Spark, Kafka, and Airflow. It uses Google's Gemini LLM with ChromaDB vector search and web search via SerpAPI to offer short, useful explanations and suggested fixes.

---
## 🛠️ GenAI Capabilities Used
LogLens demonstrates 4 core GenAI capabilities:

Capability	Description
✅ Retrieval-Augmented Generation	Uses ChromaDB to fetch similar past logs
✅ LLM Reasoning	Gemini 2.0 Flash summarizes root causes in plain English
✅ Web Search + Summarization	Fetches top community answers via SerpAPI, summarizes with Gemini
✅ Confidence + Audit Agent	Compares LLM vs Web fix, gives confidence score and feedback

## 🚀 Objective

Build an intelligent assistant that:
- Accepts natural language error inputs
- Retrieves relevant logs using ChromaDB
- Uses Gemini LLM to generate RCA (Root Cause Analysis)
- Audits the LLM output using live web fixes
- Returns the most confident, simplified fix to the user

---

## 🛠 Tech Stack

| Tool             | Purpose                                      |
|------------------|----------------------------------------------|
| Python           | Core programming language                    |
| Gemini 2.0 Flash | LLM for RCA and fix generation               |
| ChromaDB         | Vector store for semantic log similarity     |
| SerpAPI          | Web scraping (StackOverflow, GitHub)         |
| Kaggle Notebook  | Runtime and development platform             |

---

## 📦 How It Works

```mermaid
flowchart TD
    A[User Inputs Error Log] --> B[Retrieve Similar Logs from ChromaDB]
    B --> C[Gemini Agent: Generate RCA]
    C --> D[Web Agent: Search StackOverflow/GitHub]
    D --> E[Gemini: Summarize External Solutions]
    C --> F[Audit Agent: Compare Gemini vs Web Fix]
    F --> G[Return Final Fix with Confidence Score]
    G --> H[Suggest Follow-Up Questions & Learning Resources]
```

---

## 🧱 Steps in the Code

1. **Log Simulation & ChromaDB**: Generates fake Spark, Airflow, and Kafka logs and stores embeddings in ChromaDB.
2. **Gemini RCA Agent**: Takes the user error and retrieves similar logs. Gemini then explains the issue and gives 2–3 fixes.
3. **Web Search Agent**: SerpAPI fetches relevant snippets. Gemini summarizes them.
4. **Audit Agent**: Compares Gemini’s RCA vs the Web fix and gives a verdict + confidence score.
5. **Chatbot Loop**: Continues accepting user errors like a support assistant.

---

## 🧪 Running the Project

1. Upload this notebook to Kaggle or run locally.
2. Add your API keys (Google + SerpAPI) using environment variables or Kaggle Secrets.
3. Run all cells.
4. Enter your error and interact with the assistant.

---

## 📚 Example Query

**Input:**
```
org.apache.spark.shuffle.FetchFailedException: Failed to connect to host
```

**Output:**
```
Spark couldn't fetch data between workers. It might be due to memory issues or removed executors. Try increasing memory, tuning dynamic allocation, or reviewing logs.
Confidence: 0.85
```

---

## ✅ Future Additions

- Streamlit UI
- Live log ingestion via REST API
- Chat history with memory


In [17]:
# Install necessary packages
!pip uninstall -y google google-cloud-aiplatform google-genai -q
!pip install -q google-generativeai chromadb serpapi

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[0m

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [18]:
import os
import json
import uuid
import random
import requests
from datetime import datetime, timedelta
from kaggle_secrets import UserSecretsClient
import chromadb
import google.generativeai as genai

# Get API keys from Kaggle secrets
user_secrets = UserSecretsClient()
genai.configure(api_key=user_secrets.get_secret("google_api_key"))
serpapi_key = user_secrets.get_secret("serpapi_key")

# Load Gemini model
model = genai.GenerativeModel("gemini-2.0-flash")


**Generate Fake Logs + Store in ChromaDB**

In [19]:
# Sample log errors for simulation
error_catalog = {
    "Spark": [
        ("OutOfMemoryError", "Job aborted: java.lang.OutOfMemoryError: Java heap space", "Increase executor memory"),
        ("NotSerializableException", "Task not serializable: java.io.NotSerializableException", "Make UDF class Serializable"),
        ("ClassNotFoundException", "Caused by: java.lang.ClassNotFoundException", "Add missing JAR dependency"),
        ("NullPointerException", "Exception: java.lang.NullPointerException", "Add null checks in code"),
        ("AnalysisException", "cannot resolve column in input schema", "Fix column name or select statement"),
        ("DiskFullError", "Executor lost: No space left on device", "Clean up disk or use bigger disk"),
        ("StageRetryLimit", "Stage failed 4 times, aborting job", "Fix data skew or memory issues"),
        ("FetchFailedException", "org.apache.spark.shuffle.FetchFailedException", "Investigate shuffle configuration"),
        ("FileNotFoundException", "Input path does not exist: s3://...", "Verify file path in job config"),
        ("SparkSubmitError", "spark-submit failed with exit code 1", "Validate spark-submit command and configs")
    ],
    "Airflow": [
        ("BrokenDAG", "Broken DAG: No module named 'plugin'", "Ensure plugin exists in airflow/plugins"),
        ("TriggerRuleError", "Invalid trigger rule: ALL_WRONG", "Use valid trigger like all_success"),
        ("FileNotFoundError", "FileNotFoundError: 'data.csv' not found", "Check file path or upstream task"),
        ("TaskSkipped", "Task skipped due to dependency", "Ensure upstream tasks are healthy"),
        ("NoneTypeError", "'NoneType' object has no attribute 'write'", "Initialize object before usage"),
        ("TaskTimeout", "Task timed out after 300s", "Increase timeout or optimize task"),
        ("ImportError", "ImportError: cannot import airflow.providers...", "Check module install or DAG syntax"),
        ("InvalidCron", "Invalid cron expression: */99 * * * *", "Fix cron syntax"),
        ("SQLAlchemyError", "sqlalchemy.exc.OperationalError", "Check DB connectivity and credentials"),
        ("DeadlockError", "Scheduler deadlock: no heartbeat from workers", "Scale out workers or debug DAGs")
    ],
    "Kafka": [
        ("ConsumerLag", "[WARN] Consumer lag high: 50000", "Scale up consumer instances"),
        ("SSLHandshakeError", "SSL handshake failed with broker", "Check SSL cert and config"),
        ("TopicNotFound", "No such topic 'user_events'", "Create topic before consuming"),
        ("KafkaTimeout", "Timeout expired while committing offsets", "Check broker latency or partition load"),
        ("StuckConsumer", "Consumer stuck for 10+ mins", "Restart or debug consumer group"),
        ("OffsetOutOfRange", "OffsetOutOfRangeException", "Reset offset to earliest/latest"),
        ("LeaderNotAvailable", "No leader for partition 0", "Restart broker or check cluster state"),
        ("BufferExhaustedException", "Buffer full, producer failed", "Increase buffer or reduce message rate"),
        ("UnknownTopicOrPartition", "Unknown topic or partition", "Check spelling and Kafka setup"),
        ("RebalanceInProgress", "RebalanceInProgressException", "Wait or tune rebalance configs")
    ]
}

# Generate logs
def generate_logs():
    logs = []
    now = datetime.utcnow()
    for system, errors in error_catalog.items():
        for idx, (etype, msg, fix) in enumerate(errors):
            logs.append({
                "log_id": f"{system.lower()}-{idx:03}",
                "component": system,
                "timestamp": (now - timedelta(minutes=random.randint(1, 5000))).isoformat() + "Z",
                "error_type": etype,
                "content": msg,
                "expected_fix": fix,
                "is_resolved": False
            })
    return logs

logs = generate_logs()

# Store in ChromaDB
chroma_client = chromadb.PersistentClient(path="/kaggle/working/chroma_db")
collection = chroma_client.get_or_create_collection("logs")

for log in logs:
    doc_id = str(uuid.uuid4())
    text = f"{log['component']} | {log['error_type']}: {log['content']}"
    collection.add(
        ids=[doc_id],
        documents=[text],
        metadatas=[{
            "log_id": log["log_id"],
            "component": log["component"],
            "error_type": log["error_type"],
            "expected_fix": log["expected_fix"]
        }]
    )


**Similar Log Retrieval + RCA Agent (Text Response)**

In [20]:
def retrieve_similar_logs(query_text, top_k=3):
    results = collection.query(query_texts=[query_text], n_results=top_k)
    return results["documents"][0], results["metadatas"][0]

def gemini_rca_summary(log_entry):
    docs, metas = retrieve_similar_logs(log_entry["content"])
    context = "\n".join(
        f"- Log: {doc}\n  Fix: {meta['expected_fix']}"
        for doc, meta in zip(docs, metas)
    )

    prompt = f"""
You're a helpful assistant for debugging system logs.

Here are similar logs:
{context}

Now, for the following issue:
"{log_entry['content']}"

Give a very short and simple explanation:
1. What the issue is (in 1–2 lines)
2. What caused it (in simple words)
3. What can fix it (2–3 quick suggestions)
4. End with a confidence score between 0 and 1

Be brief. Use everyday language. No jargon. No code. No JSON. Just clear plain text.
"""

    return model.generate_content(prompt).text.strip()


**Web Search + Summarizer Agent**

In [21]:
def search_and_summarize_web(log_text):
    params = {
        "engine": "google",
        "q": f"{log_text} site:stackoverflow.com OR site:github.com",
        "api_key": serpapi_key,
        "num": "5"
    }
    res = requests.get("https://serpapi.com/search", params=params).json()
    snippets = [r.get("snippet", "") for r in res.get("organic_results", [])][:3]
    
    context = "\n".join(snippets)
    
    if not context:
        return "No relevant info found online."

    prompt = f"""
You are a GenAI assistant. Given these web snippets:

{context}

Summarize the most effective fix or strategy for this issue.
"""
    return model.generate_content(prompt).text.strip()


**Compare LLM vs Web – Audit Agent**

In [22]:
def audit_agent(log_text, gemini_fix, web_fix):
    prompt = f"""
You are an audit agent comparing two solutions:

Log: {log_text}

--- Gemini's RCA ---
{gemini_fix}

--- External Web Fix ---
{web_fix}

Evaluate which is more accurate, what's missing, and give a confidence score (0–1).
Return plain text (no JSON).
"""
    return model.generate_content(prompt).text.strip()


In [30]:
def classify_input(text):
    prompt = f"""
You're a classifier inside a log analysis chatbot.

Your task is to classify the user's message strictly as either:
- "error" — only if the user provides a specific technical error, log message, or code exception.
- "chat" — if the message is vague, casual, a greeting, or doesn't contain technical details.

Examples:
- "Hi there" → chat
- "How are you?" → chat
- "I'm getting a NullPointerException in Spark" → error
- "Facing some error in a project" → chat
- "TimeoutError while consuming from Kafka" → error

Only return: "error" or "chat".

Input: "{text}"
Classification:
"""
    response = model.generate_content(prompt)
    return response.text.strip().lower()


In [31]:
def run_rca_chat():
    print("👋 Welcome to LogLens AI Assistant!")
    print("🔎 I can help debug errors from Spark, Kafka, Airflow, etc.")

    while True:
        user_input = input("\n💬 Enter your error or message (type 'exit' to quit):\n> ").strip()

        if user_input.lower() in ["exit", "quit"]:
            print("👋 Goodbye! Hope your logs stay clean.")
            break

        # Step 1: Classify input
        intent = classify_input(user_input)

        if intent == "chat":
            # Let Gemini respond to general small talk
            prompt = f"""
You are a friendly assistant in a log debugging chatbot.

The user said:
"{user_input}"

Respond in a natural, conversational way.
"""
            response = model.generate_content(prompt)
            print(f"\n🤖 Gemini says: {response.text.strip()}")
            continue

        elif intent == "error":
            # Step 2: Validate if it's a meaningful error
            error_keywords = ["exception", "error", "failed", "traceback", "stack", "timeout", "crash", "null", "not found", "missing"]
            if len(user_input.split()) < 5 or not any(keyword in user_input.lower() for keyword in error_keywords):
                clarification_prompt = f"""
The user said:

"{user_input}"

This is too vague to diagnose. Politely ask them to provide the full error message, code snippet, or logs.
"""
                reply = model.generate_content(clarification_prompt)
                print(f"\n🤖 Gemini says: {reply.text.strip()}")
                continue

            # Step 3: Format log
            log = {
                "log_id": "user-log",
                "component": "Unknown",
                "error_type": "UserInput",
                "content": user_input,
                "is_resolved": False,
                "expected_fix": None
            }

            # Step 4: RCA via Gemini + ChromaDB
            print("\n🤖 Analyzing with Gemini + ChromaDB...")
            gemini_response = gemini_rca_summary(log)
            print("\n🧠 Gemini RCA Suggestion:")
            print(gemini_response)

            # Step 5: Web Search & Summarization
            print("\n🔍 Searching the web for similar solutions...")
            web_fix = search_and_summarize_web(user_input)
            print("\n🌐 Web-Based Fix Summary:")
            print(web_fix)

            # Step 6: Audit
            print("\n📊 Comparing both responses...")
            audit_result = audit_agent(user_input, gemini_response, web_fix)
            print("\n📢 Final Verdict:")
            print(audit_result)

            print("\n💡 You can now enter a follow-up error or ask another question.")

        else:
            print("\n🤖 Hmm, I couldn’t classify your message. Could you rephrase or paste the actual error?")


In [32]:
run_rca_chat()

👋 Welcome to LogLens AI Assistant!
🔎 I can help debug errors from Spark, Kafka, Airflow, etc.



💬 Enter your error or message (type 'exit' to quit):
>  hi



🤖 Gemini says: Hi there! How can I help you debug some logs today? Let me know what you're working on and what kind of issues you're seeing. I'm ready to dive in!



💬 Enter your error or message (type 'exit' to quit):
>  how are you doing



🤖 Gemini says: I'm doing well, thanks for asking! Just here and ready to help debug some logs. What kind of log issues are you wrestling with today?



💬 Enter your error or message (type 'exit' to quit):
>  yeah i have an issue with one of my project



🤖 Gemini says: Okay, I'm here to help! Tell me about the issue you're having with your project. I'm ready to listen. The more details you can give me, the better I can understand what's going on. For example:

*   **What is the project?** (e.g., a web app, a script, a game)
*   **What is the expected behavior?**
*   **What is actually happening?**
*   **What have you already tried?**
*   **Do you have any logs or error messages you can share?**

Don't worry if you don't have all the answers right now, just tell me what you know! Let's figure this out together.



💬 Enter your error or message (type 'exit' to quit):
>  I have an oumm error in spark



🤖 Analyzing with Gemini + ChromaDB...

🧠 Gemini RCA Suggestion:
Okay, here's a breakdown of your Spark OOM error:

1.  **What:** Spark ran out of memory and crashed.
2.  **Why:** Your data might be too big to fit in memory, or a task is trying to hold too much at once.
3.  **Fix:** Give Spark more memory, reduce the size of your data, or optimize your code to use less memory.

**Confidence:** 0.8

🔍 Searching the web for similar solutions...

🌐 Web-Based Fix Summary:
Based on the provided snippets, here's a summary of potential issues and implied strategies:

*   **Out of Memory (OOM) errors:** These occur when Spark executors cannot handle large RDDs/Dataframes/Datasets persisted in storage. The solution would involve optimizing data handling and storage to reduce memory pressure.

*   **Performance Degradation due to Shuffle Spill Files:** Spark creates many shuffle spill files per map task (as many as the number of reducers), leading to performance issues. Addressing this might inv


💬 Enter your error or message (type 'exit' to quit):
>  Thank you



🤖 Gemini says: You're very welcome! I'm glad I could help. Is there anything else I can assist you with regarding your logs today? Just let me know!



💬 Enter your error or message (type 'exit' to quit):
>  Exit


👋 Goodbye! Hope your logs stay clean.
