## Smart Waterways: AIS RAG Chatbot with Fallback

This notebook implements a **Retrieval-Augmented Generation (RAG) chatbot** for querying **Automatic Identification System (AIS) vessel tracking data**. It integrates structured vessel records from a CSV file with a large language model (LLM), enabling natural language queries about vessel positions, attributes, and movements.

### Key Features
- **Data Ingestion & Embeddings**  
  Loads AIS vessel records from a CSV file and embeds them using `BAAI/bge-small-en`. Data is stored and retrieved from a persistent **Chroma vector database**.

- **Retrieval-Augmented Chat**  
  Uses LangChain’s `SelfQueryRetriever` and `ConversationalRetrievalChain` to interpret user queries, retrieve the most relevant vessel records, and generate context-aware answers.

- **Fallback Knowledge**  
  If no relevant AIS records are found or the model produces an uncertain answer, a **fallback chain** provides responses using general maritime knowledge.

- **Conversation Memory**  
  Maintains chat history using `ConversationBufferMemory`, allowing for multi-turn conversations with context.

- **Interactive Gradio UI**  
  A user-friendly interface built with **Gradio**, featuring:
  - A chat window for natural queries  
  - Support for both **Enter key** and **Submit button**  
  - Clear button to reset the conversation  

### Usage
Run the notebook to launch a **shareable Gradio app**, where you can type vessel-related queries (e.g., positions, speed, cargo type) and get answers sourced either from the AIS dataset or from general maritime expertise when data is missing.

---

In [26]:
import os
import warnings
import gradio as gr

from langchain_community.document_loaders.csv_loader import CSVLoader
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
from langchain.chat_models import ChatOpenAI
from langchain.chains.query_constructor.base import AttributeInfo
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains import ConversationalRetrievalChain, LLMChain
from langchain.memory import ConversationBufferMemory
from langchain.prompts import PromptTemplate

warnings.filterwarnings('ignore')

# ----------------------------
# Setup: Embeddings + CSV Loader
# ----------------------------

embedding = HuggingFaceEmbeddings(
    model_name="BAAI/bge-small-en",
    model_kwargs={"device": "cpu"},
    encode_kwargs={"normalize_embeddings": True},
)

# meta_columns = ['MMSI','VesselName', 'CallSign']
meta_columns = ['MMSI', 'BaseDateTime', 'LAT', 'LON', 'SOG', 'COG', 'Heading',
       'VesselName', 'IMO', 'CallSign', 'VesselType', 'Status', 'Length',
       'Width', 'Draft', 'Cargo', 'TransceiverClass']
all_columns = ['MMSI', 'BaseDateTime', 'LAT', 'LON', 'SOG', 'COG', 'Heading',
       'VesselName', 'IMO', 'CallSign', 'VesselType', 'Status', 'Length',
       'Width', 'Draft', 'Cargo', 'TransceiverClass']

loader = CSVLoader(
    file_path="AIS_sampleData2.csv",
    metadata_columns=meta_columns,
    content_columns=all_columns,
)
documents = loader.load()

doc_db = Chroma(
    persist_directory="AIS_sampleData2_db_V2",
    embedding_function=embedding,
)

#run the following line for the first time when embeddings are being created.
# doc_db.add_documents(documents)

# ----------------------------
# Setup: LLM + Retriever
# ----------------------------

os.environ["GROQ_API_KEY"] = "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"

llm = ChatOpenAI(
    model="llama-3.3-70b-versatile",
    base_url="https://api.groq.com/openai/v1",
    api_key=os.environ["GROQ_API_KEY"],
    temperature=0.3,
)

metadata_field_info = [
    AttributeInfo(name="MMSI", description="Unique Maritime Mobile Service Identity number of the vessel", type="string"),
    AttributeInfo(name="BaseDateTime", description="UTC date and time of the report and the vessel record", type="string"),
    AttributeInfo(name="LAT", description="Latitude of the vessel's position", type="float"),
    AttributeInfo(name="LON", description="Longitude of the vessel's position", type="float"),
    AttributeInfo(name="SOG", description="Speed Over Ground in knots", type="float"),
    AttributeInfo(name="COG", description="Course Over Ground in degrees", type="float"),
    AttributeInfo(name="Heading", description="Vessel's true heading in degrees (0-359 or 511 for not available)", type="integer"),
    AttributeInfo(name="VesselName", description="Name of the vessel", type="string"),
    AttributeInfo(name="IMO", description="International Maritime Organization number", type="string"),
    AttributeInfo(name="CallSign", description="Vessel's radio call sign", type="string"),
    AttributeInfo(name="VesselType", description="Type of vessel (e.g., Cargo, Tanker, Passenger, Fishing)", type="string"),
    AttributeInfo(name="Status", description="Navigational status of the vessel (e.g., Underway, Anchored, Moored)", type="string"),
    AttributeInfo(name="Length", description="Length of the vessel in meters", type="integer"),
    AttributeInfo(name="Width", description="Width of the vessel in meters", type="integer"),
    AttributeInfo(name="Draft", description="Current static draft of the vessel in meters", type="float"),
    AttributeInfo(name="Cargo", description="Type of cargo carried by the vessel", type="string"),
    AttributeInfo(name="TransceiverClass", description="AIS transceiver class (A or B)", type="string"),
    # Add other metadata fields if 'source' or 'row' are important for querying
    AttributeInfo(name="source", description="The source CSV file name", type="string"),
    AttributeInfo(name="row", description="Row number in the original CSV file", type="integer"),
]

document_content_description = "AIS data for vessel records"

retriever = SelfQueryRetriever.from_llm(
    llm,
    doc_db,
    document_content_description,
    metadata_field_info
)

memory = ConversationBufferMemory(
    memory_key="chat_history",
    return_messages=True,
    output_key="answer"
)

conversational_chain = ConversationalRetrievalChain.from_llm(
    llm=llm,
    retriever=retriever,
    memory=memory,
    return_source_documents=True,
    output_key="answer"
)

# ----------------------------
# Setup: Fallback Chain
# ----------------------------

fallback_prompt = PromptTemplate.from_template(
    "You are a maritime expert. Answer the question based on your general maritime knowledge.\n\nQuestion: {question}\n\nAnswer:"
)

fallback_chain = LLMChain(llm=llm, prompt=fallback_prompt)

# ----------------------------
# Chat Handler with Fallback
# ----------------------------

def chat_interface(user_input, history):
    response = conversational_chain.invoke({"question": user_input})
    answer = response["answer"].strip()

    # no_docs = len(response.get("source_documents", [])) == 0
    fresh_docs = retriever.get_relevant_documents(user_input)
    no_docs = len(fresh_docs) == 0
    trigger_phrases = [
        "i don't know",
        "i am not sure",
        "no relevant information",
        "not provided in the context",
        "context does not provide",
        "no information found"
    ]
    poor_response = any(phrase in answer.lower() for phrase in trigger_phrases)
    
    if no_docs or poor_response:
        answer = fallback_chain.run({"question": user_input})
        answer += "\n\n_(Answered using general maritime knowledge)_"
    else:
        answer += "\n\n_(Answered using AIS database)_"

    history.append((user_input, answer))
    return history, history

# ----------------------------
# Gradio UI
# ----------------------------

with gr.Blocks(title="AIS RAG Chatbot") as demo:
    # Top title centered
    gr.Markdown(
        "<h1 style='text-align: center;'>Smart Waterways: A Retrieval-Augmented AI Framework for Vessel Tracking and Decision Support</h1>"
    )

    # gr.Markdown("### 🚢 AIS Vessel Query Assistant")

    chatbot = gr.Chatbot(label="AIS Vessel Query Assistant")
    message = gr.Textbox(placeholder="Type your question here...", show_label=False)
    submit_btn = gr.Button("Submit")
    state = gr.State([])

    def respond(user_input, chat_history):
        return chat_interface(user_input, chat_history)

    # Handle submit via Enter key
    message.submit(respond, [message, state], [chatbot, state])
    # Handle submit via button click
    submit_btn.click(respond, [message, state], [chatbot, state])
    
    #Clear textbox after submit (both Enter and Button)
    message.submit(lambda *args: "", None, message)
    submit_btn.click(lambda *args: "", None, message)

    gr.ClearButton([message, chatbot, state])

    # Footer info
    gr.Markdown("""
    <div style='text-align: center; font-size:18px;'>
        <hr>
        <strong>Smart Rivers 2025</strong><br><br>
        Presented by<br>
        <strong>Zaidur Rahman</strong> & <strong>Dr. Heather Nachtmann</strong><br>
        <img src="https://brand.uark.edu/_resources/images/UA_Logo.png" 
             alt="University of Arkansas Logo" 
             width="120" 
             style="display:block; margin-left:auto; margin-right:auto;">
    </div>
    """)

demo.launch(share=True)

* Running on local URL:  http://127.0.0.1:7882


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


* Running on public URL: https://37e532ae76d2f79c8e.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


