# AI Cyoda configurations Q&A with RAG Langchain

Welcome to this Jupyter notebook! This notebook serves as your guide to developing an AI-powered Question & Answer system using the Langchain library. This system utilizes the Retrieval-Augmented Generation (RAG) model, a powerful tool that leverages OpenAI's GPT-3 model to provide intelligent and context-aware responses.

The primary purpose of this notebook is to generate Cyoda mapping configurations and resources. It does so by interacting with the data set available in the official Cyoda repository. By following along, you'll learn how to harness the power of Langchain and RAG to create a sophisticated AI tool for Cyoda.

## What will we cover?

In this notebook, we will go through the following steps:

1. **Setting up the environment**: We will install necessary libraries and load environment variables.

2. **Initializing the AI model**: We will initialize the ChatOpenAI model with the appropriate parameters.

3. **Loading instructions and entities**: We will load instructions and entities from the official repository using the GitLoader.

4. **Splitting documents and creating a vectorstore**: We will split the loaded documents into chunks and create a vectorstore using the Chroma library.

5. **Defining prompts for contextualizing and answering questions**: We will define prompts that the AI model will use to contextualize and answer questions.

6. **Creating a retrieval chain**: We will create a retrieval chain that combines the history-aware retriever and the question-answer chain.

7. **Running the chatbot**: Finally, we will run the chatbot and see it in action!

## Let's get started!

Please follow along with the code cells and comments to understand each step of the process. If you have any questions or run into any issues, feel free to ask for help. Happy coding!

Install requirements

In [None]:
pip install -r ../requirements.txt

# Load environment variables

In [1]:
from dotenv import load_dotenv
import os

load_dotenv()
OPENAI_API_KEY = os.environ['OPENAI_API_KEY']
WORK_DIR = os.environ['WORK_DIR']

In [None]:
##for google colab (optional)
# This cell is optional and can be skipped
#from google.colab import userdata
#API_KEY = userdata.get('OPENAI_API_KEY')
#WORK_DIR = userdata.get('WORK_DIR')

## Handle unsupported version of sqlite3 (optional)

In [None]:
pip install pysqlite3-binary==0.5.2.post3

In [None]:
import sys
__import__('pysqlite3')
sys.modules['sqlite3'] = sys.modules["pysqlite3"]

# Initialize ChatOpenAI

In [2]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_openai import ChatOpenAI
from langchain_community.document_loaders import GitLoader
from langchain.chains import create_history_aware_retriever, create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.messages import HumanMessage
from langchain_community.vectorstores import Chroma

In [3]:
llm = ChatOpenAI(temperature=0, max_tokens = 6000, model="gpt-3.5-turbo-16k", openai_api_key=OPENAI_API_KEY)

# Load instructions and entities from the official cyoda repository

In [4]:
loader = GitLoader(
    clone_url="https://github.com/Cyoda-platform/cyoda-ai",
    repo_path=WORK_DIR,
    branch="cyoda-ai-configurations-3.0.x",
    file_filter=lambda file_path: file_path.startswith(
        f"{WORK_DIR}/data/config-generation/mappings/"
    ),
)
docs = loader.load()
print(f"Number of documents loaded: {len(docs)}")

Number of documents loaded: 5


# Split documents and create vectorstore

In [5]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(docs)
vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings())
retriever = vectorstore.as_retriever(search_kwargs={"k": 10})

In [6]:
count = vectorstore._collection.count()
print(count)

33


# Define prompts for contextualizing question and answering question

In [7]:
contextualize_q_system_prompt = """Given a chat history and the latest user question \
which might reference context in the chat history, formulate a standalone question \
which can be understood without the chat history. Do NOT answer the question, \
just reformulate it if needed and otherwise return it as is."""
contextualize_q_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", contextualize_q_system_prompt),
        MessagesPlaceholder("chat_history"),
        ("human", "{input}"),
    ]
)


In [8]:
history_aware_retriever = create_history_aware_retriever(
    llm, retriever, contextualize_q_prompt
)

# Answer question

In [9]:
qa_system_prompt = """You are a mapping tool. You should do your best to answer the question.
Use the following pieces of retrieved context to answer the question. \

{context}"""
qa_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", qa_system_prompt),
        MessagesPlaceholder("chat_history"),
        ("human", "{input}"),
    ]
)
question_answer_chain = create_stuff_documents_chain(llm, qa_prompt)


# Create retrieval chain

In [10]:
rag_chain = create_retrieval_chain(history_aware_retriever, question_answer_chain)

In [11]:
# Function to read file content
def read_file_to_string(file_path):
    with open(file_path, 'r') as file:
        return file.read()

# Define question

In [12]:
INPUT = read_file_to_string(f"{WORK_DIR}/data/entities/tender_entity/resources/data_source_inputs/tender_input_1.json")
ENTITY = "net.cyoda.saas.model.TenderEntity"
RETURN_STRING = "Return only DataMappingConfigDto json."
question = f"Produce a mapping from this input to this target entity. Input: {INPUT}. Entity: {ENTITY}. {RETURN_STRING}. Work only with JSON attributes that are simple and not arrays. Do NOT use any atributes that are not present in schema. {RETURN_STRING}."

# Initialize chat history

In [13]:
chat_history = {}

In [14]:
# Function to add a message to the chat history
def add_to_chat_history(id, question, message):
    if id in chat_history:
        chat_history[id].extend([HumanMessage(content=question), message])
    else:
        chat_history[id] = [HumanMessage(content=question), message]

In [15]:
# Function to clear chat history
def clear_chat_history(id):
    if id in chat_history:
        del chat_history[id]

In [16]:
import uuid

# Generate a unique ID for the chat session
id = uuid.uuid1()

In [17]:
# First question and AI response
ai_msg_1 = rag_chain.invoke({"input": question, "chat_history": chat_history.get(id, [])})
add_to_chat_history(id, question, ai_msg_1["answer"])

In [18]:
print(ai_msg_1["answer"])

{
    "@bean": "com.cyoda.plugins.mapping.core.dtos.DataMappingConfigDto",
    "id": "c784c270-f0fe-11ee-9561-ee157423307a",
    "name": "tender",
    "lastUpdated": 1712069164720,
    "dataType": "JSON",
    "description": "",
    "entityMappings": [
        {
            "id": {
                "id": "c77e59d0-f0fe-11ee-9561-ee157423307a"
            },
            "name": "tender",
            "entityClass": "net.cyoda.saas.model.TenderEntity",
            "entityRelationConfigs": [
                {
                    "srcRelativeRootPath": "root:/"
                }
            ],
            "columns": [
                {
                    "srcColumnPath": "date",
                    "dstCyodaColumnPath": "date",
                    "dstCyodaColumnPathType": "java.lang.String",
                    "dstCollectionElementSetModes": [],
                    "transformer": {
                        "type": "COMPOSITE",
                        "children": []
                    }
   

In [19]:
# Second question and AI response
second_question = "Produce a script for this mapping. Return only script json object which contains body and inputSrcPaths inside script attribute."
ai_msg_2 = rag_chain.invoke({"input": second_question, "chat_history": chat_history.get(id, [])})
add_to_chat_history(id, second_question, ai_msg_2["answer"])
print(ai_msg_2["answer"])

{
    "body": "var notices = [];\nvar Notice = Java.type('net.cyoda.saas.model.Notice');\n\n// Add notices from input\nfor (var i = 0; i < input.notices.length; i++) {\n    var notice = new Notice();\n    notice.setId(input.notices[i].id != null ? input.notices[i].id : 0);\n    notice.setDate(input.notices[i].date != null ? input.notices[i].date : \"00-00-00\");\n    notice.setType(input.notices[i].type != null ? input.notices[i].type : \"Unknown type\");\n    notices.push(notice);\n}\nentity.setNotices(notices);\n",
    "inputSrcPaths": [
        "notices/*/id",
        "notices/*/date",
        "notices/*/type"
    ]
}


In [20]:
print(chat_history)

{UUID('6809e334-fbc1-11ee-a4da-f01898ec22af'): [HumanMessage(content='Produce a mapping from this input to this target entity. Input: {\n  "id": "1",\n  "date": "2019-07-16",\n  "deadline_date": "2019-07-25",\n  "deadline_length_days": "9",\n  "title": "Sustitucin de duchas de los baos del pasillo C y D de la Residencia Juvenil Baltasar Gracian",\n  "category": "constructions",\n  "sid": "3996914",\n  "src_url": "https",\n  "src_final_url": "https",\n  "awarded_value": "20252.00",\n  "awarded_currency": "EUR",\n  "purchaser": {\n    "id": "1",\n    "sid": null,\n    "name": null\n  },\n  "type": {\n    "id": "minor-contract",\n    "name": "Minor contract",\n    "slug": "minor-contract"\n  },\n  "notices": [\n    {\n      "id": null,\n      "sid": null,\n      "date": "2019-08-30",\n      "type": {},\n      "src_id": null,\n      "src_url": null,\n      "data": {\n        "date": "2019-08-30",\n        "type": "Anuncio de Adjudicacin"\n      },\n      "sections": []\n    },\n    {\n    

In [21]:
clear_chat_history(id)

In [22]:
for document in ai_msg_1["context"]:
    print(document)
    print()

page_content='Instruction how to produce a mapping for a target entity.\nEntity\nHere is an example target entity:\n{\n  "tender_entity": {\n    "name": "string",\n    "types": [\n      "string"\n    ],\n    "contactUser": "string",\n    "systemAccount": true,\n    "date": "string",\n    "deadlineDate": "string",\n    "deadlineLengthDays": 0,\n    "category": "string",\n    "awardedValue": 0.0,\n    "purchaser": "string",\n    "notices": [\n      {\n        "name": "string",\n        "id": "string",\n        "sid": "string",\n        "date": "string",\n        "type": "string",\n        "srcId": "string",\n        "srcUrl": "string",\n        "data": "string"\n      }\n    ]\n  }\n}' metadata={'file_name': 'mapping_instruction_0.txt', 'file_path': 'data/config-generation/mappings/mapping_instruction_0.txt', 'file_type': '.txt', 'source': 'data/config-generation/mappings/mapping_instruction_0.txt'}

page_content='Example output need to be modified for the custom input and entity:\n```\n