# RAG chatbot for data analysis
This chatbot aims to be a help to data scientists and analysts, who need to manage unstructured information. It was written on a laptop without GPU, therefore the chosen models and database were selected to be as lightweight as possible. This affects the performance of the bot accordingly. If you have a more powerful computer, feel free to swap out the models with more potent ones.
Please check the answers for their veracity
The interface of this chatbot is a Jupyter notebook, in order to make it easier for data scientists to use it in their environments.

In [1]:
from rerankers import Reranker
import os
import pinecone
from pinecone import Pinecone, ServerlessSpec
from rag_utils import create_database, generate_embeddings, store_embeddings, clean_text, create_corpus, store_knowledgebase, chunk_document, query_pinecone, generate_response, clear_db

## Hyperparameters
Before you can start your exciting data journey, you need to decide on a few hyperparameters. If you want to adjust them for every question separately, feel free to do so

API_key (str): your Pinecone API-key
model_name (str): in order to make this bot as lightweight as possible, both the retriever and the generative part use the same model, a T5 one (will be relevant for embedding. If you want to use a different model type, you might need to change the embedding logic)
db_name (str): the name you wish to give your database instance
top_k (int, optional): The number of top results to return from the Pinecone query.
chunk_size (int, optional): The size of each chunk for processing.
overlap (int, optional): The overlap between chunks.
temperature (float, optional): The temperature parameter to control creativity.
max_new_tokens (int, optional): The maximum number of new tokens to generate.
bm25_weight (float, optional): The weight for BM25 scores.
semantic_search_weight (float, optional): The weight for semantic search scores.
ranker: reranker, for more options look into the documentation of the pythons reranker library
directory (folder path): path to the folder, where the text files are stored that you want to query

In [None]:
API_key = "enter your API key here"
model_name = "google/flan-t5-large"  #t5-base
db_name = "pine-new"
top_k=10
temperature=0.4
chunk_size=200
overlap=20
max_new_tokens=300
bm25_weight=0.7
semantic_search_weight=0.3
ranker = Reranker('mixedbread-ai/mxbai-rerank-large-v1', model_type='cross-encoder')
directory = "name or path of your directory"

### Step 1: Create your knowledgebase

In [3]:
# create a text corpus out of your txt files
corpus = create_corpus(directory)

In [4]:
# Create a Pinecone client instance
os.environ["PINECONE_API_KEY"] = API_key
pc = pinecone.Pinecone(api_key=os.getenv("PINECONE_API_KEY"))

In [5]:
# create the database
index_name = "pine-new"
if index_name not in pc.list_indexes().names():
    pc.create_index(
    name=index_name,
    dimension=1024, # Replace with your model dimensions # 768  #512 #1024
    metric="cosine", # Replace with your model metric
    spec=ServerlessSpec(
        cloud="aws",
        region="us-east-1"
    ) 
)
    
index = pc.Index(index_name)

In [None]:
# store your embedded corpus in your database
store_knowledgebase(model_name, corpus, index)

### Step 2: Ask your questions

In [None]:
# example question
query = "Enter your query here"
prompt = "Form new, full sentences out of the information."
response = generate_response(query, prompt, model_name, index, corpus, top_k, chunk_size, overlap, temperature, max_new_tokens, bm25_weight, semantic_search_weight, ranker)
print(response)