## A Walkthough for Document Vectorization
Throughout this walkthrough, we will be using OpenAI embedding.

It is necessary to import all the required libraries. <br> 
You can install them by executing a command in the terminal.<br><br>
``pip install -r reqirements.txt``

In [12]:
import chromadb
import openai
# import langchain
from chromadb.utils import embedding_functions
from langchain.document_loaders import TextLoader
import os
from typing import List
import pandas as pd

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


Now, we'll introduce two functions for processing text from files.<br><br>
<b>read_docx_as_single_page():</b> This function reads DOC files.<br>
<b>read_pdf_as_single_page():</b> This function reads PDF files.<br><br>
Each function takes the path of the file as a parameter.

In [81]:
from PyPDF2 import PdfReader
import docx

def read_docx_as_single_page(docx_path: str) -> str:
    doc = docx.Document(docx_path)
    text = ""
    for paragraph in doc.paragraphs:
        text += paragraph.text + "\n"
    return text


def read_pdf_as_single_page(pdf_path: str) -> str:
    with open(pdf_path, "rb") as f:
        pdf_reader = PdfReader(f)
        text = ""
        for page_num in range(len(pdf_reader.pages)):
            # Extract text from the current page and append it to the string
            text += pdf_reader.pages[page_num].extract_text()
        return text
    
def read_csv_row_formatted(csv_path: str) -> str: #Added this function to help with .CSV file
    test_df = pd.read_csv(csv_path, sep=',')
    output = test_df.to_json(indent = 1, orient = 'records')
    return output
    

This section of the code processes files and creates a list of documents. We will assume all files are in the <b>"procom_collection"</b> folder.

In [82]:
folder_name = 'procom_data'
document = []
metadata = []

for file in os.listdir(folder_name):
    # Get the file name for metadata.
    file_name = file.split('.')[0]
    metadata.append({"competition": file_name})
    # If the file is a PDF then process it with "read_pdf_as_single_page".
    # Elif the file is a PDF then process it with "read_docx_as_single_page".
    # ELse process it with "TextLoader".
    # Finally append the text string to the document list.
    if file.endswith(".pdf"):
        pdf_path = f"./{folder_name}/" + file
        document.append(read_pdf_as_single_page(pdf_path))
    elif file.endswith(".docx") or file.endswith(".doc"):
        doc_path = f"./{folder_name}/" + file
        document.append(read_docx_as_single_page(doc_path)) 
    elif file.endswith(".txt"):
        txt_path = f"./{folder_name}/" + file
        loader = TextLoader(txt_path)
        document.append(loader.load()) 
    elif file.endswith(".csv"): #Added this function, reads CSV file in JSON format
        csv_path = f"./{folder_name}/" + file
        document.append(read_csv_row_formatted(csv_path))

# print(document)
# print(metadata)

After retrieving the document, we will set up the Chroma database. Since the Chroma Database operates as a vector database, we must initially establish the embedding function.

Chroma DB supports various embedding functions. We have the option to utilize either the default embedding function provided by Chroma DB or opt for the OpenAI embedding. For this project, we will utilize the OpenAI embedding function.

In [85]:
chroma_client = chromadb.PersistentClient("./")

In [6]:
# To use the default embedding, you will
# need to install "sentence_transformers" package 
# first.
 
sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="all-MiniLM-L6-v2")

  from .autonotebook import tqdm as notebook_tqdm
modules.json: 100%|██████████| 349/349 [00:00<00:00, 547kB/s]
config_sentence_transformers.json: 100%|██████████| 116/116 [00:00<00:00, 225kB/s]
README.md: 100%|██████████| 10.7k/10.7k [00:00<00:00, 18.4MB/s]
sentence_bert_config.json: 100%|██████████| 53.0/53.0 [00:00<00:00, 58.3kB/s]
config.json: 100%|██████████| 612/612 [00:00<00:00, 1.06MB/s]
pytorch_model.bin: 100%|██████████| 90.9M/90.9M [00:39<00:00, 2.30MB/s]
tokenizer_config.json: 100%|██████████| 350/350 [00:00<00:00, 726kB/s]
vocab.txt: 100%|██████████| 232k/232k [00:00<00:00, 590kB/s]
tokenizer.json: 100%|██████████| 466k/466k [00:00<00:00, 2.28MB/s]
special_tokens_map.json: 100%|██████████| 112/112 [00:00<00:00, 548kB/s]
1_Pooling/config.json: 100%|██████████| 190/190 [00:00<00:00, 168kB/s]


In [5]:
# Setting OpenAI variables
OPEN_AI_API_KEY = "OPENAI API KEY"
embedding_model = "text-embedding-ada-002"
openai_client = openai.OpenAI(api_key=OPEN_AI_API_KEY)

# This is OpenAI embedding function
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
    api_key=OPEN_AI_API_KEY,
    model_name=embedding_model,
)


Now we will set the parameters of chroma db and add new data to it. 

In [84]:
collection = chroma_client.get_or_create_collection(
    name="Procom_Competitions", embedding_function=sentence_transformer_ef
)

for i, text in enumerate(document):
   collection.add(documents = document[i], metadatas=metadata[i], ids=[str(i)])

In [11]:
# Testing 
#collection.count()
# t = collection.query(query_texts = "events")
# t['documents'][0]

24

In [69]:
#TESTING

# file = "Procom'24 Plan Day1.csv"
# xlsx_path = f"./{folder_name}/" + file
# s = pd.read_csv(xlsx_path, index_col = 0)
# s = s.to_json(orient='records')
# # formatted_data = []
# # for row in s:
# #     formatted_row = []
# #     for key, value in row.items():
# #         # If the value is a list, enclose it within square brackets
# #         if isinstance(value, list):
# #             value = "[" + ", ".join(map(str, value)) + "]"
# #         formatted_row.append(value)
# #     formatted_data.append(formatted_row)

# # # Print or return the formatted data
# # # for row in formatted_data:
# # #     print(row)
# # formatted_data

In [53]:
#TESTING
# csv_file = "procom_data/Procom'24 Plan Day1.csv"
# test_df = pd.read_csv(csv_file, sep=',')

In [74]:
# TESTING
# json_output = "procom_data/test.json"
# output = test_df.to_json(indent = 1, orient = 'records')
# output

'[\n {\n  "Competitions":"Competitive Programming ",\n  "Capacity(Teams)":70.0,\n  "Lab\\/Room Allocated Day 1 ":"Lab 1, Lab 2, Lab 4, Lab 5",\n  "Lab\\/Room Allocated Day 2":"Lab 1 , Lab 2, Lab 4 "\n },\n {\n  "Competitions":"Speed Debugging ",\n  "Capacity(Teams)":35.0,\n  "Lab\\/Room Allocated Day 1 ":"E1, E2, E3 ",\n  "Lab\\/Room Allocated Day 2":"Lab 10, Lab 12"\n },\n {\n  "Competitions":"Chatcraft",\n  "Capacity(Teams)":25.0,\n  "Lab\\/Room Allocated Day 1 ":"Lab 8 ",\n  "Lab\\/Room Allocated Day 2":"Lab 8"\n },\n {\n  "Competitions":"database design",\n  "Capacity(Teams)":35.0,\n  "Lab\\/Room Allocated Day 1 ":"Lab 10, Lab 12",\n  "Lab\\/Room Allocated Day 2":"Lab 5, Lab 6"\n },\n {\n  "Competitions":"Code in the dark",\n  "Capacity(Teams)":35.0,\n  "Lab\\/Room Allocated Day 1 ":"Lab 5, Lab 8",\n  "Lab\\/Room Allocated Day 2":"Lab 5, Lab 8"\n },\n {\n  "Competitions":"Game developement ",\n  "Capacity(Teams)":25.0,\n  "Lab\\/Room Allocated Day 1 ":"Outside Auditorium",\n  "Lab\

In [9]:
def get_embedding(text: str, model: str = "text-embedding-ada-002") -> List[float]:
    text = text.replace("\n", " ")
    return openai_client.embeddings.create(input=[text], model=model).data[0].embedding


def ask_procom_chatbot(user_query: str) -> str:
    # Generate an embedded query from user query.
    embedded_query = get_embedding(user_query, model=embedding_model)

    # Retrieve all relevant data to teh user query
    all_relevant_info = collection.query(
        query_embeddings=embedded_query,
        n_results=1,
    )

    # Separating and concatenating the retrieved data.
    query_set = all_relevant_info["documents"][0]
    query_respond = "\n".join(str(item) for item in query_set)

    # Send the all info to the chatbot
    chatbot_response = ask_openai(user_query, query_respond)
    return chatbot_response


def ask_openai(user_query: str, query_respond: str) -> str:
    system_prompt = f"""
        You are an informative chatbot specifically made for 
        an event called PROCOM. Your task is to answer user 
        questions based solely on the provided information. You can not answer the 
        user query from your knowledge. If the user's query is
        within the scope of the provided information, you 
        will provide an answer. However, if the query is not 
        relevant to the information provided, you will politely inform the user
        that it's out of your knowledge.
        Below is the available information:
        There are a total of 19 competitions listed below in Procom:
        AI Showdown, App Dev, Blockchain Blitz, Chatcraft, Code In The Dark, 
        Code Sprint, Competitive Programming, CTF, Database Design, Game Dev,
        Hackathon, LFR Rules, Psuedo War, ROBO SOCCER Rules, ROBO SUMO Rules, 
        ROBO WAR(Light Weight) Rules, Speed Debugging, UIUX, and Web Dev.
        INFORMATION  
        ####
        {query_respond}
        ####
        Your response must be easy for user to understand.
    """

    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_query},
    ]

    response = openai_client.chat.completions.create(
        model="gpt-3.5-turbo", messages=messages, temperature=0
    )

    response_message = response.choices[0].message.content
    return response_message

In [13]:
# user_input = "How many events are there in Procom? can you name them?"
# user_input = "Tell me about game dev?"
user_input = "What are the rules of game dev? and how to make team?"

procom_chatbot_response = ask_procom_chatbot(user_input)
print(procom_chatbot_response)

In the Game Dev competition, the rules include using only four allowed engines (Unity, Godot, Unreal, Pygame), bringing your own laptops, submitting a Github link for your engine, no exact replica games allowed, no inappropriate aesthetics, and addressing technical issues before the event starts. 

To make a team, you can have a maximum of 5 members. Each team needs to ensure creativity, follow the given theme (2D Roguelike Pixelated), and enhance a half-cooked game provided. Teams will be judged based on their modifications, creativity, and overall game quality in two rounds: The Creative Forge and The Arena of the Legend.
