<a href="https://colab.research.google.com/github/Anze-/datathon2k25/blob/alberto/feature_engineering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **OrderFox Hackathon: Domain-Specific Chatbot Design** 🚀  

## **📌 Overview**  
In this hackathon, your goal is to design and implement a domain-specific chatbot using **Retrieval-Augmented Generation (RAG)**. You will be provided with a **dataset of HTML-crawled documents** in jason format, and your task is to build a system that effectively retrieves relevant information and generates accurate responses.  

## **🔹 What You Need to Do**  
1. **Set up your own environment** - A github Repository, or make a copy of this colab notebook, or a Kaggle notebook.
2. **Load and Process Documents** – Extract text from the json files provided to you.
3. **Implement Document Retrieval**
– Try different retrieval approaches, including:  
   - Keyword-based search  
   - Vector embeddings (e.g. Bag-of-Words, Word2Vec, embedding models on Hugging Face, embeddig models provided with OpenAI API)  + vector database
   - Graph / Tree extraction + graph database
   - Hybrid methods   
4. **Perform Response Generation** – Retrieve relevant documents and generate responses using a Language Model:
  - Feel free to explore prompt engineer

5. **Optimize and Evaluate Your System** – Compare performance based on relevance, grounding, fluency, efficiency, and cost.  
6. **Deliver a Well-Structured Solution** – Organize your code as a modular project repository.  
 - Code repository or notebook
 - Make sure to include your knowledge database

## **📦 What Is Provided?**  
✅ **Dataset**: JSON files containing extracted HTML content, available on a shared Google Drive and Kaggle.  
✅ **Baseline Notebook**: A starter kit with useful tools and guidance.  
✅ **Evaluation Metrics**: A structured evaluation framework to assess performance.  

## **🖥️ What Computing Resources Can You Use?**  
You are welcome to choose your preferred platform to develop your solution. Either   
- Locally (on your own machine) 💻  
- or Using **cloud platforms** such as **Google Colab** or **Kaggle** (a free account is sufficient).  

## **🛠️ Tools You May Consider**  
(*These are recommendations to help you get started. You are free to use alternative tools—just document your choices clearly!*)  
- **Database**: FAISS, ChromaDB, SQLite, Elasticsearch, Neo4j and etc.  
- **Embedding Models**: Hugging Face Sentence-Transformers, OpenAI Embeddings  
- **LLM for Generation**: OpenAI: gpt-4o-mini
- **Others**: Langchain, GraphRAG, and etc.

## **📌 Final Delivery**  
Your final submission should include:  
✅ A well-documented **GitHub repository or notebook**  
✅ A clear **README** explaining your approach  
✅ A structured **retrieval and generation modules**  

### **🔥 Bonus Points For**  
✨ Innovative retrieval techniques  
✨ Well-organized, modular code  
✨ Creative visualizations or user interfaces  


# 1. Set up working environment

In [1]:
from decorator import append
!pip install openai

# Database options
!pip install chromadb # if you use chromadb as your vector database

# Others
!pip install langchain-community # if you use langchain for orchastration
!pip install transformers #if you use huggingface for vector embedding



In [2]:
# enable GPU if needed, GPU can speed up your vector embedding if you computing these vectors locally (not using API)

import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

Using device: cuda


In [3]:

import os
import json
import chromadb
import openai
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA

# Set OpenAI API Key
os.environ["OPENAI_API_KEY"] = ""


# 2. Knowledge Base Preparation

## 2.1 Load documents

Once you are added access to this folder, it will appear at your google drive "Shared drives". Then you can mount your drive and as following, and access your data from "/content/drive/Shared drives/Datathon/Data/hackathon_data/". Enjoy the ride! :)

In [4]:
# Load the Drive and mount
# from google.colab import drive
# drive.mount('/content/drive/')

Load json file.

In [5]:
#folder_path = "/content/drive/Shared drives/Datathon/Data/hackathon_data/"# Google drive path of the dataset
folder_path = "data/hackathon_data"
files_in_folder = os.listdir(folder_path)

len(files_in_folder)

13144

In [6]:
def load_documents(json_file):
    """Loads the JSON file."""
    with open(json_file, 'r', encoding="utf-8") as f:
      try:
          data = json.load(f)
          return data
      except json.JSONDecodeError:
          print(f"Error reading {json_file}, it may not be a valid JSON file.")
    return []



In [7]:
docs= []

def document_should_skip(url):
  extensions = [".js", ".css", ".pdf", ".csv", ".doc", ".docx"]
  for ext in extensions:
    if ext in url:
      return True
  return False


for filename in files_in_folder[:100]:
    if filename.endswith('.json'):

        file_path = os.path.join(folder_path, filename)
        doc = load_documents(file_path)
        for page_url, page in doc['text_by_page_url'].items():
            if not document_should_skip(page_url):
                docs.append(page)


In [8]:
avg = 0
for doc in docs:
    avg +=len(doc)
doc_avg = 500

In [9]:
chunked_docs = []

for doc in docs:
    if len(doc) > doc_avg:
        for i in range(0, len(doc), doc_avg):
            chunked_docs.append(doc[i : i + doc_avg])
    else:
        chunked_docs.append(doc)

In [10]:
len(chunked_docs)

110914

In [11]:
from gliner import GLiNER

torch.cuda.empty_cache()

# Initialize GLiNER with the base model
model = GLiNER.from_pretrained("urchade/gliner_small-v2.1")

model.to("cuda")

# Labels for entity prediction
# Most GLiNER models should work best when entity types are in lower case or title case
labels = ["Technology", "Service", "Material", "Product","Industry","Region"]


# Perform entity prediction
# for chunked_doc in docs[:1024]:
#     entities_to_chunked_docu = model.predict_entities(chunked_doc, ["Technology", "Service", "Material", "Product","Industry","Region"], threshold=0.6)

model.run(chunked_docs, ["Technology", "Service", "Material", "Product","Industry","Region"], threshold=0.6, batch_size=32)

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


OutOfMemoryError: CUDA out of memory. Tried to allocate 1.12 GiB. GPU 0 has a total capacity of 5.92 GiB of which 876.69 MiB is free. Process 11392 has 4.76 MiB memory in use. Including non-PyTorch memory, this process has 4.09 GiB memory in use. Of the allocated memory 2.91 GiB is allocated by PyTorch, and 1.10 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

In [21]:
import spacy

spacy.prefer_gpu()
nlp = spacy.load("en_core_web_sm", enable = ["ner"])
processed_docs = nlp.pipe(chunked_docs, n_process=1)

doc_ents = []
for doc in processed_docs:
    doc_ents.append(doc.ents)

KeyboardInterrupt: 