# Implementation of a RAG System for Data Science Knowledge Management


<h1> Domain Understanding</h1>
Kaggle is a highly popular platform in the data science community. As of 2021, it had over 8 million registered users.

Kaggle has gained popularity by running competitions that range from fun brain exercises to commercial contests that award monetary prizes and rank participants. These competitions often involve real-world problems and offer substantial prizes, attracting data scientists from around the world to participate and learn.

Moreover, Kaggle is not just a competition platform. It's a comprehensive data science ecosystem that includes public datasets, collaborative notebooks, and a robust discussion forum. This makes it a go-to resource for data scientists to learn new skills, collaborate on projects, and stay updated with the latest trends in the field.

<h1> Problem Description </h1>

The primary objective of this use case is to utilize the capabilities of Large Language Models (LLMs), under the Retrieval-Augmented Generation (RAG) architecture, to enhance the understanding and utilization of the Kaggle platform and Python programming language within an enterprise setting.

Tasks:

1. Kaggle Solution Summarization: The LLM will summarize Kaggle solution write-ups, providing concise and understandable summaries of complex data science solutions.

2. Kaggle Competition Concept Explanation: The LLM will explain or teach concepts from Kaggle competition solution write-ups, aiding in the understanding and learning of advanced data science techniques and methodologies.

3. Kaggle Platform Query Resolution: The LLM will answer common questions about the Kaggle platform, assisting users in navigating and utilizing the platform effectively.


<h1> Project Process </h1>


1. **Data Acquisition:**

 Initially, the data is gathered and web scraped from Kaggle’s documentation website called “How to use Kaggle”. This source includes information about Kaggle competitions, datasets, notebook discussions, documentation, etc. This data will serve as the knowledge source for the large language model.

2. **Data Preprocessing:**

  The data is then broken down into smaller chunks and converted into a vector database for efficient querying. For the purpose of a vector database, the ChromaDB will be utilised which is used to store the vectors and query them.is used to store the vectors and query them. The embedding function used is  HuggingFace InstructEmbeddings, an embedding model to convert the chunks into vectors.

3. **Query for Relevant Data & Craft Response:**
  Here, the user query and context will be used to generate a response using the Gemma LLM. A prompt is used to combine the question and the retrieved documents from the vector database.

  The Gemma 7B Instruct model will be employed for this task. However, given its substantial size, it will be presented through the Inference API on Hugging Face for efficiency.

# Installing Libraries

In [None]:
# Import necessary libraries
!pip install sentence-transformers==2.2.2  # Install a specific version of sentence-transformers that is compatible with the setup
!pip install unstructured  # Install the unstructured package, which provides functionalities related to handling unstructured data, such as text processing and NLP
!pip install langchain  # Install the langchain package for language processing
!pip install chromadb  # Install the Chroma database for efficient querying
!pip install InstructorEmbedding  # Install the InstructorEmbedding package for converting chunks into vectors


Collecting sentence-transformers==2.2.2
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/86.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━[0m [32m61.4/86.0 kB[0m [31m1.7 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.6.0->sentence-transformers==2.2.2)
  Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.7/23.7 MB[0m [31m34.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.6.0->sentence-transformers==2.2.2)
  Downloading nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_

# Importing Libraries

In [None]:
# Standard Python libraries for operating system and file operations
import os
import shutil

# Python library for making HTTP requests
import requests

# Python library for parsing HTML and XML documents
from bs4 import BeautifulSoup

# Standard Python library for regular expressions
import re

# Hypothetical module that provides embeddings for the instructor
from InstructorEmbedding import INSTRUCTOR

# Module that provides embeddings using Hugging Face's models
from langchain_community.embeddings import HuggingFaceInstructEmbeddings

# Module that loads documents from a directory
from langchain.document_loaders import DirectoryLoader

# Module that splits text into smaller chunks recursively
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Module that provides a template for chat prompts
from langchain.prompts import ChatPromptTemplate

# Module that provides a vector database for efficient querying
from langchain.vectorstores import Chroma


  from tqdm.autonotebook import trange


# 1. Data Acquisition - Web Scraping


In [None]:
# Function to remove special characters, escape sequences, and links from text
def special_character_removal(text):
    """
    Removes escape characters, symbols, and links from the input text.

    Args:
        text (str): The input text to be cleaned.

    Returns:
        str: The cleaned text with special characters, escape sequences, and links removed.
    """

    # Remove escape characters, symbols, and links
    cleaned_text = re.sub(r'\\u[0-9a-fA-F]{4}', '', text)  # Remove unicode escape sequences
    cleaned_text = re.sub(r'\\n', '', cleaned_text)  # Remove newline escape sequences
    cleaned_text = re.sub(r'https?://\S+', '', cleaned_text)  # Remove URLs

    # Extract sentences using regex
    sentences_final = re.findall(r'([^.!?]+(?:[.!?]+|$))', cleaned_text)

    # Recombines the extracted sentences into a single string, separated by spaces.
    combined_text_final = ' '.join(sentences_final)

    # Returns the combined text with all extracted sentences.
    return combined_text_final


In [None]:
# List of endpoints from the Kaggle documentation web page
documentation_endpoints = ["competitions", "datasets", "notebooks", "api"]

documentation_url = "https://www.kaggle.com/docs/"
OUTPUT_DIRECTORY = "/content/DE_CW/"

# Iterate through each endpoint of the documentation
for endpoint in documentation_endpoints:

    # Complete URL for the documentation page
    endpoint_url = f"{documentation_url}{endpoint}"
    # HTTP GET request for the documentation page
    response = requests.get(endpoint_url)

    if response.status_code == 200:  # If the request is successful
        response = response.content  # Get binary data from the response
        soup = BeautifulSoup(response, "html.parser")  # Parse HTML content using BeautifulSoup

        # Find the main content component of the page
        page_components = soup.find(class_="kaggle-component")
        # Get HTML content as a string
        html_content_str = page_components.prettify()

        # Clean the HTML content by removing special characters, escape sequences, and links
        cleaned_content = special_character_removal(html_content_str)

        # Save the cleaned content to a text file
        with open(f'{OUTPUT_DIRECTORY}{endpoint}.txt', 'w') as file:
            file.writelines(cleaned_content)
            print(f"{endpoint} saved to {OUTPUT_DIRECTORY}")
            file.close()


competitions saved to /content/DE_CW/
datasets saved to /content/DE_CW/
notebooks saved to /content/DE_CW/
api saved to /content/DE_CW/
