# Implementation of a RAG System for Data Science Knowledge Management


<h1> Domain Understanding</h1>
Kaggle is a highly popular platform in the data science community. As of 2021, it had over 8 million registered users.

Kaggle has gained popularity by running competitions that range from fun brain exercises to commercial contests that award monetary prizes and rank participants. These competitions often involve real-world problems and offer substantial prizes, attracting data scientists from around the world to participate and learn.

Moreover, Kaggle is not just a competition platform. It's a comprehensive data science ecosystem that includes public datasets, collaborative notebooks, and a robust discussion forum. This makes it a go-to resource for data scientists to learn new skills, collaborate on projects, and stay updated with the latest trends in the field.

<h1> Problem Description </h1>

The primary objective of this use case is to utilize the capabilities of Large Language Models (LLMs), under the Retrieval-Augmented Generation (RAG) architecture, to enhance the understanding and utilization of the Kaggle platform and Python programming language within an enterprise setting.

Tasks:

1. Kaggle Solution Summarization: The LLM will summarize Kaggle solution write-ups, providing concise and understandable summaries of complex data science solutions.

2. Kaggle Competition Concept Explanation: The LLM will explain or teach concepts from Kaggle competition solution write-ups, aiding in the understanding and learning of advanced data science techniques and methodologies.

3. Kaggle Platform Query Resolution: The LLM will answer common questions about the Kaggle platform, assisting users in navigating and utilizing the platform effectively.


<h1> Project Process </h1>


1. **Data Acquisition:**

 Initially, the data is gathered and web scraped from Kaggle’s documentation website called “How to use Kaggle”. This source includes information about Kaggle competitions, datasets, notebook discussions, documentation, etc. This data will serve as the knowledge source for the large language model.

2. **Data Preprocessing:**

  The data is then broken down into smaller chunks and converted into a vector database for efficient querying. For the purpose of a vector database, the ChromaDB will be utilised which is used to store the vectors and query them.is used to store the vectors and query them. The embedding function used is  HuggingFace InstructEmbeddings, an embedding model to convert the chunks into vectors.

3. **Query for Relevant Data & Craft Response:**
  Here, the user query and context will be used to generate a response using the Gemma LLM. A prompt is used to combine the question and the retrieved documents from the vector database.

  The Gemma 7B Instruct model will be employed for this task. However, given its substantial size, it will be presented through the Inference API on Hugging Face for efficiency.

# Installing Libraries

In [None]:
# Import necessary libraries
!pip install sentence-transformers==2.2.2  # Install a specific version of sentence-transformers that is compatible with the setup
!pip install unstructured  # Install the unstructured package, which provides functionalities related to handling unstructured data, such as text processing and NLP
!pip install langchain  # Install the langchain package for language processing
!pip install chromadb  # Install the Chroma database for efficient querying
!pip install InstructorEmbedding  # Install the InstructorEmbedding package for converting chunks into vectors


Collecting sentence-transformers==2.2.2
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/86.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━[0m [32m61.4/86.0 kB[0m [31m1.7 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.6.0->sentence-transformers==2.2.2)
  Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.7/23.7 MB[0m [31m34.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.6.0->sentence-transformers==2.2.2)
  Downloading nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_

# Importing Libraries

In [None]:
# Standard Python libraries for operating system and file operations
import os
import shutil

# Python library for making HTTP requests
import requests

# Python library for parsing HTML and XML documents
from bs4 import BeautifulSoup

# Standard Python library for regular expressions
import re

# Hypothetical module that provides embeddings for the instructor
from InstructorEmbedding import INSTRUCTOR

# Module that provides embeddings using Hugging Face's models
from langchain_community.embeddings import HuggingFaceInstructEmbeddings

# Module that loads documents from a directory
from langchain.document_loaders import DirectoryLoader

# Module that splits text into smaller chunks recursively
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Module that provides a template for chat prompts
from langchain.prompts import ChatPromptTemplate

# Module that provides a vector database for efficient querying
from langchain.vectorstores import Chroma


  from tqdm.autonotebook import trange


# 1. Data Acquisition - Web Scraping


In [None]:
# Function to remove special characters, escape sequences, and links from text
def special_character_removal(text):
    """
    Removes escape characters, symbols, and links from the input text.

    Args:
        text (str): The input text to be cleaned.

    Returns:
        str: The cleaned text with special characters, escape sequences, and links removed.
    """

    # Remove escape characters, symbols, and links
    cleaned_text = re.sub(r'\\u[0-9a-fA-F]{4}', '', text)  # Remove unicode escape sequences
    cleaned_text = re.sub(r'\\n', '', cleaned_text)  # Remove newline escape sequences
    cleaned_text = re.sub(r'https?://\S+', '', cleaned_text)  # Remove URLs

    # Extract sentences using regex
    sentences_final = re.findall(r'([^.!?]+(?:[.!?]+|$))', cleaned_text)

    # Recombines the extracted sentences into a single string, separated by spaces.
    combined_text_final = ' '.join(sentences_final)

    # Returns the combined text with all extracted sentences.
    return combined_text_final


In [None]:
# List of endpoints from the Kaggle documentation web page
documentation_endpoints = ["competitions", "datasets", "notebooks", "api"]

documentation_url = "https://www.kaggle.com/docs/"
OUTPUT_DIRECTORY = "/content/DE_CW/"

# Iterate through each endpoint of the documentation
for endpoint in documentation_endpoints:

    # Complete URL for the documentation page
    endpoint_url = f"{documentation_url}{endpoint}"
    # HTTP GET request for the documentation page
    response = requests.get(endpoint_url)

    if response.status_code == 200:  # If the request is successful
        response = response.content  # Get binary data from the response
        soup = BeautifulSoup(response, "html.parser")  # Parse HTML content using BeautifulSoup

        # Find the main content component of the page
        page_components = soup.find(class_="kaggle-component")
        # Get HTML content as a string
        html_content_str = page_components.prettify()

        # Clean the HTML content by removing special characters, escape sequences, and links
        cleaned_content = special_character_removal(html_content_str)

        # Save the cleaned content to a text file
        with open(f'{OUTPUT_DIRECTORY}{endpoint}.txt', 'w') as file:
            file.writelines(cleaned_content)
            print(f"{endpoint} saved to {OUTPUT_DIRECTORY}")
            file.close()


competitions saved to /content/DE_CW/
datasets saved to /content/DE_CW/
notebooks saved to /content/DE_CW/
api saved to /content/DE_CW/


# 2. Data Preparation
The first part of the process involves loading the data. The load_txt_documents(OUTPUT_DIRECTORY) function is used for this purpose. It utilizes the DirectoryLoader() class to load all .txt files from the specified directory (OUTPUT_DIRECTORY).

The glob parameter is set to "*.txt", which means it will look for all files with the .txt extension. Once the documents are loaded, the function prints the number of documents found and returns a list of these documents.

In [None]:
# Function to load .txt data from a folder
def load_txt_documents(OUTPUT_DIRECTORY):
    """
    Load .txt documents from the specified directory.

    Args:
        OUTPUT_DIRECTORY (str): The path to the directory containing the .txt documents.

    Returns:
        list: A list of loaded .txt documents.
    """
    # Use DirectoryLoader from langchain package to load documents
    loader = DirectoryLoader(OUTPUT_DIRECTORY, glob="*.txt")
    # Load the documents
    documents = loader.load()
    # Print the number of documents found
    print(f"{len(documents)} document(s) found.")
    # Return the loaded documents
    return documents

In [None]:
# Load .txt data from folder
documents = load_txt_documents(OUTPUT_DIRECTORY)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


4 document(s) found.


In [None]:
documents[:1]

[Document(page_content='<script class="kaggle-component" nonce="FfUlfah2rDrp3nOitcrFdQ=="> var Kaggle=window. Kaggle||{};Kaggle. State=Kaggle. State||[];Kaggle. State. push({"title":"Datasets","subtitle":"Explore, analyze, and share quality data","imageUrl":"","mimeType":"text/html","pageContent":"hr width=100% align=left! --Types of Datasets--h3 id=types-of-datasetsTypes of Datasets/h3pKaggle supports a variety of dataset publication formats, but we strongly encourage dataset publishers to share their data in an accessible, non-proprietary format if possible. Not only are open, accessible data formats better supported on the platform, they are also easier to work with for more people regardless of their tools. /ppThis page describes the file formats that we recommend using when sharing data on Kaggle Datasets. Plus, learn why and how to make less well-supported file types as accessible as possible to the data science community. /p    h4 id=supported-file-typesSupported File Types/h4  

After the documents are loaded, they are then split into smaller, more manageable chunks. This is done using the split_text_to_chunks(documents) function, which utilizes the RecursiveCharacterTextSplitter() class.

The parameters for the text splitter are set as follows:
1. chunk_size=1000: This specifies the maximum size of each chunk. In this case, each chunk will contain up to 1000 characters.

2. chunk_overlap=50: This is the number of characters that will overlap between adjacent chunks. This overlap can help ensure that no important information is lost at the boundaries between chunks.

3. length_function=len: This specifies the function used to calculate the length of the text. In this case, the built-in len function is used, which returns the number of characters in the text.

3. add_start_index=True: This indicates that the start index of each chunk (relative to the original document) should be included in the metadata for each chunk.

The split_text_to_chunks(documents) function splits each document into chunks and then prints the number of chunks created. It then returns a list of these chunks.

In [None]:
# Split documents into chunks based on relevance
def split_text_to_chunks(documents):

    """
    Split documents into chunks based on recursive character text splitter.

    Args:
        documents (list): List of documents to be split into chunks.

    Returns:
        list: List of chunks containing the split content of the documents.
    """

    # Initialize text splitter
    text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1000,
                                                   chunk_overlap = 50,
                                                   length_function = len,
                                                   add_start_index = True)

    # Split documents into chunks using the initialized text splitter
    chunks = text_splitter.split_documents(documents)

    # Print the number of documents and chunks for verification
    print(f"Split {len(documents)} documents into {len(chunks)} chunks.")

    # Select a chunk at random (in this case, the 11th chunk) for inspection
    document = chunks[10]

    # Print the content and metadata mentioning the source and the start index of the selected chunk for verification
    print(document.page_content)
    print(document.metadata)

    # Return the list of chunks for further processing
    return chunks

In [None]:
# Split documents into chunks based on relevance
chunks = split_text_to_chunks(documents)

Split 4 documents into 156 chunks.
--Searching for Datasets--h3 id=searching-for-datasetsSearching for Datasets/h3pDatasets is not just a simple data repository. Each dataset is a community where you can discuss data, discover public code and techniques, and create your own projects in Notebooks. You can find many different interesting datasets of all shapes and sizes if you take the time to look around and find them! /ppThe latest and greatest from Datasets is surfaced on Kaggle in several different places. /p    h4 id=newsfeedNewsfeed/h4    pWhen youre logged into your Kaggle account, the a href= homepage/a provides a live newsfeed of what people are doing on the platform. New Datasets uploaded by people you follow and hot Datasets with lots of activity will show up here. By browsing down the page you can check out all the latest updates from your fellow Kagglers. /p    pYou can tweak your news feed to your liking by following other Kagglers. To follow someone, go to their profile pa

 Based on the parameters set for the text splitter, the 4 documents are split into 156 chunks. This means that the original documents have been broken down into 156 smaller pieces of text, each containing up to 1000 characters, ready for further processing or analysis.