# Building RAG Chatbots for Technical Documentation

## Table of contents

- [Introduction](#introduction)
- [Environment Setup](#environment-setup)
- [Load and split the document](#load-and-split-the-document)
- [Generate and store the embeddings](#generate-and-store-the-embeddings)

## Introduction 

This project involves implementing a retrieval augmented generation (RAG) with `LangChain` to create a chatbot for
answering questions about technical documentation. The document chosen for this assignment was the following: The European Union Medical Device Regulation - Regulation (EU) 2017/745 (EU MDR). 

## Environment Setup

Install the packages and dependencies to be used:

In [1]:
# install langchain
%pip install -qU langchain langchain-community langchain-chroma langchain-text-splitters unstructured sentence_transformers langchain-huggingface

Note: you may need to restart the kernel to use updated packages.


## Load and split the document

In [2]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import UnstructuredHTMLLoader

file_path = "document.html"

loader = UnstructuredHTMLLoader(file_path)
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=200, add_start_index=True
)
docs = loader.load_and_split(text_splitter)

print(f"{docs[1].metadata}\n")
print(docs[1].page_content)
print(len(docs))



{'source': 'document.html', 'start_index': 763}

After consulting the Committee of the Regions,

Acting in accordance with the ordinary legislative procedure (2),

Whereas:

(1) Council Directive 90/385/EEC ( 3 ) and Council Directive 93/42/EEC ( 4 ) constitute the Union regulatory framework for medical devices, other than in vitro diagnostic medical devices. However, a fundamental revision of those Directives is needed to establish a robust, transparent, predictable and sustainable regulatory framework for medical devices which ensures a high level of safety and health whilst supporting innovation.
853


## Generate and store the embeddings

In [3]:
# Generate and store the embeddings
from langchain_chroma import Chroma
from langchain_huggingface.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

vectorstore = Chroma.from_documents(documents=docs, embedding=embeddings)



  from tqdm.autonotebook import tqdm, trange
