## **Title: Integration of Retrieval Augmented Generation (RAG) with Open Source LLM and LangChain for Autism Intervention Research.**

### **Objective: This assignment aims to develop a Retrieval Augmented Generation (RAG) system using an open-source LLM with less than 7B parameters and LangChain. The RAG will be built using a vector publications database, as provided. The model and development is expected to retrieve, summarize and generate relevant research findings on Autism, Therapy, and Intervention based on a user query.**

In [43]:
# Written by: Ashrey, IIT Madras

### **Dependencies**

In [1]:
!pip -q install langchain;
!pip -q install tiktoken;
!pip -q install chromadb;
!pip -q install pypdf;
!pip -q install sentence-transformers==2.2.2;
!pip -q install InstructorEmbedding;
!pip -q install faiss-cpu;

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m812.8/812.8 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m26.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m276.8/276.8 kB[0m [31m35.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m101.7/101.7 kB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.4/49.4 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m53.0/53.0 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m144.8/144.8 kB[0m [31m18.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━

## **Libraries**

In [2]:
import os

In [3]:
from langchain.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import PyPDFLoader
from langchain.document_loaders import DirectoryLoader

In [4]:
from InstructorEmbedding import INSTRUCTOR
from langchain.embeddings import HuggingFaceInstructEmbeddings

  from tqdm.autonotebook import trange


### **Load the data from the papers**

The research papers can be downloaded from the link: [Autism/Therapy Research Papers](https://drive.google.com/drive/folders/1mfPteojPewXLWgXMthS15D7S-525b5CA?usp=drive_link)


**NOTE: Make sure to keep the papers/pdfs in "papers" direcctory.**

In [5]:
loader = DirectoryLoader(f'papers/', glob = "./*.pdf", loader_cls = PyPDFLoader);
documents = loader.load();



In [6]:
# documents

In [7]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1000, chunk_overlap = 200)
texts = text_splitter.split_documents(documents)

In [8]:
texts[0]

Document(page_content='1\nSCiENtifiC  REPORtS  | (2018) 8:17008 | DOI:10.1038/s41598-018-35215-8www.nature.com/scientificreportsAtypical postural control can be \ndetected via computer vision \nanalysis in toddlers with autism \nspectrum disorder\nGeraldine Dawson   1, Kathleen Campbell2, Jordan Hashemi1,3, Steven J. Lippman n  4, \nValerie Smith4, Kimberly Carpente r1, Helen Egger5, Steven Espinosa3, Saritha Vermeer1, \nJeffrey\xa0 Baker6 & Guillermo Sapiro3,7\nEvidence\xa0 suggests\xa0 that\xa0differences\xa0 in\xa0motor\xa0function\xa0 are\xa0an\xa0early\xa0feature\xa0of\xa0autism\xa0spectrum\xa0 disorder\xa0\n(ASD).\xa0One\xa0aspect\xa0of\xa0motor\xa0ability\xa0that\xa0develops\xa0 during\xa0childhood\xa0 is\xa0postural\xa0 control,\xa0reflected\xa0 in\xa0the\xa0\nability to maintain a steady head and body position without excessive sway. Observational studies have \ndocumented\xa0 differences\xa0 in\xa0postural\xa0 control\xa0in\xa0older\xa0children\xa0with\xa0ASD.\xa0The\xa0prese

### **Getting the embeddings from the documents/research papers**

In [9]:
import pickle
import faiss
from langchain.vectorstores import FAISS

In [10]:
def store_embeddings(docs, embeddings, sotre_name, path):

    vectorStore = FAISS.from_documents(docs, embeddings)

    with open(f"{path}/faiss_{sotre_name}.pkl", "wb") as f:
        pickle.dump(vectorStore, f)

In [11]:
def load_embeddings(sotre_name, path):
    with open(f"{path}/faiss_{sotre_name}.pkl", "rb") as f:
        VectorStore = pickle.load(f)
    return VectorStore

In [12]:
from langchain.embeddings import HuggingFaceInstructEmbeddings
instructor_embeddings = HuggingFaceInstructEmbeddings(model_name = "hkunlp/instructor-xl",
                                                      model_kwargs = {"device": "cuda"})

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


.gitattributes:   0%|          | 0.00/1.48k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/270 [00:00<?, ?B/s]

2_Dense/config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/3.15M [00:00<?, ?B/s]

README.md:   0%|          | 0.00/66.3k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/4.96G [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.40k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/461 [00:00<?, ?B/s]

load INSTRUCTOR_Transformer
max_seq_length  512


In [13]:
Embedding_store_path = f"embedding_store"

In [14]:
# store_embeddings(texts,
#                  instructor_embeddings,
#                  sotre_name='instructEmbeddings',
#                  path=Embedding_store_path)

In [15]:
# db_instructEmbedd = load_embeddings(sotre_name='instructEmbeddings',
#                                     path=Embedding_store_path)

In [16]:
db_instructEmbedd = FAISS.from_documents(texts, instructor_embeddings)

In [17]:
retriever = db_instructEmbedd.as_retriever(search_kwargs = {"k": 5})

In [18]:
retriever.search_type

'similarity'

### **Queries and their summary responses**

In [19]:
def pretty_print_docs(docs):
    print(
        f"\n{'-' * 100}\n".join(
            [f"Summary {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]
        )
    )

In [20]:
from langchain.retrievers.document_compressors import DocumentCompressorPipeline
from langchain_community.document_transformers import EmbeddingsRedundantFilter
from langchain.retrievers.document_compressors import EmbeddingsFilter
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import HuggingFaceBgeEmbeddings
from langchain.retrievers import ContextualCompressionRetriever

embeddings = HuggingFaceBgeEmbeddings()

splitter = CharacterTextSplitter(chunk_size = 550, chunk_overlap = 0, separator = ". ")

redundant_filter = EmbeddingsRedundantFilter(embeddings=embeddings)

relevant_filter = EmbeddingsFilter(embeddings = embeddings, similarity_threshold = 0.80)

pipeline_compressor = DocumentCompressorPipeline(
    transformers = [splitter, redundant_filter, relevant_filter]
)

compression_retriever = ContextualCompressionRetriever(
    base_compressor = pipeline_compressor, base_retriever = retriever
)


.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/191 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/90.3k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/720 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

**The 2 code blocks below are just to get the summaries for all the questions in one place. Seperate responses for every query is provided below.**

In [60]:
questions = [
        "What are the variety of Multimodal and Multi-modular AI Approaches to Streamline Autism Diagnosis in Young Children?",
        "What is Autism Spectrum Disorder, how it is caused?",
        "What is the cure of Autism Spectrum Disorder?",
        "What are Stereotypical and maladaptive behaviors in Autism Spectrum, how are these detected and managed?",
        "How relevant is eye contact and how it can be used to detect Autism?",
        "How can cross country trials help in development of Machine learning based Multimodal solutions?",
        "How early infants cry can help in the early detection of Autism?",
        "What are various methods to detect Atypical Pattern of Facial expression in Children?",
        "What kind of facial expressions can be used to detect Autism Disorder in children?",
        "What are methods to detect Autism from home videos?",
        "What is Still-Face Paradigm in Early Screening for High-Risk Autism Spectrum Disorder?",
        "What is West Syndrome?",
        "What is the utility of Behavior and interaction imaging at 9 months of age predict autism/intellectual disability in high-risk infants with West syndrome?"
        ]

In [61]:
def export_output_to_text_file(question, compressed_docs):
    with open("output.txt", "a") as file:
        file.write("Query: " + question + "\n\n")
        for i, doc in enumerate(compressed_docs):
            file.write("Summary {}:\n".format(i + 1))
            file.write(doc.page_content.strip() + "\n\n")

In [62]:
for question in questions:
    compressed_docs = compression_retriever.get_relevant_documents(question)
    export_output_to_text_file(question, compressed_docs)



### **Below are all the 10 questions and their relevant summary responses**

The questions were retrieved from the following provided link: [Questions](https://docs.google.com/spreadsheets/d/1m1ZrxKAJF3KOSRn9QUz2a63Ll0nBR0KjMPLTe1rp4cM/edit?usp=sharing)

In [59]:
# question = "What are the variety of Multimodal and Multi-modular AI Approaches to Streamline Autism Diagnosis in Young Children?"
# compressed_docs = compression_retriever.get_relevant_documents(question)
# # print("Query: ", question)
# # print("\n")
# # print(pretty_print_docs(compressed_docs))
# export_output_to_text_file(question, compressed_docs)

In [None]:
# question = "What is Autism Spectrum Disorder, how it is caused?"
# compressed_docs = compression_retriever.get_relevant_documents(question)
# # print("Query: ", question)
# # print("\n")
# # print(pretty_print_docs(compressed_docs))
# export_output_to_text_file(question, compressed_docs)

In [None]:
# question = "What is the cure of Autism Spectrum Disorder?"
# compressed_docs = compression_retriever.get_relevant_documents(question)
# # print("Query: ", question)
# # print("\n")
# # print(pretty_print_docs(compressed_docs))
# export_output_to_text_file(question, compressed_docs)

In [None]:
# question = "What are Stereotypical and maladaptive behaviors in Autism Spectrum, how are these detected and managed?"
# compressed_docs = compression_retriever.get_relevant_documents(question)
# # print("Query: ", question)
# # print("\n")
# # print(pretty_print_docs(compressed_docs))
# export_output_to_text_file(question, compressed_docs)

In [None]:
# question = "How relevant is eye contact and how it can be used to detect Autism?"
# compressed_docs = compression_retriever.get_relevant_documents(question)
# # print("Query: ", question)
# # print("\n")
# # print(pretty_print_docs(compressed_docs))
# export_output_to_text_file(question, compressed_docs)

In [None]:
# question = "How can cross country trials help in development of Machine learning based Multimodal solutions?"
# compressed_docs = compression_retriever.get_relevant_documents(question)
# # print("Query: ", question)
# # print("\n")
# # print(pretty_print_docs(compressed_docs))
# export_output_to_text_file(question, compressed_docs)

In [None]:
# question = "How early infants cry can help in the early detection of Autism?"
# compressed_docs = compression_retriever.get_relevant_documents(question)
# # print("Query: ", question)
# # print("\n")
# # print(pretty_print_docs(compressed_docs))
# export_output_to_text_file(question, compressed_docs)

In [None]:
# question = "What are various methods to detect Atypical Pattern of Facial expression in Children?"
# compressed_docs = compression_retriever.get_relevant_documents(question)
# # print("Query: ", question)
# # print("\n")
# # print(pretty_print_docs(compressed_docs))
# export_output_to_text_file(question, compressed_docs)

In [None]:
# question = "What kind of facial expressions can be used to detect Autism Disorder in children?"
# compressed_docs = compression_retriever.get_relevant_documents(question)
# # print("Query: ", question)
# # print("\n")
# # print(pretty_print_docs(compressed_docs))
# export_output_to_text_file(question, compressed_docs)

In [None]:
# question = "What are methods to detect Autism from home videos?"
# compressed_docs = compression_retriever.get_relevant_documents(question)
# # print("Query: ", question)
# # print("\n")
# # print(pretty_print_docs(compressed_docs))
# export_output_to_text_file(question, compressed_docs)

In [None]:
# question = "What is Still-Face Paradigm in Early Screening for High-Risk Autism Spectrum Disorder?"
# compressed_docs = compression_retriever.get_relevant_documents(question)
# # print("Query: ", question)
# # print("\n")
# # print(pretty_print_docs(compressed_docs))
# export_output_to_text_file(question, compressed_docs)

In [None]:
# question = "What is West Syndrome?"
# compressed_docs = compression_retriever.get_relevant_documents(question)
# # print("Query: ", question)
# # print("\n")
# # print(pretty_print_docs(compressed_docs))
# export_output_to_text_file(question, compressed_docs)

In [None]:
# question = "What is the utility of Behavior and interaction imaging at 9 months of age predict autism/intellectual disability in high-risk infants with West syndrome?"
# compressed_docs = compression_retriever.get_relevant_documents(question)
# # print("Query: ", question)
# # print("\n")
# # print(pretty_print_docs(compressed_docs))
# export_output_to_text_file(question, compressed_docs)

In [63]:
!pip freeze > requirements.txt