# CogniAble
## Integration of Retrieval Augmented Generation (RAG) with Open Source LLM and LangChain for Autism Intervention Research.
> - By - Nandan Hemanth
> - Github repo: - https://github.com/NandanHemanth/Multimex-ingestion 
> - Resume - https://drive.google.com/file/d/1IB6X_G7mPwvzv1M8I0QiV099_02bylI2/view?usp=sharing 
---


In [1]:
!nvidia-smi

Fri Apr 12 21:22:46 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 551.86                 Driver Version: 551.86         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                     TCC/WDDM  | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GeForce RTX 3060 ...  WDDM  |   00000000:01:00.0 Off |                  N/A |
| N/A   51C    P8              9W /   95W |     405MiB /   6144MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

## 1. Installing all the necessary libraries for this task :

In [2]:
%pip install langchain
%pip install torch
%pip install sentence_transformers
%pip install huggingface-hub
%pip install pypdf
%pip install tiktoken
%pip -q install accelerate
%pip install ollama
%pip install chromadb
%pip -q install git+https://github.com/huggingface/transformers

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


## 2. Importing all the necessary Libraries & Modules :

In [3]:
from langchain_community.llms import Ollama
from langchain.vectorstores.faiss import FAISS
from langchain.vectorstores import Chroma
from PyPDF2 import PdfReader
from langchain_community.embeddings import OllamaEmbeddings
from langchain_core.output_parsers import StrOutputParser
from langchain.prompts import PromptTemplate
# from langchain.document_loaders import PyPDFDirectoryLoader
from langchain.text_splitter import CharacterTextSplitter
# from langchain_community.vectorstores import DocArrayInMemorySearch
from operator import itemgetter
import os

## 3. Defining the LLM for the Task - "Mistral - 7B: latest" :
- Ollama is an open-source library in Python which helps in seamless integration of **Open-source LLM models** into this Project
- download & load the *"Mistral-7B : latest* and define the 'MODEL'
- For creating vector embeddings, I used the **"OllamaEmbeddings"** as it is open-source and works very efficiently for this model

In [4]:
# Loading the model - "Mistral 7B" and testing it
MODEL = "mistral:latest"
model = Ollama(model=MODEL)
embeddings = OllamaEmbeddings(model=MODEL)

model.invoke("Tell me a tech joke")

' Why don\'t programmers like nature?\n\nBecause it has too many bugs.\n\nOr, why did the Java developer quit his job?\n\nTo work on a greener project: Python. (This one is a play on words as "greener" can mean both "more environmentally friendly" and "less experienced," since "Python" is often used to describe someone who\'s new to a subject.)'

## 4. Loading all the PDFs from the directory - "./research_papers/" : [TASK 1]
- This snippet extracts **RAW TEXT** from all the PDFs(15 PDFs), removing Tables, Figures
- This raw text consists of **6,98,508 tokens** which will be converted to vector embeddings!

In [5]:
pdf_docs = "./research_papers/"

# Directory containing PDF files
pdf_dir = "./research_papers/"

# Initialize an empty string to store text from all PDF files
text = ""

# Iterate over each file in the directory
for filename in os.listdir(pdf_dir):
    if filename.endswith(".pdf"):
        # Construct the full path to the PDF file
        pdf_path = os.path.join(pdf_dir, filename)
        
        # Open the PDF file
        with open(pdf_path, "rb") as file:
            # Create a PdfReader object
            pdf_reader = PdfReader(file)
            
            # Iterate over each page in the PDF
            for page in pdf_reader.pages:
                # Extract text from the page and append it to the 'text' variable
                text += page.extract_text()

# Output the length of the concatenated text
print(len(text))

Multiple definitions in dictionary at byte 0x1cc6b for key /MediaBox
Multiple definitions in dictionary at byte 0x1ce61 for key /MediaBox
Multiple definitions in dictionary at byte 0x1d014 for key /MediaBox
Multiple definitions in dictionary at byte 0x1d1ae for key /MediaBox
Multiple definitions in dictionary at byte 0x1d33d for key /MediaBox
Multiple definitions in dictionary at byte 0x1d4af for key /MediaBox
Multiple definitions in dictionary at byte 0x1d699 for key /MediaBox
Multiple definitions in dictionary at byte 0x1d85b for key /MediaBox
Multiple definitions in dictionary at byte 0x1db05 for key /MediaBox


698508


## 5. Data Pre-processing & Cleaning :
- This snippet extacts the clean code, removing line breaks, special characters, etc
- The raw text is broken down into *text_chunks* which have a **chunk size = 1000, chunk_overlap = 30** & is ready to be converted to vector embeddings

In [6]:
text_splitter = CharacterTextSplitter(
        separator = "\n",
        chunk_size = 1000,
        chunk_overlap = 30,
        length_function=len
    )
chunks = text_splitter.split_text(text)

## 6. Vector Embeddings & Vector Database Creation: [TASK 2]
- For Vector database creation, I've used **ChromaDB** as this is more scalable and robust when it come to **vector embeddings & retrieval**
- The retrieval technique uses *Similarity_search* for retrieval  of documents from the database based on similarity_score

In [7]:
vectorstore=Chroma.from_texts(
        texts=chunks,
        embedding=embeddings
    )
retriever = vectorstore.as_retriever()

## 7. Define a Prompt Template for Question & Answers :
- Using PromptTemplate, a module in langchain, we can define  templates to generate prompts that are used when the chain is invoked or prompted
- this provides a **blueprint to Answer questions in a systematic way**
- the *StrOutputParser()* is used to render the output without  any HTML tags or other unwanted characters

In [8]:
template = """
Answer the question based on the context below. If you can't 
answer the question, reply "I don't know".

Context: {context}

Question: {question}
"""

prompt = PromptTemplate.from_template(template)

parser = StrOutputParser()

## 8. Define the Chain to Invoke LLM to generate Answers:
- this snippet defines a chain consisting of context, question, promptTemplate, model and parser to render output correctly

In [9]:
chain = (
    {
        "context": itemgetter("question") | retriever,
        "question": itemgetter("question"),
    }
    | prompt
    | model
    | parser
)

## 9. Query Formulation & Retrieval : [TASK 3]
- this snippet holds 13 different questions which is being prompted to the LLM, to retrieve from the vector database
- The output(Answer) of the generated answer is being dispalyed

In [10]:
questions = [
    "What are the variety of Multimodal and Multi-modular AI Approaches to Streamline Autism Diagnosis in Young Children?",
    "What is Autism Spectrum Disorder, how it is caysed?",
    "What is the cure of Autism Spectrum Disorder?",
    "What are Stereotypical and maladaptive behaviors in Autism Spectrum, how are these detected and managed?",
    "How relevant is eye contact and how it can be used to detect Autism?",
    "How can cross country trials help in development of Machine learning based Multimodal solutions?",
    "How early infants cry can help in the early detection of Autism?",
    "What are various methods to detect  Atypical Pattern of Facial expression in Children?",
    "What kind of facial expressions can be used to detect Autism Disorder in children?",
    "What are methods to detect Autism from home videos?",
    "What is Still-Face Paradigm in Early Screening for High-Risk Autism Spectrum Disorder?",
    "What is West Syndrome?",
    "What is the utility of Behavior and interaction imaging at 9 months of age predict autism/intellectual disability in high-risk infants with West syndrome?"
]

for question in questions:
    print(f"Question: {question}")
    print(f"Answer: {chain.invoke({'question': question})}")
    print()

Question: What are the variety of Multimodal and Multi-modular AI Approaches to Streamline Autism Diagnosis in Young Children?
Answer:  Based on the provided context, there have been several studies exploring the use of artificial intelligence (AI) for shortening the behavioral diagnosis of autism in young children. These approaches can be categorized into multimodal and multi-modular AI methods as follows:

1. Multimodal Approaches:
   a. Use of computer vision tools to analyze autism-related behaviors in infants (Hashemi et al., 2014; Hashemi et al., 2018).
   b. Analysis of symmetry in early autism spectrum disorders using lying tasks (Esposito et al., 2009).

2. Multi-modular Approaches:
   a. Use of the Autism Diagnostic Observation Schedule (ADOS) alongside other diagnostic tools (Lord et al., 1999).
   b. Evaluation of a novel mobile-health screening tool for toddlers and preschoolers at risk for autism spectrum disorder using various developmental and behavioral measures (Kanne

## 10. Summarization and Response Generation : [TASK 4]
- Takes the top 5 retrieved Answeres from the 13 queries
- Summarises all these top 5 Answeres 


In [14]:
summary = ""
summary_questions = [
    "What are the variety of Multimodal and Multi-modular AI Approaches to Streamline Autism Diagnosis in Young Children?",
    "What is Autism Spectrum Disorder, how it is caused?",
    "What are Stereotypical and maladaptive behaviors in Autism Spectrum, how are these detected and managed?",
    "What kind of facial expressions can be used to detect Autism Disorder in children?",
    "What are methods to detect Autism from home videos?"
]

for question in summary_questions:
    summary += chain.invoke({'question': question}) + "\n"  # Assuming chain.invoke() takes a dictionary as input

# Create a configuration object
config = {"key": "value"}  # Replace with your actual configuration

# Summarizing the top 5 retrieved queries
summary_result = model.invoke(summary, config=config)  # Pass the config object
print(summary_result)


 Based on the context provided in the text, there have been several studies exploring the use of artificial intelligence (AI) for autism diagnosis in young children using various approaches. These approaches include:

1. Computer vision tools: Analyzing facial expressions and body movements using machine learning algorithms.
2. Speech analysis: Assessing speech and language development using standardized screening questionnaires and analyzing speech-like vocalizations.
3. Motor impairment: Analyzing early motor and socio-communicative behaviors from home videos.
4. Multi-modal analysis: Combining data from multiple sources, such as parent reports, observation, and medical records.
5. Mobile health screening tools: Evaluating a novel mobile-health screening tool for early detection of autism in toddlers and preschoolers.
6. Machine learning algorithms: Exploring symmetry in early autism spectrum disorders and studying the co-occurrence of motor problems and autistic symptoms using machi

# CogniAble - Assignment 2 Submission
- Github : https://github.com/NandanHemanth/
- Resume : https://drive.google.com/file/d/1IB6X_G7mPwvzv1M8I0QiV099_02bylI2/view?usp=sharing 