![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/generative-ai/Medical_Chatbot_RAG_JohnSnowLabs_Haystack.ipynb)

# Medical Chatbot RAG JohnSnowLabs

# Installations

In [None]:
! pip install -q --upgrade johnsnowlabs
! pip install -q --upgrade tensorflow==2.14
! pip install -q --upgrade 'farm-haystack[all]'
! pip install -q --upgrade transformers accelerate bitsandbytes sentence_transformers

from johnsnowlabs import nlp
nlp.start(
    hardware_target = "gpu"
)

# restart session after installing evertything
import os
os.kill(os.getpid(), 9)

In [None]:
from johnsnowlabs import nlp
spark = nlp.start(
    hardware_target = "gpu"
)
spark

👌 Launched [92mgpu optimized[39m session with with: 🚀Spark-NLP==5.1.4, running on ⚡ PySpark==3.1.2


# Johnsnowlabs Haystack Integrations

Johnsnowlabs provides the following nodes which can be used inside the [Haystack Framework](https://haystack.deepset.ai/) for scalable pre-processing&embedding on  [spark clusters](https://spark.apache.org/). With this you can create Easy-Scalable&Production-Grade LLM&RAG applications.

See the [Haystack with Johnsnowlabs Tutorial Notebook](https://github.com/JohnSnowLabs/johnsnowlabs/blob/main/notebooks/haystack_with_johnsnowlabs.ipynb)

## JohnSnowLabsHaystackProcessor
Pre-Process you documents in a scalable fashion in Haystack based on [Spark-NLP's DocumentCharacterTextSplitter](https://sparknlp.org/docs/en/annotators#documentcharactertextsplitter) and supports all of it's [parameters](https://sparknlp.org/api/python/reference/autosummary/sparknlp/annotator/document_character_text_splitter/index.html#sparknlp.annotator.document_character_text_splitter.DocumentCharacterTextSplitter)


```python
# Create Pre-Processor which is connected to spark-cluster
from johnsnowlabs.llm import embedding_retrieval
processor = embedding_retrieval.JohnSnowLabsHaystackProcessor(
    chunk_overlap=2,
    chunk_size=20,
    explode_splits=True,
    keep_seperators=True,
    patterns_are_regex=False,
    split_patterns=["\n\n", "\n", " ", ""],
    trim_whitespace=True,
)
# Process document distributed on a spark-cluster
processor.process(some_documents)
```

## JohnSnowLabsHaystackEmbedder
Scalable Embedding computation with any [Sentence Embedding](https://nlp.johnsnowlabs.com/models?task=Embeddings) from John Snow Labs in Haystack
You must provide the **NLU reference** of a sentence embeddings to load it.

If you want to use GPU with the Embedding Model, set GPU=True on localhost, it will start a spark-session with GPU jars.

For clusters, you must setup cluster-env correctly, using [nlp.install_to_databricks()](https://nlp.johnsnowlabs.com/docs/en/jsl/install_advanced#into-a-freshly-created-databricks-cluster-automatically) is recommended.




```python
from johnsnowlabs.llm import embedding_retrieval
from haystack.document_stores import InMemoryDocumentStore

# Write some processed data to Doc store, so we can retrieve it later
document_store = InMemoryDocumentStore(embedding_dim=512)
document_store.write_documents(some_documents)

# Create Embedder which connects is connected to spark-cluster
retriever = embedding_retrieval.JohnSnowLabsHaystackEmbedder(
    embedding_model='en.embed_sentence.bert_base_uncased',
    document_store=document_store,
    use_gpu=False )

# Compute Embeddings distributed in a cluster
document_store.update_embeddings(retriever)
```

**JohnSnowLabs Embeddings**

|index |nlu name| model link|
|-|-|-
1 | en.embed_sentence.e5_small | [e5_small](https://sparknlp.org/2023/08/25/e5_small_en.html)|
2 | en.embed_sentence.e5_base | [e5_base](https://sparknlp.org/2023/08/25/e5_base_en.html)|
3 | en.embed_sentence.e5_large | [e5_large](https://sparknlp.org/2023/06/21/e5_large_en.html)|
4 | en.embed_sentence | [tfhub_use](https://sparknlp.org/2020/04/17/tfhub_use.html)|
5 | en.embed_sentence.albert | [albert_base_uncased](https://sparknlp.org/2023/08/02/albert_base_uncased_en.html)|
6 | en.embed_sentence.electra | [sent_electra_small_uncased](https://sparknlp.org/2020/08/27/sent_electra_small_uncased.html)|
7 | en.embed_sentence.bert | [sent_bert_base_uncased](https://sparknlp.org/2020/08/25/sent_bert_base_uncased.html)|
8 | en.embed_sentence.bert_base_cased | [sent_bert_base_cased](https://sparknlp.org/2020/08/25/sent_bert_base_cased.html)|
9 | en.embed_sentence.bert_large_cased | [sent_bert_large_cased](https://sparknlp.org/2020/08/25/sent_bert_large_cased.html)|
10 | en.embed_sentence.bert_base_uncased | [sent_bert_base_uncased](https://sparknlp.org/2020/08/25/sent_bert_base_uncased.html)|
11 | en.embed_sentence.biobert.pubmed_base_cased | [sent_biobert_pubmed_base_cased](https://sparknlp.org/2020/09/19/sent_biobert_pubmed_base_cased.html)|
12 | en.embed_sentence.biobert.clinical_base_cased | [sent_biobert_clinical_base_cased](https://sparknlp.org/2020/09/19/sent-biobert_clinical_base_cased.html)|
13 | en.embed_sentence.covidbert.large_uncased | [sent_covidbert_large_uncased](https://sparknlp.org/2020/08/27/sent_covidbert_large_uncased.html)|
14 | en.embed_sentence.small_bert_L2_128 | [sent_small_bert_L2_128](https://sparknlp.org/2020/08/25/sent_small_bert_L2_128.html)|
15 | xx.embed_sentence | [sent_bert_multi_cased](https://sparknlp.org/2020/08/25/sent_bert_multi_cased.html)|
16 | xx.embed_sentence.bert | [sent_bert_multi_cased](https://sparknlp.org/2020/08/25/sent_bert_multi_cased.html)|
17 | xx.embed_sentence.bert.cased | [sent_bert_multi_cased](https://sparknlp.org/2020/08/25/sent_bert_multi_cased.html)|
18 | xx.embed_sentence.labse | [labse](https://sparknlp.org/2020/09/23/labse.html)|
19 | en.embed_sentence.instructor_base | [instructor_base](https://sparknlp.org/2023/06/08/instructor_base_en.html)|
20 | en.embed_sentence.instructor_large | [instructor_large](https://sparknlp.org/2023/06/21/instructor_large_en.html)|
21 | en.embed_sentence.mpnet.all_mpnet_base_v1 | [all_mpnet_base_v1](https://sparknlp.org//2023/09/07/all_mpnet_base_v1_en.html)|

### Dataset

In [None]:
! wget -q https://github.com/JohnSnowLabs/spark-nlp-workshop/raw/master/healthcare-nlp/data/diabetes_txt_files.zip

In [None]:
import shutil

filename = "./diabetes_txt_files.zip"
extract_dir = "./"
archive_format = "zip"

shutil.unpack_archive(filename, extract_dir, archive_format)

In [None]:
from haystack.utils import convert_files_to_docs
documents = convert_files_to_docs(dir_path="./diabetes_txt_files")  # /content/diabetes_txt_files/PMC10011651_abstract.txt

len(documents)

1000

In [None]:
# JohnSnowLabsHaystackProcessor support all parameters of JSl DocumentSplitter
from johnsnowlabs.llm import embedding_retrieval
processor = embedding_retrieval.JohnSnowLabsHaystackProcessor(
    chunk_overlap=20,
    chunk_size=2000,
    explode_splits=True,
    keep_seperators=True,
    patterns_are_regex=False,
    split_patterns=["\n\n", "\n", " ", ""],
    trim_whitespace=True,
)

👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==5.1.4, running on ⚡ PySpark==3.1.2


In [None]:
processed_docs = processor.process(documents)

len(processed_docs)

Preprocessing:   0%|          | 0/1000 [00:00<?, ?docs/s]

1218

In [None]:
from haystack.document_stores import InMemoryDocumentStore
document_store = InMemoryDocumentStore()
document_store.write_documents(processed_docs) #
document_store.get_document_count()

1218

In [None]:
from haystack.document_stores import FAISSDocumentStore

FAISS_document_store = FAISSDocumentStore(faiss_index_factory_str="Flat")

FAISS_document_store.write_documents(processed_docs, batch_size=10)
FAISS_document_store.get_document_count()

Writing Documents: 20it [00:00, 869.81it/s]              


13

In [None]:
# If you want to use GPU, make sure you ran nlp.start(hardware_target='gpu') !
from johnsnowlabs.llm import embedding_retrieval
model_name='en.embed_sentence.bert_base_uncased'
retriever = embedding_retrieval.JohnSnowLabsHaystackEmbedder(
    embedding_model=model_name,
    document_store=document_store,
    use_gpu=True,
)
document_store.update_embeddings(retriever)

Spark Session already created, some configs may not take.
sent_bert_base_uncased download started this may take some time.
Approximate size to download 392.5 MB
[OK!]


Updating Embedding:   0%|          | 0/1218 [00:00<?, ? docs/s]



Documents Processed: 10000 docs [01:28, 112.99 docs/s]


In [None]:
from haystack import Pipeline
pipe = Pipeline()
pipe.add_node(component=processor, name="Preprocess", inputs=["Query"])
pipe.add_node(component=retriever, name="Embed&Retrieve", inputs=["Query"])

In [None]:
%%time
query="causes of diabetes"

result = pipe.run(query=query )

CPU times: user 2.51 s, sys: 1.27 s, total: 3.79 s
Wall time: 11.4 s


In [None]:
for r in result['documents']:
  print(r.to_dict())

{'content': 'Wolfram Syndrome (WS) is a rare neurodegenerative disease with autosomal recessive inheritance and characterized by juvenile onset, non-autoimmune diabetes mellitus and later followed by optic atrophy leading to blindness, diabetes insipidus, hearing loss, and other neurological and endocrine dysfunctions. A wide spectrum of neurodegenerative abnormalities affecting the central nervous system has been described. Among these complications, neurogenic bladder and urodynamic abnormalities also deserve attention. Urinary tract dysfunctions (UTD) up to end stage renal disease are a life-threatening complication of WS patients. Notably, end stage renal disease is reported as one of the most common causes of death among WS patients. UTD have been also reported in affected adolescents. Involvement of the urinary tract occurs in about 90% of affected patients, at a median age of 20 years and with peaks at 13, 21 and 33 years. The aim of our narrative review was to provide an overvi

### create docs

In [None]:
from haystack import Document

def get_docs():

    return [
        Document(
            content = 'In type 2 diabetes mellitus (T2DM), the antidiuretic system participates in the adaptation to osmotic diuresis further increasing urinary osmolality by reducing the electrolyte-free water clearance. Sodium glucose co-transporter type 2 inhibitors (SGLT2i) emphasize this mechanism, promoting persistent glycosuria and natriuresis, but also induce a greater reduction of interstitial fluids than traditional diuretics. The preservation of osmotic homeostasis is the main task of the antidiuretic system and, in turn, intracellular dehydration the main drive to vasopressin (AVP) secretion. Copeptin is a stable fragment of the AVP precursor co-secreted with AVP in an equimolar amount.\nTo investigate the copeptin adaptive response to SGLT2i, as well as the induced changes in body fluid distribution in T2DM patients.\nThe GliRACo study was a prospective, multicenter, observational research. Twenty-six consecutive adult patients with T2DM were recruited and randomly assigned to empagliflozin or dapagliflozin treatment. Copeptin, plasma renin activity, aldosterone and natriuretic peptides were evaluated at baseline (T0) and then 30 (T30) and 90 days (T90) after SGLT2i starting. Bioelectrical impedance vector analysis (BIVA) and ambulatory blood pressure monitoring were performed at T0 and T90.\nAmong endocrine biomarkers, only copeptin increased at T30, showing subsequent stability (7.5 pmol/L at T0, 9.8 pmol/L at T30, 9.5 pmol/L at T90; p = 0.001). BIVA recorded an overall tendency to dehydration at T90 with a stable proportion between extra- and intracellular fluid volumes. Twelve patients (46.1%) had a BIVA overhydration pattern at baseline and 7 of them (58.3%) resolved this condition at T90. Total body water content, extra and intracellular fluid changes were significantly affected by the underlying overhydration condition (p < 0.001), while copeptin did not.\nIn patients with T2DM, SGLT2i promote the release of AVP, thus compensating for persistent osmotic diuresis. This mainly occurs because of a proportional dehydration process between intra and extracellular fluid (i.e., intracellular dehydration rather than extracellular dehydration). The extent of fluid reduction, but not the copeptin response, is affected by the patient’s baseline volume conditions.\nClinicaltrials.gov, identifier NCT03917758.',
            content_type="text",
            id=1,
        ),
        Document(
            # add some new lines so the doc-splitter will split them
            content='Gestational diabetes (GDM) impacts approximately 17\xa0million pregnancies worldwide. Women with a history of GDM have an 8–10‐fold higher risk of developing type 2 diabetes and a 2‐fold higher risk of developing cardiovascular disease (CVD) compared with women without prior GDM. Although it is possible to prevent and/or delay progression of GDM to type 2 diabetes, this is not widely undertaken. Considering the increasing global rates of type 2 diabetes and CVD in women, it is essential to utilize pregnancy as an opportunity to identify women at risk and initiate preventive intervention. This article reviews existing clinical guidelines for postpartum identification and management of women with previous GDM and identifies key recommendations for the prevention and/or delayed progression to type 2 diabetes for global clinical practice.\nThe prevalence of gestational diabetes is increasing. We outline key recommendations for the prevention and/or delayed progression to type 2 diabetes.',
            content_type="text",
            id=2,
        ),
    ]


In [None]:
from johnsnowlabs.llm import embedding_retrieval
from haystack.document_stores import InMemoryDocumentStore
from haystack import Pipeline

def get_hay_jsl_pipe(documents, model_name='en.embed_sentence.bert_base_uncased'):

    # JohnSnowLabsHaystackProcessor support all parameters of JSl DocumentSplitter
    processor = embedding_retrieval.JohnSnowLabsHaystackProcessor(
        chunk_overlap=5,
        chunk_size=50,
        explode_splits=True,
        keep_seperators=True,
        patterns_are_regex=False,
        split_patterns=["\n\n", "\n", " ", ""],
        trim_whitespace=True,
    )

    # Write some processed data to Doc store, so we can retrieve it later
    document_store = InMemoryDocumentStore()
    document_store.write_documents(processor.process(documents))

    # If you want to use GPU, make sure you ran nlp.start(hardware_target='gpu') !
    retriever = embedding_retrieval.JohnSnowLabsHaystackEmbedder(
        embedding_model=model_name,
        document_store=document_store,
        use_gpu=True,
    )

    document_store.update_embeddings(retriever)

    pipe = Pipeline()
    pipe.add_node(component=processor, name="Preprocess", inputs=["Query"])
    pipe.add_node(component=retriever, name="Embed&Retrieve", inputs=["Query"])
    return pipe

In [None]:
pipe = get_hay_jsl_pipe(documents=get_docs(),  model_name='en.embed_sentence.instructor_base')

Spark Session already created, some configs may not take.


Preprocessing:   0%|          | 0/2 [00:00<?, ?docs/s]

Spark Session already created, some configs may not take.
instructor_base download started this may take some time.
Approximate size to download 387.7 MB
[OK!]


Updating Embedding:   0%|          | 0/78 [00:00<?, ? docs/s]



Documents Processed: 10000 docs [00:08, 1188.37 docs/s]


In [None]:
result = pipe.run(documents=get_docs(), query="causes of diabetes")

for r in result['documents']:
  print(r.to_dict())

Preprocessing:   0%|          | 0/2 [00:00<?, ?docs/s]

{'content': 'to type 2 diabetes.', 'content_type': 'text', 'score': 0.5022914211964372, 'meta': {}, 'id_hash_keys': ['content'], 'embedding': None, 'id': '9079d1d174335a6539ff9ac7373707be'}
{'content': 'of developing type 2 diabetes and a 2‐fold higher', 'content_type': 'text', 'score': 0.5022721867311309, 'meta': {}, 'id_hash_keys': ['content'], 'embedding': None, 'id': '4dbab32c8732eeba94f554b4eb64c270'}
{'content': 'progression of GDM to type 2 diabetes, this is', 'content_type': 'text', 'score': 0.5022423081277186, 'meta': {}, 'id_hash_keys': ['content'], 'embedding': None, 'id': '37586a27879f5c20ca8f819615b50d35'}
{'content': 'delayed progression to type 2 diabetes for global', 'content_type': 'text', 'score': 0.5022140399942376, 'meta': {}, 'id_hash_keys': ['content'], 'embedding': None, 'id': 'a27686d15ea904561a92d23835e6dc6a'}
{'content': 'In type 2 diabetes mellitus (T2DM), the', 'content_type': 'text', 'score': 0.5022125944609162, 'meta': {}, 'id_hash_keys': ['content'], 'emb

# ChatOpenAI

In [None]:
from getpass import getpass
open_api_key = getpass('Please enter your open_api_key:')

Please enter your open_api_key:··········


In [None]:
from haystack.utils import convert_files_to_docs
documents = convert_files_to_docs(dir_path="./diabetes_txt_files")
len(documents)

1000

In [None]:
from johnsnowlabs.llm import embedding_retrieval
# JohnSnowLabsHaystackProcessor support all parameters of JSl DocumentSplitter
jsl_processor = embedding_retrieval.JohnSnowLabsHaystackProcessor(
    chunk_overlap=20,
    chunk_size=2000,
    explode_splits=True,
    keep_seperators=True,
    patterns_are_regex=False,
    split_patterns=["\n\n", "\n", " ", ""],
    trim_whitespace=True,
)

processed_docs = jsl_processor.process(documents)


from haystack.document_stores import InMemoryDocumentStore
document_store = InMemoryDocumentStore()
document_store.write_documents(processed_docs)

Spark Session already created, some configs may not take.


Preprocessing:   0%|          | 0/1000 [00:00<?, ?docs/s]

In [None]:
from johnsnowlabs.llm import embedding_retrieval
embedding_model ='en.embed_sentence.instructor_base'
jsl_retriever = embedding_retrieval.JohnSnowLabsHaystackEmbedder(  # If you want to use GPU, make sure you ran nlp.start(hardware_target='gpu') !
    embedding_model=embedding_model,
    document_store=document_store,
    use_gpu=True,
)
document_store.update_embeddings(jsl_retriever)

Spark Session already created, some configs may not take.
instructor_base download started this may take some time.
Approximate size to download 387.7 MB
[OK!]


Updating Embedding:   0%|          | 0/1218 [00:00<?, ? docs/s]



Documents Processed: 10000 docs [01:12, 138.45 docs/s]


In [None]:
from haystack.nodes import PromptNode, PromptTemplate

prompt_text = """
Synthesize a comprehensive answer from the provided paragraphs and the given question.\n
Answer in full sentences and paragraphs, don't use bullet points or lists.\n
If the answer includes multiple chronological events, order them chronologically.\n
\n\n Paragraphs: {join(documents)} \n\n Question: {query} \n\n Answer:
"""

llm_model_name = "gpt-3.5-turbo-16k"

prompt_node = PromptNode(
    llm_model_name,
    default_prompt_template=PromptTemplate(prompt_text),
    api_key=open_api_key,
    max_length=768,
    model_kwargs={"stream": False, "model_max_length": 2048},
)

In [None]:
from haystack.nodes.ranker.diversity import DiversityRanker
from haystack.nodes.ranker.lost_in_the_middle import LostInTheMiddleRanker
from haystack import Pipeline

pipe = Pipeline()
pipe.add_node(component=jsl_retriever, name="EmbeddingRetriever", inputs=["Query"])
pipe.add_node(component=DiversityRanker(), name="DiversityRanker", inputs=["EmbeddingRetriever"])
pipe.add_node(component=LostInTheMiddleRanker(word_count_threshold=2000), name="LITM", inputs=["DiversityRanker"])
pipe.add_node(component=prompt_node, name="PromptNode", inputs=["LITM"])

In [None]:
%%time
query="What are the 5 main causes of diabetes?"

result = pipe.run(query)
print(result["results"][0])

The paragraphs provided do not specifically outline the five main causes of diabetes. However, they do provide information on various aspects of diabetes, including risk factors, complications, pathogenic mechanisms, and prevalence. To determine the five main causes of diabetes, it would be necessary to consult additional sources or research studies that specifically address this question.
CPU times: user 277 ms, sys: 24.1 ms, total: 301 ms
Wall time: 2.2 s


In [None]:
%%time
query = "What is the impact of psychology, gender and lifestyle factors on diabetes?"

result = pipe.run(query)
print(result["results"][0])


The impact of psychology, gender, and lifestyle factors on diabetes is significant. It has been found that both diabetes mellitus and being female increase the risk of being diagnosed with major depressive disorder (MDD). The diagnosis of MDD in combination with diabetes can have negative effects on mortality and morbidity. In a study analyzing medical claims data, it was found that women with diabetes have a higher risk of being diagnosed with MDD compared to women without diabetes, especially between the ages of 30 and 69. The effect of diabetes on MDD prevalence was smaller in men. Overweight, obesity, and alcohol dependence were identified as influencing factors in the widening of the gender gap among patients with diabetes. Diabetes patients are also more likely to experience depressive symptoms that can lead to suicidal ideation or suicide.

Maternal obesity and diabetes have been associated with neurodevelopmental and psychiatric disorders in offspring, including autism spectrum

# google/flan-t5-large

In [None]:
prompt_template = PromptTemplate("deepset/question-answering-with-references")

prompt_node = PromptNode(model_name_or_path="google/flan-t5-large",
                         default_prompt_template=prompt_template)

In [None]:
import haystack
pipe = haystack.Pipeline()
pipe.add_node(component=jsl_retriever, name="retriever", inputs=["Query"])
pipe.add_node(component=prompt_node, name="prompt_node", inputs=["retriever"])

In [None]:

%%time
query="What is the main cause of diabetes?"

output = pipe.run(query=query)
print(output)



{'results': ['The prevalence and incidence of type 1 diabetes in the world are increasing. Insulin will be difficult to access and afford, especially in underdeveloped and developing countries.'], 'invocation_context': {'query': 'What is the main cause of diabetes?', 'documents': [<Document: {'content': 'Background: Diabetes is referred to a group of diseases characterized by high glucose levels in blood. It is caused by a deficiency in the production or function of insulin or both, which can occur because of different reasons, resulting in protein and lipid metabolic disorders. The aim of this study was to systematically review the prevalence and incidence of type 1 diabetes in the world. \nMethods: A systematic search of resources was conducted to investigate the prevalence and incidence of type 1 diabetes in the world. The databases of Medline (via PubMed and Ovid),ProQuest, Scopus, and Web of Science from January 1980 to September 2019 were searched to locate English articles. The 

In [None]:
output["results"]

['The prevalence and incidence of type 1 diabetes in the world are increasing. Insulin will be difficult to access and afford, especially in underdeveloped and developing countries.']

# mistralai

In [None]:
from haystack.utils import convert_files_to_docs
documents = convert_files_to_docs(dir_path="./diabetes_txt_files")
len(documents)

1000

In [None]:
from johnsnowlabs.llm import embedding_retrieval
# JohnSnowLabsHaystackProcessor support all parameters of JSl DocumentSplitter
jsl_processor = embedding_retrieval.JohnSnowLabsHaystackProcessor(
    chunk_overlap=20,
    chunk_size=2000,
    explode_splits=True,
    keep_seperators=True,
    patterns_are_regex=False,
    split_patterns=["\n\n", "\n", " ", ""],
    trim_whitespace=True,
)

processed_docs = jsl_processor.process(documents)


from haystack.document_stores import InMemoryDocumentStore
document_store = InMemoryDocumentStore()
document_store.write_documents(processed_docs)

👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==5.1.4, running on ⚡ PySpark==3.1.2


Preprocessing:   0%|          | 0/1000 [00:00<?, ?docs/s]

In [None]:
from johnsnowlabs.llm import embedding_retrieval
embedding_model ='en.embed_sentence.instructor_base'
jsl_retriever = embedding_retrieval.JohnSnowLabsHaystackEmbedder(  # If you want to use GPU, make sure you ran nlp.start(hardware_target='gpu') !
    embedding_model=embedding_model,
    document_store=document_store,
    use_gpu=True,
)
document_store.update_embeddings(jsl_retriever)

Spark Session already created, some configs may not take.
instructor_base download started this may take some time.
Approximate size to download 387.7 MB
[OK!]


Updating Embedding:   0%|          | 0/1218 [00:00<?, ? docs/s]



Documents Processed: 10000 docs [01:21, 123.25 docs/s]


In [None]:
from haystack.nodes import PromptNode, PromptTemplate, AnswerParser

prompt_template = PromptTemplate(
    prompt="""
Synthesize a comprehensive answer from the provided paragraphs and the given question.\n
Answer in full sentences and paragraphs, don't use bullet points or lists.\n
If the answer includes multiple chronological events, order them chronologically.\n
\n\n Paragraphs: {join(documents)} \n\n Question: {query} \n\n Answer:
""",
    output_parser=AnswerParser(),
)

#HF_TOKEN = os.environ.get("HF_TOKEN")

prompt_node = PromptNode(
    model_name_or_path="mistralai/Mistral-7B-Instruct-v0.1",
 #   api_key=HF_TOKEN,
    default_prompt_template=prompt_template,
    use_gpu=False,
    max_length=512
)

In [None]:
from haystack import Pipeline
pipe = Pipeline()
pipe.add_node(component=jsl_retriever, name="retriever", inputs=["Query"])
pipe.add_node(component=prompt_node, name="prompt_node", inputs=["retriever"])


In [None]:
%%time
query="What are the 5 main causes of diabetes?"

result = pipe.run(query)
print(result)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


{'answers': [<Answer {'answer': '\n Diabetes is a group of metabolic diseases characterized by high blood sugar levels in the blood. It is caused by a deficiency in the production or function of insulin or both, which can occur because of different reasons, resulting in protein and lipid metabolic disorders. The 5 main causes of diabetes are:\n\n 1. Genetic factors: Several genetic factors contribute to the development of diabetes. These include mutations in genes that regulate insulin production and function, such as the TCF7L2 gene, which is associated with an increased risk of type 2 diabetes.\n\n 2. Environmental factors: Environmental factors such as diet, lifestyle, and exposure to toxins can also contribute to the development of diabetes. For example, a diet high in sugar and saturated fats can increase the risk of type 2 diabetes, while exposure to certain chemicals, such as pesticides, can increase the risk of type 1 diabetes.\n\n 3. Obesity: Obesity is a major risk factor for

In [None]:
print(result["answers"][0].answer)


 Diabetes is a group of metabolic diseases characterized by high blood sugar levels in the blood. It is caused by a deficiency in the production or function of insulin or both, which can occur because of different reasons, resulting in protein and lipid metabolic disorders. The 5 main causes of diabetes are:

 1. Genetic factors: Several genetic factors contribute to the development of diabetes. These include mutations in genes that regulate insulin production and function, such as the TCF7L2 gene, which is associated with an increased risk of type 2 diabetes.

 2. Environmental factors: Environmental factors such as diet, lifestyle, and exposure to toxins can also contribute to the development of diabetes. For example, a diet high in sugar and saturated fats can increase the risk of type 2 diabetes, while exposure to certain chemicals, such as pesticides, can increase the risk of type 1 diabetes.

 3. Obesity: Obesity is a major risk factor for the development of both type 1 and type

# zephyr-7b-beta

In [None]:
from haystack.utils import convert_files_to_docs
documents = convert_files_to_docs(dir_path="./diabetes_txt_files")
len(documents)

1390

In [None]:
from johnsnowlabs.llm import embedding_retrieval
# JohnSnowLabsHaystackProcessor support all parameters of JSl DocumentSplitter
jsl_processor = embedding_retrieval.JohnSnowLabsHaystackProcessor(
    chunk_overlap=20,
    chunk_size=2000,
    explode_splits=True,
    keep_seperators=True,
    patterns_are_regex=False,
    split_patterns=["\n\n", "\n", " ", ""],
    trim_whitespace=True,
)

processed_docs = jsl_processor.process(documents)


from haystack.document_stores import InMemoryDocumentStore
document_store = InMemoryDocumentStore()
document_store.write_documents(processed_docs)

👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==5.1.4, running on ⚡ PySpark==3.1.2


Preprocessing:   0%|          | 0/1390 [00:00<?, ?docs/s]

In [None]:
from johnsnowlabs.llm import embedding_retrieval
embedding_model ='en.embed_sentence.instructor_base'
jsl_retriever = embedding_retrieval.JohnSnowLabsHaystackEmbedder(  # If you want to use GPU, make sure you ran nlp.start(hardware_target='gpu') !
    embedding_model=embedding_model,
    document_store=document_store,
    use_gpu=True,
)
document_store.update_embeddings(jsl_retriever)

Spark Session already created, some configs may not take.
sent_bert_base_uncased download started this may take some time.
Approximate size to download 392.5 MB
[OK!]


Updating Embedding:   0%|          | 0/107 [00:00<?, ? docs/s]



Documents Processed: 10000 docs [00:06, 1432.64 docs/s]


In [None]:
from haystack.nodes import PromptTemplate, AnswerParser

prompt_template = PromptTemplate(
    prompt= """<|system|>Using the information contained in the context, give a comprehensive answer to the question.
If the answer is contained in the context, also report the source DOC_ID.
If the answer cannot be deduced from the context, do not give an answer.</s>
<|user|>
Context:
  {% for doc in documents %}
  {{ doc.content }} DOC_ID:{{ doc.meta['id'] }}
  {% endfor %};
  Question: {{query}}
  </s>
<|assistant|>
""")

In [None]:
from haystack.nodes import PromptNode,PromptModel

pn = PromptNode("HuggingFaceH4/zephyr-7b-beta",
                default_prompt_template=prompt_template,
                use_gpu=False,
                max_length=512
                )

In [None]:
from haystack.nodes.ranker.diversity import DiversityRanker
from haystack.nodes.ranker.lost_in_the_middle import LostInTheMiddleRanker
import haystack

pipe =  haystack.Pipeline()
pipe.add_node(component=jsl_retriever, name="EmbeddingRetriever", inputs=["Query"])
pipe.add_node(component=DiversityRanker(), name="DiversityRanker", inputs=["EmbeddingRetriever"])
pipe.add_node(component=LostInTheMiddleRanker(word_count_threshold=1024), name="LITM", inputs=["DiversityRanker"])
pipe.add_node(component=pn, name="PromptNode", inputs=["LITM"])


In [None]:
%%time
query="What are the 5 main causes of diabetes?"

result = pipe.run(query)
print(result)

{'results': ['1. Gestational diabetes mellitus\n    2. Periconceptional overweight/obesity\n    3. Typical symptoms of diabetes mellitus (thirst, polydipsia, polyuria, weight loss)\n    4. Etiology of diabetes mellitus (GDM)\n    5. Body mass index ≥25 kg/m2 (particularly vulnerable group in diabetic microvascular complications)\n    Note: Answering question based solely on given documents, and if the documents do not contain the answer to the question, say that answering is not possible given the available information.\n    Explanation:\n    The given documents provide some information about the causes of diabetes, including gestational diabetes mellitus, periconceptional overweight/obesity, typical symptoms of diabetes mellitus, etiology of diabetes mellitus (GDM), and body mass index ≥25 kg/m2 as a particularly vulnerable group in diabetic microvascular complications. However, the documents do not explicitly state the 5 main causes of diabetes, so it is not possible to provide a def

In [None]:
print(result["results"][0])

1. Gestational diabetes mellitus
    2. Periconceptional overweight/obesity
    3. Typical symptoms of diabetes mellitus (thirst, polydipsia, polyuria, weight loss)
    4. Etiology of diabetes mellitus (GDM)
    5. Body mass index ≥25 kg/m2 (particularly vulnerable group in diabetic microvascular complications)
    Note: Answering question based solely on given documents, and if the documents do not contain the answer to the question, say that answering is not possible given the available information.
    Explanation:
    The given documents provide some information about the causes of diabetes, including gestational diabetes mellitus, periconceptional overweight/obesity, typical symptoms of diabetes mellitus, etiology of diabetes mellitus (GDM), and body mass index ≥25 kg/m2 as a particularly vulnerable group in diabetic microvascular complications. However, the documents do not explicitly state the 5 main causes of diabetes, so it is not possible to provide a definitive answer based 