# 📚 Lab3: KG from unstructured data

**Goal** of the session: extract and normalize concepts from natural language text, and generate a RDF knowledge graph with those concepts.

**Tasks**:
1. **Extract** disease mentions from an english text about climate change (Named Entity Recognition, aka. NER)
2. **Normalize** the extracted diseases by mapping to standard concepts in a relevant ontology (Semantic similarity search, aka. vector databases)
3. Create a **RDF knowledge graph** for the extracted diseases annotations, using the [Text Annotation Ontology](https://vemonet.github.io/tao). The KG should contain:
    * The extracted diseases text annotations (start/end index, value in the text)
    * Linked to their top match ontology concept URI
    * Linked to the sentence they can be found in (create 1 URI per sentence in the text)

In [1]:
annotate_text = """As world leaders gather on Monday for the Climate Action Summit at the UN General Assembly, it is sadly clear that the prospect of rising global temperatures and sea levels has failed to generate a sufficient sense of urgency around climate change. What might spur leaders to action, if it were better understood, is the enormous threat that climate change already poses to human health.\nClimate change exacerbates chronic and contagious disease, worsens food and water shortages, increases the risk of pandemics, and aggravates mass displacement. The broad environmental effects of climate change have long been discussed as long-term risks; what\u2019s clear now is that the health effects are worse than anticipated\u2014and that they\u2019re already being felt.\nCLIMATE ILLS
The dangerous health effects of climate change begin with the emissions that cause it. Black carbon, methane, and nitrogen oxides are powerful drivers of global warming, and, along with other air pollutants such as carbon monoxide and ozone, they are responsible for over seven million deaths each year, about one in eight worldwide. The problem extends beyond cities with famously poor air quality, such as New Delhi, Beijing, and S\u00e3o Paulo. Ninety percent of the world\u2019s urban dwellers breathe air containing unsafe pollution levels, according to the World Health Organization (WHO).\nThe dangers start at the beginning of life. Toxic pollutants cross the placenta, increasing the risk of preterm birth and low birth weight, which can cause lifelong damage to multiple organ systems. Children breathe more rapidly than adults do, so they absorb more pollutants at a time when their developing organs are more vulnerable. As a result, air pollution causes an estimated 600,000 deaths each year in children under five, mostly from pneumonia. There is also emerging evidence that air pollution compromises children\u2019s cognitive development and can increase their risk of behavioral disorders.
In adults, pollution contributes to a wide range of respiratory and circulatory diseases, and may accelerate cognitive decline in seniors. Most air-pollution-related deaths are due to heart attacks and strokes, but ambient air pollution also accounts for a significant number of pneumonia, asthma, emphysema, and lung cancer deaths.\nIn addition to air pollution, emissions are responsible for rising global temperatures. These in turn lead to increased humidity and cause more frequent and intense heat waves that worsen hypertension and mental health problems, and can limit the effectiveness of certain medications. When a person\u2019s body temperature rises to 104 degrees Fahrenheit or above, systematic organ failure occurs. Heat waves this summer killed 1,435 people in France alone, the only country to have published statistics on heat-related deaths. As many of the world\u2019s major population centers grow hotter and more humid, more people will die from simply overheating.
Climate change also compounds the threat of communicable diseases. Increased rainfall and higher temperatures favor vector-borne diseases\u2014those caused by parasites, viruses, and bacteria transmitted by mosquitoes, ticks, flies, and fleas. Cold-blooded insects generally prefer warmer temperatures, which not only extend their breeding seasons but accelerate their geographical expansion.
As man-made climate change has taken hold over the last four decades, dozens of new infectious diseases have emerged or begun to threaten new regions, including Zika and Ebola.
The mosquito is already the deadliest animal in the world, causing more than half a million deaths each year\u2014438,000 of them from malaria. Warmer temperatures make it easier to transmit malaria at higher altitudes, and may cause it to spread farther into African highlands.
Another virus likely to spread as a result of climate change is dengue, which currently infects 96 million people each year and kills 90,000 of them. Dengue virus is transmitted by two species of mosquito\u2014Aedes aegypti and Aedes albopictus\u2014that are unusually tough and also transmit yellow fever, Zika, West Nile, and other viruses. More than half the world\u2019s population lives in areas where an Aedes species is already present\u2014and that proportion is likely to grow. A. aegypti in particular thrives not only in warm and moist environments but in drought-prone ones, too. Europe, North America, and high-elevation areas in the tropics may soon have to contend with dengue as well as other emerging diseases.\nThe steady swarm of Aedes into new regions points to the single biggest threat of all: pandemic disease. As man-made climate change has taken hold over the last four decades, dozens of new infectious diseases have emerged or begun to threaten new regions, including Zika and Ebola. Cholera is also becoming more difficult to control: warm, brackish waters and rising sea levels help spread the disease, which infects about four million people each year and kills about 100,000 of them. Bubonic plague, spread by rats and fleas, is predicted to increase with warmer springs and wetter summers. Anthrax, whose spores are released by thawing permafrost, could spread farther as a result of stronger winds.\nAnd those are just the direct health effects of climate change. Rising sea levels and increased ocean acidification will reduce fishing and aquaculture, aggravating malnutrition and food insecurity. Contamination of aquifers will exacerbate water shortages. Droughts, which already kill and displace more people than any other type of weather catastrophe, are predicted to grow longer and more frequent. The World Bank estimates that by 2050, there could be one billion climate refugees from sub-Saharan Africa, South Asia, and Latin America.\nA HEALTHY RESPONSE\nExtreme weather also disrupts public health infrastructure and services. That is why it is imperative that countries around the world invest in adapting health-care systems to the environmental changes already underway and likely to follow. At the first high-level meeting on Universal Health Coverage (UHC) at the United Nations, the WHO will call on world leaders to invest not only in safe water, hygiene, and sanitation services but also in universal access to health services for chronic disease, child health, and antenatal and palliative care. Almost every disease caused or aggravated by climate change can be prevented or treated if addressed early. Unfortunately, the WHO projects that by 2030, 42 percent of the world\u2019s population will either not have access to health services or not be able to afford them.
Universal health coverage includes screening and accurate and timely diagnostics. It also includes surveillance and rapid response to emerging global health threats.
The world has tended to fight such threats one disease at a time\u2014whether it is smallpox, polio, TB, HIV/AIDS, or malaria\u2014and to adopt a firefighting approach when a deadly pandemic such as Ebola emerges.
"""


In [2]:
import sys
!{sys.executable} -m pip install requests tqdm spacy textblob oxrdflib fastembed qdrant-client langchain langchain-community
!{sys.executable} -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m33.0 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m[36m0:00:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


## ☄️ Named Entity Recognition with Spacy

We will use `spacy` NER to extract and categorize entities from an english text about climate change.



In [3]:
import spacy
import requests

# Load English tokenizer, tagger, parser, NER and word vectors
nlp = spacy.load("en_core_web_sm")

# Get annotations for the text document
doc = nlp(annotate_text)

print(f"🔎 Found {len(doc.ents)} entities")

# Display extracted entities
for ent in doc.ents:
    print(ent.text, ent.label_, ent.start_char, ent.end_char)



  from .autonotebook import tqdm as notebook_tqdm


🔎 Found 61 entities
Monday DATE 27 33
the Climate Action Summit ORG 38 63
the UN General Assembly ORG 67 90
over seven million CARDINAL 1031 1049
each year DATE 1057 1066
about one CARDINAL 1068 1077
eight CARDINAL 1081 1086
New Delhi GPE 1172 1181
Beijing GPE 1183 1190
São Paulo PERSON 1196 1205
Ninety percent PERCENT 1207 1221
the World Health Organization (WHO ORG 1313 1347
an estimated 600,000 CARDINAL 1720 1740
each year DATE 1748 1757
under five DATE 1770 1780
emphysema PERSON 2252 2261
104 degrees QUANTITY 2614 2625
Fahrenheit WORK_OF_ART 2626 2636
this summer DATE 2691 2702
1,435 CARDINAL 2710 2715
France GPE 2726 2732
mosquitoes GPE 3134 3144
the last four decades DATE 3367 3388
dozens CARDINAL 3390 3396
Zika and Ebola WORK_OF_ART 3481 3495
more than half a million CARDINAL 3564 3588
year—438,000 PRODUCT 3601 3613
African NORP 3752 3759
96 million CARDINAL 3867 3877
each year DATE 3885 3894
90,000 CARDINAL 3905 3911
two CARDINAL 3952 3955
Aedes PERSON 3994 3999
West Nile LOC 4

In [4]:
spacy.displacy.render(doc, jupyter=True, style="ent")
# from spacy import displacy
# displacy.render(next(doc.sents), style='dep', jupyter=True)

## 🧫 Train spacy to recognize diseases

Spacy is not able to recognize diseases. So we will train it for this purpose. 

We will use an existing [dataset where diseases have been annotated](https://raw.githubusercontent.com/MaastrichtU-IDS/prodigy-drug-indication-annotation/master/relation/dailymed_disease3.jsonl) in english sentences, containing ~500 annotations for diseases, sample:

```json
{
    "text":"    Iritis, iridocyclitis.",
    "spans": [
        {"start":4,"end":10,"token_start":1,"token_end":1,"label":"DISEASE"},
        {"start":12,"end":25,"token_start":3,"token_end":3,"label":"DISEASE"}
    ]
}
```

First generate the training data:


In [5]:
import json

resp = requests.get("https://raw.githubusercontent.com/MaastrichtU-IDS/prodigy-drug-indication-annotation/master/relation/dailymed_disease3.jsonl")
training_data = []

# Iterate over the lines of the file, and extract the annotations
for line in resp.text.split("\n"):
    try:
        annotations = json.loads(line)
    except Exception as _e:
        continue
    text = annotations['text']
    entity_mentions= []
    # Iterate text annotations, to only extract the one labelled as DISEASE entities
    for a in annotations['spans']:
        if a['label'] == 'DISEASE':
            entity_mention = (a['start'], a['end'], 'DISEASE')
        entity_mentions.append(entity_mention)

    if len(entity_mentions) > 0:
        training_data.append( (text,  {'entities': entity_mentions}))

print(f"⛏️ Extracted {len(training_data)} diseases for training, and formatted them as spaCy NER training data.")
print(training_data[0])

⛏️ Extracted 555 diseases for training, and formatted them as spaCy NER training data.
('Adverse events consisted of bacterial pneumonia in one case, staphylococcus bacteremia in one case and two cases of post-extubation stridor.', {'entities': [(28, 47, 'DISEASE')]})


Train the spacy `nlp` model:

In [6]:
import random
from tqdm import tqdm
from spacy.util import minibatch, compounding
from spacy.training import Example

# Training takes ~20s per iteration
training_iterations = 3

# Fine-tune blank english model
nlp = spacy.blank("en")
ner = nlp.add_pipe("ner")
nlp.add_pipe('sentencizer')

ner.add_label("DISEASE")
finetuned_pipes = ["ner", "trf_wordpiecer", "trf_tok2vec"]
unaffected_pipes = [pipe for pipe in nlp.pipe_names if pipe not in finetuned_pipes]
with nlp.disable_pipes(*unaffected_pipes) :
    optimizer = nlp.begin_training()
    sizes = compounding(1.0, 4.0, 1.001)
    for itn in tqdm(range(training_iterations), "Training NER model"):
        # Shuffle examples before training
        random.shuffle(training_data)
        # Batch up the examples using spaCy's minibatch
        batches = minibatch(training_data, size=sizes)
        losses = {}
        for batch in batches:
            for text, annotations in batch:
                doc = nlp.make_doc(text)
                example = Example.from_dict(doc, annotations)
                nlp.update([example], drop=0.5, sgd=optimizer, losses=losses)
# Save model
# nlp.to_disk("../data/ner_model_disease")

Training NER model: 100%|██████████| 3/3 [00:47<00:00, 15.83s/it]


We can now rerun the `nlp` pipeline on the text to extract potential diseases

In [7]:
doc = nlp(annotate_text)

spacy.displacy.render(doc,jupyter=True, style = "ent")

# print("text | type | sentence")
# print("\n".join([f"{ent.text} | {ent.label_} | {ent.sent.text}" for ent in doc.ents]))

> You can also [improve `spacy` NER performances by using LLMs](https://spacy.io/usage/large-language-models#usage)

## 🔗 Link natural language descriptions to standard concepts URIs

Now that spacy extracted potential diseases for us we need to map those labels to standard concepts in an ontology.

Many solutions:
1. Use a search API, such as the one from BioPortal: [bioportal.bioontology.org/search](https://data.bioontology.org/search?q=Community-acquired%20bacterial%20pneumonia&ontologies=&include_properties=false&include_views=false&includeObsolete=false&require_definition=false&exact_match=false&apikey=8b5b7825-538d-40e0-9e9e-5ab9274a9aeb). But this would be too easy and it might not fit your use-case all the time
2. Build a semantic search index for your ontology 😎

We will go with option 2. To do this we will use a generic embedding model for english, and a similarity search engine (aka. vectorstore or vector database)

We will use the [Human Disease Ontology (DOID)](https://bioportal.bioontology.org/ontologies/DOID), because it is small enough to be quickly processed (18k classes, 23MB). But other ontologies such as NCIT are more complete

### 1. Get diseases labels and IDs from the ontology

In [16]:
from rdflib import Graph

doid_g = Graph(store="Oxigraph")
doid_g.parse("http://purl.obolibrary.org/obo/doid.owl", format="xml")
len(doid_g)

266481

In [9]:
# Recursively get all subClassesOf disease (doid:4)
query = """
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX doid: <http://purl.obolibrary.org/obo/DOID_>

SELECT DISTINCT ?diseaseUri ?label WHERE {
    ?diseaseUri rdfs:subClassOf+ doid:4 ;
        rdfs:label ?label .
}
ORDER BY str(?diseaseUri)
"""
qres = doid_g.query(query)

print(f"🧫 Found {len(qres)} diseases")
# Print the 3 first results
for row in list(qres)[:3]:
    print(f"{row.label} {row.diseaseUri}")


🧫 Found 11500 diseases
angiosarcoma http://purl.obolibrary.org/obo/DOID_0001816
pterygium http://purl.obolibrary.org/obo/DOID_0002116
disease of metabolism http://purl.obolibrary.org/obo/DOID_0014667


### 2. Generate embeddings for each disease label and load them to the vector database

Now generate embeddings (vectors of floats from 0 to 1 representing a concept) for each disease using an open source generic embedding model, and store those embeddings in a vector database

A good way to quickly find embeddings model is to check the [Massive Text Embedding Benchmark (MTEB) leaderboard](https://huggingface.co/spaces/mteb/leaderboard) on HuggingFace

Here we will use open source libraries built by Qdrant 🇩🇪
* The BGE FlagEmbedding model through the `fastembed` library (optimized to run on CPU).
* The Qdrant vectorstore (easy to deploy, and to use, good abstraction without losing control)

> Many other dedicated vector databases are available: Milvius, Chroma, pinecone, etc
> 
> There are also extensions to store embeddings and perform similarity search in PostgreSQL: pgVector or LanternDB. And it is available in JSON-based databases: MongoDB, redis
> 
> Basically most modern systems now use embeddings models and similarity search for their search indexes.


In [17]:
from qdrant_client import QdrantClient
from qdrant_client.http.models import (
    Distance,
    PointStruct,
    VectorParams,
)
from fastembed.embedding import FlagEmbedding

res_list = list(qres)
# You can reduce the amount of diseases for the demo (it takes ~2min30s on 16 CPUs for the 11k diseases)
# res_list = list(qres)[:5000]

# Initialize embedding model and vector database
embedding_model = FlagEmbedding(model_name="BAAI/bge-small-en-v1.5", max_length=512)
embedding_size = 384

collection = "diseases"
vectordb = QdrantClient(
    location=":memory:",
)
vectordb.recreate_collection(
    collection_name=collection,
    vectors_config=VectorParams(size=embedding_size, distance=Distance.COSINE),
)

# Generate embeddings for the list of disease labels
labels_list = [row.label for row in res_list]
embeddings = list(embedding_model.embed(labels_list))
print(embeddings[0])

# Create the objects to insert in the collection with the embeddings, label and URI
points = [
    PointStruct(
        id=i,
        vector=embedding,
        payload={
            "uri": res_list[i].diseaseUri,
            "label": labels_list[i],
            "type": "disease",
        },
    )
    for i, embedding in enumerate(embeddings)
]
vectordb.upsert(collection_name=collection, points=points)

collection_info = vectordb.get_collection(collection_name=collection)
print(list(collection_info)[2])

Fetching 9 files: 100%|██████████| 9/9 [00:00<00:00, 16792.14it/s]


[ 0.01414234 -0.013885    0.02664707 -0.00595295  0.02500878 -0.05168652
  0.05693207  0.03443561  0.03690271  0.01074176 -0.02259133 -0.02173014
  0.00696225 -0.0204661  -0.01870032  0.04507202 -0.05871626  0.02881367
 -0.06865767  0.04924264  0.0039005  -0.02974666 -0.04282469 -0.05110366
  0.00613714  0.02783325 -0.01617181 -0.00422632 -0.04081199 -0.10839949
 -0.0334617   0.01359641 -0.00690251 -0.0043321  -0.02209448 -0.02759799
 -0.05973151  0.06467617 -0.05012893  0.04891388  0.02927759  0.03886029
 -0.03951165  0.03423601 -0.0767827  -0.02751978 -0.02291217 -0.00599798
  0.09871525  0.03441411 -0.03673831 -0.06545779  0.00328765  0.0021383
 -0.00782257 -0.02303196  0.03252333 -0.01079618  0.05677114  0.01385299
  0.04179419 -0.00321197 -0.13654813  0.09156476 -0.04054898 -0.04316433
 -0.01886327 -0.01941874  0.03138032  0.03272972  0.00983377 -0.0179648
  0.00446826  0.05778679 -0.03451386 -0.04024907  0.04950196 -0.0144544
  0.03468213  0.04343939 -0.02649575  0.05254826 -0.01

Let's check if it works:

In [18]:
search = "headache"

def search_disease(search_query):
    query_embeddings = list(embedding_model.embed([search_query]))
    return vectordb.search(
        collection_name=collection,
        query_vector=query_embeddings[0],
        limit=5,
    )

hits = search_disease(search)
for hit in hits:
    print(f"{hit.payload['label']} | {hit.payload['uri']} [{hit.score}]")

migraine | http://purl.obolibrary.org/obo/DOID_6364 [0.8831740617752075]
migraine with aura | http://purl.obolibrary.org/obo/DOID_10024 [0.7594956159591675]
facial neuralgia | http://purl.obolibrary.org/obo/DOID_13865 [0.7179465293884277]
migraine without aura | http://purl.obolibrary.org/obo/DOID_12783 [0.7174647450447083]
familial hemiplegic migraine | http://purl.obolibrary.org/obo/DOID_0060178 [0.7143872976303101]


Finally let's run it on the entities extracted by spacy, and populate a RDF KG using the [Text Annotation Ontology](https://vemonet.github.io/tao).

We use `tao:parts_of` to link each annotation to the annotated text, and `dcterms:isPartOf` to link each annotation to the sentence they are in.

In [20]:
from rdflib import URIRef, Literal, RDFS, Namespace, RDF, XSD, DCTERMS

score_threshold = 0.8

BASE_NS = Namespace("https://w3id.org/myannotations/")
TAO = Namespace("http://pubannotation.org/ontology/tao.owl#")

g = Graph(store="Oxigraph")
g.bind("", BASE_NS)
g.bind("tao", TAO)
g.bind("doid", Namespace("http://purl.obolibrary.org/obo/DOID_"))

annotated_text_uri = BASE_NS["document0"]
g.add((annotated_text_uri, RDF.type, TAO.text_document))
g.add((annotated_text_uri, RDFS.label, Literal(annotate_text)))

sentence_uris = {}
for i, ent in enumerate(doc.ents):
    hits = search_disease(ent.text)
    # Only consider the first match
    if len(hits) > 0 and hits[0].score > score_threshold:
        print(f"{ent.text} > {hits[0].payload['label']} ({hit.payload['uri']}) [{hits[0].score}]")
        # Add the text annotation to the RDF graph
        annot_uri = BASE_NS[f"match{i}"]
        g.add((annot_uri, RDF.type, TAO.text_span))
        g.add((annot_uri, TAO.has_value, Literal(ent.text)))
        g.add((annot_uri, TAO.begins_at, Literal(ent.start_char, datatype=XSD.integer)))
        g.add((annot_uri, TAO.ends_at, Literal(ent.end_char, datatype=XSD.integer)))
        g.add((annot_uri, TAO.parts_of, annotated_text_uri))

        # Create or retrieve the URI for the sentence
        sentence_text = ent.sent.text
        if sentence_text not in sentence_uris:
            sentence_uri = BASE_NS[f"sentence{len(sentence_uris)}"]
            sentence_uris[sentence_text] = sentence_uri
            g.add((sentence_uri, RDF.type, TAO.text_span))
            g.add((sentence_uri, TAO.has_value, Literal(sentence_text)))
            g.add((sentence_uri, TAO.parts_of, annotated_text_uri))
        # We use a different predicate to link the annotation to the sentence
        g.add((annot_uri, DCTERMS.isPartOf, sentence_uris[sentence_text]))

        # Add link to the matching concept
        concept_uri = URIRef(hits[0].payload['uri'])
        g.add((annot_uri, TAO.denotes, concept_uri))
        g.add((concept_uri, RDFS.label, Literal(hits[0].payload['label'])))

g.serialize("diseases_annotations_kg.ttl", format="ttl")
# print(g.serialize(format="ttl"))

pneumonia > pneumonia (http://purl.obolibrary.org/obo/DOID_0060178) [1.0000001192092896]
respiratory > respiratory failure (http://purl.obolibrary.org/obo/DOID_0060178) [0.8170928955078125]
cognitive decline > subjective cognitive decline (http://purl.obolibrary.org/obo/DOID_0060178) [0.9172177314758301]
strokes > cerebral infarction (http://purl.obolibrary.org/obo/DOID_0060178) [0.816611111164093]
pneumonia > pneumonia (http://purl.obolibrary.org/obo/DOID_0060178) [1.0000001192092896]
asthma > asthma (http://purl.obolibrary.org/obo/DOID_0060178) [1.0]
emphysema > pulmonary emphysema (http://purl.obolibrary.org/obo/DOID_0060178) [0.934086799621582]
lung cancer > lung cancer (http://purl.obolibrary.org/obo/DOID_0060178) [1.0]
hypertension > hypertension (http://purl.obolibrary.org/obo/DOID_0060178) [1.0]
infectious > viral infectious disease (http://purl.obolibrary.org/obo/DOID_0060178) [0.8203649520874023]
malaria > malaria (http://purl.obolibrary.org/obo/DOID_0060178) [1.0000001192092

<Graph identifier=N8c6fb618fe6a4854b35205c6bb32b928 (<class 'rdflib.graph.Graph'>)>

## 😐 Sentiment analysis

In [13]:
from textblob import TextBlob

# Sentiment Analysis
blob = TextBlob(annotate_text)
if blob.sentiment.polarity > 0:
    print(f"👍 {blob.sentiment}")
else:
    print(f"👎 {blob.sentiment}")


👍 Sentiment(polarity=0.08600756239645128, subjectivity=0.4497620049703383)


## 🤗 Use LLMs

Use the [HuggingFace Inference API](https://huggingface.co/docs/api-inference/faq) to **query open source LLMs for free**, but with a limited context size. Just create an account on HuggingFace and get an access token.

> Put the access token in the `.hf_token` file in the same folder as the notebook

Query the Mixtral 8x7B model through HuggingFace HTTP API endpoint:

In [21]:
import os
import requests

question = "The world has tended to fight such threats one disease at a time\u2014whether it is smallpox, polio, TB, HIV/AIDS, or malaria\u2014and to adopt a firefighting approach when a deadly pandemic such as Ebola emerges."

with open(".hf_token", "r") as file:
    HF_TOKEN = file.read().strip()

API_URL = "https://api-inference.huggingface.co/models/mistralai/Mixtral-8x7B-Instruct-v0.1"
headers = {"Authorization": f"Bearer {HF_TOKEN}"}

def query_llm(input: str):
    payload={
        "inputs": f"[INST] From the text below, extract the diseases mentioned. Return the results as a JSON list of the extracted diseases label in the text. \nTEXT: {input}?\n[/INST]",
        "parameters": {
            "return_full_text": False
        }
    }
    return requests.post(API_URL, headers=headers, json=payload).json()

query_llm(question)

KeyboardInterrupt: 

Or use it with LangChain:

In [22]:
from langchain_community.llms import HuggingFaceHub
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate

os.environ["HUGGINGFACEHUB_API_TOKEN"] = HF_TOKEN

template = """From the text below, extract the diseases mentioned.
Return the results as a JSON list of the extracted diseases label in the text

Text: {question}"""

prompt = PromptTemplate.from_template(template)
llm = HuggingFaceHub(
    repo_id="mistralai/Mixtral-8x7B-Instruct-v0.1",
    # model_kwargs={"temperature": 0.5, "max_length": 64},
)
llm_chain = LLMChain(prompt=prompt, llm=llm)

# print(llm_chain.run(annotate_text))
for chunk in llm_chain.stream(question):
    print(chunk["text"], end="", flush=True)

From the text below, extract the diseases mentioned.
Return the results as a JSON list of the extracted diseases label in the text

Text: The world has tended to fight such threats one disease at a time—whether it is smallpox, polio, TB, HIV/AIDS, or malaria—and to adopt a firefighting approach when a deadly pandemic such as Ebola emerges.

{
  "diseases": [
    "smallpox",
    "polio",
    "TB",
    "HIV/AIDS",
    "malaria",
    "Ebola"
  ]
}