### Unstructured Data

In [1]:
!head -100 ../data/unstructured/unstructured_gene_disease.txt

Gene::1 and Disease::MESH:D005909 show an improper regulation interaction linked to the disease, suggesting a pathogenic mechanism.
Gene::10 and Disease::MESH:C562839 are involved in an interaction related to causal mutations, suggesting a genetic link to the disease.
Gene::10 and Disease::MESH:D001172 are linked through an interaction where polymorphisms alter risk, suggesting genetic variations influence disease susceptibility.
Gene::10 and Disease::MESH:D001932 are linked through an interaction where polymorphisms alter risk, suggesting genetic variations influence disease susceptibility.
Gene::10 plays a role in disease pathogenesis of Disease::MESH:D003110
Gene::10 and Disease::MESH:D004409 are linked through an interaction where polymorphisms alter risk, suggesting genetic variations influence disease susceptibility.
Gene::10 and Disease::MESH:D006331 are linked through an interaction where polymorphisms alter risk, suggesting genetic variations influence disease susceptibi

In [2]:
from llama_index import Document
from llama_index import SimpleDirectoryReader

documents = SimpleDirectoryReader(
    input_files=["../data/unstructured/unstructured_gene_disease.txt"]
).load_data()

document = Document(text="\n\n".join([doc for doc in documents[0].text.split('\n')]))

print(f'document type: {type(documents)}')
print(f'total # docs: {len(documents)}')
print(f'doc type: {type(documents[0])}')
print(f'doc sample: {documents[0]}')

document type: <class 'list'>
total # docs: 1
doc type: <class 'llama_index.schema.Document'>
doc sample: Doc ID: c74c71d7-a842-492b-901c-8d7879d954c1
Text: Gene::1 and Disease::MESH:D005909 show an improper regulation
interaction linked to the disease, suggesting a pathogenic mechanism.
Gene::10 and Disease::MESH:C562839 are involved in an interaction
related to causal mutations, suggesting a genetic link to the disease.
Gene::10 and Disease::MESH:D001172 are linked through an interaction
where pol...


In [3]:
import torch

from llama_index import ServiceContext
from llama_index import VectorStoreIndex
from llama_index.llms import HuggingFaceLLM
from llama_index.prompts import PromptTemplate

# This will wrap the default prompts that are internal to llama-index
# taken from https://huggingface.co/Writer/camel-5b-hf
query_wrapper_prompt = PromptTemplate(
    "Below is an instruction that describes a task. "
    "Write a response that appropriately completes the request.\n\n"
    "### Instruction:\n{query_str}\n\n### Response:"
)

llm = HuggingFaceLLM(
    context_window=2048,
    max_new_tokens=256,
    generate_kwargs={"temperature": 0.25, "do_sample": False},
    query_wrapper_prompt=query_wrapper_prompt,
    tokenizer_name="Writer/camel-5b-hf",
    model_name="Writer/camel-5b-hf",
    device_map="auto",
    tokenizer_kwargs={"max_length": 2048},
    # model_kwargs={"torch_dtype": torch.float16}
)

service_context = ServiceContext.from_defaults(
    chunk_size=512,
    llm=llm,
    embed_model="local:BAAI/bge-small-en-v1.5"
)
index = VectorStoreIndex.from_documents([document],
                                        service_context=service_context)
query_engine = index.as_query_engine()

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [4]:
response = query_engine.query(
    "Describe the role of Gene::10 in different diseases."
)
print(str(response))

Token indices sequence length is longer than the specified maximum sequence length for this model (1132 > 512). Running this sequence through the model will result in indexing errors
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Gene::10 is associated with various diseases, including:
1. Disease::MESH:D004194 - Gene::10 is involved in the pathogenesis of this disease.
2. Disease::MESH:D005234 - Gene::10 is involved in the pathogenesis of this disease.
3. Disease::MESH:D006930 - Gene::10 is involved in the pathogenesis of this disease.
4. Disease::MESH:D006976 - Gene::10 is involved in the pathogenesis of this disease.
5. Disease::MESH:D007511 - Gene::10 is involved in the pathogenesis of this disease.
6. Disease::MESH:D009202 - Gene::10 is involved in the pathogenesis of this disease.
7. Disease::MESH:D009205 - Gene::10 is involved in the pathogenesis of this disease.
8. Disease::MESH:D014947 - Gene::10 is involved in the pathogenesis of this disease.
9. Disease::MESH:D014947 - Gene::10 is involved in the pathogenesis of this disease.
10. Disease::MESH:D014947 - Gene::


In [5]:
response = query_engine.query(
    "What is the top gene that causes disease? Describe the relationships with the disease."
)
print(str(response))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


The top gene that causes disease is Gene::10534. It is linked to Disease::MESH:D009362 and Disease::MESH:D009369, which are both related to the disease.


In [6]:
response = query_engine.query(
    "List top 5 genes that is associated with Disease::MESH:D014947? Briefly describe how they are associated."
)
print(str(response))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


1. Gene::4597 - It is associated with Disease::MESH:D014947 due to its role in disease pathogenesis.
2. Gene::4598 - It is associated with Disease::MESH:D014947 due to its role in disease pathogenesis.
3. Gene::4599 - It is associated with Disease::MESH:D014947 due to its role in disease pathogenesis.
4. Gene::4597 - It is associated with Disease::MESH:D014947 due to its role in disease pathogenesis.
5. Gene::4598 - It is associated with Disease::MESH:D014947 due to its role in disease pathogenesis.


### Semi-structured Data

In [7]:
!head -100 ../data/semi_structured/semi_structured_gene_disease.txt

interaction-entities: Gene::1 and Disease::MESH:D005909
interaction-type: improper regulation linked to disease

interaction-entities: Gene::10 and Disease::MESH:C562839
interaction-type: causal mutations

interaction-entities: Gene::10 and Disease::MESH:D001172
interaction-type: polymorphisms alter risk

interaction-entities: Gene::10 and Disease::MESH:D001932
interaction-type: polymorphisms alter risk

interaction-entities: Gene::10 and Disease::MESH:D003110
interaction-type: role in pathogenesis

interaction-entities: Gene::10 and Disease::MESH:D004409
interaction-type: polymorphisms alter risk

interaction-entities: Gene::10 and Disease::MESH:D006331
interaction-type: polymorphisms alter risk

interaction-entities: Gene::10 and Disease::MESH:D010190
interaction-type: polymorphisms alter risk

interaction-entities: Gene::10 and Disease::MESH:D015179
interaction-type: role in pathogenesis

interaction-entities: Gene::10 and Disease::MESH:D015212
interactio

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [8]:
documents = SimpleDirectoryReader(
    input_files=["../data/semi_structured/semi_structured_gene_disease.txt"]
).load_data()

document = Document(text="\n\n".join([doc for doc in documents[0].text.split('\n')]))

print(f'document type: {type(documents)}')
print(f'total # docs: {len(documents)}')
print(f'doc type: {type(documents[0])}')
print(f'doc sample: {documents[0]}')

document type: <class 'list'>
total # docs: 1
doc type: <class 'llama_index.schema.Document'>
doc sample: Doc ID: 14d008d5-f947-4ea7-b850-147a5353245a
Text: interaction-entities: Gene::1 and Disease::MESH:D005909
interaction-type: improper regulation linked to disease  interaction-
entities: Gene::10 and Disease::MESH:C562839 interaction-type: causal
mutations  interaction-entities: Gene::10 and Disease::MESH:D001172
interaction-type: polymorphisms alter risk  interaction-entities:
Gene::10 and Disea...


In [9]:
index = VectorStoreIndex.from_documents([document],
                                        service_context=service_context)
query_engine = index.as_query_engine()

In [10]:
response = query_engine.query(
    "Describe the role of Gene::10 in different diseases."
)
print(str(response))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Gene::10 is associated with various diseases, including:
1. Alzheimer's disease
2. Parkinson's disease
3. Type 2 diabetes
4. Heart disease
5. Migraine
6. Sleep apnea
7. Crohn's disease
8. Rheumatoid arthritis
9. Osteoarthritis
10. Psoriasis


In [11]:
response = query_engine.query(
    "What is the top gene that causes disease? Describe the relationships with the disease."
)
print(str(response))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


The top gene that causes disease is Gene::1958. It is related to Disease::MESH:D005234, which is linked to Disease::MESH:D005354, which is linked to Disease::MESH:D005909, which is linked to Disease::MESH:D006528, which is linked to Disease::MESH:D006976, which is linked to Disease::MESH:D007029, which is linked to Disease::MESH:D007239, which is linked to Disease::MESH:D007938, which is linked to Disease::MESH:D007951, which is linked to Disease::MESH:D008107, which is linked to Disease::MESH:D009336, which is linked to Disease::MESH:D009369, which is linked to Disease::MESH:D050030, which is linked to Disease::MESH:D053447, which is linked to Disease::MESH:D050030, which is linked to Disease::MESH:D050030, which is linked to Disease::MESH:D050030, which is linked to Disease::MESH:D0500


In [12]:
response = query_engine.query(
    "List top 5 genes that is associated with Disease::MESH:D014947? Briefly describe how they are associated."
)
print(str(response))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


1. Gene::494332 - Improper regulation linked to disease
2. Gene::494333 - Improper regulation linked to disease
3. Gene::51497 - Improper regulation linked to disease
4. Gene::51497 and Disease::MESH:D002289 - Improper regulation linked to disease
5. Gene::51497 and Disease::MESH:D002292 - Improper regulation linked to disease


### Structured Data

In [13]:
!head -100 ../data/structured/structured_gene_disease.csv

,node_a,node_b,relation_entities,relation_type,source,relation_name
1694421,Gene::1,Disease::MESH:D005909,Gene:Disease,L,GNBR,improper regulation linked to disease
1694422,Gene::10,Disease::MESH:C562839,Gene:Disease,U,GNBR,causal mutations
1694423,Gene::10,Disease::MESH:D001172,Gene:Disease,Y,GNBR,polymorphisms alter risk
1694424,Gene::10,Disease::MESH:D001932,Gene:Disease,Y,GNBR,polymorphisms alter risk
1694425,Gene::10,Disease::MESH:D003110,Gene:Disease,J,GNBR,role in pathogenesis
1694426,Gene::10,Disease::MESH:D004409,Gene:Disease,Y,GNBR,polymorphisms alter risk
1694427,Gene::10,Disease::MESH:D006331,Gene:Disease,Y,GNBR,polymorphisms alter risk
1694428,Gene::10,Disease::MESH:D010190,Gene:Disease,Y,GNBR,polymorphisms alter risk
1694429,Gene::10,Disease::MESH:D015179,Gene:Disease,J,GNBR,role in pathogenesis
1694430,Gene::10,Disease::MESH:D015212,Gene:Disease,Y,GNBR,polymorphisms alter risk
1694431,Gene::10,Disease::MESH:D064420,Gene:Disease,J,GNBR,role in pathogenesis
1694

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [14]:
documents = SimpleDirectoryReader(
    input_files=["../data/structured/structured_gene_disease.csv"]
).load_data()

document = Document(text="\n\n".join([doc for doc in documents[0].text.split('\n')]))

print(f'document type: {type(documents)}')
print(f'total # docs: {len(documents)}')
print(f'doc type: {type(documents[0])}')
print(f'doc sample: {documents[0]}')

document type: <class 'list'>
total # docs: 1
doc type: <class 'llama_index.schema.Document'>
doc sample: Doc ID: 741786c6-b192-4904-be63-82a18c2481c3
Text: 1694421, Gene::1, Disease::MESH:D005909, Gene:Disease, L, GNBR,
improper regulation linked to disease 1694422, Gene::10,
Disease::MESH:C562839, Gene:Disease, U, GNBR, causal mutations
1694423, Gene::10, Disease::MESH:D001172, Gene:Disease, Y, GNBR,
polymorphisms alter risk 1694424, Gene::10, Disease::MESH:D001932,
Gene:Disease, Y, GNBR, polymorp...


In [15]:
index = VectorStoreIndex.from_documents([document],
                                        service_context=service_context)
query_engine = index.as_query_engine()

In [16]:
response = query_engine.query(
    "Describe the role of Gene::10 in different diseases."
)
print(str(response))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Gene::10 is associated with various diseases, including D003920 (MESH:D003920), which is related to Alzheimer's disease. It may play a role in the pathogenesis of Alzheimer's disease by influencing the production of amyloid beta protein, which is a key component of the brain's plaques.


In [17]:
response = query_engine.query(
    "What is the top gene that causes disease? Describe the relationships with the disease."
)
print(str(response))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


The top gene that causes disease is Gene:Disease, L, GNBR, which is linked to the disease by the relationship "improper regulation linked to disease".


In [18]:
response = query_engine.query(
    "List top 5 genes that is associated with Disease::MESH:D014947? Briefly describe how they are associated."
)
print(str(response))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


1. Gene:Disease, L, GNBR, promotes progression
2. Gene:Disease, L, GNBR, promotes progression
3. Gene:Disease, L, GNBR, promotes progression
4. Gene:Disease, L, GNBR, promotes progression
5. Gene:Disease, L, GNBR, promotes progression
