#### The prompt that we used to get the output from 3 different GPT models. It will help provided you additional context on how the CSVs were generated and what their respective columns refer.

In [4]:
GPT_PROMPT = '''Identify and extract named entities for plant genes and their associations with traits from the input text. Focus exclusively on plant genes and traits, excluding entities from other species. Map identified traits to Trait Ontology (TO) Terms using the provided TO knowledge base. Identify the gene-trait relationship describe it in one word which can be a verb; and provide evidence along with the experimental method used to establish this relationship. If the gene-trait relationship is not mentioned explicitly in the text, no output is needed.

# Steps

1. **Identification**: 
   - Extract plant gene names and associated traits from the input text.
   
2. **Trait Mapping**:
   - Use the provided TO knowledge base to map each identified trait to the most specific TO Term using the 'name', 'definition', 'comment', and 'synonym' fields.

3. **Relationship Classification**:
   - Analyze the input text to determine the type of gene-trait relationship and describe it in on word which should be a verb
   - Extract a brief piece of text or inference as evidence for the relationship.
   - Identify the experimental method used, such as QTL, GWAS, gene knock-out, gene silencing, sequence analysis or gene overexpression.

4. **Data Compilation**:
   - Organize extracted entities and relevant information according to the output format.

# Output Format

```json
[
    {
        "gene": "<name of the gene>",
        "species": "<specify the latin species name the gene belongs to>",
        "trait_name": "<exact name of trait from the TO knowledge base>",
        "trait_id": "<corresponding TO id>",
        "relation_type": "<verb describing the gene-trait relation>",
        "evidence": "<brief evidence sentence from the input text supporting the relationship>",
        "method": "<experimental method that established the relationship, e.g., QTL, GWAS, gene knock-out, gene silencing, sequence analysis, gene overexpression>"
    }
]
```
'''

## Storing Trait Ontology Details in the Vector DB

In [None]:
# !pip install chromadb

import chromadb

In [2]:
# Function to extract the 'id' and 'name' from a block of text
def extract_id_name(block):
    lines = block.splitlines()
    id_value = ""
    name_value = ""
    for line in lines:
        if line.startswith("id:"):
            id_value = line.split("id:")[1].strip()
        elif line.startswith("name:"):
            name_value = line.split("name:")[1].strip().replace(" ", "_")
    return id_value, name_value

# Read the entire content of the input file
input_file = 'trait_ontology_details.txt'
with open(input_file, 'r') as file:
    content = file.read()

# Split the content by double newlines and ensure last block is captured
blocks = content.strip().split("\n\n")

trait_ids = {}

for block in blocks:
    if block.strip():
        term_id, term_name = extract_id_name(block)
        trait_ids[term_id] = block

## Using the freely availble embedding model from Hugging Face

In [None]:
import chromadb.utils.embedding_functions as embedding_functions

# Follow the link to know more - https://docs.trychroma.com/integrations/hugging-face

huggingface_ef = embedding_functions.HuggingFaceEmbeddingFunction(
    api_key="YOUR_API_KEY",
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)

In [None]:
client = chromadb.PersistentClient(path='.')

In [None]:
collection_trait_details = client.create_collection(
        name="trait_details", 
        embedding_function=huggingface_ef,
        metadata={"hnsw:space": "cosine"} 
    )

In [None]:
collection_trait_details.add(
    documents=list(trait_ids.values()),
    ids=list(trait_ids.keys())
)

In [8]:
# Add your code from here onwards --- 