# Tutorial: StarkQA-PrimeKG Loader

In this tutorial, we will explain how to load StarkQA-PrimeKG dataset, which is a dataset for question answering over knowledge graphs of PrimeKG. 

Prior information about the StarkQA-PrimeKG dataaset can be found in the following repositories:
- https://github.com/snap-stanford/stark
- https://stark.stanford.edu/
- https://huggingface.co/datasets/snap-stanford/stark

We first need to import the necessary libraries as follows.

In [1]:
# Import necessary libraries
import sys
sys.path.append('../../..')
from aiagents4pharma.talk2knowledgegraphs.datasets.starkqa_primekg import StarkQAPrimeKG






### Load StarQA-PrimeKG

The `StarkQAPrimeKG` allows to load the data from the HuggingFace Hub if the data is not available locally. 

Otherwise, the data is loaded from the local directory as defined in the `local_dir`.

In [2]:
# Define starkqa primekg data by providing a local directory where the data is stored
starkqa_data = StarkQAPrimeKG(local_dir="../../../../data/starkqa_primekg_test/")

To load the dataframes of StarkQA and its split, we just need a method as follows.

In [3]:
# Invoke a method to load the data
starkqa_data.load_data()

# Get the StarkQAPrimeKG data, which are the QA pairs, split indices, and the node information
starkqa_df = starkqa_data.get_starkqa()
starkqa_split_indices = starkqa_data.get_starkqa_split_indicies()
starkqa_node_info = starkqa_data.get_starkqa_node_info()

Loading StarkQAPrimeKG dataset...
../../../../data/starkqa_primekg_test/qa/prime/stark_qa/stark_qa.csv already exists. Loading the data from the local directory.


In [4]:
starkqa_data.starkqa

Unnamed: 0,id,query,answer_ids
0,0,Could you identify any skin diseases associate...,[95886]
1,1,What drugs target the CYP3A4 enzyme and are us...,[15450]
2,2,What is the name of the condition characterize...,"[98851, 98853]"
3,3,What drugs are used to treat epithelioid sarco...,[15698]
4,4,Can you supply a compilation of genes and prot...,"[7161, 22045]"
...,...,...,...
11199,11199,Which gene or protein is not expressed in fema...,[2414]
11200,11200,Could you identify a biological pathway in whi...,[128199]
11201,11201,Is there an interaction between genes or prote...,"[127611, 62903]"
11202,11202,Which pharmacological agents that stimulate os...,[20180]


### Check StarQA-PrimeKG Dataframes


StarkQA dataframes contain the following columns:
- `id`: Unique identifier for each question and answer pair
- `query`: The synthesized question from the StarkQA dataset
- `answer_ids`: The unique identifier for the answer to the question (multiple answers are possible)

In [5]:
# Check a sample of the starkqa primekg dataframe
starkqa_df.head()

Unnamed: 0,id,query,answer_ids
0,0,Could you identify any skin diseases associate...,[95886]
1,1,What drugs target the CYP3A4 enzyme and are us...,[15450]
2,2,What is the name of the condition characterize...,"[98851, 98853]"
3,3,What drugs are used to treat epithelioid sarco...,[15698]
4,4,Can you supply a compilation of genes and prot...,"[7161, 22045]"


The current StarkQA-PrimeKG has about 11K records of questions and answers pairs.

In [6]:
# Check dimensions of the starkqa primekg dataframe
starkqa_df.shape

(11204, 3)

### CHeck StarkQA-PrimeKG Node Information

StarkQA provides an additional node information for PrimeKG as a dictionary for each node.

This allows us to further enrich the features of the knowledge graph nodes.

In [7]:
# Check the node information of PrimeKG
starkqa_node_info[0]

{'id': 9796,
 'type': 'gene/protein',
 'name': 'PHYHIP',
 'source': 'NCBI',
 'details': {'query': 'PHYHIP',
  '_id': '9796',
  '_score': 17.934021,
  'alias': ['DYRK1AP3', 'PAHX-AP', 'PAHXAP1'],
  'genomic_pos': {'chr': '8',
   'end': 22232101,
   'ensemblgene': 'ENSG00000168490',
   'start': 22219703,
   'strand': -1},
  'name': 'phytanoyl-CoA 2-hydroxylase interacting protein',
  'summary': 'Enables protein tyrosine kinase binding activity. Involved in protein localization. Located in cytoplasm. [provided by Alliance of Genome Resources, Apr 2022]'}}

### Check StarQA-PrimeKG splits

StarkQA-PrimeKG splits contain train, validation, and test indices for benchmarking the QA-driven AI models.

In [8]:
# Check the split indices of the starkqa primekg dataframe
starkqa_split_indices.keys()

dict_keys(['train', 'val', 'test', 'test-0.1'])

Finally, we can check the number of each split as follows.

In [9]:
# Check the number of samples in each split of the starkqa primekg dataframe
for split, idx in starkqa_split_indices.items():
    print(f"{split}: {len(idx)}")

train: 6162
val: 2241
test: 2801
test-0.1: 280
