First let's take a look at our model to dataset(s) linking use case -

In [1]:
#let's load the eval example and take a look 
import json
from pprint import pprint
one_to_many_datasets=json.load(open("./eval_datasets/one_to_many_datasets.json","r"))
print('Model Info:')
pprint(one_to_many_datasets['models'])
print('Datasets Info:')
for data in one_to_many_datasets['datasets']:
    print('\n','Dataset Name:',data['info']['name'],'\n','Dataset ID:',data['info']['id'])

Model Info:
[{'info': {'description': 'Model',
           'header': {'description': 'Susceptible Infection Recovered model '
                                     'for modeling the covid-19 pandemic in '
                                     'the U.S',
                      'model_version': '0.1',
                      'name': 'SIR Model for COVID-19 in the U.S',
                      'schema': 'https://raw.githubusercontent.com/DARPA-ASKEM/Model-Representations/petrinet_v0.5/petrinet/petrinet_schema.json'},
           'id': '75ea93af-ffdc-449d-8667-4b2e378d3df7',
           'metadata': {'annotations': {}},
           'model': {'states': [{'grounding': {'identifiers': {'ido': '0000514'},
                                               'modifiers': {}},
                                 'id': 'susceptible_population',
                                 'name': 'susceptible_population',
                                 'units': None},
                                {'grounding': {'identifiers

Now let's create documents representing the dataset and model objects, embed them using openai embeddings and store them in our vector 
store:

For now the documents contain the object name and description, but we could include lots of other info...

In [2]:
from embed import document_embed
vs=document_embed(one_to_many_datasets['datasets']+one_to_many_datasets['models'],db_dir="./demo/datasets")

Now let's get dataset matches in the form of a ranked list for our example model - We will use cosine similarity of the embedded vectors to surface the top k relevant datasets

In [3]:
from find import find_dataset_semantic_matching
pred_ranking=find_dataset_semantic_matching(one_to_many_datasets['models'][0]['info']['id'],db_dir="./demo/datasets")
print('Datasets Ranked by Semantic Relevance to Model:')
pred_ranking

Datasets Ranked by Semantic Relevance to Model:


['0c582c5a-d70c-41e3-924f-b401c6e62188',
 '0c582c5a-d70c-41e3-924f-b401c6e62188',
 '0c582c5a-d70c-41e3-924f-b401c6e62188',
 '35b1abf5-8e6c-47f7-b922-8e8141a8cdc4',
 '35b1abf5-8e6c-47f7-b922-8e8141a8cdc4']

We can compare this to the ground truth, which was hand annotated -

In [4]:
one_to_many_datasets['ground_truth']

[{'model_id': '75ea93af-ffdc-449d-8667-4b2e378d3df7',
  'ranking_lists': ['0c582c5a-d70c-41e3-924f-b401c6e62188',
   'de0e7a0c-9f39-4b3e-baea-9b25ee84d7cc',
   'a65e3c96-e188-42a2-8f58-d5f59a6f9282',
   '35b1abf5-8e6c-47f7-b922-8e8141a8cdc4',
   '3d50093c-f548-4b7c-a8bd-76648713a2da']}]

Now let's take a look at the one model to one dataset use case - 

In [5]:
one_to_one_features=json.load(open("./eval_datasets/one_to_one_features.json","r"))
print('Model Info:')
pprint(one_to_one_features['models'])
print('Datasets Info:')
print('\n','Dataset Name:',one_to_one_features['datasets'][0]['info']['name'],'\n','Dataset ID:',one_to_one_features['datasets'][0]['info']['id'])
for feature in one_to_one_features['datasets'][0]['info']['columns']:
    print('\n','Dataset Feature Name:',feature['name'],'\n','Dataset Feature Description:',feature['description'])

Model Info:
[{'info': {'description': 'Suspectable, Infected, Recovered Epidemology model',
           'header': {'description': 'Model',
                      'model_version': '0.1',
                      'name': 'Model',
                      'schema': 'https://raw.githubusercontent.com/DARPA-ASKEM/Model-Representations/petrinet_v0.5/petrinet/petrinet_schema.json'},
           'id': 'de5e51cc-8b57-45ec-b3af-67556797ad4c',
           'metadata': {'annotations': {}},
           'model': {'states': [{'grounding': {'identifiers': {'ido': '0000514'},
                                               'modifiers': {}},
                                 'id': 'susceptible_population',
                                 'name': 'susceptible_population',
                                 'units': None},
                                {'grounding': {'identifiers': {'ido': '0000511'},
                                               'modifiers': {}},
                                 'id': 'infected_popu

Now let's create documents representing the dataset feature and model feature objects, embed them using openai embeddings and store them in our vector store:

In [6]:
from embed import short_feature_embed
vs=short_feature_embed(one_to_one_features['datasets']+one_to_one_features['models'],db_dir="./demo/one_to_one_features")

Feature county on model United States COVID-19 Community Levels by County was embedded
Feature county_fips on model United States COVID-19 Community Levels by County was embedded
Feature state on model United States COVID-19 Community Levels by County was embedded
Feature county_population on model United States COVID-19 Community Levels by County was embedded
Feature health_service_area_number on model United States COVID-19 Community Levels by County was embedded
Feature health_service_area on model United States COVID-19 Community Levels by County was embedded
Feature health_service_area_population on model United States COVID-19 Community Levels by County was embedded
Feature covid_inpatient_bed_utilization on model United States COVID-19 Community Levels by County was embedded
Feature covid_hospital_admissions_per_100k on model United States COVID-19 Community Levels by County was embedded
Feature covid_cases_per_100k on model United States COVID-19 Community Levels by County was 

Now let's get ranked dataset feature matches in the form of a ranked list for our each model feature in our example model - 

Example LLM Query used to rank features - 
"""Here are a list of epidemiology dataset features with their names and descriptions.
        Please rank in order from most similar to least similar, the dataset features to an epidemiology 
        feature with the name {model_feature_name}.
        
        Output your ranking list in the following format:
            1. [dataset_feature_name]
            2. [dataset_feature_name]
            3. [dataset_feature_name]
            4. [dataset_feature_name]
            5. [dataset_feature_name]

        Dataset features list :
            {dataset_features_list}"""

In [8]:
from find import find_dataset_features_basic_llm_query_1
pred_ranking=find_dataset_features_basic_llm_query_1(one_to_one_features['models'][0]['info']['id'],db_dir="./demo/one_to_one_features")
print('Dataset Feature Relevance Ranked by Semantic Relevance to Model Feature for Each Feature:')
pred_ranking

Dataset Feature Relevance Ranked by Semantic Relevance to Model Feature for Each Feature:


{'susceptible_population': [{'dataset_id': '41a32771-7a35-4da5-98a6-d420172108e8',
   'feature_name': 'county_population'},
  {'dataset_id': '41a32771-7a35-4da5-98a6-d420172108e8',
   'feature_name': 'health_service_area_population'},
  {'dataset_id': '41a32771-7a35-4da5-98a6-d420172108e8',
   'feature_name': 'covid_cases_per_100k'},
  {'dataset_id': '41a32771-7a35-4da5-98a6-d420172108e8',
   'feature_name': 'covid_hospital_admissions_per_100k'},
  {'dataset_id': '41a32771-7a35-4da5-98a6-d420172108e8',
   'feature_name': 'covid_inpatient_bed_utilization'},
  {'dataset_id': '41a32771-7a35-4da5-98a6-d420172108e8',
   'feature_name': 'covid-19_community_level'},
  {'dataset_id': '41a32771-7a35-4da5-98a6-d420172108e8',
   'feature_name': 'health_service_area'},
  {'dataset_id': '41a32771-7a35-4da5-98a6-d420172108e8',
   'feature_name': 'health_service_area_number'},
  {'dataset_id': '41a32771-7a35-4da5-98a6-d420172108e8',
   'feature_name': 'county'},
  {'dataset_id': '41a32771-7a35-4da5-9

We can compare this to the ground truth, which was hand annotated -

In [9]:
one_to_one_features['ground_truth']

[{'model_id': 'de5e51cc-8b57-45ec-b3af-67556797ad4c',
  'ranking_lists': {'susceptible_population': [{'dataset_id': '41a32771-7a35-4da5-98a6-d420172108e8',
     'feature_name': 'county_population'}],
   'infected_population': [{'dataset_id': '41a32771-7a35-4da5-98a6-d420172108e8',
     'feature_name': 'covid_cases_per_100k'}],
   'immune_population': [{'dataset_id': '41a32771-7a35-4da5-98a6-d420172108e8',
     'feature_name': 'county_population'}]}}]

We can do this in the case of one model and many datasets as well, we just first get the top k documents then the top k features in those datasets for each model feature - 

In [10]:
one_to_many_features=json.load(open("./eval_datasets/one_to_many_features.json","r"))

In [None]:
feat_vs=short_feature_embed(one_to_many_features['datasets']+one_to_many_features['models'],db_dir="./demo/one_to_many/features")
doc_vs=document_embed(one_to_many_features['datasets']+one_to_many_features['models'],db_dir="./demo/one_to_many/datasets")

Feature county on model United States COVID-19 Community Levels by County was embedded
Feature county_fips on model United States COVID-19 Community Levels by County was embedded
Feature state on model United States COVID-19 Community Levels by County was embedded
Feature county_population on model United States COVID-19 Community Levels by County was embedded
Feature health_service_area_number on model United States COVID-19 Community Levels by County was embedded
Feature health_service_area on model United States COVID-19 Community Levels by County was embedded
Feature health_service_area_population on model United States COVID-19 Community Levels by County was embedded
Feature covid_inpatient_bed_utilization on model United States COVID-19 Community Levels by County was embedded
Feature covid_hospital_admissions_per_100k on model United States COVID-19 Community Levels by County was embedded
Feature covid_cases_per_100k on model United States COVID-19 Community Levels by County was 

In [None]:
from find import find_dataset_features_hierarchical
pred_ranking=find_dataset_features_hierarchical(model['info']['id'],
                                                dataset_db_dir="./demo/one_to_many/datasets",
                                                features_db_dir="./demo/one_to_many/features")
print('Dataset Features Ranked by Semantic Relevance to Model Features:')
pred_ranking

We can compare this to the ground truth, which was hand annotated -

In [None]:
print('Model Info:')
pprint(one_to_one_features['models'])
print('Datasets Info:')
print('\n','Dataset Name:',one_to_one_features['datasets'][0]['info']['name'],'\n','Dataset ID:',one_to_one_features['datasets'][0]['info']['id'])
for feature in one_to_one_features['datasets'][0]['info']['columns']:
    print('\n','Dataset Feature Name:',feature['name'],'\n','Dataset Feature Description:',feature['description'])