# TDS Demo Preparation

This notebook includes functionality used to prepare for June 2023 demos.

In [1]:
import requests
import io
import json
from copy import deepcopy
import pandas as pd
import numpy as np
tds_url = 'http://localhost:8001'
mit_url = 'http://100.26.10.46'

## Forecast Hub Data Prep
This mirrors the code [used by CIEMSS](https://github.com/ciemss/pyciemss/blob/main/notebook/april_ensemble/covid_data/Covid_hosp_data.ipynb) to prepare data for the April Ensemble Challenge.

It includes incident case data, incident hospitalizations, and cumulative deaths. 

In [2]:
url = 'https://media.githubusercontent.com/media/reichlab/covid19-forecast-hub/master/data-truth/truth-Incident%20Cases.csv'
raw_cases = pd.read_csv(url)
raw_cases.rename(columns={'value': 'cases'},inplace=True)
raw_cases_us = raw_cases[raw_cases['location_name']=='United States']

url = 'https://media.githubusercontent.com/media/reichlab/covid19-forecast-hub/master/data-truth/truth-Incident%20Hospitalizations.csv'
raw_hosp = pd.read_csv(url)
raw_hosp.rename(columns={'value': 'hospitalizations'},inplace=True)
raw_hosp_us = raw_hosp[raw_hosp['location_name']=='United States']

url = 'https://media.githubusercontent.com/media/reichlab/covid19-forecast-hub/master/data-truth/truth-Cumulative%20Deaths.csv'
raw_deaths = pd.read_csv(url)
raw_deaths.rename(columns={'value': 'deaths'},inplace=True)
raw_deaths_us = raw_deaths[raw_deaths['location_name']=='United States']

  raw_cases = pd.read_csv(url)
  raw_deaths = pd.read_csv(url)


In [3]:
merged_df = pd.merge(raw_cases_us, raw_hosp_us, on=['date','location','location_name'], how='outer')
merged_df = pd.merge(merged_df, raw_deaths_us, on=['date','location','location_name'], how='outer')

In [4]:
merged_df.to_csv('forecast_hub_demo_data.csv', index=False)

## Datasets Profiling

Now we can profile this dataset using MIT's service.

In [5]:
dataset = {
  "username": "Adam Smith",
  "name": "COVID-19 Forecast Hub Ground Truth Data",
  "description": "COVID-19 case incidents, hospitalization incidents and cumulative deaths provided by COVID-19 Forecast Hub.",
  "file_names": [
    "forecast_hub_demo_data.csv"
  ],
  "source": "https://github.com/reichlab/covid19-forecast-hub/blob/master/data-truth/README.md",
  }

First, we'll profile the dataset via MIT's extraction service.

This currently requires sending a GPT key. We'll fetch one from a local file. To recreate, add your key to a file named openai_key alongside this notebook or fill it in below.

In [6]:
openai_key = open('openai_key', 'r').read().strip()

The MIT extraction service works best with just the top few lines of a dataset.

3 or 4 lines seems to be the sweet spot.

In [7]:
df = merged_df

In [8]:
buffer = io.StringIO()
df.head(3).to_csv(buffer, index=False)
file_sample = buffer.getvalue()

In [9]:
print(file_sample)

date,location,location_name,cases,hospitalizations,deaths
2020-01-22,US,United States,1.0,,1.0
2020-01-23,US,United States,0.0,,1.0
2020-01-24,US,United States,1.0,,1.0



Now let's fetch the README for the demo data so that we can provide it as context to the MIT service

In [10]:
doc = requests.get('https://raw.githubusercontent.com/reichlab/covid19-forecast-hub/master/data-truth/README.md').text

Send the sample to two MIT endpoints for extraction (this usually takes between 30-90 seconds for each call)

In [11]:
resp = requests.post(
    url=f"{mit_url}/annotation/link_dataset_col_to_dkg",
    params={"csv_str": file_sample, "doc": doc, "gpt_key": openai_key},
)
mit_groundings = resp.json()
mit_groundings

{'date': {'col_name': 'date',
  'concept': 'Date',
  'unit': 'YYYY-MM-DD',
  'description': 'The date of the data entry.',
  'dkg_groundings': [['apollosv:00000429', 'date'],
   ['oboinowl:date', 'date'],
   ['dc:date', 'Date'],
   ['geonames:2130188', 'Hakodate'],
   ['oboinowl:hasDate', 'has_date'],
   ['idocovid19:0001277', 'COVID-19 incidence', 'class'],
   ['ido:0000480', 'infection incidence', 'class'],
   ['idocovid19:0001283', 'SARS-CoV-2 incidence', 'class'],
   ['hp:0001402', 'Hepatocellular carcinoma', 'class'],
   ['oae:0000178', 'AE incidence rate', 'class'],
   ['orphanet.ordo:409966', 'point prevalence', 'class'],
   ['obcs:0000064', 'period prevalence', 'class'],
   ['cemo:weighted_prevalence', 'weighted prevalence', 'class'],
   ['idocovid19:0001272', 'COVID-19 prevalence', 'class'],
   ['ido:0000486', 'infection prevalence', 'class']]},
 'location': {'col_name': 'location',
  'concept': 'Location Code',
  'unit': 'Text',
  'description': 'The location code of the plac

In [12]:
resp = requests.post(
    url=f"{mit_url}/annotation/upload_file_extract/?gpt_key={openai_key}",
    files={"file": file_sample},
)
resp.json()
mit_annotations = {a['name']: a for a in resp.json()}
mit_annotations

{'date': {'type': 'variable',
  'name': 'date',
  'id': 'v0',
  'text_annotations': [' Date of the data entry'],
  'dkg_annotations': [['apollosv:00000429', 'date'],
   ['oboinowl:date', 'date']],
  'title': 'd30f8211b9f1611304ab63023c2db124__file',
  'data_annotations': []},
 'location': {'type': 'variable',
  'name': 'location',
  'id': 'v1',
  'text_annotations': [' Country of the data entry'],
  'dkg_annotations': [['so:0000199', 'translocation'],
   ['so:0001885', 'TFBS_translocation']],
  'title': 'd30f8211b9f1611304ab63023c2db124__file',
  'data_annotations': [[[7, 'us-counties-2023.csv', 1, 'county'],
    'https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties-2023.csv']]},
 'location_name': {'type': 'variable',
  'name': 'location_name',
  'id': 'v2',
  'text_annotations': [' Name of the country of the data entry'],
  'dkg_annotations': [],
  'title': 'd30f8211b9f1611304ab63023c2db124__file',
  'data_annotations': [[[4, 'OxCGRT_nat_latest.csv', 2, 'RegionNa

In [13]:
columns = []
for c in df.columns:
    annotations = mit_annotations.get(c, {}).get("text_annotations", [])
    # Skip any single empty strings that are sometimes returned and drop extra items that are sometimes included (usually the string 'class')
    groundings = {g[0]: g[1] for g in mit_groundings.get(c, None).get('dkg_groundings', None) \
                if g and isinstance(g, list)}
    col = {
      "name": c,
      "data_type": "float",
      "annotations": annotations,
      "metadata": {},
      "grounding": {
          "identifiers": groundings,
      },
    }
    columns.append(col)

dataset['columns'] = columns

The dataset is now ready to be sent to the data service

In [14]:
with open('forecast_hub_demo_data.json','w') as f:
    f.write(json.dumps(dataset))