# 

# Gen3 Schema Development Framework 

This notebook provides a framework for developing a gen3 data schema using an ***existing data schema as a template***. For this project we used the Kidsfirst data. Working from the following resources: 

* [uc-cdis/kf-dictionary](https://github.com/uc-cdis/kf-dictionary/tree/develop)
* [AustralianBioCommons/gen3schemadev](https://github.com/AustralianBioCommons/gen3schemadev)
* [Thyroid dictionary graph structure](https://drive.google.com/file/d/1DsSTsl_jExKRcDp-OWfsMcSMb_firDKK/view?usp=sharing)

## Set up the environment

In [1]:
# Set working directory
!cd /Users/gsam0138/Documents/Repositories/Gen3-data-dictionary-dev

In [2]:
# Clone kf-dictionary repository, will be using this as template: 
!git clone https://github.com/uc-cdis/kf-dictionary.git

fatal: destination path 'kf-dictionary' already exists and is not an empty directory.


In [3]:
# Clone Australian BioCommons schema repository, this is where we will save thyroid schema:
!git clone https://github.com/AustralianBioCommons/gen3schemadev.git

fatal: destination path 'gen3schemadev' already exists and is not an empty directory.


In [21]:
# Copy the kf yamls across to the AustralianBioCommons/gen3schemadev codebase: 
!mkdir gen3schemadev/schema/kidsfirst
!cp -r kf-dictionary/gdcdictionary/schemas/* ./gen3schemadev/schema/kidsfirst

mkdir: gen3schemadev/schema/kidsfirst: File exists


In [22]:
# Setup virtual environment
!python3 -m venv .venv
!source .venv/bin/activate
!pip3 install -r ./gen3schemadev/requirements.txt


zsh:1: /Users/gsam0138/Documents/Repositories/Gen3-data-dictionary-dev/.venv/bin/pip3: bad interpreter: /Users/gsam0138/Documents/Gen3-data-dictionary-dev/.venv/bin/python3: no such file or directory
Collecting argparse (from -r ./gen3schemadev/requirements.txt (line 11))
  Using cached argparse-1.4.0-py2.py3-none-any.whl.metadata (2.8 kB)
Using cached argparse-1.4.0-py2.py3-none-any.whl (23 kB)
Installing collected packages: argparse
Successfully installed argparse-1.4.0


In [26]:
# Import necessary class and create ConfigBundle instance
import gen3schemadev
import os
from gen3schemadev.schemabundle import ConfigBundle

bundle = ConfigBundle("/Users/gsam0138/Documents/Repositories/Gen3-data-dictionary-dev/gen3schemadev/schema/kidsfirst")
os.makedirs("/Users/gsam0138/Documents/Repositories/Gen3-data-dictionary-dev/gen3schemadev/schema/thyroid", exist_ok=True)

ModuleNotFoundError: No module named 'gen3schemadev.schemabundle'

## Set up the template schema

In [7]:
# Dump contents of kidsfirst bundle into thyroid
bundle.dump("/Users/gsam0138/Documents/Repositories/Gen3-data-dictionary-dev/gen3schemadev/schema/thyroid")

NameError: name 'bundle' is not defined

In [74]:
# Check the properties of the participants.yaml file
bundle.objects['participant.yaml'].properties.keys()

dict_keys(['$ref_ubiq', 'days_to_lost_to_followup', 'disease_type', 'index_date', 'lost_to_followup', 'primary_site', 'is_proband', 'external_id', 'family_id', 'consent_type', 'projects', 'families'])

In [75]:
# Define properties to delete from the participants.yaml file quickly
deletion_list={}
deletion_list={
    "participant.yaml": ["days_to_lost_to_followup", "is_proband", "family_id", "families", "days_to_lost_to_followup", "lost_to_followup"]
}

In [81]:
# Summarise all properties from all bundle objects and output to excel
import pandas as pd

data = []
for obj in bundle.objects:
    for attrib in bundle.objects[obj].get_properties():
        data.append({"object":obj,"property":attrib})
dataframe = pd.DataFrame(data)

dataframe.to_excel("objects.xlsx", index=False)

## Export dataframe to Excel (for now)

**For now just using Excel, as having issues with getting Google Sheets to work with API. No time to fix it but once its fixed, follow instructions below.** 

This requires authorisation to access the specified google sheets document in your drive. To set this up: 

1. Go to the [Google Developers Console](https://console.cloud.google.com/welcome/new)
2. Select your project or create a new one
3. Navigate to `IAM & Admin` > `Service accounts`
4. Click `Create Service Account`
5. Fill in the required information and click `Create`
6. Assign the necessary roles (e.g., "Editor" or "Owner") to the service account
7. Click `Continue` and then `Done`
8. Click on the newly created service account
9. Navigate to the `Keys` tab and click `Add Key` > `JSON`

The JSON file will be downloaded to your computer. From here, you can provide the downloaded json file as your credentials file in the code block below. 


In [77]:
## THIS IS CURRENTLY BROKEN, DON'T RUN ##

# Authenticate and export data to google sheets
#!pip install gspread oauth2client
import gspread
from google.oauth2.service_account import Credentials

scope = ['https://spreadsheets.google.com/feeds', 'https://www.googleapis.com/auth/drive']
client = gspread.authorize(creds)

try:
    # Load credentials from the credentials.json file
    creds = Credentials.from_service_account_file('/Users/gsam0138/Documents/gen3/gen3-421404-66ae97a6e8cc.json', scopes=scope)

    # Authorize with gspread
    client = gspread.authorize(creds)

    # Create a Google Sheets file
    sheet_name = 'Thyroid-gen3-test'
    sheet = client.create(sheet_name)
    
    # Or open an existing sheet
    # sheet = client.open(sheet_name) 

    print(f"Sheet '{sheet_name}' has been created!")

except Exception as e:
    print(f"An error occurred: {e}")

# Convert DataFrame to Google Sheets format
worksheet = sheet.get_worksheet(0)  # Get the first worksheet
worksheet.update([dataframe.columns.values.tolist()] + dataframe.values.tolist())


An error occurred: [Errno 2] No such file or directory: '/Users/gsam0138/Documents/gen3/gen3-421404-66ae97a6e8cc.json'


{'spreadsheetId': '1Gn9g3iwzGmit1rvHsCQRnjdyIv8DIQpSLwWm5bij7ow',
 'updatedRange': 'Sheet1!A1:B385',
 'updatedRows': 385,
 'updatedColumns': 2,
 'updatedCells': 770}

### Pulling data schema from google sheets
- pulls design from the following google sheet template [link](https://docs.google.com/spreadsheets/d/1zjDBDvXgb0ydswFBwy47r2c8V1TFnpUj1jcG0xsY7ZI/edit?usp=sharing)

In [None]:
# Pulling data schema from google sheets
# Note this is loads a gdoc template 
!rm -R schema_out
!python3 sheet2yaml-CLI.py --google-id '1zjDBDvXgb0ydswFBwy47r2c8V1TFnpUj1jcG0xsY7ZI' --objects-gid 0 --links-gid 270346573 --properties-gid 613332252 --enums-gid 1807456496

In [None]:
# Run if  umccr-dict does not exist
!cd "$(pwd)/../" && git clone https://github.com/AustralianBioCommons/umccr-dictionary.git


In [None]:
# Moving schema_out to umccr-dictionary
!mkdir -p ../umccr-dictionary/dictionary/schema_dev/gdcdictionary/schemas
!cp schema_out/* ../umccr-dictionary/dictionary/schema_dev/gdcdictionary/schemas/
!ls -lsha ../umccr-dictionary/dictionary/schema_dev/gdcdictionary/schemas/

In [None]:
# loading containers
!cd ../umccr-dictionary && make down
!cd ../umccr-dictionary && make pull
!cd ../umccr-dictionary && make up
!cd ../umccr-dictionary && make ps


In [None]:
# compiling and bundling into json
!cd ../umccr-dictionary && make compile program=schema_dev


In [None]:
# Running Validation
!cd ../umccr-dictionary && make validate program=schema_dev

In [None]:
# Visualising data dictionary
!open http://localhost:8080/#schema/schema_dev.json