<a href="https://polly.elucidata.io/manage/workspaces?action=open_polly_notebook&source=github&path=path_place_holder&kernel=elucidata/Python 3.10&machine=medium" target="_parent"><img src="https://elucidatainc.github.io/PublicAssets/open_polly.svg" alt="Open in Polly"/></a>


# Ingestion Case 1

## Task: Upload selected datasets from an existing OmixAtlas to a new OmixAtlas

Assumptions: 
1. No custom curation required
2. Schema of new OA to be different from the existing OA
3. Indexing to be done only for dataset and sample level metadata

Steps:
#### Create OA
1. Create OA on Test Polly
#### Schema management
2. Create schema in csv file
3. Prepare schema payload for both dataset and sample only.
4. Insert schema for dataset and sample level metadata
5. Verify if schema is properly inserted
#### Preparing files
6. Fetch the template which dataset level metadata should follow
7. Prepare dataset level metadata
8. Verify if dataset level metadata is as per the template 
9. Prepare gct to be uploaded
#### Ingestion
10. Ingestion
11. Verify indexing

In [1]:
!sudo pip3 install https://elucidatainc.github.io/PublicAssets/builds/polly-python/polly_python-0.1.5-py3-none-any.whl

Collecting polly-python==0.1.5
  Downloading https://elucidatainc.github.io/PublicAssets/builds/polly-python/polly_python-0.1.5-py3-none-any.whl (52 kB)
[K     |████████████████████████████████| 52 kB 1.5 MB/s eta 0:00:011
[?25hCollecting python-magic==0.4.24
  Downloading python_magic-0.4.24-py2.py3-none-any.whl (12 kB)
Collecting rst2txt
  Downloading rst2txt-1.1.0-py2.py3-none-any.whl (12 kB)
Collecting postpy2==0.0.6
  Downloading postpy2-0.0.6-py3-none-any.whl (17 kB)
Collecting Cerberus==1.3.2
  Downloading Cerberus-1.3.2.tar.gz (52 kB)
[K     |████████████████████████████████| 52 kB 2.2 MB/s eta 0:00:011
Collecting pytest
  Downloading pytest-7.1.2-py3-none-any.whl (297 kB)
[K     |████████████████████████████████| 297 kB 6.4 MB/s eta 0:00:01
[?25hCollecting cmapPy
  Downloading cmapPy-4.0.1-py2.py3-none-any.whl (150 kB)
[K     |████████████████████████████████| 150 kB 131.8 MB/s eta 0:00:01
[?25hCollecting joblib
  Downloading joblib-1.1.0-py2.py3-none-any.whl (306 kB)


In [1]:
! polly files copy -y -s polly://dataset_schema.csv -d ./ 
! polly files copy -y -s polly://sample_schema.csv -d ./ 

[32m[1mA new version of Polly CLI is available. To update, execute the command npm update -g @elucidatainc/pollycli[22m[39m
[32m[1mRefreshing session...[22m[39m
[32m[1mSession refreshed![22m[39m
polly://dataset_schema.csv
./
7[?7l[1Gprogress [░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░] 0% | NA | ETA: 0s | time elapsed: 0s[0K[1Gprogress [████████████████████████████████████████] 100% | 1.198 KB/1.198 KB | ETA: 0s | time elapsed: 0s[0K[?7h8
[32m[1mSuccess: Download complete[22m[39m
[32m[1mA new version of Polly CLI is available. To update, execute the command npm update -g @elucidatainc/pollycli[22m[39m
polly://sample_schema.csv
./
7[?7l[1Gprogress [░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░] 0% | NA | ETA: 0s | time elapsed: 0s[0K[1Gprogress [████████████████████████████████████████] 100% | 831 Bytes/831 Bytes | ETA: 0s | time elapsed: 0s[0K[?7h8
[32m[1mSuccess: Download complete[22m[39m


### Importing dependecies and authenticating into Polly

In [31]:
import os
from polly.omixatlas import OmixAtlas
from polly.workspaces import Workspaces

from polly.auth import Polly
#AUTH_KEY = 'KEY' # prod polly
AUTH_KEY= "KEY" # test polly
Polly.auth(AUTH_KEY, env = "testpolly")

omixatlas = OmixAtlas()
workspaces = Workspaces()

### generating schema payload

This block of code will be used to generate the payload of schema when the schema is prepared in the csv file. 

**Input:** csv file containing the schema 

**Output:** payload to be ingested to the OmixAtlas

In [4]:
def get_fields(schema_file):
    schema_df = pd.read_csv(schema_file)
    col_types = {
        "field_name": str,
        "original_name": str,
        "type": str,
        "is_keyword": bool,
        "is_array": bool,
        "is_filter": bool,
        "is_column": bool,
        "is_keyword": bool,
        "filter_size": int,
        "display_name": str,
        "description": str,
    }
    schema_df = schema_df.astype(col_types)
    schema_df["field_name"] = schema_df["field_name"].apply(lambda x: x.lower().strip().replace(" ", "_"))
    schema_df["type"] = schema_df["type"].str.lower().str.strip()
    schema_df["description"] = schema_df["description"].str.replace("nan", "NA")
    
    object_rows = schema_df[schema_df["type"] == "object"]
    object_rows_array = object_rows[object_rows["is_array"] == 1]
    if len(object_rows_array.index) > 0:
        raise ValueError("Fields with type 'object' cannot be arrays")
    object_rows_keyword = object_rows[object_rows["is_keyword"] == 1]
    if len(object_rows_keyword.index) > 0:
        raise ValueError("Fields with type 'object' cannot be keywords")
    
    field_list = schema_df.to_dict(orient="records")
    fields = {}
    for field in field_list:
        field_name = field.pop("field_name")
        fields[field_name] = field
    
    return fields

def generate_schema_payload(schema_type, schema_file):
    return {
        "data": {
            "type": "schemas",
            "id": repo_id,
            "attributes": {
                "repo_id": repo_id,
                "schema_type": schema_type,
                "schema": {
                    "all": {
                        "all": get_fields(schema_file)
                    }
                }
            }
        }
    }

### generating schema for dataset level metadata (files)

Calling the functions above to generate the schema payload. 

Here, the repo_id must be added as a string. The file is the input csv file which contains the schema in excel sheet.

In [65]:
print("Creating dataset-level metadata schema...")
repo_id = "1657110718820"
file = "dataset_schema.csv"
payload_d = generate_schema_payload("files", file)
payload_d

Creating dataset-level metadata schema...


{'data': {'type': 'schemas',
  'id': '1657110718820',
  'attributes': {'repo_id': '1657110718820',
   'schema_type': 'files',
   'schema': {'all': {'all': {'dataset_source': {'original_name': 'dataset_source',
       'type': 'text',
       'is_keyword': True,
       'is_array': False,
       'is_filter': False,
       'is_column': False,
       'filter_size': 1,
       'display_name': 'Source',
       'description': 'Source from where the data was fetched'},
      'dataset_id': {'original_name': 'dataset_id',
       'type': 'text',
       'is_keyword': True,
       'is_array': False,
       'is_filter': False,
       'is_column': True,
       'filter_size': 1,
       'display_name': 'Dataset ID',
       'description': 'Unique ID assocaited with every dataset'},
      'description': {'original_name': 'description',
       'type': 'text',
       'is_keyword': False,
       'is_array': False,
       'is_filter': False,
       'is_column': False,
       'filter_size': 1,
       'display_na

### adding dataset level schema to OA

The payload above can be used with polly python function insert_schema to add the schema in a new OmixAtlas

In [67]:
omixatlas.insert_schema(1657110718820, payload_d)

'{"data": {"type": "schemas", "id": "1657110718820", "attributes": {"repo_id": "1657110718820", "schema_type": "files", "schema": {"all": {"all": {"dataset_source": {"type": "text", "is_keyword": true, "is_array": false, "is_filter": false, "is_column": false, "is_curated": false, "is_ontology": false, "filter_size": 1, "original_name": "dataset_source", "display_name": "Source", "description": "Source from where the data was fetched"}, "dataset_id": {"type": "text", "is_keyword": true, "is_array": false, "is_filter": false, "is_column": true, "is_curated": false, "is_ontology": false, "filter_size": 1, "original_name": "dataset_id", "display_name": "Dataset ID", "description": "Unique ID assocaited with every dataset"}, "description": {"type": "text", "is_keyword": false, "is_array": false, "is_filter": false, "is_column": false, "is_curated": false, "is_ontology": false, "filter_size": 1, "original_name": "description", "display_name": "Title", "description": "Description of the data

### generating schema for sample level metadata (gct_metadata)

Just like the dataset level metadata, the sample level metadata should also be added to the OmixAtlas.

In [19]:
print("Creating dataset-level metadata schema...")
repo_id = "1657110718820" #to be given as string (repo_name is not acceptable here)
file = "sample_schema.csv"
payload_s = generate_schema_payload("gct_metadata", file)
payload_s

Creating dataset-level metadata schema...


{'data': {'type': 'schemas',
  'id': '1657110718820',
  'attributes': {'repo_id': '1657110718820',
   'schema_type': 'gct_metadata',
   'schema': {'all': {'all': {'sample_id': {'original_name': 'sample_id',
       'type': 'text',
       'is_keyword': True,
       'is_array': False,
       'is_filter': False,
       'is_column': True,
       'filter_size': 1,
       'display_name': 'Sample ID',
       'description': 'Unique ID associated with every sample'},
      'dataset_id': {'original_name': 'dataset_id',
       'type': 'text',
       'is_keyword': True,
       'is_array': True,
       'is_filter': False,
       'is_column': True,
       'filter_size': 1,
       'display_name': 'Dataset ID',
       'description': 'Unique ID of the dataset to which the sample belongs'},
      'timestamp_': {'original_name': 'timestamp_',
       'type': 'text',
       'is_keyword': False,
       'is_array': False,
       'is_filter': False,
       'is_column': False,
       'filter_size': 1,
       'd

### adding sample level schema to OA

The payload above can be used with polly python function insert_schema to add the schema in a new OmixAtlas

In [20]:
omixatlas.insert_schema(1657110718820, payload_s)

'{"data": {"type": "schemas", "id": "1657110718820", "attributes": {"repo_id": "1657110718820", "schema_type": "gct_metadata", "schema": {"all": {"all": {"sample_id": {"type": "text", "is_keyword": true, "is_array": false, "is_filter": false, "is_column": true, "is_curated": false, "is_ontology": false, "filter_size": 1, "original_name": "sample_id", "display_name": "Sample ID", "description": "Unique ID associated with every sample"}, "dataset_id": {"type": "text", "is_keyword": true, "is_array": true, "is_filter": false, "is_column": true, "is_curated": false, "is_ontology": false, "filter_size": 1, "original_name": "dataset_id", "display_name": "Dataset ID", "description": "Unique ID of the dataset to which the sample belongs"}, "timestamp_": {"type": "text", "is_keyword": true, "is_array": false, "is_filter": false, "is_column": false, "is_curated": false, "is_ontology": false, "filter_size": 2000, "original_name": "kw_timestamp_", "display_name": "Timestamp", "description": "Unix 

### validating if the schema is appropriately added

In [68]:
schema = omixatlas.get_schema(1657110718820)
schema.dataset

Unnamed: 0,Source,Datatype,Field Name,Field Description,Field Type,Is Curated,Is Array
0,all,all,curated_organism,Orgnism from which the samples were derived,text,False,True
1,all,all,curated_tissue,Tissue from which the samples were derivved,text,False,True
2,all,all,total_num_samples,Total number of samples in a dataset,integer,False,False
3,all,all,dataset_source,Source from where the data was fetched,text,False,False
4,all,all,dataset_id,Unique ID assocaited with every dataset,text,False,False
5,all,all,data_type,The type of biomolecular data captured (eg - E...,text,False,False
6,all,all,description,Description of the dataset,text,False,False
7,all,all,curated_cell_line,Cell lines from which the samples were derived...,text,False,True
8,all,all,curated_disease,Disease associated with the dataset,text,False,True
9,all,all,curated_drug,Drugs administered in the samples belonging to...,text,False,True


In [22]:
schema.sample

Unnamed: 0,Source,Datatype,Field Name,Field Description,Field Type,Is Curated,Is Array
0,all,all,src_repo,Name of the repository this data entity origin...,text,False,False
1,all,all,id_key,Name of the key that was used for creation of ...,text,False,False
2,all,all,src_uri,Unique URI derived from source data file's S3 ...,text,False,False
3,all,all,sample_id,Unique ID associated with every sample,text,False,False
4,all,all,dataset_id,Unique ID of the dataset to which the sample b...,text,False,True
5,all,all,curated_cell_line,Cell line from which the sample was derived,text,False,False
6,all,all,curated_gene,Gene of interest in the sample,text,False,True
7,all,all,curated_drug,Drug admistered in the sample,text,False,True
8,all,all,curated_disease,Disease associated with the sample,text,False,True
9,all,all,src_dataset_id,Dataset ID of the file this data entity origin...,text,False,False


## Preparing dataset level metadata files (.json)

### fetching template for the same

In order to prepare the dataset level metadata, the user should ensure the keys of json files they prepare adheres to the schema of dataset level. 

In order to facilitate this, users can use the function dataset_metadata_template to generate the template. These are mandatory fields to have in the dataset level metadata.

In [69]:
template = omixatlas.dataset_metadata_template(1657110718820)

In [70]:
template

{'curated_organism': [],
 'curated_tissue': [],
 'total_num_samples': 'integer',
 'dataset_source': 'text',
 'dataset_id': 'text',
 'data_type': 'text',
 'description': 'text',
 'curated_cell_line': [],
 'curated_disease': [],
 'curated_drug': [],
 'curated_cell_type': [],
 '__index__': {'file_metadata': True,
  'col_metadata': True,
  'row_metadata': False,
  'data_required': False}}

### from source OmixAtlas, fetching the dataset level metadata for the datasets to be ingested in new OmixAtlas

Because some of the information present at dataset level metadata in source OA and destination OA will remain same, we can just query the source OA and use the information to prepare the dataset level metadata for destination OA.

In [26]:
list = ['PXD002408_Biological_Replicate_1', 'PXD002408_Biological_Replicate_2']
query = f"SELECT * from valo_onco.datasets WHERE dataset_id IN {tuple(list)}"
metadata = omixatlas.query_metadata(query)
metadata

Query execution succeeded (time taken: 1.14 seconds, data scanned: 0.007 MB)
Fetched 2 rows


Unnamed: 0,curated_organism,dosage,src_uri,total_num_samples,year,ingestion_approved,description,curated_cell_line,data_table_name,data_table_version,platform,exposure_time,timestamp_,perturbation_type,file_type,publication,curated_cell_type,key,summary,perturbation_modality,src_repo,drug_smiles,package,file_location,author,dataset_id,curated_disease,curated_drug,curated_gene,abstract,version,curated_strain,pubchem_id,bucket,curated_tissue,dataset_source,data_type,overall_design,is_current,region
0,[Homo sapiens],"[30 μM, 50 μM]",polly:data://valo_onco_data_lake/data/PRIDE/Pr...,7,,Approved,Quantitative K-GG site analysis,[HCT 116],valo_onco__pxd002408_biological_replicate_1,0,LTQ Orbitrap Velos,"[2 hours, 6 hours]",1656064257394,[Chemical],gct,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4...,[colorectal cancer cell line],valo_onco_data_lake/data/PRIDE/Proteomics/PXD0...,,[Drug],valo_onco,[],valo_onco_data_lake/data,https://discover-prod-datalake-v1.s3-us-west-2...,,PXD002408_Biological_Replicate_1,[Colonic Neoplasms],"[dimethyl sulfoxide, growth hormone]","[USP1, USP7]",,0,[None],"[679, 46931953]",discover-prod-datalake-v1,[colorectum],PRIDE,Proteomics,,True,us-west-2
1,[Homo sapiens],"[30 μM, 50 μM]",polly:data://valo_onco_data_lake/data/PRIDE/Pr...,3,,Approved,Quantitative K-GG site analysis,[HCT 116],valo_onco__pxd002408_biological_replicate_2,0,LTQ Orbitrap Velos,"[2 hours, 6 hours]",1656064386505,[Chemical],gct,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4...,[colorectal cancer cell line],valo_onco_data_lake/data/PRIDE/Proteomics/PXD0...,,[Drug],valo_onco,[],valo_onco_data_lake/data,https://discover-prod-datalake-v1.s3-us-west-2...,,PXD002408_Biological_Replicate_2,[Colonic Neoplasms],"[dimethyl sulfoxide, growth hormone]","[USP1, USP7]",,0,[None],"[679, 46931953]",discover-prod-datalake-v1,[colorectum],PRIDE,Proteomics,,True,us-west-2


Before running the next cell, user should create two folders = 'data_final' and 'metadata_final' in their working directory

### converting the dataset level metadata to .json file

In [28]:
# Converting to json
for i in metadata.index:
    res.loc[i].to_json(f"metadata_final/{metadata['dataset_id'][i]}.json".format(i))
filenames = os.listdir('metadata_final')

### adding the _ _index_ _ OR any custom curated fields

Once the dataset level metadata files are created then as per the template, _ _index_ _ needs to be appended to the .json file prior to ingestion.

If there are any custom curation fields, then those can also be added in a similar way.

In [None]:
for i in filenames:
    entry = { "__index__": {
        'file_metadata': 'true',  
        'col_metadata': 'true',  
        'row_metadata': 'false',  
        'data_required': 'false',
    }}
    file = open('metadata_final/'+i)
    json_met = json.load(file)
    entry.update(json_met)
    with open('metadata_final/'+i, "w") as file:
        json.dump(entry, file)

### Validating if the keys of json file matches the template

Prior to ingestion, the user should ensure the keys in dataset level metadata matches with that of the template generated above.

In [78]:
template_keys = set()
for key in template:
        template_keys.add(key)

In [79]:
template_keys

{'__index__',
 'curated_cell_line',
 'curated_cell_type',
 'curated_disease',
 'curated_drug',
 'curated_organism',
 'curated_tissue',
 'data_type',
 'dataset_id',
 'dataset_source',
 'description',
 'total_num_samples'}

In [80]:
import json
keys_in_json = set()
f = open('metadata_final/PXD002408_Biological_Replicate_1.json')
data = json.load(f)

for key in data:
    keys_in_json.add(key)

In [81]:
keys_in_json

{'__index__',
 'abstract',
 'author',
 'bucket',
 'curated_cell_line',
 'curated_cell_type',
 'curated_disease',
 'curated_drug',
 'curated_gene',
 'curated_organism',
 'curated_strain',
 'curated_tissue',
 'data_table_name',
 'data_table_version',
 'data_type',
 'dataset_id',
 'dataset_source',
 'description',
 'dosage',
 'drug_smiles',
 'exposure_time',
 'file_location',
 'file_type',
 'ingestion_approved',
 'is_current',
 'key',
 'overall_design',
 'package',
 'perturbation_modality',
 'perturbation_type',
 'platform',
 'pubchem_id',
 'publication',
 'region',
 'src_repo',
 'src_uri',
 'summary',
 'timestamp_',
 'total_num_samples',
 'version',
 'year'}

In [82]:
intersect = keys_in_json.intersection(template_keys)
intersect

{'__index__',
 'curated_cell_line',
 'curated_cell_type',
 'curated_disease',
 'curated_drug',
 'curated_organism',
 'curated_tissue',
 'data_type',
 'dataset_id',
 'dataset_source',
 'description',
 'total_num_samples'}

### validating indexes are same

In [83]:
template_keys.difference(intersect)

set()

### Downloading corresponding gct files

As per the assumption, no custom curation is required hence the gct files don't need any modification. Hence, downloading the files from the source OA and uploading them as it is in the destination OA

In [29]:
#create a folder data_final in the working repository
for i in metadata['dataset_id']:
    print (str(i))
    data = omixatlas.download_data("valo_onco",str(i))
    file_name = f"{str(i)}.gct"
    url = data.get('data').get('attributes').get('download_url')
    status = os.system(f"wget -O 'data_final/{file_name}' '{url}'")
    if status == 0:
        print("Downloaded data successfully")
    else:
        raise Exception("Download not successful")

PXD002408_Biological_Replicate_1


--2022-07-07 06:15:20--  https://discover-prod-datalake-v1.s3.amazonaws.com/valo_onco_data_lake/data/PRIDE/Proteomics/PXD002408_Biological_Replicate_1.gct?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=ASIAVRYB5UBIDT4GJDG3%2F20220707%2Fus-west-2%2Fs3%2Faws4_request&X-Amz-Date=20220707T061520Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Security-Token=IQoJb3JpZ2luX2VjEKb%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FwEaCXVzLXdlc3QtMiJHMEUCIGiIc25y6NdyTsFxBO6k35UH7tv5dN7QcGg2tcP7iRt0AiEAhvwTkZEpIMMqYbo6fCCGawCrPTNUfKAkiYsxQogguk4qqgIIz%2F%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FARADGgwzODE3MTkyNTcxNjgiDG%2B8EXHMxFxPQXdB4Sr%2BAQ7iJQ2aPKz%2F4V3bH76HODP0midBGdvkbkEnEeNVCUs6%2Bhm9Se%2BWttxQBwqImF%2F3FE1ubSTiWLz6u6nhLloZ%2F%2FC%2FeORRYJk4ObNFH0JH8A2NMxunnDaGvgn8FUbUiuVB%2BvRDTBvvFSq42qn7tX7k6l%2Bb5kM5pxq%2F%2F3hI%2FwzJsMdgpOB6TFm3VmEqMBYa3uErh4VbwDetXVL0mzB8dpagKBbtTZvGEZNs14lpn5q8nqAuyBClDdOFx3a8kRyTUbZVESHLK13g%2FQpiRTwFbMhGkrfTaLrnGZyiAsZeEW1UXH0m1BZaLTBhlsgqMgXHsqd%2FrTDWr3iYaBXwFERB17bKMNjzmZYGOpoBq5mm%2Fn

Downloaded data successfully
PXD002408_Biological_Replicate_2
Downloaded data successfully


--2022-07-07 06:15:22--  https://discover-prod-datalake-v1.s3.amazonaws.com/valo_onco_data_lake/data/PRIDE/Proteomics/PXD002408_Biological_Replicate_2.gct?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=ASIAVRYB5UBIDT4GJDG3%2F20220707%2Fus-west-2%2Fs3%2Faws4_request&X-Amz-Date=20220707T061522Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Security-Token=IQoJb3JpZ2luX2VjEKb%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FwEaCXVzLXdlc3QtMiJHMEUCIGiIc25y6NdyTsFxBO6k35UH7tv5dN7QcGg2tcP7iRt0AiEAhvwTkZEpIMMqYbo6fCCGawCrPTNUfKAkiYsxQogguk4qqgIIz%2F%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FARADGgwzODE3MTkyNTcxNjgiDG%2B8EXHMxFxPQXdB4Sr%2BAQ7iJQ2aPKz%2F4V3bH76HODP0midBGdvkbkEnEeNVCUs6%2Bhm9Se%2BWttxQBwqImF%2F3FE1ubSTiWLz6u6nhLloZ%2F%2FC%2FeORRYJk4ObNFH0JH8A2NMxunnDaGvgn8FUbUiuVB%2BvRDTBvvFSq42qn7tX7k6l%2Bb5kM5pxq%2F%2F3hI%2FwzJsMdgpOB6TFm3VmEqMBYa3uErh4VbwDetXVL0mzB8dpagKBbtTZvGEZNs14lpn5q8nqAuyBClDdOFx3a8kRyTUbZVESHLK13g%2FQpiRTwFbMhGkrfTaLrnGZyiAsZeEW1UXH0m1BZaLTBhlsgqMgXHsqd%2FrTDWr3iYaBXwFERB17bKMNjzmZYGOpoBq5mm%2Fn

## Ingesting the data

In order to ingest the data, user can use the following function:-

**add_datasets(repo_id (int/str), source_folder_path (dict), destination_folder (str) (optional), priority (str) (optiona))**

**Input:** 

repo_id: This is the repository ID to which ingestion should be done

source_folder_path: This is the dictionary with keys "data" and "metadata". The corresponding value pairs should be the folder containing the file (gct, h5ad, vcf, mmcif etc) for data and folder containing json of dataset level metadata for metadata.

destination_folder (optional): This is the folder within S3 when data gets pushed

priority (optional): This is the priority flag as per ingestion is being done. Default is 'medium'

**Output:** 

Status of file upload for each dataset in a dataframe

In [84]:
repo_id = "1657110718820"
source_folder_path_data = "/import/data_final"
source_folder_metadata = "/import/metadata_final"
destination_folder = "220707-1426"
priority = "high"
source_folder_path = {"data":source_folder_path_data, "metadata":source_folder_metadata}
print(source_folder_path)
omixatlas.add_datasets(repo_id, source_folder_path, destination_folder, priority)

{'data': '/import/data_final', 'metadata': '/import/metadata_final'}
                              File Name        Message
0                combined_metadata.json  File Uploaded
1  PXD002408_Biological_Replicate_1.gct  File Uploaded
2  PXD002408_Biological_Replicate_2.gct  File Uploaded
