<a href="https://polly.elucidata.io/manage/workspaces?action=open_polly_notebook&source=github&path=path_place_holder&kernel=elucidata/Python 3.10&machine=medium" target="_parent"><img src="https://elucidatainc.github.io/PublicAssets/open_polly.svg" alt="Open in Polly"/></a>


# Ingestion Case 1

## Task: Upload selected datasets from an existing OmixAtlas (source) to a new OmixAtlas (destination)

### Assumptions: 
1. No custom curation required
2. Dataset level schema of destination OA to be different from the source OA
3. Sample level schema of destination OA to be same as source OA
4. Indexing to be done only for dataset and sample level metadata

### Steps:
#### Create OA
1. Create OA on Test Polly

#### Schema management
2. Create schema in csv file
3. Prepare schema payload for both dataset and sample only.
4. Insert schema for dataset and sample level metadata
5. Verify if schema is properly inserted

#### Preparing files
6. Fetch the template which dataset level metadata should follow
7. Prepare dataset level metadata
8. Verify if dataset level metadata is as per the template 
9. Prepare gct to be uploaded

#### Ingestion
10. Ingestion
11. Verify indexing

In [2]:
!sudo pip3 install polly-python

Collecting polly-python
  Downloading polly_python-0.1.5-py3-none-any.whl (53 kB)
[K     |████████████████████████████████| 53 kB 2.6 MB/s eta 0:00:011
[?25hCollecting joblib
  Downloading joblib-1.1.0-py2.py3-none-any.whl (306 kB)
[K     |████████████████████████████████| 306 kB 9.0 MB/s eta 0:00:01
Collecting certifi==2021.10.8
  Downloading certifi-2021.10.8-py2.py3-none-any.whl (149 kB)
[K     |████████████████████████████████| 149 kB 147.3 MB/s eta 0:00:01
[?25hCollecting pytz==2021.1
  Downloading pytz-2021.1-py2.py3-none-any.whl (510 kB)
[K     |████████████████████████████████| 510 kB 129.4 MB/s eta 0:00:01
[?25hCollecting Cerberus==1.3.2
  Downloading Cerberus-1.3.2.tar.gz (52 kB)
[K     |████████████████████████████████| 52 kB 2.9 MB/s  eta 0:00:01
[?25hCollecting requests==2.25.1
  Downloading requests-2.25.1-py2.py3-none-any.whl (61 kB)
[K     |████████████████████████████████| 61 kB 21.6 MB/s  eta 0:00:01
Collecting sqlparse
  Downloading sqlparse-0.4.2-py3-none-

#### Importing csv files containing schema for dataset level metadata

In [6]:
! polly files copy -y -s polly://dataset_schema.csv -d ./ 

[32m[1mA new version of Polly CLI is available. To update, execute the command npm update -g @elucidatainc/pollycli[22m[39m
polly://dataset_schema.csv
./
7[?7l[1Gprogress [░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░] 0% | NA | ETA: 0s | time elapsed: 0s[0K[1Gprogress [████████████████████████████████████████] 100% | 1.198 KB/1.198 KB | ETA: 0s | time elapsed: 0s[0K[?7h8
[32m[1mSuccess: Download complete[22m[39m


### Importing dependecies and authenticating into Polly

In [3]:
import os
from polly.omixatlas import OmixAtlas
from polly.workspaces import Workspaces

from polly.auth import Polly
#AUTH_KEY = 'NWQ3N2xjcGFvZTo6QTBPQ3ZBeHpNSjRxZmpJc3hCbkpXMTdVZjRvT0VTbng5SjRIclVXcg==' # prod polly
AUTH_KEY= "dHg4MHBjczByMTo6MFY5cUllREFDbmFuaWhTT3hxTEcyMkJhTk1udFY2NWgyZ2RXeXQ4eg==" # test polly
Polly.auth(AUTH_KEY, env = "testpolly")

omixatlas = OmixAtlas()
workspaces = Workspaces()

## Creating a new OmixAtlas

In [8]:
omixatlas.create("ingestion testing OA", "This OA is for testing the performance of ingestion functions")

 OmixAtlas 1657777478388 Created  


Unnamed: 0,Repository Id,Repository Name,Display Name,Description
0,1657777478388,ingestion_testing_oa,ingestion testing OA,This OA is for testing the performance of inge...


After creating the OmixAtlas, please ask the admin to map the resource with the organization

### generating schema payload

**This block of code will be used to generate the payload of schema when the schema is prepared in the csv file**

**Input:** csv file containing the schema 

**Output:** payload to be ingested to the OmixAtlas

In [9]:
def get_fields(schema_file):
    schema_df = pd.read_csv(schema_file)
    col_types = {
        "field_name": str,
        "original_name": str,
        "type": str,
        "is_keyword": bool,
        "is_array": bool,
        "is_filter": bool,
        "is_column": bool,
        "is_keyword": bool,
        "filter_size": int,
        "display_name": str,
        "description": str,
    }
    schema_df = schema_df.astype(col_types)
    schema_df["field_name"] = schema_df["field_name"].apply(lambda x: x.lower().strip().replace(" ", "_"))
    schema_df["type"] = schema_df["type"].str.lower().str.strip()
    schema_df["description"] = schema_df["description"].str.replace("nan", "NA")
    
    object_rows = schema_df[schema_df["type"] == "object"]
    object_rows_array = object_rows[object_rows["is_array"] == 1]
    if len(object_rows_array.index) > 0:
        raise ValueError("Fields with type 'object' cannot be arrays")
    object_rows_keyword = object_rows[object_rows["is_keyword"] == 1]
    if len(object_rows_keyword.index) > 0:
        raise ValueError("Fields with type 'object' cannot be keywords")
    
    field_list = schema_df.to_dict(orient="records")
    fields = {}
    for field in field_list:
        field_name = field.pop("field_name")
        fields[field_name] = field
    
    return fields

def generate_schema_payload(schema_type, schema_file):
    return {
        "data": {
            "type": "schemas",
            "id": repo_id,
            "attributes": {
                "repo_id": repo_id,
                "schema_type": schema_type,
                "schema": {
                    "all": {
                        "all": get_fields(schema_file)
                    }
                }
            }
        }
    }

### generating schema for dataset level metadata (files)

Calling the functions above to generate the schema payload. 

Here, the repo_id must be added as a string. The file is the input csv file which contains the schema in excel sheet.

In [10]:
print("Creating dataset-level metadata schema...")
repo_id = "1657777478388"
file = "dataset_schema.csv"
payload_d = generate_schema_payload("files", file)
payload_d

Creating dataset-level metadata schema...


{'data': {'type': 'schemas',
  'id': '1657777478388',
  'attributes': {'repo_id': '1657777478388',
   'schema_type': 'files',
   'schema': {'all': {'all': {'dataset_source': {'original_name': 'dataset_source',
       'type': 'text',
       'is_keyword': True,
       'is_array': False,
       'is_filter': False,
       'is_column': False,
       'filter_size': 1,
       'display_name': 'Source',
       'description': 'Source from where the data was fetched'},
      'dataset_id': {'original_name': 'dataset_id',
       'type': 'text',
       'is_keyword': True,
       'is_array': False,
       'is_filter': False,
       'is_column': True,
       'filter_size': 1,
       'display_name': 'Dataset ID',
       'description': 'Unique ID assocaited with every dataset'},
      'description': {'original_name': 'description',
       'type': 'text',
       'is_keyword': False,
       'is_array': False,
       'is_filter': False,
       'is_column': False,
       'filter_size': 1,
       'display_na

### adding dataset level schema to destination OA

The payload above can be used with polly python function insert_schema to add the schema in a new OmixAtlas

In [11]:
omixatlas.insert_schema(1657777478388, payload_d)

'{"data": {"type": "schemas", "id": "1657777478388", "attributes": {"repo_id": "1657777478388", "schema_type": "files", "schema": {"all": {"all": {"dataset_source": {"type": "text", "is_keyword": true, "is_array": false, "is_filter": false, "is_column": false, "is_curated": false, "is_ontology": false, "filter_size": 1, "original_name": "dataset_source", "display_name": "Source", "description": "Source from where the data was fetched"}, "dataset_id": {"type": "text", "is_keyword": true, "is_array": false, "is_filter": false, "is_column": true, "is_curated": false, "is_ontology": false, "filter_size": 1, "original_name": "dataset_id", "display_name": "Dataset ID", "description": "Unique ID assocaited with every dataset"}, "description": {"type": "text", "is_keyword": false, "is_array": false, "is_filter": false, "is_column": false, "is_curated": false, "is_ontology": false, "filter_size": 1, "original_name": "description", "display_name": "Title", "description": "Description of the data

### generating schema for sample level metadata (gct_metadata)

Just like the dataset level metadata, the sample level metadata should also be added to the OmixAtlas.

**In this case, because the sample level schema of destination OA is assumed to be same as source OA, we'll get schema dictionary from the source OA and then inserting it in the destination OA.**

In [18]:
#source OA = 1657110718820
schema = omixatlas.get_schema("1657110718820", ["sample"], return_type = "dict")

In [19]:
payload_s = schema.sample

In [20]:
payload_s

{'data': {'type': 'schemas',
  'id': '1657110718820',
  'attributes': {'repo_id': '1657110718820',
   'schema_type': 'gct_metadata',
   'schema': {'all': {'all': {'growth_protocol_ch1': {'is_array': False,
       'is_ontology': False,
       'is_keyword': False,
       'original_name': 'growth_protocol_ch1',
       'description': 'NA',
       'type': 'text',
       'is_filter': False,
       'is_column': False,
       'is_curated': False,
       'filter_size': 1,
       'display_name': 'Growth protocol ch1'},
      'src_uri': {'is_array': False,
       'is_ontology': False,
       'is_keyword': True,
       'original_name': 'kw_src_uri',
       'description': "Unique URI derived from source data file's S3 location",
       'type': 'text',
       'is_filter': False,
       'is_column': False,
       'is_curated': False,
       'filter_size': 2000,
       'display_name': 'Source URI'},
      'sample_id': {'is_array': False,
       'is_ontology': False,
       'is_keyword': True,
       '

### adding sample level schema to OA

The payload above can be used with polly python function insert_schema to add the schema in a new OmixAtlas.

Note: Even though the payload we fetched contained the "id" of source OA, when insert_schema is initiated with that payload, the "id" will get replpaced with destination OA.

In [21]:
#destination OA: 1657777478388
omixatlas.insert_schema(1657777478388, payload_s)

'{"data": {"type": "schemas", "id": "1657777478388", "attributes": {"repo_id": "1657777478388", "schema_type": "gct_metadata", "schema": {"all": {"all": {"growth_protocol_ch1": {"type": "text", "is_keyword": false, "is_array": false, "is_filter": false, "is_column": false, "is_curated": false, "is_ontology": false, "filter_size": 1, "original_name": "growth_protocol_ch1", "display_name": "Growth protocol ch1", "description": "NA"}, "src_uri": {"type": "text", "is_keyword": true, "is_array": false, "is_filter": false, "is_column": false, "is_curated": false, "is_ontology": false, "filter_size": 2000, "original_name": "kw_src_uri", "display_name": "Source URI", "description": "Unique URI derived from source data file\'s S3 location"}, "sample_id": {"type": "text", "is_keyword": true, "is_array": false, "is_filter": false, "is_column": true, "is_curated": false, "is_ontology": false, "filter_size": 1, "original_name": "kw_column", "display_name": "Sample ID", "description": "Unique ID ass

### validating if the schema is appropriately added

Note:- There are additional feilds in the schema (than ones in csv file) which gets auto-added upon inserting schema in a new OA

In [22]:
schema = omixatlas.get_schema(1657777478388)
schema.dataset

Unnamed: 0,Source,Datatype,Field Name,Field Description,Field Type,Is Curated,Is Array
0,all,all,curated_organism,Orgnism from which the samples were derived,text,False,True
1,all,all,src_uri,Unique URI derived from data file's S3 location,text,False,False
2,all,all,total_num_samples,Total number of samples in a dataset,integer,False,False
3,all,all,description,Description of the dataset,text,False,False
4,all,all,curated_cell_line,Cell lines from which the samples were derived...,text,False,True
5,all,all,data_table_name,Name of the data table associated with data file,text,False,False
6,all,all,data_table_version,Current version of the data table associated w...,integer,False,False
7,all,all,timestamp_,Unix timestamp denoting time of creation for t...,text,False,False
8,all,all,file_type,Data file's type,text,False,False
9,all,all,curated_cell_type,Types of cell present in the dataset,text,False,True


In [23]:
schema.sample

Unnamed: 0,Source,Datatype,Field Name,Field Description,Field Type,Is Curated,Is Array
0,all,all,growth_protocol_ch1,,text,False,False
1,all,all,src_uri,Unique URI derived from source data file's S3 ...,text,False,False
2,all,all,sample_id,Unique ID associated with every sample,text,False,False
3,all,all,curated_gene_modified,Gene modified through genetic modification,text,False,True
4,all,all,dose_ch1,,text,False,False
5,all,all,curated_cohort_name,Name of the cohort to which the sample belongs,text,False,False
6,all,all,curated_control,Signifies whether the given sample is a contro...,integer,False,False
7,all,all,src_dataset_id,Dataset ID of the file this data entity origin...,text,False,False
8,all,all,extract_protocol_ch1,,text,False,False
9,all,all,characteristics_ch2,,text,False,False


## Preparing dataset level metadata files (.json)

### fetching template for the same

In order to prepare the dataset level metadata, the user should ensure the keys of json files they prepare adheres to the schema of dataset level. 

In order to facilitate this, users can use the function dataset_metadata_template to generate the template. These are mandatory fields to have in the dataset level metadata.

In [24]:
template = omixatlas.dataset_metadata_template(1657777478388)

In [25]:
template

{'curated_organism': [],
 'kw_src_uri': 'text',
 'total_num_samples': 'integer',
 'description': 'text',
 'curated_cell_line': [],
 'data_table_name': 'text',
 'data_table_version': 'integer',
 'kw_timestamp_': 'text',
 'kw_file_type': 'text',
 'curated_cell_type': [],
 'kw_key': 'text',
 'kw_src_repo': 'text',
 'kw_package': 'text',
 'kw_file_location': 'text',
 'dataset_id': 'text',
 'curated_disease': [],
 'curated_drug': [],
 'version': 'integer',
 'kw_bucket': 'text',
 'curated_tissue': [],
 'dataset_source': 'text',
 'data_type': 'text',
 'is_current': 'text',
 'kw_region': 'text',
 '__index__': {'file_metadata': True,
  'col_metadata': True,
  'row_metadata': False,
  'data_required': False}}

### from source OmixAtlas, fetching the dataset level metadata for the datasets to be ingested in destination OmixAtlas

Because some of the information present at dataset level metadata in source OA and destination OA will remain same, we can just query the source OA and use the information to prepare the dataset level metadata for destination OA.

In [26]:
list = ['PXD002408_Biological_Replicate_1', 'PXD002408_Biological_Replicate_2']
query = f"SELECT * from 1657110718820.datasets WHERE dataset_id IN {tuple(list)}"
metadata = omixatlas.query_metadata(query)
metadata

Query execution succeeded (time taken: 1.83 seconds, data scanned: 0.002 MB)
Fetched 2 rows


Unnamed: 0,curated_organism,src_uri,total_num_samples,description,curated_cell_line,data_table_name,data_table_version,timestamp_,file_type,curated_cell_type,key,src_repo,package,file_location,dataset_id,curated_disease,curated_drug,version,bucket,curated_tissue,dataset_source,data_type,is_current,region
0,[Homo sapiens],polly:data://valo_onco/data/220707-1146/PXD002...,7,Quantitative K-GG site analysis,[HCT 116],valo_onco__pxd002408_biological_replicate_1,0,1657177092182,gct,[],valo_onco/data/220707-1146/PXD002408_Biologica...,valo_onco,valo_onco/data,https://discover-test-datalake-v1.s3-ap-southe...,PXD002408_Biological_Replicate_1,[Colonic Neoplasms],"[dimethyl sulfoxide, growth hormone]",0,discover-test-datalake-v1,[colorectum],PRIDE,Proteomics,True,ap-southeast-1
1,[Homo sapiens],polly:data://valo_onco/data/220707-1146/PXD002...,3,Quantitative K-GG site analysis,[HCT 116],valo_onco__pxd002408_biological_replicate_2,0,1657177198137,gct,[],valo_onco/data/220707-1146/PXD002408_Biologica...,valo_onco,valo_onco/data,https://discover-test-datalake-v1.s3-ap-southe...,PXD002408_Biological_Replicate_2,[Colonic Neoplasms],"[dimethyl sulfoxide, growth hormone]",0,discover-test-datalake-v1,[colorectum],PRIDE,Proteomics,True,ap-southeast-1


**Before running the next cell, user should create two folders = 'data_final' and 'metadata_final' in their working directory**

### converting the dataset level metadata to .json file

In [27]:
# Converting to json
for i in metadata.index:
    metadata.loc[i].to_json(f"metadata_final/{metadata['dataset_id'][i]}.json".format(i))
filenames = os.listdir('metadata_final')

### adding the _ _index_ _ OR any custom curated fields

Once the dataset level metadata files are created then as per the template, _ _index_ _ needs to be appended to the .json file prior to ingestion.

**If there are any custom curation fields, then those can also be added in a similar way**

In [28]:
for i in filenames:
    entry = { "__index__": {
        'file_metadata': 'true',  
        'col_metadata': 'true',  
        'row_metadata': 'false',  
        'data_required': 'false',
    }}
    file = open('metadata_final/'+i)
    json_met = json.load(file)
    entry.update(json_met)
    with open('metadata_final/'+i, "w") as file:
        json.dump(entry, file)

### Validating if the keys of json file matches the template

Prior to ingestion, the user should ensure the keys in dataset level metadata matches with that of the template generated above.

In [29]:
template_keys = set()
for key in template:
        template_keys.add(key)

In [30]:
template_keys

{'__index__',
 'curated_cell_line',
 'curated_cell_type',
 'curated_disease',
 'curated_drug',
 'curated_organism',
 'curated_tissue',
 'data_table_name',
 'data_table_version',
 'data_type',
 'dataset_id',
 'dataset_source',
 'description',
 'is_current',
 'kw_bucket',
 'kw_file_location',
 'kw_file_type',
 'kw_key',
 'kw_package',
 'kw_region',
 'kw_src_repo',
 'kw_src_uri',
 'kw_timestamp_',
 'total_num_samples',
 'version'}

In [31]:
import json
keys_in_json = set()
f = open('metadata_final/PXD002408_Biological_Replicate_1.json')
data = json.load(f)

for key in data:
    keys_in_json.add(key)

In [32]:
keys_in_json

{'__index__',
 'bucket',
 'curated_cell_line',
 'curated_cell_type',
 'curated_disease',
 'curated_drug',
 'curated_organism',
 'curated_tissue',
 'data_table_name',
 'data_table_version',
 'data_type',
 'dataset_id',
 'dataset_source',
 'description',
 'file_location',
 'file_type',
 'is_current',
 'key',
 'package',
 'region',
 'src_repo',
 'src_uri',
 'timestamp_',
 'total_num_samples',
 'version'}

In [33]:
intersect = keys_in_json.intersection(template_keys)
intersect

{'__index__',
 'curated_cell_line',
 'curated_cell_type',
 'curated_disease',
 'curated_drug',
 'curated_organism',
 'curated_tissue',
 'data_table_name',
 'data_table_version',
 'data_type',
 'dataset_id',
 'dataset_source',
 'description',
 'is_current',
 'total_num_samples',
 'version'}

### validating indexes are same

In this case, given the intersection has only the fields which are automatically added to dataset level metadata file during ingestion, it will not make any impact on the ingestion job.

In [34]:
template_keys.difference(intersect)

{'kw_bucket',
 'kw_file_location',
 'kw_file_type',
 'kw_key',
 'kw_package',
 'kw_region',
 'kw_src_repo',
 'kw_src_uri',
 'kw_timestamp_'}

### Downloading corresponding gct files

As per the assumption, no custom curation is required hence the gct files don't need any modification. Hence, downloading the files from the source OA and uploading them as it is in the destination OA

In [35]:
#create a folder data_final in the working repository
for i in metadata['dataset_id']:
    print (str(i))
    data = omixatlas.download_data("1657110718820",str(i))
    file_name = f"{str(i)}.gct"
    url = data.get('data').get('attributes').get('download_url')
    status = os.system(f"wget -O 'data_final/{file_name}' '{url}'")
    if status == 0:
        print("Downloaded data successfully")
    else:
        raise Exception("Download not successful")

PXD002408_Biological_Replicate_1


--2022-07-14 05:55:56--  https://discover-test-datalake-v1.s3.amazonaws.com/valo_onco/data/220713-13235/PXD002408_Biological_Replicate_1.gct?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=ASIARBT2ZRFJPGHAPSHS%2F20220714%2Fap-southeast-1%2Fs3%2Faws4_request&X-Amz-Date=20220714T055556Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Security-Token=IQoJb3JpZ2luX2VjEE4aDmFwLXNvdXRoZWFzdC0xIkcwRQIhAIykPPH7JG%2F89ouW88shMZbvGa8WB9kE1JGH%2BliqD%2BJ7AiBXl11EdL%2BAyMenUIfixOdgKHoHGYlvjOz5ppbWZ1Y5jCqqAgiH%2F%2F%2F%2F%2F%2F%2F%2F%2F%2F8BEAIaDDA3MjE5ODIyODMwNiIMRtW5amK%2B7yo9VavJKv4BXIcqTtAYsBH3xqRYpq9aAFxAFbJIsKM5POxkP7xZ0lRsEVaK4OKptp13k%2ByVl3HHiQkClcqWMsa7aklShXm8sdIcQwVE8fpn%2FgDNr3hDp7ixvudbkpWW9hcJeNBoods7kQscO5AIVVx1SmtJO6NlVlQVrTQRXCSoWZoaw68ryO1x6xkdj2P4zA1CCql1VNp0vN%2Blm3hw026iaRO3Ci2iMHUHuOnu8d%2BONKUXPzOJiyyLgkJtasT4Kz4GbUeYlbdJvIbFLfQ6su4yTTH2Jh1UhzEnhqFGfsleEmmRGDORyAW2K%2Be8C%2Fwc8Ds%2F3CmF4vdE0MMIvkhn27x9ml4DRscw4tq%2BlgY6mgEBPdRpVjjlWsOsIGyYiYJJ2lIsiS8nxVqvOmReMl%2BYMTW2sh6

Downloaded data successfully
PXD002408_Biological_Replicate_2


--2022-07-14 05:56:00--  https://discover-test-datalake-v1.s3.amazonaws.com/valo_onco/data/220713-13235/PXD002408_Biological_Replicate_2.gct?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=ASIARBT2ZRFJNVEYMIJP%2F20220714%2Fap-southeast-1%2Fs3%2Faws4_request&X-Amz-Date=20220714T055600Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Security-Token=IQoJb3JpZ2luX2VjEE4aDmFwLXNvdXRoZWFzdC0xIkYwRAIgbnnq5u7o9FYFhPSvYPSrNFfJnv0viog094wmY42HYKgCIA%2FDdH6DiaMSbluMlMydMfgEtOG%2B5U2OYUamUY0IOX0nKqoCCIf%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FwEQAhoMMDcyMTk4MjI4MzA2IgxcAuKF6rQFREZ5o8cq%2FgE2j9JuiCLgIaoJUaA1UU5oZkVVtcHi5phvHnkdBmfMpZ336%2BVoJIVv2KApYq9JrSoA1W8C%2FlwbF4KNICZcLoGQjB1CpxxuAn8qc9Y9g6q0SsJP2qNR3%2FezqhrlGuugEoBWof4UgaWR23BcD0EwyusRuw9JTiumzIByJhtLozPEsBrzOIeeAgnYXOIqD4BobQdJI1ljka%2Bwf0T7QihNhlvkqz3q5B5Ge1sf3yu6lF3rrW4Nitkmi0YUTe%2FR57OP8Qq4gSPvgQSbVKHCdZTT8%2Fixo2C6859fAr4%2Fsal%2FuM8Si8tnPaX%2Fkm6r%2Fd0YhRNdjm8JEze9HJwnIqLwgZGHCjDu376WBjqbAf%2FqZf4j2YHaw9hOUJYI4iZoXCBTzh9zLOxiY9jUI4tmvcDj1nS

Downloaded data successfully



    50K .......... .......... .......... .......... .......... 68%  306K 0s
   100K .......... .......... .......... .......... .....     100%  280K=0.5s

2022-07-14 05:56:02 (297 KB/s) - 'data_final/PXD002408_Biological_Replicate_2.gct' saved [149121/149121]



## Ingesting the data

In order to ingest the data, user can use the following function:-

**add_datasets(repo_id (int/str), source_folder_path (dict), destination_folder (str) (optional), priority (str) (optiona))**

**Input:** 

repo_id: This is the repository ID to which ingestion should be done

source_folder_path: This is the dictionary with keys "data" and "metadata". The corresponding value pairs should be the folder containing the file (gct, h5ad, vcf, mmcif etc) for data and folder containing json of dataset level metadata for metadata.

destination_folder (optional): This is the folder within S3 when data gets pushed

priority (optional): This is the priority flag as per ingestion is being done. Default is 'medium'

**Output:** 

Status of file upload for each dataset in a dataframe

In [37]:
repo_id = "1657777478388"
source_folder_path_data = "/import/data_final"
source_folder_metadata = "/import/metadata_final"
destination_folder = "220714-1126"
priority = "high"
source_folder_path = {"data":source_folder_path_data, "metadata":source_folder_metadata}
print(source_folder_path)
omixatlas.add_datasets(repo_id, source_folder_path, destination_folder, priority)

{'data': '/import/data_final', 'metadata': '/import/metadata_final'}
                              File Name        Message
0                combined_metadata.json  File Uploaded
1  PXD002408_Biological_Replicate_1.gct  File Uploaded
2  PXD002408_Biological_Replicate_2.gct  File Uploaded


## Deletion of datset from the OmixAtlas

**delete_datasets(repo_id: int/str, dataset_ids: list)**

This function is used to delete data from an omixatlas

**Input:**

repo_id: (int/str) This is the repository ID from which dataset should be deleted

dataset_ids: (list) dataset_ids that users want to delete

**Output:**

Status of file deletion for each dataset in a dataframe

In [5]:
repo_id = "1657777478388"
dataset_ids = ["PXD002408_Biological_Replicate_1"]
omixatlas.delete_datasets(repo_id, dataset_ids)

                         Dataset Id  \
0  PXD002408_Biological_Replicate_1   

                                                                      Message  
0  Request Accepted. Dataset Will be deleted in the next version of OmixAtlas  
