# DSS-AZUL-INDEXING

### This jupyter notebook is to do a couple testing indexing operations using the DataExtractor class and the FileIndexer class.

Below here we import our modules and set up:
* ElasticSearch Client
* Dummy payload of event
* Parse the bundle_uuid and the bundle_version

In [1]:
from elasticsearch import Elasticsearch
from chalicelib.utils import DataExtractor
from chalicelib.indexer import FileIndexer, AssayOrientedIndexer
import json
from pprint import pprint

# Create an ElasticSearch client
es = Elasticsearch([{'host': 'localhost', 'port': 9200}])
# Sample event payload
payload = { "query": { "query": { "match_all":{}} }, "subscription_id": "ba50df7b-5a97-4e87-b9ce-c0935a817f0b", "transaction_id": "ff6b7fa3-dc79-4a79-a313-296801de76b9", "match": { "bundle_version": "2018-01-03T173141.039382Z", "bundle_uuid": "d8a4576b-66f5-4b10-aecf-f6c4025b9997" } }
bundle_uuid = payload['match']['bundle_uuid']
bundle_version = payload['match']['bundle_version']
# Create 

Next, we will create an instance of the DataExtractor and use it to get the contents from the bundle referenced by the variable `payload`. We will be pulling from the AWS replica.

In [2]:
# Create DataExtractor instance pointing to HCA Staging
extractor = DataExtractor("https://dss.staging.data.humancellatlas.org/v1")
# Use dummy payload and get the metadata_files and the data_files
metadata_files, data_files = extractor.extract_bundle(payload, "aws")
# Print each dictionary
print("\n#####################################################")
print("#                    PRINTING METADATA               #")
print("#####################################################")
pprint(metadata_files, indent=4)
print("\n#####################################################")
print("#                  PRINTING DATA FILES               #")
print("#####################################################")
pprint(data_files, indent=4)


#####################################################
#                    PRINTING METADATA               #
#####################################################
{   'assay.json': {   'content': {   'assay_id': 'assay_1',
                                     'core': {   'schema_url': 'https://raw.githubusercontent.com/HumanCellAtlas/metadata-schema/4.6.1/json_schema/assay.json',
                                                 'schema_version': '4.6.1',
                                                 'type': 'assay'},
                                     'rna': {   'end_bias': 'five_prime_end',
                                                'library_construction': 'smart-seq2',
                                                'strand': 'both'},
                                     'seq': {   'instrument_model': 'HiSeq '
                                                                    '2500',
                                                'instrument_platform': 'Illumina',
      

Next we pass this on to the FileIndexer class to create a File Oriented index entry on ElasticSearch running on `localhost:9200`. But first, we get the index settings and get the configuration files.

In [3]:
# Define helper method to open files
def open_and_return_json(file_path):
    """
    Opens and returns the contents of the json file given in file_path
    :param file_path: Path of a json file to be opened
    :return: Returns an obj with the contents of the json file
    http://localhost:8888/notebooks/Indexing_Code_Tour.ipynb#"""
    with open(file_path, 'r') as file_:
        loaded_file = json.load(file_)
    return loaded_file

# Get the index's settings
index_settings = open_and_return_json('chalicelib/settings.json')
# Get the index overall config
index_mapping_config = open_and_return_json('chalicelib/config.json')

file_indexer = FileIndexer(metadata_files,
                           data_files,
                           es,
                           "file_index_v4",
                           "doc",
                           index_settings=index_settings,
                           index_mapping_config=index_mapping_config)

assay_indexer = AssayOrientedIndexer(metadata_files,
                                     data_files,
                                     es,
                                     "assay_index_v4",
                                     "doc",
                                     index_settings=index_settings,
                                     index_mapping_config=index_mapping_config)

file_indexer.index(bundle_uuid, bundle_version)
#assay_indexer.index(bundle_uuid, bundle_version)
print("INDEXING DONE")

Content: assay_id ; Name: assayId 

("THE LEAF _file is: {'single_cell': {'cell_handling': 'FACS'}, 'core': "
 "{'type': 'assay', 'schema_url': "
 "'https://raw.githubusercontent.com/HumanCellAtlas/metadata-schema/4.6.1/json_schema/assay.json', "
 "'schema_version': '4.6.1'}, 'rna': {'end_bias': 'five_prime_end', 'strand': "
 "'both', 'library_construction': 'smart-seq2'}, 'assay_id': 'assay_1', 'seq': "
 "{'instrument_platform': 'Illumina', 'molecule': 'polyA RNA', 'paired_ends': "
 "True, 'lanes': [{'number': 1, 'r2': 'R2.fastq.gz', 'r1': 'R1.fastq.gz'}], "
 "'instrument_model': 'HiSeq 2500'}}\n"
 ' ')
the file content is: assay_1

JUST APPENDED ENTRY, dictionary right now:

defaultdict(<class 'list'>, {'assayId': ['assay_1']})
Content: cell_handling ; Name: scCellHandling 

"THE LEAF _file is: {'cell_handling': 'FACS'}\n "
the file content is: FACS

JUST APPENDED ENTRY, dictionary right now:

defaultdict(<class 'list'>,
            {'assayId': ['assay_1'],
             'scCellHandli