<h2> Jupyter notebook to search the Encode database </h2>

Import the class ConnectToEncode API (+ requests and json modules)

In [1]:
from connectEncodeAPI import ConnectToEncodeAPI 

<h3> Create the connection object </h3>

In [2]:
database = ConnectToEncodeAPI()

Check the connection to the database

In [3]:
database.check_database_connection()
print("Status code of connection is %s" % database.response_status_code)

Status code of connection is 200


<h3> Perform a custom search </h3>

The Encode API allows for six types of queries:
<ol>
    <li> General key word search </li>
    <li> Search for a biosample by ID </li>
    <li> Search for a file based on an md5 code </li>
    <li> Search for an experiment based on an experiment accession number </li>
    <li> Search for all fastq files belonging to one experiment </li>
    <li> Search all replicates for experiment </li>
</ol>

<b> Perform search on a general key word </b>

In [28]:
search_request='Liver cancer'

In [29]:
database.general_search_on_key_word(key_word=search_request)

In [30]:
print("Number of results based on key word %s: " % database.response_search.json()['total'])

Number of results based on key word 9961: 


To Fix: you can only see the first 25 [index 0 -24] results in from the API request

In [31]:
study_number = 0  # remember python indexing starts at 0, max =24

In [32]:
database.response_search.json()['@graph'][11]

{'@id': '/publications/c1e30870-7ad5-4796-8a6d-5652c0947122/',
 '@type': ['Publication', 'Item'],
 'abstract': 'Serum gamma-glutamyl transferase (GGT) activity is a marker of liver disease which is also prospectively associated with the risk of all-cause mortality, cardiovascular disease, type 2 diabetes and cancers. We have discovered novel loci affecting GGT in a genome-wide association study (rs1497406 in an intergenic region of chromosome 1, P = 3.9 x 10(-8); rs944002 in C14orf73 on chromosome 14, P = 4.7 x 10(-13); rs340005 in RORA on chromosome 15, P = 2.4 x 10(-8)), and a highly significant heterogeneity between adult and adolescent results at the GGT1 locus on chromosome 22 (maximum P(HET) = 5.6 x 10(-12) at rs6519520). Pathway analysis of significant and suggestive single-nucleotide polymorphism associations showed significant overlap between genes affecting GGT and those affecting common metabolic and inflammatory diseases, and identified the hepatic nuclear factor (HNF) fami

<b> Perform search on biosample ID </b>

In [34]:
biosample_ID = "ENCBS000AAA"

In [35]:
database.search_for_biosample(accession_number=biosample_ID)

In [36]:
database.biosample_response.json()

{'accession': 'ENCBS000AAA',
 'url': 'http://www.atcc.org/Products/All/HTB-22.aspx',
 'aliases': ['richard-myers:MCF7-003'],
 'schema_version': '26',
 'status': 'released',
 'lab': '/labs/richard-myers/',
 'award': '/awards/U54HG004576/',
 'date_created': '2013-12-12T05:50:02.101495+00:00',
 'submitted_by': '/users/df9f3c8e-b819-4885-8f16-08f6ef0001e8/',
 'notes': '(PMID: 4357757)',
 'documents': ['/documents/984071d4-9149-476a-b353-93592c6f48f3/'],
 'references': [],
 'source': '/sources/atcc/',
 'product_id': 'HTB-22',
 'biosample_ontology': '/biosample-types/cell_line_EFO_0001203/',
 'genetic_modifications': [],
 'alternate_accessions': [],
 'description': 'mammary gland, adenocarcinoma',
 'treatments': [],
 'dbxrefs': ['UCSC-ENCODE-cv:MCF-7'],
 'donor': '/human-donors/ENCDO000AAE/',
 'organism': '/organisms/human/',
 'passage_number': 5,
 'internal_tags': [],
 'culture_harvest_date': '2012-04-10',
 'culture_start_date': '2012-03-16',
 '@id': '/biosamples/ENCBS000AAA/',
 '@type': ['

<b> Perform search for file on md5 code </b>

In [37]:
md5 = "7b9f8ccd15fea0bda867e042db2b6f5a"

In [38]:
database.search_file_on_md5(md5_code=md5)


In [40]:
database.md5_search.json()

{'@context': '/terms/',
 '@graph': [{'@id': '/files/ENCFF000BXK/',
   '@type': ['File', 'Item'],
   'accession': 'ENCFF000BXK',
   'assay_term_name': 'ChIP-seq',
   'assembly': 'hg19',
   'award': {'project': 'ENCODE'},
   'biological_replicates': [1],
   'biosample_ontology': {'organ_slims': ['bodily fluid', 'blood'],
    'term_name': 'K562'},
   'dataset': '/experiments/ENCSR000AKS/',
   'date_created': '2010-11-16T00:00:00.000000+00:00',
   'derived_from': ['/files/ENCFF000BXX/'],
   'file_format': 'bam',
   'file_size': 964104660,
   'file_type': 'bam',
   'href': '/files/ENCFF000BXK/@@download/ENCFF000BXK.bam',
   'lab': {'title': 'Bradley Bernstein, Broad'},
   'mapped_read_length': 36,
   'mapped_run_type': 'single-ended',
   'origin_batches': ['/biosamples/ENCBS639AAA/'],
   'output_category': 'alignment',
   'output_type': 'alignments',
   'quality_metrics': [],
   'read_length_units': 'nt',
   'simple_biosample_summary': '',
   'status': 'archived',
   'target': {'label': 'H3

<b> Perform search on Experiment Accession Number</b>

In [44]:
code_for_experiment_search = "ENCSR000AKS"

In [45]:
database.search_on_experiment(experiment_code=code_for_experiment_search )

In [46]:
database.response_experiment.json()

{'@context': '/terms/',
 '@graph': [{'@id': '/files/ENCFF534KXV/',
   '@type': ['File', 'Item'],
   'accession': 'ENCFF534KXV',
   'aliases': ['encode-processing-pipeline:39f88113-ace9-4d9e-806b-33f0171bf328-encode-processing-caper_out_v04_05-chip-39f88113-ace9-4d9e-806b-33f0171bf328-call-overlap-shard-5-glob-155eada107f68a2195912a39f5dee4bc-rep2_vs_rep3.overlap.bfilt.narrowPeak.bb'],
   'alternate_accessions': [],
   'analyses': ['/analyses/ENCAN211AVB/'],
   'analysis_step_version': '/analysis-step-versions/histone-chip-seq-replicated-overlap-file-format-conversion-step-v-1-0/',
   'assay_term_name': 'ChIP-seq',
   'assay_title': 'Histone ChIP-seq',
   'assembly': 'GRCh38',
   'audit': {'INTERNAL_ACTION': [{'path': '/analyses/ENCAN211AVB/',
      'level_name': 'INTERNAL_ACTION',
      'level': 30,
      'name': 'audit_item_status',
      'detail': 'Released analysis {ENCAN211AVB|/analyses/ENCAN211AVB/} has in progress subobject quality standard {encode4-histone-chip|/quality-standard

<b> Search all replicates for one experiment </b>

In [47]:
code_for_replicate_search="ENCSR000AKS"

In [48]:
database.search_all_replicates_for_experiment(experiment_code=code_for_replicate_search)


In [49]:
database.response_replicates.json()

{'@context': '/terms/',
 '@graph': [{'@id': '/replicates/cf59360f-f46d-4cd4-91d6-f02343e3a660/',
   '@type': ['Replicate', 'Item'],
   'aliases': ['bradley-bernstein:Rep_DNA_Lib 8051']},
  {'@id': '/replicates/9b51f350-7070-48ae-9534-266b449885f8/',
   '@type': ['Replicate', 'Item'],
   'aliases': ['bradley-bernstein:Rep_DNA_Lib 8049']},
  {'@id': '/replicates/cb79831f-7786-4bcf-b881-f42740d82727/',
   '@type': ['Replicate', 'Item'],
   'aliases': []},
  {'@id': '/replicates/8adabf24-52c4-482c-a3a0-3a47e4b59af8/',
   '@type': ['Replicate', 'Item'],
   'aliases': []}],
 '@id': '/search/?type=Replicate&experiment.accession=ENCSR000AKS&format=json',
 '@type': ['Search'],
 'clear_filters': '/search/?type=Replicate',
 'columns': {'@id': {'title': 'ID'}, 'aliases': {'title': 'Aliases'}},
 'facet_groups': [],
 'facets': [{'field': 'type',
   'title': 'Data Type',
   'terms': [{'key': 'Replicate', 'doc_count': 4}],
   'total': 4,
   'type': 'terms',
   'appended': False,
   'open_on_load': Fal

<b> Search the fastq files belonging to an experiment </b>

In [50]:
code_to_search_fastq_files = "ENCSR000AKS"

In [51]:
database.search_fastq_files_experiment(experiment_code=code_to_search_fastq_files)


In [52]:
database.response_fastq_search.json()

{'@context': '/terms/',
 '@graph': [],
 '@id': '/search/?type=File&dataset=/experiments/ENCSR000AKS+/&file_format=fastq&format=json&frame=object',
 '@type': ['Search'],
 'clear_filters': '/search/?type=File',
 'columns': {'@id': {'title': 'ID'},
  'title': {'title': 'Title'},
  'accession': {'title': 'Accession'},
  'dataset': {'title': 'Dataset'},
  'assembly': {'title': 'Genome assembly'},
  'technical_replicates': {'title': 'Technical replicates'},
  'biological_replicates': {'title': 'Biological replicates'},
  'file_format': {'title': 'File Format'},
  'file_type': {'title': 'File type'},
  'file_format_type': {'title': 'File format type'},
  'file_size': {'title': 'File size'},
  'assay_term_name': {'title': 'Assay term name'},
  'biosample_ontology.term_name': {'title': 'Biosample name'},
  'biosample_ontology.organ_slims': {'title': 'Organ'},
  'simple_biosample_summary': {'title': 'Simple biosample summary'},
  'origin_batches': {'title': 'Batch'},
  'target.label': {'title': 