<a href="https://polly.elucidata.io/manage/workspaces?action=open_polly_notebook&amp;source=github&amp;path=ElucidataInc%2Fpolly-python%2Fblob%2Fmain%2Fontology_recommendation_disease_tissue.ipynb&amp;kernel=elucidata%2FPython+3&amp;machine=medium" target="_parent"><img alt="Open in Polly" src="https://elucidatainc.github.io/PublicAssets/open_polly.svg"/></a>


# Ontology recommendations for disease and tissue using polly-python

Term expansion functionality for disease and tissue to be added in Polly-Python. This notebook is a playground to test the functionality. The users would now be able to call a function - 'recommend' on disease and tissue column of meta data. 

Usage of 'recommend' function - 

recommend(field_name, search_term, key - ['match' | 'related'])

field_name -> It can take value: disease, tissue, curated_disease, curated_tissue based on V1 or V2 APIs.

search_term -> Disease or tissue name for which recommendations are required.

key -> Can be "match" or "related"

    match - Only the terms that have an exact match of the keyword in them will be returned as an output.
        
    related - The list of expanded terms would contain the matched terms, the synonyms, and hypernyms of the keyword as per MeSH ontology. 

## For users querying V1 infrastructure

For 'match' query in disease - 

query = """SELECT * FROM geo.datasets
            WHERE disease IN recommend('disease', obesity', 'match')"""
            
For 'related' query in tissue - 

query = """SELECT * FROM geo.datasets
            WHERE tissue IN recommend('tissue', 'lung', 'related')"""  

## For users querying V2 infrastructure

For 'match' query in disease - 

query = """SELECT * FROM geo.datasets WHERE CONTAINS(curated_disease, recommend('curated_disease', 'breast neoplasms', 'match'))"""

For 'related' query in tissue - 

query = """SELECT * FROM geo.datasets WHERE CONTAINS(curated_tissue, recommend( 'curated_tissue', 'liver', 'match'))"""


In [2]:
# please do not modify
from IPython.display import display_html
def restartkernel() :
    display_html("<script>Jupyter.notebook.kernel.restart()</script>",raw=True)

# Import polly-python

In [1]:
pip install polly-python --user  #Restart kernel after the cell executes.

Collecting polly-python
  Downloading https://files.pythonhosted.org/packages/43/6b/1d9ae3d7941d65e8dd24becca436f52d788eb7383000179fb289297bd849/polly_python-0.0.10-py3-none-any.whl
Collecting urllib3==1.26.6 (from polly-python)
[?25l  Downloading https://files.pythonhosted.org/packages/5f/64/43575537846896abac0b15c3e5ac678d787a4021e906703f1766bfb8ea11/urllib3-1.26.6-py2.py3-none-any.whl (138kB)
[K     |################################| 143kB 10.0MB/s eta 0:00:01
Collecting python-magic==0.4.24 (from polly-python)
  Downloading https://files.pythonhosted.org/packages/d3/99/c89223c6547df268596899334ee77b3051f606077317023617b1c43162fb/python_magic-0.4.24-py2.py3-none-any.whl
Collecting boto3>=1.17.73 (from polly-python)
[?25l  Downloading https://files.pythonhosted.org/packages/9b/5e/12e50157795274dad90bbecbfe6c3283ecbdb9096462df0cc5cb30dca0ab/boto3-1.21.45-py3-none-any.whl (132kB)
[K     |################################| 133kB 91.5MB/s eta 0:00:01
[?25hCollecting cmapPy (from poll

Collecting wrapt<2,>=1.10 (from Deprecated->polly-python)
[?25l  Downloading https://files.pythonhosted.org/packages/ba/8c/3d3dff02ae905157ba417b801f4a7aa4e6fedbc43882e9c765b7aae438ac/wrapt-1.14.0-cp36-cp36m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (74kB)
[K     |################################| 81kB 71.3MB/s eta 0:00:01
[?25hCollecting cached-property; python_version < "3.8" (from h5py>=2.6.0->cmapPy->polly-python)
  Downloading https://files.pythonhosted.org/packages/48/19/f2090f7dad41e225c7f2326e4cfe6fff49e57dedb5b53636c9551f86b069/cached_property-1.5.2-py2.py3-none-any.whl
Building wheels for collected packages: retrying
Failed to build retrying
[31mERROR: awscli 1.18.178 has requirement botocore==1.19.18, but you'll have botocore 1.20.112 which is incompatible.[0m
[31mERROR: awscli 1.18.178 has requirement s3transfer<0.4.0,>=0.3.0, but you'll have s3transfer 0.5.2 which is incompatible.[0m
[31mERROR: boto3 1.21.45 has requireme

In [None]:
restartkernel() #Pause for a few seconds before the kernel is refreshed

# Import Dependencies

In [1]:
import os
from polly.auth import Polly
from polly.omixatlas import OmixAtlas

# Auth With Token on Polly

In [2]:
POLLY_REFRESH_TOKEN = os.environ['POLLY_REFRESH_TOKEN']
client = OmixAtlas(POLLY_REFRESH_TOKEN)

# Util Functions. [Execute following code before using recommend function]

In [3]:
import re

REGEX_STRING = "recommend\s*\(\s*'(.*?)',\s*'(.*?)',\s*'(.*?)'\)"

def get_recommendation_list(sql_query):
    recommendation_list = re.findall(REGEX_STRING, sql_query)
    if recommendation_list:
        return recommendation_list
    else:
        return []


def get_index(sql_query):
    index = re.search(r"[a-z_+]*.datasets", sql_query)
    if index:
        index = index.group().split('.')[0]
    return index

def process_result(result, keyword):
    related_terms = [bucket["key"] for bucket in result["data"]["aggregations"]["related_terms"]["buckets"] ]
    auto_complete = [bucket["key"] for bucket in result["data"]["aggregations"]["auto_complete"]["buckets"] ]
    if keyword == 'match':
        return auto_complete
    else:
        return list(set(related_terms)|set(auto_complete))
    
def get_expanded_result(recommendation_list, index):
    expanded_result = []
    dis_tis_mappings = {
        "curated_disease": "disease",
        "disease": "disease",
        "curated_tissue" : "tissue",
        "tissue" : "tissue"
    }
    for recommendations in recommendation_list:
        dis_tis = dis_tis_mappings[recommendations[0]]
        term = recommendations[1]
        keyword = recommendations[2]
        
        repo_id = library_client.get_index_id(index)
        result = library_client.autocomplete(repo_id, term, field = dis_tis)
        result = process_result(result, keyword)
        expanded_result.append(result)
    return expanded_result

def edit_query(sql_query, expanded_result):
    edited_query = sql_query
    for result in expanded_result:
        result_string = "(" + ', '.join(f"'{w}'" for w in result) + ")"
        edited_query = re.sub(REGEX_STRING, result_string, edited_query, 1)
    return edited_query

def edit_v2_query(sql_query, expanded_result, recommendation_list):
    CONTAINS_REGEX_STRING = "(?i)CONTAINS\s*\(\s*(curated_disease|curated_tissue)\s*,\s*recommend\s*\(\s*'(.*?)',\s*'(.*?)',\s*'(.*?)'\s*\)\s*\)"
    edited_query = sql_query
    for i, zipped_result in enumerate(zip(expanded_result, recommendation_list)):
        result_string = '('
        expanded_terms = zipped_result[0]
        recommend = zipped_result[1]
        key = recommend[0]
        
        for i, term in enumerate(expanded_terms):
            if (recommend[0] == 'curated_disease'):
                term = term.title()
            result_string = result_string + 'CONTAINS(' + key + ", '" + term + "')"
            if i != len(expanded_terms) - 1:
                result_string += ' OR '
        result_string += ')'
        edited_query = re.sub(CONTAINS_REGEX_STRING, result_string, edited_query, 1)
        
    return edited_query

def recommend(sql_query, api_version='v2'):
    recommendation_list = get_recommendation_list(sql_query)
    if (len(recommendation_list) == 0):
        return sql_query
    index = get_index(sql_query)
    if not index:
        return sql_query
    expanded_result = get_expanded_result(recommendation_list, index)
    edited_query = ''
    if (api_version == 'v1'):
        edited_query = edit_query(sql_query, expanded_result)
    else:
        edited_query = edit_v2_query(sql_query, expanded_result, recommendation_list)
    return edited_query

In [None]:
from requests import Session
import json
import ssl
import logging
import os
import platform
import tempfile
from pathlib import Path
from typing import Union, Dict
from collections import namedtuple
import pandas as pd
import requests
from retrying import retry

from polly import constants as const

from polly import helpers
from polly.auth import Polly
from polly.constants import DATA_TYPES
from polly.errors import (
    QueryFailedException,
    UnfinishedQueryException,
    InvalidParameterException,
    error_handler,
    is_unfinished_query_error,
    paramException,
    wrongParamException,
    apiErrorException,
    invalidApiResponseException,
)
from deprecated import deprecated
from polly.index_schema_level_conversion_const import indexes_schema_level_map

QUERY_API_V1 = "v1"
QUERY_API_V2 = "v2"

class PollySession(Session):
    def __init__(self, REFRESH_TOKEN):
        Session.__init__(self)
        self.headers = {
            "Content-Type": "application/vnd.api+json",
            "Cookie" : f"refreshToken={REFRESH_TOKEN}",
            "User-Agent" : "polly-python/"
        }


class OmixAtlas_autoComplete:
    polly_api_url= {"dev":"dev","test":"test","prod":""}
    repo_index_id = {'gdx_files': 1646370413059, 'auron_data_lake_files': 1613041985263, 'hpa_files': 1619514006627, 'teddy_files': 1615464367934, 'valo_onco_files': 1649184014614, 'gdc_files': 1623221686703, 'geo_files': 9, 'rcsb_structures_files': 1639402777465, 'auron_single_cell_atlas_files': 1649415790935, 'enterprise_atlas_files': 1638441282192, 'cptac_files': 1609924165364, 'liveromix_atlas_files': 1615965444377, 'cbioportal_files': 1623986995264, 'depmap_files': 1612338998334, 'pcd_files': 1622113130397, 'lincs_files': 32, 'metabolomics_files': 23, 'immport_files': 1621422280385, 'transcriptomics__cyt_files': 1612862450692, 'ukbiobank_files': 1638762466067, 'exelixis_files': 1649172174566, 'bbio_files': 1647250357385, 'sc_data_lake_files': 17, 'tcga_files': 15, 'etx_files': 1641883311001, 'ngj_files': 1618578954468, 'gtex_files': 14, 'gnomad_files': 1628836648493, 'geo_raw_counts_files': 1647341066415}
    
    def __init__(self, token: str, polly_env) -> None:
        self.session = PollySession(token)
        self.base_url = "https://v2.api.{}polly.elucidata.io/v1/omixatlases".format(self.polly_api_url[polly_env])
        self.omixatlas_base_url = f"https://v2.api.{self.polly_api_url[polly_env]}polly.elucidata.io"

    def get_all_omixatlas(self):
        url = self.base_url
        params = {"summarize": "true"}
        response = self.session.get(url,params=params)
        return response.json()

    def omixatlas_summary(self, key: str):
        url = f"{self.base_url}/{key}"
        params = {"summarize": "true"}
        response = self.session.get(url,params=params)
        error_handler(response)
        return response.json()

    def autocomplete(self,atlas_id, keyword =None, field=None, put_synonyms = None):
        url = f"{self.base_url}/{atlas_id}/autocomplete"
        payload = {"data":{"attributes":{},"type":"omixatlases"}}
        if keyword is not None:
            payload["data"]["attributes"].update({"keyword":keyword})
        if field is not None:
            payload["data"]["attributes"].update({"field":field})
        if put_synonyms is not None:
            payload["data"]["attributes"].update({"put_synonyms":put_synonyms})
        response = self.session.post(url,json=payload)
        message = response.json().get('message', None)
        if message is not None:
            print(message)
        return self.__process_query_response(response.json())


    def __process_query_response(self, response: dict):
        response.pop("took", None)
        response.pop("timed_out", None)
        response.pop("_shards", None)
        processed_response = None
        try:
            hits = response.get('hits').get('hits')
            if hits:
                processed_response = pd.DataFrame(hits)
            else:
                response.pop('hits', None)
                processed_response = response
        except AttributeError:
            processed_response = response
        return processed_response 

    def get_index_id(self, index):
        index = index + "_files"
        return self.repo_index_id[index]
    
    def query_metadata(
        self,
        query: str,
        experimental_features=None,
        query_api_version=QUERY_API_V2,
        page_size=None,  # Note: do not increase page size more than 999
    ):
        max_page_size = 999
        if page_size is not None and page_size > max_page_size:
            raise ValueError(
                f"The maximum permitted value for page_size is {max_page_size}"
            )
        elif page_size is None and query_api_version != QUERY_API_V2:
            page_size = 500

        query = recommend(query, query_api_version)
        print(query)
        queries_url = f"{self.base_url}/queries"
        queries_payload = {
            "data": {
                "type": "queries",
                "attributes": {"query": query, "query_api_version": query_api_version},
            }
        }
        if experimental_features is not None:
            queries_payload.update({"experimental_features": experimental_features})

        response = self.session.post(queries_url, json=queries_payload)
        error_handler(response)

        query_data = response.json().get("data")
        query_id = query_data.get("id")
        return self._process_query_to_completion(query_id, query_api_version, page_size)

    @retry(
        retry_on_exception=is_unfinished_query_error,
        wait_exponential_multiplier=500,  # Exponential back-off starting 500ms
        wait_exponential_max=10000,  # After 10s, retry every 10s
        stop_max_delay=300000,  # Stop retrying after 300s (5m)
    )
    def _process_query_to_completion(
        self, query_id: str, query_api_version: str, page_size: Union[int, None]
    ):
        queries_url = f"{self.base_url}/queries/{query_id}"
        response = self.session.get(queries_url)
        error_handler(response)

        query_data = response.json().get("data")
        query_status = query_data.get("attributes", {}).get("status")
        if query_status == "succeeded":
            return self._handle_query_success(query_data, query_api_version, page_size)
        elif query_status == "failed":
            self._handle_query_failure(query_data)
        else:
            raise UnfinishedQueryException(query_id)

    def _handle_query_failure(self, query_data: dict):
        fail_msg = query_data.get("attributes").get("failure_reason")
        raise QueryFailedException(fail_msg)

    def _handle_query_success(
        self, query_data: dict, query_api_version: str, page_size: Union[int, None]
    ) -> pd.DataFrame:
        query_id = query_data.get("id")

        details = []
        time_taken_in_ms = query_data.get("attributes").get("exec_time_ms")
        if isinstance(time_taken_in_ms, int):
            details.append("time taken: {:.2f} seconds".format(time_taken_in_ms / 1000))
        data_scanned_in_bytes = query_data.get("attributes").get("data_scanned_bytes")
        if isinstance(data_scanned_in_bytes, int):
            details.append(
                "data scanned: {:.3f} MB".format(data_scanned_in_bytes / (1024**2))
            )

        if details:
            detail_str = ", ".join(details)
            print("Query execution succeeded " f"({detail_str})")
        else:
            print("Query execution succeeded")

        if query_api_version != QUERY_API_V2 or page_size is not None:
            return self._fetch_results_as_pages(query_id, page_size)
        else:
            return self._fetch_results_as_file(query_id)

    def _fetch_results_as_pages(self, query_id, page_size):
        first_page_url = (
            f"{self.base_url}/queries/{query_id}" f"/results?page[size]={page_size}"
        )
        response = self.session.get(first_page_url)
        error_handler(response)
        result_data = response.json()
        rows = [row_data.get("attributes") for row_data in result_data.get("data")]

        all_rows = rows

        message = "Fetched {} rows"
        print(message.format(len(all_rows)), end="\r")

        while (
            result_data.get("links") is not None
            and result_data.get("links").get("next") is not None
            and result_data.get("links").get("next") != "null"
        ):
            next_page_url = self.omixatlas_base_url + result_data.get("links").get("next")
            response = self.session.get(next_page_url)
            error_handler(response)
            result_data = response.json()
            if result_data.get("data"):
                rows = [
                    row_data.get("attributes") for row_data in result_data.get("data")
                ]
            else:
                rows = []
            all_rows.extend(rows)
            print(message.format(len(all_rows)), end="\r")

        # Blank line resets console line start position
        print()

        return pd.DataFrame(all_rows)

    def _fetch_results_as_file(self, query_id):
        results_file_req_url = (
            f"{self.base_url}/queries/{query_id}/results?action=download"
        )
        response = self.session.get(results_file_req_url)
        error_handler(response)
        result_data = response.json()

        results_file_download_url = result_data.get("data", {}).get("download_url")
        if (
            results_file_download_url is None
            or results_file_download_url == "Not available"
        ):
            # The user is probably executing SHOW TABLES or DESCRIBE query
            return self._fetch_results_as_pages(query_id, 100)

        def _local_temp_file_path(filename):
            temp_dir = Path(
                "/tmp" if platform.system() == "Darwin" else tempfile.gettempdir()
            ).absolute()

            temp_file_path = os.path.join(temp_dir, filename)
            if Path(temp_file_path).exists():
                os.remove(temp_file_path)

            return temp_file_path

        def _download_file_stream(download_url, _local_file_path):
            with requests.get(download_url, stream=True, headers={}) as r:
                r.raise_for_status()
                with open(_local_file_path, "wb") as f:
                    for chunk in r.iter_content(chunk_size=8192):
                        f.write(chunk)

        local_file_path = _local_temp_file_path(f"{query_id}.csv")
        _download_file_stream(results_file_download_url, local_file_path)

        data_df = pd.read_csv(local_file_path)
        print(f"Fetched {len(data_df.index)} rows")

        return data_df


library_client = OmixAtlas_autoComplete(POLLY_REFRESH_TOKEN, polly_env = "prod")

# Queries for V1 storage infrastructure

## Previous query on V1 infrastructure
Before implementation of this feature, users query for a given tissue and disease as shown below. 

As per the output, user is able to fetch only 5 datasets for the given disease and tissue combination.

In [18]:
sql_query = """SELECT * FROM geo.datasets WHERE disease = 'nephritis' AND tissue = 'kidney' LIMIT 2000"""
result = library_client.query_metadata(query=sql_query, query_api_version="v1")
result

SELECT * FROM geo.datasets WHERE disease = 'nephritis' AND tissue = 'kidney' LIMIT 2000
Query execution succeeded
Fetched 5 rows


Unnamed: 0,file_type,tissue,disease,dataset_id,organism,dataset_source,platform,description,kw_data_type,kw_cell_type,...,kw_location,kw_timestamp,kw_smiles,publication_name,year,operation,is_public,data_repository,publication,drug
0,gct,[kidney],[Nephritis],GSE37402_GPL6246,[Mus musculus],GEO,Microarray,PGE2 promotes recovery from established Nephro...,Transcriptomics,[None],...,https://discover-prod-datalake-v1.s3-us-west-2...,1647615752910,,,,,,,,
1,gct,"[smooth muscle, serum, kidney, hypophysis]","[Renal Insufficiency, Nephritis, Cataract, Glo...",GSE69536_GPL10787,[Mus musculus],GEO,Microarray,Gene expression profile in kidney with experim...,Transcriptomics,[epithelial cell],...,https://discover-prod-datalake-v1.s3-us-west-2...,1647619035476,[Nc1ncnc2n(cnc12)[C@@H]1O[C@@H]2COP(O)(=O)O[C@...,,,,,,,
2,gct,[kidney],"[Proteinuria, Lupus Erythematosus, Systemic, N...",GSE10144_GPL6362,[Mus musculus],GEO,Microarray,Spironolactone effect on renal RNA expression ...,Transcriptomics,[None],...,https://discover-prod-datalake-v1.s3-us-west-2...,1647607811106,,,,,,,,
3,,[kidney],"[Nephritis, Megalocytic interstitial nephritis...",GSE75693_GPL570,[Homo sapiens],GEO,Microarray,Mining the Human Urine Proteome for Monitoring...,Transcriptomics,,...,https://discover-prod-datalake-v1.s3-us-west-2...,1647619587985,,27165815.0,2018.0,"{'is_normalized': 'true', 'batch_corrected_var...",True,geo,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi...,[None]
4,gct,"[kidney, serum]","[DNA Repair-Deficiency Disorders, Nephritis, P...",GSE107364_GPL10787,[Mus musculus],GEO,Microarray,NEMO is distinct from IKK2 by regulating non-c...,Transcriptomics,"[lymphocyte, T cell, regulatory T cell, T-help...",...,https://discover-prod-datalake-v1.s3-us-west-2...,1647608599458,,,,,,,,


## New queries after implementation of ontology recommendations
Now the users can query as shown below. 

As per the output, user is able to fetch only 138 datasets for the given disease and tissue combination.

In [19]:
sql_query = """SELECT * FROM geo.datasets WHERE disease IN recommend('disease', 'nephritis', 'related') 
                AND tissue IN recommend('tissue', 'kidney', 'related') LIMIT 2000"""
result = library_client.query_metadata(query=sql_query, query_api_version="v1")
result

SELECT * FROM geo.datasets WHERE disease IN ('nephritis, interstitial', 'nephritis', 'pyelonephritis', 'lupus nephritis', 'glomerulosclerosis, focal segmental', 'glomerulonephritis, iga', 'nephritis, hereditary', 'kidney tubular necrosis, acute, nephritis, interstitial', 'granulomatosis with polyangiitis', 'megalocytic interstitial nephritis', 'anti-glomerular basement membrane disease', 'glomerulonephritis, membranoproliferative', 'glomerulonephritis, membranous', 'nephrosis, lipoid', 'glomerulonephritis') 
                AND tissue IN ('renal tubule', 'renal artery', 'renal distal tubule', 'inner medullary collecting duct', 'henles loop', 'renal outer medulla', 'renal papilla', 'renal proximal tubule', 'connecting tubule', 'renal proximal tubule epithelium', 'urine', 'collecting duct', 'cortical collecting duct', 'renal glomerular capsule', 'renal medulla', 'renal parenchyma', 'renal corpuscle', 'nephron', 'tubulointerstitium', 'podocyte', 'kidney', 'renal glomerulus', 'mesangium', 

Unnamed: 0,publication_name,tissue,dataset_source,description,organism,year,disease,operation,platform,dataset_id,...,kw_timestamp,author,abstract,type,file_type,source_process,manually_curated,processing,drug,data_matrix_available
0,28242240,[kidney],GEO,NorUrsodeoxycholic Acid Ameliorates Cholemic N...,[Mus musculus],2018,"[Nephritis, Interstitial, Cholestasis]","{'is_normalized': 'true', 'batch_corrected_var...",Microarray,GSE84584_GPL16570,...,1647620075138,,,,,,,,,
1,27942582,"[blood plasma, renal proximal tubule]",GEO,Transcriptional effects of Pentraxin-2 in fibr...,[Mus musculus],2018,"[Nephritis, Hereditary, Renal Insufficiency, C...","{'is_normalized': 'true', 'batch_corrected_var...",Microarray,GSE85409_GPL6887,...,1647620156614,,,,,,,,,
2,27760209,[kidney],GEO,Kidney gene expression profiles from a lupus n...,[Mus musculus],2018,"[Proteinuria, Lupus Nephritis]","{'is_normalized': 'true', 'batch_corrected_var...",Microarray,GSE86423_GPL11180,...,1647620245266,,,,,,,,,
3,25840911,[kidney],GEO,Macrophage epoxygenase determines a pro-fibrot...,[Rattus norvegicus],2018,[Glomerulonephritis],"{'is_normalized': 'true', 'batch_corrected_var...",RNASeq,GSE65715_GPL18694,...,1647623785117,,,,,,,,,
4,30514835,[kidney],GEO,Human iPSC derived glomeruli facilitate accura...,[Homo sapiens],2019,"[Glomerulonephritis, Proteinuria]","{'is_normalized': 'true', 'batch_corrected_var...",RNASeq,GSE99583_GPL18573,...,1648049528744,"Belinda,,Phipson",Podocytes are the highly specialised cells wit...,Expression profiling by high throughput sequen...,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
133,28646076,[kidney],GEO,Transcriptomic and proteomic profiling reveal ...,[Homo sapiens],2018,"[Segmental glomerulosclerosis, Glomerulonephri...","{'is_normalized': 'true', 'batch_corrected_var...",Microarray,GSE93798_GPL22945,...,1647620645302,,,,,,,,,
134,none,[renal cortex],GEO,Global gene expression profiling on renal scar...,[Rattus norvegicus],2018,"[Vesico-Ureteral Reflux, Pyelonephritis]","{'is_normalized': 'true', 'batch_corrected_var...",Microarray,GSE7087_GPL890,...,1647619239080,,,,,,,,,
135,,"[renal glomerulus, skin, kidney]",GEO,Gene Expression Profiling of Glomeruli from a ...,[Mus musculus],,"[Diffuse Cerebral Sclerosis of Schilder, Glome...",,Microarray,GSE18358_GPL1261,...,1647613538812,,,,h5ad,connector,,,,
136,30776024,[renal glomerulus],GEO,Expression data from podocyte injured glomerulus,[Mus musculus],2018,"[Glomerulonephritis, Kidney Failure, Chronic]","{'is_normalized': 'true', 'batch_corrected_var...",Microarray,GSE112116_GPL16570,...,1647609380875,,,,,,,,,


## Other query examples on V1 infrastructure

In [5]:
sql_query = "SELECT * FROM geo.datasets WHERE disease IN recommend('disease', 'nephritis', 'match') LIMIT 2000"
result = library_client.query_metadata(query=sql_query, query_api_version="v1")
result

SELECT * FROM geo.datasets WHERE disease IN ('lupus nephritis', 'nephritis, hereditary', 'nephritis, interstitial', 'pyelonephritis', 'glomerulonephritis', 'glomerulonephritis, iga', 'glomerulonephritis, membranoproliferative', 'glomerulonephritis, membranous', 'kidney tubular necrosis, acute, nephritis, interstitial', 'megalocytic interstitial nephritis', 'nephritis') LIMIT 2000
Query execution succeeded
Fetched 162 rows


Unnamed: 0,publication_name,tissue,dataset_source,description,organism,year,disease,operation,platform,dataset_id,...,kw_timestamp,author,abstract,type,file_type,source_process,manually_curated,processing,drug,data_matrix_available
0,28242240,[kidney],GEO,NorUrsodeoxycholic Acid Ameliorates Cholemic N...,[Mus musculus],2018,"[Nephritis, Interstitial, Cholestasis]","{'is_normalized': 'true', 'batch_corrected_var...",Microarray,GSE84584_GPL16570,...,1647620075138,,,,,,,,,
1,27942582,"[blood plasma, renal proximal tubule]",GEO,Transcriptional effects of Pentraxin-2 in fibr...,[Mus musculus],2018,"[Nephritis, Hereditary, Renal Insufficiency, C...","{'is_normalized': 'true', 'batch_corrected_var...",Microarray,GSE85409_GPL6887,...,1647620156614,,,,,,,,,
2,27760209,[kidney],GEO,Kidney gene expression profiles from a lupus n...,[Mus musculus],2018,"[Proteinuria, Lupus Nephritis]","{'is_normalized': 'true', 'batch_corrected_var...",Microarray,GSE86423_GPL11180,...,1647620245266,,,,,,,,,
3,27760209,[blood],GEO,Whole blood gene expression profiles from a lu...,[Mus musculus],2018,"[Proteinuria, Lupus Nephritis]","{'is_normalized': 'true', 'batch_corrected_var...",Microarray,GSE86424_GPL11180,...,1647620357142,,,,,,,,,
4,25840911,[kidney],GEO,Macrophage epoxygenase determines a pro-fibrot...,[Rattus norvegicus],2018,[Glomerulonephritis],"{'is_normalized': 'true', 'batch_corrected_var...",RNASeq,GSE65715_GPL18694,...,1647623785117,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
157,28646076,[kidney],GEO,Transcriptomic and proteomic profiling reveal ...,[Homo sapiens],2018,"[Segmental glomerulosclerosis, Glomerulonephri...","{'is_normalized': 'true', 'batch_corrected_var...",Microarray,GSE93798_GPL22945,...,1647620645302,,,,,,,,,
158,none,[None],GEO,Molecular Evidence of Chronic Cellular Rejecti...,[Homo sapiens],2018,[Glomerulonephritis],"{'is_normalized': 'true', 'batch_corrected_var...",Microarray,GSE93659_GPL6244,...,1647620748217,,,,,,,,,
159,none,[renal cortex],GEO,Global gene expression profiling on renal scar...,[Rattus norvegicus],2018,"[Vesico-Ureteral Reflux, Pyelonephritis]","{'is_normalized': 'true', 'batch_corrected_var...",Microarray,GSE7087_GPL890,...,1647619239080,,,,,,,,,
160,30776024,[renal glomerulus],GEO,Expression data from podocyte injured glomerulus,[Mus musculus],2018,"[Glomerulonephritis, Kidney Failure, Chronic]","{'is_normalized': 'true', 'batch_corrected_var...",Microarray,GSE112116_GPL16570,...,1647609380875,,,,,,,,,


In [7]:
sql_query = "SELECT * FROM geo.datasets WHERE tissue IN recommend('tissue', 'lung', 'related') LIMIT 2000"
result = library_client.query_metadata(query=sql_query, query_api_version="v1")
result

SELECT * FROM geo.datasets WHERE tissue IN ('endothelium', 'olfactory receptor neuron', 'bronchial epithelium', 'respiratory epithelium', 'tracheal epithelium', 'respiratory system', 'lung endothelium', 'bronchiole', 'bud', 'nasal polyp', 'bronchus', 'lung epithelium', 'alveolus', 'pleura', 'pleural cavity', 'lung bud', 'alveolar epithelium', 'bronchial mucosa', 'nasal mucosa', 'lung', 'olfactory epithelium', 'epithelial lining fluid', 'bronchial smooth muscle', 'pleural fluid', 'sputum') LIMIT 2000
Query execution succeeded
Fetched 2000 rows


Unnamed: 0,tissue,dataset_source,description,organism,year,disease,operation,platform,dataset_id,is_public,...,publication_name,manually_curated,file_type,source_process,geo_summary,pubmed_id,kw_smiles,drug,processing,data_matrix_available
0,"[lung, skin]",GEO,Layered ontogeny and in situ perinatal priming...,[Mus musculus],2019,"[Hypersensitivity, Helminthiasis]","{'is_normalized': 'true', 'batch_corrected_var...",RNASeq,GSE126924_GPL21103,true,...,,,,,,,,,,
1,[lung],GEO,TGFβ-induced fibroblast activation requires pe...,[Homo sapiens],2019,[Normal],"{'is_normalized': 'true', 'batch_corrected_var...",RNASeq,GSE136534_GPL20301,true,...,,,,,,,,,,
2,[lung],GEO,Unconventional ST2- and CD127-negative lung IL...,[Mus musculus],2019,"[Asthma, Alternariosis]","{'is_normalized': 'true', 'batch_corrected_var...",RNASeq,GSE136156_GPL17021,true,...,,,,,,,,,,
3,[lung],GEO,IL-33 blockade impacts mediators of persistenc...,[Mus musculus],2019,[Bronchial Diseases],"{'is_normalized': 'true', 'batch_corrected_var...",RNASeq,GSE137324_GPL13112,true,...,31562870,,,,,,,,,
4,[lung],GEO,Impact of transcriptional mutagenesis on p53 t...,[Homo sapiens],2020,[Normal],"{'is_normalized': 'true', 'batch_corrected_var...",RNASeq,GSE138853_GPL18573,true,...,32782319,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,[lung],GEO,Effect of BET bromodomain inhibitor and/or mTO...,[Homo sapiens],2020,[Small Cell Lung Carcinoma],,RNAseq,GSE155923_GPL20301,,...,,,,,,,,,connector,False
1996,"[polyp, nasal mucosa]",GEO,Transcriptome Analysis Identifies Doublesex an...,[Homo sapiens],2020,"[Nasal Polyps, Asthma]",,RNAseq,GSE158277_GPL16791,,...,,,,,,,,,connector,False
1997,[lung],GEO,Severe COVID-19 Is Characterized by an Impaire...,[Homo sapiens],2021,[COVID-19],,RNAseq,GSE178824_GPL18573,,...,,,,,,,,,connector,False
1998,"[lung, peripheral blood]",GEO,A high OXPHOS CD8 T cell subset is predictive ...,[Homo sapiens],2020,"[Lung Neoplasms, Melanoma]",,RNAseq,GSE152590_GPL22790,,...,,,,,,,,,connector,False


# SQL Queries for V2 storage infrastructure

## Previous query on V2 infrastructure
Before implementation of this feature, users query for a given tissue and disease as shown below. 

As per the output, user is able to fetch 929 datasets for the given disease and tissue combination.

In [23]:
sql_query = """SELECT dataset_id, curated_disease, curated_tissue FROM geo.datasets WHERE 
        CONTAINS(curated_disease,'Breast Neoplasms') AND 
        CONTAINS(curated_tissue,'breast')""" 
result = library_client.query_metadata(query=sql_query, query_api_version="v2")
result

SELECT dataset_id, curated_disease, curated_tissue FROM geo.datasets WHERE 
        CONTAINS(curated_disease,'Breast Neoplasms') AND 
        CONTAINS(curated_tissue,'breast')
Query execution succeeded (time taken: 12.35 seconds, data scanned: 0.996 MB)
Fetched 929 rows


Unnamed: 0,dataset_id,curated_disease,curated_tissue
0,GSE47108_GPL6244,[Breast Neoplasms],[breast]
1,GSE5224_GPL570,"[Sialadenitis, Mastitis, Bacterial Infections,...","[mammary gland, forebrain, breast]"
2,GSE62766_GPL13607,[Breast Neoplasms],[breast]
3,GSE69296_GPL10558,"[Breast Neoplasms, Ovarian Neoplasms]",[breast]
4,GSE71283_GPL13497,"[Breast Neoplasms, Fanconi Anemia]",[breast]
...,...,...,...
924,GSE28556_GPL6885,"[Neuroblastoma, Melanoma, Breast Neoplasms, Gl...","[breast, muscle, liver, heart]"
925,GSE36565_GPL10558,"[Neoplasms, Breast Neoplasms]",[breast]
926,GSE6324_GPL96,"[Neoplasms, Blister, Unilateral Breast Neoplas...","[bone, osteoclast, breast]"
927,GSE19536_GPL6480,[Breast Neoplasms],"[lymph node, breast]"


## New queries after implementation of ontology recommendations
Now the users can query as shown below. 

As per the output, user is able to fetch only 1441 datasets for the given disease and tissue combination.

In [22]:
sql_query = """SELECT dataset_id, curated_disease, curated_tissue FROM geo.datasets WHERE 
        CONTAINS(curated_disease, recommend('curated_disease', 'breast neoplasms', 'related')) AND 
        CONTAINS(curated_tissue, recommend('curated_tissue', 'breast', 'related'))""" 
result = library_client.query_metadata(query=sql_query, query_api_version="v2")
result

SELECT dataset_id, curated_disease, curated_tissue FROM geo.datasets WHERE 
        (CONTAINS(curated_disease, 'Breast Neoplasms, Male') OR CONTAINS(curated_disease, 'Unilateral Breast Neoplasms') OR CONTAINS(curated_disease, 'Breast Neoplasms') OR CONTAINS(curated_disease, ' Breast Neoplasms') OR CONTAINS(curated_disease, 'Carcinoma, Ductal, Breast') OR CONTAINS(curated_disease, 'Inflammatory Breast Neoplasms') OR CONTAINS(curated_disease, 'Breast Neoplasms ') OR CONTAINS(curated_disease, 'Triple Negative Breast Neoplasms')) AND 
        (CONTAINS(curated_tissue, 'milk fat') OR CONTAINS(curated_tissue, 'epithelium') OR CONTAINS(curated_tissue, 'mammary gland') OR CONTAINS(curated_tissue, 'breast epithelium') OR CONTAINS(curated_tissue, 'breast') OR CONTAINS(curated_tissue, 'nipple') OR CONTAINS(curated_tissue, 'milk') OR CONTAINS(curated_tissue, 'mammary duct') OR CONTAINS(curated_tissue, 'thorax') OR CONTAINS(curated_tissue, 'colostrum') OR CONTAINS(curated_tissue, 'mammary epitheliu

Unnamed: 0,dataset_id,curated_disease,curated_tissue
0,GSE122630_GPL17303,[Breast Neoplasms],[breast]
1,GSE76487_GPL11154,"[Breast Neoplasms, Neoplasm Metastasis, Myoton...",[breast]
2,GSE6596_GPL96,[Breast Neoplasms],[mammary gland]
3,GSE72644_GPL6480,[Breast Neoplasms],"[epithelium, mammary duct]"
4,GSE74539_GPL10558,[Breast Neoplasms],[breast]
...,...,...,...
1436,GSE89206_GPL17303,[Breast Neoplasms],[mammary gland]
1437,GSE63025_GPL6246,[Breast Neoplasms],[mammary gland]
1438,GSE6324_GPL96,"[Neoplasms, Blister, Unilateral Breast Neoplas...","[bone, osteoclast, breast]"
1439,GSE76360_GPL6947,"[Triple Negative Breast Neoplasms, Neoplasm In...","[breast, node]"


## Other query examples on V2 infrastructure

In [24]:
sql_query = """SELECT * FROM geo.datasets WHERE 
                CONTAINS(curated_disease, recommend('curated_disease', 'hepatitis', 'match'))"""
result = library_client.query_metadata(query=sql_query, query_api_version="v2")
result

SELECT * FROM geo.datasets WHERE 
                (CONTAINS(curated_disease, 'Hepatitis B, Chronic') OR CONTAINS(curated_disease, 'Hepatitis C, Chronic') OR CONTAINS(curated_disease, 'Hepatitis B') OR CONTAINS(curated_disease, 'Hepatitis A') OR CONTAINS(curated_disease, 'Hepatitis C') OR CONTAINS(curated_disease, 'Hepatitis D') OR CONTAINS(curated_disease, 'Hepatitis E') OR CONTAINS(curated_disease, 'Hepatitis, Autoimmune') OR CONTAINS(curated_disease, 'Hepatitis, Alcoholic') OR CONTAINS(curated_disease, 'Hepatitis, Chronic') OR CONTAINS(curated_disease, 'Hepatitis, Viral, Human') OR CONTAINS(curated_disease, 'Hepatitis') OR CONTAINS(curated_disease, 'Hepatitis B ') OR CONTAINS(curated_disease, 'Hepatitis C ') OR CONTAINS(curated_disease, 'Schistosomiasis Japonica, Hepatitis B'))
Query execution succeeded (time taken: 13.39 seconds, data scanned: 25.490 MB)
Fetched 345 rows


Unnamed: 0,curated_organism,src_uri,total_num_samples,year,description,curated_cell_line,data_table_name,data_table_version,platform,timestamp_,...,abstract,version,curated_strain,bucket,curated_tissue,dataset_source,data_type,overall_design,is_current,region
0,[Homo sapiens],polly:data://GEO_data_lake/data/Microarray/GSE...,6.0,2018.0,Gene expression analysis of Neoechinulin B (Ne...,[Huh-7.5.1],geo__gse63026_gpl10558,0.0,Microarray,1650501838066,...,,0,[None],discover-prod-datalake-v1,[None],GEO,Transcriptomics,Total RNA obtained from NeoB-treated and un-tr...,True,us-west-2
1,[Homo sapiens],polly:data://GEO_data_lake/data/Microarray/GSE...,,2007.0,Expression data of HCV-associated advance dise...,[None],,,Microarray,1650502936804,...,,0,[None],discover-prod-datalake-v1,[liver],GEO,Transcriptomics,Liver biopsy samples were collected from patie...,True,us-west-2
2,[Homo sapiens],polly:data://GEO_data_lake/data/Microarray/GSE...,6.0,2018.0,Role of caveolin-1 in hepatocellular carcinoma...,"[Hep-G2, HLE, Huh-7]",geo__gse99131_gpl16686,0.0,Microarray,1650504311882,...,,0,[None],discover-prod-datalake-v1,[liver],GEO,Transcriptomics,The aim of this study is to identify an import...,True,us-west-2
3,[Homo sapiens],polly:data://GEO_data_lake/data/RNASeq/GSE1042...,10.0,2019.0,miRNA sequencing of serum exosomes from hepato...,[None],geo__gse104251_gpl11154,0.0,RNASeq,1650504180673,...,Exosomal microRNAs have recently been studied ...,0,[None],discover-prod-datalake-v1,[liver],GEO,Transcriptomics,Deep sequencing was performed to screen differ...,True,us-west-2
4,[Homo sapiens],polly:data://GEO_data_lake/data/RNASeq/GSE1415...,91.0,2020.0,Large-scale screening of circulating microRNAs...,[None],geo__gse141522_gpl16791,0.0,RNASeq,1650504818401,...,Human immunodeficiency virus type 1 (HIV-1)-in...,0,[None],discover-prod-datalake-v1,[blood plasma],GEO,Transcriptomics,In total 97 plasma derived small RNA samples w...,True,us-west-2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
340,[Homo sapiens],polly:data://GEO_data_lake/data/RNASeq/GSE1029...,4.0,2018.0,The hepatitis C viral protein NS5A stabilizes ...,[None],geo__gse102910_gpl11154,0.0,RNASeq,1650504107282,...,,0,[None],discover-prod-datalake-v1,[None],GEO,Transcriptomics,Calculate mRNA decay rate by examining RNA-seq...,True,us-west-2
341,[Homo sapiens],polly:data://GEO_data_lake/data/RNASeq/GSE1050...,12.0,2018.0,Gene expression analysis of human liver progen...,[None],geo__gse105019_gpl16791,0.0,RNASeq,1650504182004,...,,0,[None],discover-prod-datalake-v1,[None],GEO,Transcriptomics,Human liver progenitor-like cells were derived...,True,us-west-2
342,[Homo sapiens],polly:data://GEO_data_lake/data/Microarray/GSE...,6.0,2018.0,Global re-wiring of p53 transcription regulati...,[Hep-G2],geo__gse64875_gpl13607,0.0,Microarray,1650501995675,...,,0,[None],discover-prod-datalake-v1,[None],GEO,Transcriptomics,HepG2 cells were transduced with recombinant H...,True,us-west-2
343,[Homo sapiens],polly:data://GEO_data_lake/data/Microarray/GSE...,48.0,2018.0,Targeting innate immunity for antiviral therap...,[THP-1],geo__gse74047_gpl16699,0.0,Microarray,1650502716795,...,,0,[None],discover-prod-datalake-v1,[None],GEO,Transcriptomics,THP-1 cells were differentiated in 40nM PMA fo...,True,us-west-2


In [25]:
sql_query = """SELECT * FROM geo.datasets WHERE 
            CONTAINS(curated_tissue, recommend('curated_tissue', 'liver', 'related'))"""
result = library_client.query_metadata(query=sql_query, query_api_version="v2")
result

SELECT * FROM geo.datasets WHERE 
            (CONTAINS(curated_tissue, 'liver bud') OR CONTAINS(curated_tissue, 'bud') OR CONTAINS(curated_tissue, 'hepatocyte') OR CONTAINS(curated_tissue, 'bile') OR CONTAINS(curated_tissue, 'liver') OR CONTAINS(curated_tissue, 'bile duct'))
Query execution succeeded (time taken: 14.74 seconds, data scanned: 31.864 MB)
Fetched 4213 rows


Unnamed: 0,curated_organism,src_uri,total_num_samples,year,description,curated_cell_line,data_table_name,data_table_version,platform,timestamp_,...,abstract,version,curated_strain,bucket,curated_tissue,dataset_source,data_type,overall_design,is_current,region
0,[Mus musculus],polly:data://GEO_data_lake/data/Microarray/GSE...,12.0,2018.0,Specific Genomic and Transcriptomic Aberration...,[None],geo__gse61422_gpl6246,0.0,Microarray,1650502117178,...,,0,[FVB/N],discover-prod-datalake-v1,[liver],GEO,Transcriptomics,To explore the mechanisms of the accelerated H...,True,us-west-2
1,[Mus musculus],polly:data://GEO_data_lake/data/Microarray/GSE...,15.0,2015.0,Transcriptional regulation of hepatic target g...,[None],geo__gse68867_gpl6887,0.0,Microarray,1650502269876,...,,0,[C57BL/6],discover-prod-datalake-v1,[liver],GEO,Transcriptomics,Mice received daily intraperitoneal injection ...,True,us-west-2
2,[Mus musculus],polly:data://GEO_data_lake/data/Microarray/GSE...,28.0,2018.0,"Characterization of RA839, a non-covalent smal...",[None],geo__gse71695_gpl1261,0.0,Microarray,1650502556292,...,,0,[None],discover-prod-datalake-v1,"[bone marrow, liver]",GEO,Transcriptomics,Gene expression profile of bone marrow derived...,True,us-west-2
3,[Mus musculus],polly:data://GEO_data_lake/data/Microarray/GSE...,21.0,2016.0,JNK1 Ablation in Mice Confers Long-term Metabo...,[None],geo__gse73759_gpl7202,0.0,Microarray,1650502694124,...,,0,[None],discover-prod-datalake-v1,"[liver, adipose tissue, skin]",GEO,Transcriptomics,"RNA was collected from liver, skin and epididy...",True,us-west-2
4,[Mus musculus],polly:data://GEO_data_lake/data/Microarray/GSE...,12.0,2015.0,Fucoidan alleviates high-fat diet-induced dysl...,[None],geo__gse76374_gpl15043,0.0,Microarray,1650502175008,...,,0,[L],discover-prod-datalake-v1,"[white adipose tissue, liver, aorta]",GEO,Transcriptomics,Fucoidan were ingested by fed a high-fat diet ...,True,us-west-2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4208,[Mus musculus],polly:data://GEO_data_lake/data/RNASeq/GSE7048...,84.0,2018.0,Noncanonical Genomic Imprinting Effects In Off...,[None],geo__gse70484_gpl13112,0.0,RNASeq,1650506703326,...,,0,[C57BL/6J],discover-prod-datalake-v1,"[dorsal raphe nucleus, arcuate nucleus, liver,...",GEO,Transcriptomics,Examination of allele-specific gene expression...,True,us-west-2
4209,[Homo sapiens],polly:data://GEO_data_lake/data/RNASeq/GSE7126...,4.0,2019.0,Genome-wide maps of PBX3-binding sites and Qua...,[SMMC-7721],geo__gse71262_gpl11154,0.0,RNASeq,1650506597326,...,This SuperSeries is composed of the SubSeries ...,0,[None],discover-prod-datalake-v1,[liver],GEO,Transcriptomics,Refer to individual Series,True,us-west-2
4210,[Mus musculus],polly:data://GEO_data_lake/data/RNASeq/GSE8929...,14.0,2018.0,Role of liquid sugar in regulating the hepatic...,[None],geo__gse89296_gpl17021,0.0,RNASeq,1650507324191,...,,0,[C57BL/6N],discover-prod-datalake-v1,"[adipose tissue, liver]",GEO,Transcriptomics,"Hepatic mRNA profiles of chow, HFWD, and HFWD+...",True,us-west-2
4211,[Mus musculus],polly:data://GEO_data_lake/data/RNASeq/GSE9858...,5.0,2019.0,Gene expression profiling of continuous growth...,[None],geo__gse98584_gpl13112,0.0,RNASeq,1650507831483,...,PolyA-selected RNA isolated from livers of adu...,0,[None],discover-prod-datalake-v1,[liver],GEO,Transcriptomics,Liver RNA was isolated from 8 week old male mi...,True,us-west-2


In [17]:
sql_query = """SELECT dataset_id, curated_disease, curated_tissue FROM geo.datasets 
WHERE (CONTAINS(curated_disease, recommend('curated_disease', 'breast neoplasms', 'related')) OR
CONTAINS(curated_disease, recommend('curated_disease', 'pancreatic neoplasms', 'related')))AND 
(CONTAINS(curated_tissue, recommend('curated_tissue', 'breast', 'related')) OR 
CONTAINS(curated_tissue, recommend('curated_tissue', 'pancreas', 'related')))"""
result = library_client.query_metadata(query=sql_query, query_api_version="v2")
result

SELECT dataset_id, curated_disease, curated_tissue FROM geo.datasets 
WHERE ((CONTAINS(curated_disease, 'Breast Neoplasms, Male') OR CONTAINS(curated_disease, 'Unilateral Breast Neoplasms') OR CONTAINS(curated_disease, 'Breast Neoplasms') OR CONTAINS(curated_disease, ' Breast Neoplasms') OR CONTAINS(curated_disease, 'Carcinoma, Ductal, Breast') OR CONTAINS(curated_disease, 'Inflammatory Breast Neoplasms') OR CONTAINS(curated_disease, 'Breast Neoplasms ') OR CONTAINS(curated_disease, 'Triple Negative Breast Neoplasms')) OR
(CONTAINS(curated_disease, 'Vipoma') OR CONTAINS(curated_disease, 'Pancreatic Neoplasms ') OR CONTAINS(curated_disease, 'Adenoma, Islet Cell') OR CONTAINS(curated_disease, 'Carcinoma, Pancreatic Ductal') OR CONTAINS(curated_disease, 'Pancreatic Neoplasms') OR CONTAINS(curated_disease, 'Insulinoma')))AND 
((CONTAINS(curated_tissue, 'milk fat') OR CONTAINS(curated_tissue, 'epithelium') OR CONTAINS(curated_tissue, 'mammary gland') OR CONTAINS(curated_tissue, 'breast epit

Unnamed: 0,dataset_id,curated_disease,curated_tissue
0,GSE23720_GPL9128,"[Colonic Neoplasms, Prostatic Neoplasms, Esoph...",[breast]
1,GSE56614_GPL16791,"[Leukemia, Myeloid, Acute, Breast Neoplasms, C...",[breast]
2,GSE61375_GPL11154,[Breast Neoplasms],[breast]
3,GSE75473_GPL15520,"[Muscular Dystrophies, Colorectal Neoplasms, O...","[skeletal muscle, pancreas]"
4,GSE55947_GPL13915,"[Neoplasms, Breast Neoplasms]",[breast]
...,...,...,...
1586,GSE26539_GPL7504,[Breast Neoplasms],[breast]
1587,GSE28472_GPL13224,"[Melanoma, Hematologic Neoplasms, Breast Neopl...","[breast, kidney, blood, colon, uterine cervix,..."
1588,GSE34651_GPL2877,"[Pancreatic Neoplasms, Pancreatic adenoma]",[pancreas]
1589,GSE4025_GPL96,[Breast Neoplasms],[breast]
