**Preface**  
The Genomic Data Commons (GDC) provides an API that makes the compilation of specific data convenient. Here, we have leveraged it to create datasets consisting of gene expression data of healthy patients and patients with a specific cancer (i.e. dataset X: healthy/breast cancer, dataset Y: healthy/leukemia, etc.). The GDC provides a user guide to help navigate and use their API, which we took advantage of: https://docs.gdc.cancer.gov/API/PDF/API_UG.pdf. Because we are working with an API, we also had to get acquainted with the requests library: https://www.w3schools.com/python/module_requests.asp. Lastly, we also had to learn and remind ourselves of certain dataframe manipulations: https://pandas.pydata.org/docs/reference/frame.html.




In [2]:
import requests
import json
import pandas as pd
import os
from io import StringIO

def retrieveFiles(sampleType, numSamples):#For more information regarding search and retrieval of files, refer to 1.2 of the GDC user guide
    #Here, we have a filter of filters. Only files that satisfy every child filter will be returned. 
    filters={
        "op":"and",#The main filter; every child filter in content must be satisfied for the file to be returned.
        "content":[
            {"op":"in",#Child filter 1: Only files from the TCGA-BRCA project
                "content":{
                    "field":"cases.project.project_id",
                    "value":["TCGA-BRCA"]}
            },
            {"op":"in",#Child filter 2: Only gene expression files
                "content":{
                    "field":"data_type",
                    "value":["Gene Expression Quantification"]}
            },
            {"op":"in",#Child filter 3: Only gene expression data derived from the STAR - Counts method
                "content":{
                    "field":"analysis.workflow_type",
                    "value":["STAR - Counts"]}
            },
            {"op":"in",#Child filter 4: Only files of the specified type (healthy or cancerous)
                "content":{
                    "field":"cases.samples.sample_type",
                    "value":[sampleType]}
            }
        ]
    }

    parameters={
        "filters":json.dumps(filters),#We use the filters to gather files, which we put into json format
        "fields":"file_id,file_name,cases.samples.sample_type",#The metadata we're interested in from the files we retrieved
        "format":"JSON",#The API will return the data in json format
        "size":numSamples}

    request=requests.get("https://api.gdc.cancer.gov/files", params=parameters)#Make a request to the API to gather the data we specified
    return request.json()["data"]["hits"]#Convert the json returned from the API into a dictionary, which we index to return a list of the files we collected

def extractData(fileID, label):
    request=requests.get("https://api.gdc.cancer.gov/data/"+fileID)#Make a request to download the file's data
    data=pd.read_csv(StringIO(request.text), sep="\t",comment="#")#Make a panda frame from the downloaded data. StringIO properly formats the string returned by the request
    
    if("gene_id" in data.columns and "unstranded" in data.columns):
        data=data[["gene_id", "unstranded"]]#Check if the file contains gene_id and unstranded columns and only keep them
        data.columns=["gene_id", "expression"]#Unstranded is just the gene expression level 
    else:
        print("gene_id and/or unstranded columns could not be found")
        exit()

    data["sample_id"]=fileID
    data["label"]=label
    return data

cancerFiles=retrieveFiles("Primary Tumor", 100)
normalFiles=retrieveFiles("Solid Tissue Normal", 100)

dataFrames=[]#Where we'll store each dataframe derived from each file
for file in cancerFiles:
    data=extractData(file["file_id"], label=1)
    dataFrames.append(data)

for file in normalFiles:
    data=extractData(file["file_id"], label=0)
    dataFrames.append(data)

dataset=pd.concat(dataFrames)#Concatenate each dataframe into one to get the dataset
dataset=dataset[["sample_id", "gene_id","expression", "label"]]
dataset.to_csv("geneExpressionDataBreastCancer.tsv", sep="\t", index=False)

print("Done")


ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

After using the API to create a database, we'll have to rehape it into a wide format so that it can be used to train and evaluate the models. The reshape will also result in a much smaller file.




In [12]:
import pandas as pd

data=pd.read_csv("geneExpressionDataBrain.tsv", sep="\t")
junk=data["gene_id"].str.startswith("N_")#Here, we make a series, where each entry that contains junk is marked as true.
data=data[~junk]#Some rows are junk and do not represent actual genes. As such, we keep everything that is not junk.

reshapedData=data.pivot(index="sample_id", columns="gene_id", values="expression")
labels=data[["sample_id", "label"]].drop_duplicates().set_index("sample_id")#The pivot function doesn't work on multiple columns at a time (e.g. gene_id and labels). Thus, we have to add the labels back.
dataset=reshapedData.join(labels)
dataset.to_csv("geneExpressionDataBrain.csv")

print("Done")


Done
