## GSE22058-GPL10457 part 2 (human hepatocellular carcinoma, download of raw sample data)
The script allows to pre-process **GSE22058-GPL10457** raw sample data acquired from GEO data base.
<br>
<br>
**GSE22058-GPL10457** data set consists of 194 samples:

* 96 positive samples,
* and 96 negative samples (adjacent to the positive).
<br>

**For detailed information please refer to:** https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE22058
<br>

**Related publication:**
<br>
Burchard J, Zhang C, Liu AM, Poon RT et al. microRNA-122 as a regulator of mitochondrial metabolic gene network in hepatocellular carcinoma. Mol Syst Biol 2010 Aug 24;6:402. (DOI: 10.1038/msb.2010.58)

**Before you start:**
* This code is writen in **Python 3**.
* **Required libraries**: *urllib*, *pandas*, *os*

The code is written by @MelaniaNowicka, Free University of Berlin (contact: melania.nowicka@gmail.com).

**Import necessary libraries**

In [1]:
# import required libraries
from urllib import request
import pandas as pd
import os

**Download of separate samples from** https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE22058

In [2]:
path = os.getcwd()  # path to files

# get GSM ids and ids to create urls, add GSM accession ids at 'acc=' and numerical ids at 'id='
url = "https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?view=data&acc=&id=&db=GeoDb_blob46"  # general url
url_ids = pd.read_csv(os.path.join(path, "GSE22058_url_ids.csv"), sep=";")  # read file containing the ids
acc_list = list(url_ids.GSM_id)  # GSM accession ids
ids_list = list(url_ids.id)  # numerical ids

try:
     os.mkdir(os.path.join(path, os.path.normpath("html")))
except OSError:
    print ("Creation of the directory %s failed" % path)
else:
    print ("Successfully created the directory %s " % path)

# create urls, write them to files and download single sample files
with open(os.path.join(path, "GSE22058_urls.txt"), 'a+') as output:
    for i in range(0, len(acc_list)):
        print("Downloading sample: ", i+1)
        temp_url = url.replace("acc=", "acc="+acc_list[i])
        temp_url = temp_url.replace("id=", "id="+str(ids_list[i]))
        output.write(temp_url+"\n")
        request.urlretrieve(temp_url, os.path.join(path, "html/GSE22058_"+acc_list[i]))  # download sample file

Creation of the directory C:\Users\melan\PycharmProjects\RAccoon\Cancer data studies\GEO_microarray_data failed
Downloading sample:  1
Downloading sample:  2
Downloading sample:  3
Downloading sample:  4
Downloading sample:  5
Downloading sample:  6
Downloading sample:  7
Downloading sample:  8
Downloading sample:  9
Downloading sample:  10
Downloading sample:  11
Downloading sample:  12
Downloading sample:  13
Downloading sample:  14
Downloading sample:  15
Downloading sample:  16
Downloading sample:  17
Downloading sample:  18
Downloading sample:  19
Downloading sample:  20
Downloading sample:  21
Downloading sample:  22
Downloading sample:  23
Downloading sample:  24
Downloading sample:  25
Downloading sample:  26
Downloading sample:  27
Downloading sample:  28
Downloading sample:  29
Downloading sample:  30
Downloading sample:  31
Downloading sample:  32
Downloading sample:  33
Downloading sample:  34
Downloading sample:  35
Downloading sample:  36
Downloading sample:  37
Downloadi

**Pre-process sample files to create two data sets: normalized and non-normalized data set**

In [3]:
# process sample files
header = ["ID_REF", "RAW_VALUE", "VALUE", "PVALUE"]  # create data header (original GEO header)
raw_values = pd.DataFrame.from_dict(data={})  # create empty data frame for raw values
#normalized_values = pd.DataFrame.from_dict(data={})  # create empty data frame for normalized values
web_mir_ids = []  # list of miRNA ids from web page
i = 0  # sample counter

# read and pre-process samples
for filename in os.listdir(os.path.join(path, "html/")):  # iterate over files
    print("Processing file: ", filename)

    # strip and separate file lines
    with open(os.path.join(path, "html/")+filename) as f:
        lines = [line.rstrip() for line in f]  # strip lines
    sample = lines[24:244]  # get sample-related lines (rest is html code)

    sample_split = [line.split("\t") for line in sample]  # split sample data by tab
    sample_split_df = pd.DataFrame(sample_split, columns=header)  # and create data frame

    if i == 0:  # if this is the first sample get the miRNA ids as template
        web_mir_ids = [row[0] for row in sample_split]
        raw_values["miR_IDS"] = web_mir_ids
        #print("miRNA IDs added!")
    else:  # if not just get the miRNA ids from the new file
        temp_ids = [row[0] for row in sample_split]
        if temp_ids != web_mir_ids:  # and compare with the template
            print("IDs do not match for sample:" + str(i) + "!")  # the ids must be identical for all the files

    # create raw value data sets
    temp_raw_values = [row[1] for row in sample_split]  # get the raw sample values
    #temp_normalized_values = [row[2] for row in sample_split]  # get the normalized sample values
    raw_values[acc_list[i]] = temp_raw_values  # assign the raw values as a column with the sample accession number
    #normalized_values[acc_list[i]] = temp_normalized_values  # same for the normalized values
    i += 1  # count samples

    # save sample file
    filename_temp = filename + ".csv"
    sample_split_df.to_csv(path_or_buf=os.path.join(path, "html/", filename_temp), sep=';', index=False)

Processing file:  GSE22058_GSM548041
Processing file:  GSE22058_GSM548042
Processing file:  GSE22058_GSM548043
Processing file:  GSE22058_GSM548044
Processing file:  GSE22058_GSM548045
Processing file:  GSE22058_GSM548046
Processing file:  GSE22058_GSM548047
Processing file:  GSE22058_GSM548048
Processing file:  GSE22058_GSM548049
Processing file:  GSE22058_GSM548050
Processing file:  GSE22058_GSM548051
Processing file:  GSE22058_GSM548052
Processing file:  GSE22058_GSM548053
Processing file:  GSE22058_GSM548054
Processing file:  GSE22058_GSM548055
Processing file:  GSE22058_GSM548056
Processing file:  GSE22058_GSM548057
Processing file:  GSE22058_GSM548058
Processing file:  GSE22058_GSM548059
Processing file:  GSE22058_GSM548060
Processing file:  GSE22058_GSM548061
Processing file:  GSE22058_GSM548062
Processing file:  GSE22058_GSM548063
Processing file:  GSE22058_GSM548064
Processing file:  GSE22058_GSM548065
Processing file:  GSE22058_GSM548066
Processing file:  GSE22058_GSM548067
P

**Compare downloaded samples with series matrix**

In [4]:
# read sample order and annotation
sample_order_and_annot = pd.read_csv(path+"/GSE22058_sample_info.csv", sep=";")

# create dictionaries translating from gsm id to int id and gsm id to annotation
gsm_to_id = dict(zip(sample_order_and_annot.original_ids, sample_order_and_annot.new_ids))
gsm_to_annot = dict(zip(sample_order_and_annot.original_ids, sample_order_and_annot.annotation))

# compare miRNA IDs from GPL platform and sample web page
# web_mir_ids = [int(i) for i in web_mir_ids]
# if list(mirna_ids.ID) == web_mir_ids:
#     print("miRNA ids are complete and identical.")
# else:
#    print("miRNA ids are incomplete, in wrong order or not identical!")
#    print(set(list(mirna_ids.ID)) - set(web_mir_ids))

# compare platform IDs from series matrix (file created in R using GEOquery) with the sample web page
mirna_ids_from_sm = pd.read_csv(path+"/mir_ids_from_series_matrix.csv", sep=";")
web_mir_ids = [int(i) for i in web_mir_ids]  # convert from string to int
if list(mirna_ids_from_sm.platform_ID) == web_mir_ids:
    print("miRNA ids are complete and identical.")
else:  # if not complete or not not identical show which ones differ
    print("miRNA ids are incomplete, in wrong order or not identical!")
    print(set(list(mirna_ids_from_sm.platform_ID)) - set(web_mir_ids))

miRNA ids are complete and identical.


**Create raw-value data set**

In [5]:
# create raw values dataset with the order "ID", "Annots", "mir1", "mir2", etc.
header = ["ID", "Annots"] + list(mirna_ids_from_sm.miR_ID)
raw_dataset = pd.DataFrame(columns=header)
list_of_sample_dicts = []
for sample in list(raw_values.columns[1:]):  # iterate over samples
    sample_dict = dict(zip(header, [gsm_to_id[sample], gsm_to_annot[sample]]+list(raw_values[sample])))
    list_of_sample_dicts.append(sample_dict)

# create data set and sort by ID
raw_dataset = pd.DataFrame(list_of_sample_dicts)
raw_dataset = raw_dataset.sort_values("ID", axis=0)

# normalized data was preprocessed only for comparison to series matrix purpose
# create normalized values dataset with the order "ID", "Annots", "mir1", "mir2", etc.
#normalized_dataset_web = pd.DataFrame(columns=header)  # create empty data frame
#list_of_sample_dicts = []
#for sample in list(normalized_values.columns):  # iterate over samples
#    values = [str(i).rstrip('0') for i in list(normalized_values[sample])]
#    sample_dict = dict(zip(header, [gsm_to_id[sample], gsm_to_annot[sample]]+values))
#    list_of_sample_dicts.append(sample_dict)

# create data set and sort by ID
#normalized_dataset_web = pd.DataFrame(list_of_sample_dicts)
#normalized_dataset_web = normalized_dataset_web.sort_values("ID", axis=0)

# raw_dataset.to_csv(path_or_buf=os.path.join(path, "samples/")+"GSE22058_non_norm_non_filter.csv", sep=';',
# index=False)

# normalized_dataset_web.to_csv(path_or_buf=os.path.join(path, "samples/")+"GSE22058_norm_non_filter.csv", sep=';',
# index=False)

**Remove non-human and *-miRNAs** 

In [6]:
# filter * and non-human miRNAs from the header
header = [i for i in header if "*" not in i]
header = [i for i in header if "hsa" in i]
header = ["ID", "Annots"] + header

# filter data sets using the filtered header
raw_dataset_filtered = raw_dataset[header]
#normalized_dataset_web_filtered = normalized_dataset_web[header]

222
212
210


**Save raw data set to .csv**

In [7]:
raw_dataset_filtered.to_csv(path_or_buf=path+"/GSE22058_non_norm_formatted.csv", sep=';',
                            index=False)
# normalized_dataset_web_filtered.to_csv(path_or_buf=os.path.join(path, "samples/")+"GSE22058_norm_formatted.csv", sep=';', index=False)