In [1]:
from genefab import get_datasets, GLDS

Requesting a dataset directly:

In [2]:
glds = GLDS("GLDS-4")
glds

name: GLDS-4
assays: ['a_GLDS-4_microarray_metadata-txt']
factors: ['Spaceflight', 'Cosmic Radiation']

In [3]:
assay = glds.assays[0]
assay

name: a_GLDS-4_microarray_metadata-txt
samples: ['Mmus_C57-6T_TMS_FLT_Rep1', 'Mmus_C57-6T_TMS_FLT_Rep2', 'Mmus_C57-6T_TMS_FLT_Rep3', 'Mmus_C57-6T_TMS_FLT_Rep4', 'Mmus_C57-6T_TMS_GC_Rep1', 'Mmus_C57-6T_TMS_GC_Rep2', 'Mmus_C57-6T_TMS_GC_Rep3', 'Mmus_C57-6T_TMS_GC_Rep4']
factor values: {'Factor Value:  Spaceflight': {'Space Flight', 'Ground Control'}}

In [4]:
assay.has_arrays, assay.has_normalized_data, assay.has_processed_data

(True, True, True)

In [5]:
assay.samples

['Mmus_C57-6T_TMS_FLT_Rep1',
 'Mmus_C57-6T_TMS_FLT_Rep2',
 'Mmus_C57-6T_TMS_FLT_Rep3',
 'Mmus_C57-6T_TMS_FLT_Rep4',
 'Mmus_C57-6T_TMS_GC_Rep1',
 'Mmus_C57-6T_TMS_GC_Rep2',
 'Mmus_C57-6T_TMS_GC_Rep3',
 'Mmus_C57-6T_TMS_GC_Rep4']

`Assay.metadata` is dataframe-like; by default, it is indexed by *"Sample Name"*.  
Column indexing can only be done with the double square bracket syntax, because one human-readable field name can map to multiple fields, and double square bracket indexing is expected to return a DataFrame (not a Series):

In [6]:
assay.metadata[["Protocol REF"]]

Unnamed: 0_level_0,a100014protocolref,a100009protocolref,a100004protocolref,a100019protocolref,a100022protocolref,a100017protocolref
a100000samplename,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Mmus_C57-6T_TMS_FLT_Rep1,nucleic acid hybridization,labeling,RNA extraction,normalization data transformation,GeneLab data processing protocol,scan protocol
Mmus_C57-6T_TMS_FLT_Rep2,nucleic acid hybridization,labeling,RNA extraction,normalization data transformation,GeneLab data processing protocol,scan protocol
Mmus_C57-6T_TMS_FLT_Rep3,nucleic acid hybridization,labeling,RNA extraction,normalization data transformation,GeneLab data processing protocol,scan protocol
Mmus_C57-6T_TMS_FLT_Rep4,nucleic acid hybridization,labeling,RNA extraction,normalization data transformation,GeneLab data processing protocol,scan protocol
Mmus_C57-6T_TMS_GC_Rep1,nucleic acid hybridization,labeling,RNA extraction,normalization data transformation,GeneLab data processing protocol,scan protocol
Mmus_C57-6T_TMS_GC_Rep2,nucleic acid hybridization,labeling,RNA extraction,normalization data transformation,GeneLab data processing protocol,scan protocol
Mmus_C57-6T_TMS_GC_Rep3,nucleic acid hybridization,labeling,RNA extraction,normalization data transformation,GeneLab data processing protocol,scan protocol
Mmus_C57-6T_TMS_GC_Rep4,nucleic acid hybridization,labeling,RNA extraction,normalization data transformation,GeneLab data processing protocol,scan protocol


For column indexing, human-readable field names are interpreted as case-insensitive regex patterns, for flexibility when dealing with differing titles in the API (for example, starting with "Comment:  " or not, as well as upper/lowercase differences).  
To reduce the amount of madness, they have to match from start to end:

In [7]:
assay.metadata[[".*plots.*", ".*images.*"]]

Unnamed: 0_level_0,a100025commentrawdataplots,a100024commentrawdataimages,a100029commentnormalizeddataplots
a100000samplename,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Mmus_C57-6T_TMS_FLT_Rep1,GLDS-4_array_Mmus_C57-6T_TMS_FLT_GC_All_Sample...,GLDS-4_array_Mmus_C57-6T_TMS_FLT_Rep1_GSM45859...,GLDS-4_array_Mmus_C57-6T_TMS_FLT_GC_All_Sample...
Mmus_C57-6T_TMS_FLT_Rep2,GLDS-4_array_Mmus_C57-6T_TMS_FLT_GC_All_Sample...,GLDS-4_array_Mmus_C57-6T_TMS_FLT_Rep2_GSM45859...,GLDS-4_array_Mmus_C57-6T_TMS_FLT_GC_All_Sample...
Mmus_C57-6T_TMS_FLT_Rep3,GLDS-4_array_Mmus_C57-6T_TMS_FLT_GC_All_Sample...,GLDS-4_array_Mmus_C57-6T_TMS_FLT_Rep3_GSM45859...,GLDS-4_array_Mmus_C57-6T_TMS_FLT_GC_All_Sample...
Mmus_C57-6T_TMS_FLT_Rep4,GLDS-4_array_Mmus_C57-6T_TMS_FLT_GC_All_Sample...,GLDS-4_array_Mmus_C57-6T_TMS_FLT_Rep4_GSM45859...,GLDS-4_array_Mmus_C57-6T_TMS_FLT_GC_All_Sample...
Mmus_C57-6T_TMS_GC_Rep1,GLDS-4_array_Mmus_C57-6T_TMS_FLT_GC_All_Sample...,GLDS-4_array_Mmus_C57-6T_TMS_GC_Rep1_GSM458598...,GLDS-4_array_Mmus_C57-6T_TMS_FLT_GC_All_Sample...
Mmus_C57-6T_TMS_GC_Rep2,GLDS-4_array_Mmus_C57-6T_TMS_FLT_GC_All_Sample...,GLDS-4_array_Mmus_C57-6T_TMS_GC_Rep2_GSM458599...,GLDS-4_array_Mmus_C57-6T_TMS_FLT_GC_All_Sample...
Mmus_C57-6T_TMS_GC_Rep3,GLDS-4_array_Mmus_C57-6T_TMS_FLT_GC_All_Sample...,GLDS-4_array_Mmus_C57-6T_TMS_GC_Rep3_GSM458600...,GLDS-4_array_Mmus_C57-6T_TMS_FLT_GC_All_Sample...
Mmus_C57-6T_TMS_GC_Rep4,GLDS-4_array_Mmus_C57-6T_TMS_FLT_GC_All_Sample...,GLDS-4_array_Mmus_C57-6T_TMS_GC_Rep4_GSM458601...,GLDS-4_array_Mmus_C57-6T_TMS_FLT_GC_All_Sample...


In [8]:
assay.metadata

index: ['Mmus_C57-6T_TMS_FLT_Rep1', 'Mmus_C57-6T_TMS_FLT_Rep2', 'Mmus_C57-6T_TMS_FLT_Rep3', 'Mmus_C57-6T_TMS_FLT_Rep4', 'Mmus_C57-6T_TMS_GC_Rep1', 'Mmus_C57-6T_TMS_GC_Rep2', 'Mmus_C57-6T_TMS_GC_Rep3', 'Mmus_C57-6T_TMS_GC_Rep4']
fields: ['Factor Value:  Spaceflight', 'Protocol REF', 'Material Type', 'Extract Name', 'Labeled Extract Name', 'Label', 'Hybridization Assay Name', 'Array Design REF', 'Array Data File', 'Normalization Name', 'Derived Array Data File', 'Comment:  GeneLab Processed Raw Data Files', 'Comment:  Raw Data Images', 'Comment:  Raw Data Plots', 'Comment:  Supplemental Materials', 'Comment:  Normalized Data Files', 'Comment:  Normalized Annotated Data Files', 'Comment:  Normalized Data Plots']
factor values: {'Factor Value:  Spaceflight': {'Space Flight', 'Ground Control'}}

`Assay.factors` gets you factors in the assay:

In [9]:
assay.factors

Factor,Factor Value: Spaceflight
Sample Name,Unnamed: 1_level_1
Mmus_C57-6T_TMS_FLT_Rep1,Space Flight
Mmus_C57-6T_TMS_FLT_Rep2,Space Flight
Mmus_C57-6T_TMS_FLT_Rep3,Space Flight
Mmus_C57-6T_TMS_FLT_Rep4,Space Flight
Mmus_C57-6T_TMS_GC_Rep1,Ground Control
Mmus_C57-6T_TMS_GC_Rep2,Ground Control
Mmus_C57-6T_TMS_GC_Rep3,Ground Control
Mmus_C57-6T_TMS_GC_Rep4,Ground Control


Processed data can be requested:  
*(processed == normalized and annotated)*

In [10]:
assay.processed_data[:5]
# alias: assay.normalized_annotated_data[:5]

Sample Name,Mmus_C57-6T_TMS_FLT_Rep1,Mmus_C57-6T_TMS_FLT_Rep2,Mmus_C57-6T_TMS_FLT_Rep3,Mmus_C57-6T_TMS_FLT_Rep4,Mmus_C57-6T_TMS_GC_Rep1,Mmus_C57-6T_TMS_GC_Rep2,Mmus_C57-6T_TMS_GC_Rep3,Mmus_C57-6T_TMS_GC_Rep4
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
NM_001355712,8.942076,9.468133,9.212215,9.414233,9.506787,9.364989,9.379174,9.451228
NM_001310442,7.993204,8.430513,8.214404,8.559117,8.521084,8.46005,8.294821,8.329526
NM_001204371,3.892463,4.113051,4.236066,3.673601,4.171318,4.231676,3.689275,3.571904
NM_009826,8.329473,8.507683,8.32748,8.451816,8.301892,8.345079,8.109227,8.109115
NM_001195732,3.533509,3.292036,3.958079,3.566297,3.561111,3.504118,3.541423,3.492351


So can normalized data: (non-annotated)

In [11]:
assay.normalized_data[:5]

Sample Name,Mmus_C57-6T_TMS_FLT_Rep1,Mmus_C57-6T_TMS_FLT_Rep2,Mmus_C57-6T_TMS_FLT_Rep3,Mmus_C57-6T_TMS_FLT_Rep4,Mmus_C57-6T_TMS_GC_Rep1,Mmus_C57-6T_TMS_GC_Rep2,Mmus_C57-6T_TMS_GC_Rep3,Mmus_C57-6T_TMS_GC_Rep4
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
10338001,11.710316,11.665027,11.922767,11.531424,11.561032,11.915841,11.381127,11.457108
10338002,6.272419,6.437356,6.010005,6.530253,6.487918,6.439028,6.399789,6.41453
10338003,10.310764,10.369588,10.635912,10.140038,10.223866,10.679233,10.020653,10.107153
10338004,9.229938,9.265878,9.476431,9.128688,9.292231,9.508814,8.969674,9.111594
10338005,1.892674,2.10161,2.015423,2.07719,2.065804,2.108446,2.053311,2.077626


Datasets can be constructed so that the assays are indexed by a different field.  
Again, `index_by` is interpreted as a case-insensitive regex pattern from start to end:

In [12]:
glds = GLDS("GLDS-4", index_by="hybridization assay name")
assay = glds.assays[0]
assay.processed_data[:5]

Hybridization Assay Name,GSM458594,GSM458595,GSM458596,GSM458597,GSM458598,GSM458599,GSM458600,GSM458601
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
NM_001355712,8.942076,9.468133,9.212215,9.414233,9.506787,9.364989,9.379174,9.451228
NM_001310442,7.993204,8.430513,8.214404,8.559117,8.521084,8.46005,8.294821,8.329526
NM_001204371,3.892463,4.113051,4.236066,3.673601,4.171318,4.231676,3.689275,3.571904
NM_009826,8.329473,8.507683,8.32748,8.451816,8.301892,8.345079,8.109227,8.109115
NM_001195732,3.533509,3.292036,3.958079,3.566297,3.561111,3.504118,3.541423,3.492351


Note that "Sample Name" is now in available fields, while "Hybridization Assay Name" is not (because it isn't a metadata column anymore, it's its index):

In [13]:
assay.metadata

index: ['GSM458594', 'GSM458595', 'GSM458596', 'GSM458597', 'GSM458598', 'GSM458599', 'GSM458600', 'GSM458601']
fields: ['Sample Name', 'Factor Value:  Spaceflight', 'Protocol REF', 'Material Type', 'Extract Name', 'Labeled Extract Name', 'Label', 'Array Design REF', 'Array Data File', 'Normalization Name', 'Derived Array Data File', 'Comment:  GeneLab Processed Raw Data Files', 'Comment:  Raw Data Images', 'Comment:  Raw Data Plots', 'Comment:  Supplemental Materials', 'Comment:  Normalized Data Files', 'Comment:  Normalized Annotated Data Files', 'Comment:  Normalized Data Plots']
factor values: {'Factor Value:  Spaceflight': {'Space Flight', 'Ground Control'}}

In [14]:
assay.metadata.index

Index(['GSM458594', 'GSM458595', 'GSM458596', 'GSM458597', 'GSM458598',
       'GSM458599', 'GSM458600', 'GSM458601'],
      dtype='object', name='a100015hybridizationassayname')

In [15]:
assay.metadata.columns

['Sample Name',
 'Factor Value:  Spaceflight',
 'Protocol REF',
 'Material Type',
 'Extract Name',
 'Labeled Extract Name',
 'Label',
 'Array Design REF',
 'Array Data File',
 'Normalization Name',
 'Derived Array Data File',
 'Comment:  GeneLab Processed Raw Data Files',
 'Comment:  Raw Data Images',
 'Comment:  Raw Data Plots',
 'Comment:  Supplemental Materials',
 'Comment:  Normalized Data Files',
 'Comment:  Normalized Annotated Data Files',
 'Comment:  Normalized Data Plots']

Once downloaded, normalized/processed data is not redownloaded and can be accessed quickly.  
However, if need be, it can be redownloaded by issuing:

In [16]:
_ = assay.get_normalized_data(force_redownload=True)
_ = assay.get_processed_data(force_redownload=True)
# alias: assay.get_normalized_annotated_data(force_redownload=True)

*Note that `assay.get_processed_data()` with no arguments defaults to `force_redownload=False` and is semantically equivalent to just using `assay.processed_data`, and so on.*

Now we basically have our annotation and our data for limma etc:

In [17]:
assay.factors.T

Hybridization Assay Name,GSM458594,GSM458595,GSM458596,GSM458597,GSM458598,GSM458599,GSM458600,GSM458601
Factor,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Factor Value: Spaceflight,Space Flight,Space Flight,Space Flight,Space Flight,Ground Control,Ground Control,Ground Control,Ground Control


In [18]:
assay.processed_data[:5]

Hybridization Assay Name,GSM458594,GSM458595,GSM458596,GSM458597,GSM458598,GSM458599,GSM458600,GSM458601
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
NM_001355712,8.942076,9.468133,9.212215,9.414233,9.506787,9.364989,9.379174,9.451228
NM_001310442,7.993204,8.430513,8.214404,8.559117,8.521084,8.46005,8.294821,8.329526
NM_001204371,3.892463,4.113051,4.236066,3.673601,4.171318,4.231676,3.689275,3.571904
NM_009826,8.329473,8.507683,8.32748,8.451816,8.301892,8.345079,8.109227,8.109115
NM_001195732,3.533509,3.292036,3.958079,3.566297,3.561111,3.504118,3.541423,3.492351


----
Some indexing shenanigans (some of it could be useful, some of it I can axe if we'll never need it)

We can even do Pandas-like indexing:

In [19]:
space_samples = assay.metadata[assay.factors=="Space Flight"].index
assay.processed_data[space_samples][:5]

Hybridization Assay Name,GSM458594,GSM458595,GSM458596,GSM458597
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
NM_001355712,8.942076,9.468133,9.212215,9.414233
NM_001310442,7.993204,8.430513,8.214404,8.559117
NM_001204371,3.892463,4.113051,4.236066,3.673601
NM_009826,8.329473,8.507683,8.32748,8.451816
NM_001195732,3.533509,3.292036,3.958079,3.566297


And more complex indexing, with `.loc`:  
*(although the order of columns is unpredictable so far, but then again, because some human-readable field names map to multiple metadata columns, it's quite hard to do this in a predictable way)*

In [20]:
assay.metadata.loc[
    assay.factors=="Space Flight",
    ["Sample Name", "Label", "Labeled Extract Name"]
]

Unnamed: 0_level_0,a100011label,a100010labeledextractname,a100000samplename
a100015hybridizationassayname,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
GSM458594,biotin,GSM458594,Mmus_C57-6T_TMS_FLT_Rep1
GSM458595,biotin,GSM458595,Mmus_C57-6T_TMS_FLT_Rep2
GSM458596,biotin,GSM458596,Mmus_C57-6T_TMS_FLT_Rep3
GSM458597,biotin,GSM458597,Mmus_C57-6T_TMS_FLT_Rep4


Of course, we can use iterrows on whatever subset we get, because it's a dataframe:

In [21]:
raw_plots = assay.metadata.loc[
    assay.factors=="Space Flight",
    ["Sample Name"]
]

for name, row in raw_plots.iterrows():
    print(name, row.iloc[0], sep=": ")

GSM458594: Mmus_C57-6T_TMS_FLT_Rep1
GSM458595: Mmus_C57-6T_TMS_FLT_Rep2
GSM458596: Mmus_C57-6T_TMS_FLT_Rep3
GSM458597: Mmus_C57-6T_TMS_FLT_Rep4


The above is not very useful, because indexing like this **has** to produce a DataFrame with columns as internal values, but what if I told you we could also iterate over the rows of the entire metadata:

In [22]:
for name, row in assay.metadata.iterrows():
    print(name, row["Sample Name"], sep=": ")

GSM458594: Mmus_C57-6T_TMS_FLT_Rep1
GSM458595: Mmus_C57-6T_TMS_FLT_Rep2
GSM458596: Mmus_C57-6T_TMS_FLT_Rep3
GSM458597: Mmus_C57-6T_TMS_FLT_Rep4
GSM458598: Mmus_C57-6T_TMS_GC_Rep1
GSM458599: Mmus_C57-6T_TMS_GC_Rep2
GSM458600: Mmus_C57-6T_TMS_GC_Rep3
GSM458601: Mmus_C57-6T_TMS_GC_Rep4


In the future, I may reimplement indexing by a boolean condition (such as `assay.metadata[assay.factors=="Ground Control"]`) so that it produces a metadata-like object, and not just a DataFrame, so it can also be iterrow-ed gracefully... Assuming this is something we might need.