Load the Microarray dataset from the parquet file

In [1]:
import pandas as pd
X = pd.read_parquet("data/X.parquet")
print(X.shape)

(637, 60607)


Load the phenotype data using GEOparse

In [2]:
import GEOparse

gse = GEOparse.get_GEO("GSE199633", destdir="data", silent="True")

pheno = gse.phenotype_data.copy()
print(pheno.shape)

(637, 51)


Phenotype column names

In [5]:
column_names = [col for col in pheno.columns]
print(column_names)

['title', 'geo_accession', 'status', 'submission_date', 'last_update_date', 'type', 'channel_count', 'source_name_ch1', 'organism_ch1', 'taxid_ch1', 'characteristics_ch1.0.tissue', 'characteristics_ch1.1.subject', 'characteristics_ch1.2.type of surgery', 'characteristics_ch1.3.histology', 'characteristics_ch1.4.er status', 'characteristics_ch1.5.pr status', 'characteristics_ch1.6.her2 status', 'characteristics_ch1.7.pt', 'characteristics_ch1.8.pn', 'characteristics_ch1.9.ptnm', 'characteristics_ch1.10.radiotherapy', 'characteristics_ch1.11.type of radiotherapy', 'characteristics_ch1.12.age at diagnosis', 'characteristics_ch1.13.age at diagnosis (dichotomized)', 'characteristics_ch1.14.race', 'characteristics_ch1.15.chemotherapy', 'characteristics_ch1.16.locoregional recurrence', 'characteristics_ch1.17.death', 'characteristics_ch1.18.time to locoregional recurrence from date of diagnosis (years)', 'characteristics_ch1.19.time to death or censor from date of diagnosis (years)', 'charact

The model is going to use Er (Estrogen receptor) data, first process the column to make sure it's usable.

In [6]:
er_column = "characteristics_ch1.4.er status"
raw_data = pheno[er_column].astype(str).str.lower().str.lstrip()
print(raw_data.value_counts().head(20))

characteristics_ch1.4.er status
positive    481
negative    151
--            5
Name: count, dtype: int64


The metadata is consistent so no processing of values is needed, the columns will be renamed to make easier modelling later

In [None]:
y = raw_data.map({"positive": 1,"negative": 0})