## [Intertumoral Heterogeneity within Medulloblastoma Subgroups](https://www.cell.com/cancer-cell/fulltext/S1535-6108(17)30201-5)

[GSE85218](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE85218) is a SuperSeries containing [GSE85212](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE85212) (methylation data) and [GSE85217](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE85217) (RNA data).

8's SOFT file has the information values for both 2 and 7, but the sample values for only 7.
We do not read the information values from this SOFT file, because we read them from the paper's supplementary table 2, which has richer information for 2 and 7.
(However, this table uses blue and red to represent 0 and 1, so we had to manually fix it and save it as mmc2.tsv before reading it.)
Also, we do not read 7's sample values from this SOFT file because we read them from 7.
Overall, we do not read from 8.

2's SOFT file has the information values, but is missing the sample values, which are in its supplementary file.
So we read 2's sample values from its supplementary file.

7's SOFT file has the information values and the sample values.
But these sample values have different sample names from the 2's sample values.
7 also has a supplementary file with the sample values.
And these sample values have the same sample names with the 2's sample values.
For consistency and for matching the sample names, we read 7's sample values from its supplementary file.

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import pandas as pd

import kraft

In [None]:
SETTING = kraft.json.read("setting.json")

In [None]:
data_directory_path = SETTING["data_directory_path"]

overwrite = False

In [None]:
feature_x_sample = pd.read_csv("mmc2.tsv", sep="\t", index_col=0,).T

feature_x_sample.index.name = "Feature"

feature_x_sample

In [None]:
ch3_gene_x_sample = pd.read_csv(
    kraft.internet.download(
        "https://ftp.ncbi.nlm.nih.gov/geo/series/GSE85nnn/GSE85212/suppl/GSE85212_Methylation_763samples_SubtypeStudy_TaylorLab_beta_values.txt.gz",
        data_directory_path,
        overwrite=overwrite,
    ),
    sep="\t",
)

ch3_gene_x_sample.index = (label for label in ch3_gene_x_sample.index.to_numpy())

ch3_gene_x_sample = ch3_gene_x_sample.loc[~ch3_gene_x_sample.index.isna()]

print(ch3_gene_x_sample.shape)

ch3_gene_x_sample = ch3_gene_x_sample.groupby(level=0).median()

print(ch3_gene_x_sample.shape)

ch3_gene_x_sample.index.name = "Gene"

ch3_gene_x_sample

In [None]:
rna_gene_x_sample = pd.read_csv(
    kraft.internet.download(
        "https://ftp.ncbi.nlm.nih.gov/geo/series/GSE85nnn/GSE85217/suppl/GSE85217_M_exp_763_MB_SubtypeStudy_TaylorLab.txt.gz",
        data_directory_path,
        overwrite=overwrite,
    ),
    sep="\t",
    index_col=0,
).iloc[:, 4:]

rna_gene_x_sample.index = (label for label in rna_gene_x_sample.index.to_numpy())

rna_gene_x_sample = rna_gene_x_sample.loc[~rna_gene_x_sample.index.isna()]

print(rna_gene_x_sample.shape)

rna_gene_x_sample = rna_gene_x_sample.groupby(level=0).median()

print(rna_gene_x_sample.shape)

rna_gene_x_sample.index.name = "Gene"

rna_gene_x_sample