# Gold standard curation: Preprocessing and single-step regression

In this stage of gold standard curation, we will do the data preprocessing, selection, and single-step regression for the 153 traits in our question set. This file shows the reference steps using the trait "Breast Cancer" as an example. The workflow consists of the following steps:

1. Preprocess all the cohorts related to this trait. Each cohort should be converted to a tabular form and saved to a csv file, with columns being genetic factors, the trait, and age, gender if available;
2. If there exists at least one cohort with age or gender information, conduct regression analysis with genetic features together with age or gender as the regressors.


# 1. Basic setup

In [2]:
import os
import sys

sys.path.append('..')
from utils import *

# Set your preferred name
USER = "Jiayi"
# Set the data and output directories   
DATA_ROOT = '/Users/legion/Desktop/Courses/IS389/data'   
OUTPUT_ROOT = '/Users/legion/Desktop/Courses/IS389/output'
TRAIT = 'Adrenocortical Cancer'

OUTPUT_DIR = os.path.join(OUTPUT_ROOT, USER, '-'.join(TRAIT.split()))
JSON_PATH = os.path.join(OUTPUT_DIR, "cohort_info.json")
if not os.path.exists(OUTPUT_DIR):
    os.makedirs(OUTPUT_DIR, exist_ok=True)

# Gene symbol normalization may take 1-2 minutes. You may set it to False for debugging.
NORMALIZE_GENE = True

utils.py has been loaded


In [5]:
# This cell is only for use on Google Colab. Skip it if you run your code in other environments

"""import os
from google.colab import drive

drive.mount('/content/drive', force_remount=True)
proj_dir = '/content/drive/MyDrive/AI4Science_Public'
os.chdir(proj_dir)"""

"import os\nfrom google.colab import drive\n\ndrive.mount('/content/drive', force_remount=True)\nproj_dir = '/content/drive/MyDrive/AI4Science_Public'\nos.chdir(proj_dir)"

# 2. Data preprocessing and selection

## 2.1. The TCGA Xena dataset

In TCGA Xena, there is either zero or one cohort related to the trait. We search the names of subdirectories to see if any matches the trait. If a match is found, we directly obtain the file paths.

In [6]:
dataset = 'TCGA'
dataset_dir = os.path.join(DATA_ROOT, dataset)
os.listdir(dataset_dir)[:10]

['TCGA_Adrenocortical_Cancer_(ACC)',
 'TCGA_Breast_Cancer_(BRCA)',
 'TCGA_Kidney_Papillary_Cell_Carcinoma_(KIRP)']

If no match is found, jump directly to GEO in Part 2.2

In [7]:
trait_subdir = "TCGA_Adrenocortical_Cancer_(ACC)"
cohort = 'Xena'
# All the cancer traits in Xena are binary
trait_type = 'binary'
# Once a relevant cohort is found in Xena, we can generally assume the gene and clinical data are available
is_available = True

clinical_data_file = os.path.join(dataset_dir, trait_subdir, 'TCGA.ACC.sampleMap_ACC_clinicalMatrix')
genetic_data_file = os.path.join(dataset_dir, trait_subdir, 'TCGA.ACC.sampleMap_HiSeqV2_PANCAN.gz')

In [8]:
import pandas as pd

clinical_data = pd.read_csv(clinical_data_file, sep='\t', index_col=0)
genetic_data = pd.read_csv(genetic_data_file, compression='gzip', sep='\t', index_col=0)
age_col = gender_col = None

In [9]:
def check_rows_and_columns(dataframe, display=False):
    """
    Get the lists of row names and column names of a dataset, and optionally observe them.
    :param dataframe:
    :param display:
    :return:
    """
    dataframe_rows = dataframe.index.tolist()
    if display:
        print(f"The dataset has {len(dataframe_rows)} rows, such as {dataframe_rows[:20]}")
    dataframe_cols = dataframe.columns.tolist()
    if display:
        print(f"\nThe dataset has {len(dataframe_cols)} columns, such as {dataframe_cols[:20]}")
    return dataframe_rows, dataframe_cols

In [10]:
_, clinical_data_cols = check_rows_and_columns(clinical_data)
clinical_data_cols[:10]

['_INTEGRATION',
 '_PATIENT',
 '_cohort',
 '_primary_disease',
 '_primary_site',
 'additional_pharmaceutical_therapy',
 'additional_radiation_therapy',
 'age_at_initial_pathologic_diagnosis',
 'atypical_mitotic_figures',
 'bcr_followup_barcode']

Read all the column names in the clinical dataset, to find the columns that record information about age or gender.
Reference prompt:

In [11]:
f'''
Below is a list of column names from a biomedical dataset. Please examine it and identify the columns that are likely to contain information about patients' age. Additionally, please do the same for columns that may hold data on patients' gender. Please provide your answer by strictly following this format, without redundant words:
candidate_age_cols = [col_name1, col_name2, ...]
candidate_gender_cols = [col_name1, col_name2, ...]
If no columns match a criterion, please provide an empty list.

Column names:
{clinical_data_cols}
'''

"\nBelow is a list of column names from a biomedical dataset. Please examine it and identify the columns that are likely to contain information about patients' age. Additionally, please do the same for columns that may hold data on patients' gender. Please provide your answer by strictly following this format, without redundant words:\ncandidate_age_cols = [col_name1, col_name2, ...]\ncandidate_gender_cols = [col_name1, col_name2, ...]\nIf no columns match a criterion, please provide an empty list.\n\nColumn names:\n['_INTEGRATION', '_PATIENT', '_cohort', '_primary_disease', '_primary_site', 'additional_pharmaceutical_therapy', 'additional_radiation_therapy', 'age_at_initial_pathologic_diagnosis', 'atypical_mitotic_figures', 'bcr_followup_barcode', 'bcr_patient_barcode', 'bcr_sample_barcode', 'clinical_M', 'ct_scan_findings', 'cytoplasm_presence_less_than_equal_25_percent', 'days_to_birth', 'days_to_collection', 'days_to_death', 'days_to_initial_pathologic_diagnosis', 'days_to_last_fol

In [12]:
candidate_age_cols = [ 'age_at_initial_pathologic_diagnosis',
                      'days_to_birth', 'year_of_initial_pathologic_diagnosis']
candidate_gender_cols = [ 'gender']

Choose a single column from the candidate columns that record age and gender information respectively.
If no column meets the requirement, keep 'age_col' or 'gender_col' to None

In [13]:
def preview_df(df, n=5):
    return df.head(n).to_dict(orient='list')

In [14]:
preview_df(clinical_data[candidate_age_cols])

{'age_at_initial_pathologic_diagnosis': [58, 44, 23, 23, 30],
 'days_to_birth': [-21496, -16090, -8624, -8451, -11171],
 'year_of_initial_pathologic_diagnosis': [2000, 2004, 2008, 2000, 2000]}

In [15]:
age_col = 'age_at_initial_pathologic_diagnosis'

In [16]:
preview_df(clinical_data[candidate_gender_cols])

{'gender': ['MALE', 'FEMALE', 'FEMALE', 'FEMALE', 'MALE']}

In [17]:
gender_col = 'gender'

In [18]:
def xena_select_clinical_features(clinical_df, trait, age_col=None, gender_col=None):
    feature_list = []
    trait_data = clinical_df.index.to_series().apply(xena_convert_trait).rename(trait)
    feature_list.append(trait_data)
    if age_col:
        age_data = clinical_df[age_col].apply(xena_convert_age).rename("Age")
        feature_list.append(age_data)
    if gender_col:
        gender_data = clinical_df[gender_col].apply(xena_convert_gender).rename("Gender")
        feature_list.append(gender_data)
    selected_clinical_df = pd.concat(feature_list, axis=1)
    return selected_clinical_df

In [19]:
def xena_convert_trait(row_index: str):
    """
    Convert the trait information from Sample IDs to labels depending on the last two digits.
    Tumor types range from 01 - 09, normal types from 10 - 19.
    :param row_index: the index value of a row
    :return: the converted value
    """
    last_two_digits = int(row_index[-2:])

    if 1 <= last_two_digits <= 9:
        return 1
    elif 10 <= last_two_digits <= 19:
        return 0
    else:
        return -1

In [20]:
def xena_convert_age(cell: str):
    """Convert the cell content about age to a numerical value using regular expression
    """
    match = re.search(r'\d+', str(cell))
    if match:
        return int(match.group())
    else:
        return None

In [21]:
def xena_convert_gender(cell: str):
    """Convert the cell content about gender to a binary value
    """
    if isinstance(cell, str):
        cell = cell.lower()

    if cell == "female":
        return 0
    elif cell == "male":
        return 1
    else:
        return None

In [22]:
import re
selected_clinical_data = xena_select_clinical_features(clinical_data, TRAIT, age_col=age_col, gender_col=gender_col)

In [23]:
def normalize_gene_symbols_in_index(gene_df):
    """Normalize the human gene symbols at the index of a dataframe, and replace the index with its normalized version.
    Remove the rows where the index failed to be normalized."""
    normalized_gene_list = normalize_gene_symbols(gene_df.index.tolist())
    assert len(normalized_gene_list) == len(gene_df.index)
    gene_df.index = normalized_gene_list
    gene_df = gene_df[gene_df.index.notnull()]
    return gene_df

In [24]:
def normalize_gene_symbols(gene_symbols, batch_size=1000):
    """Normalize human gene symbols in batches using the 'mygenes' library"""
    mg = mygene.MyGeneInfo()
    normalized_genes = {}

    # Process in batches
    for i in range(0, len(gene_symbols), batch_size):
        batch = gene_symbols[i:i + batch_size]
        results = mg.querymany(batch, scopes='symbol', fields='symbol', species='human')

        # Update the normalized_genes dictionary with results from this batch
        for gene in results:
            normalized_genes[gene['query']] = gene.get('symbol', None)

    # Return the normalized symbols in the same order as the input
    return [normalized_genes.get(symbol) for symbol in gene_symbols]


In [25]:
import mygene

if NORMALIZE_GENE:
    genetic_data = normalize_gene_symbols_in_index(genetic_data)

12 input query terms found dup hits:	[('GTF2IP1', 2), ('RBMY1A3P', 3), ('RPL31P11', 2), ('HERC2P2', 3), ('WASH3P', 3), ('NUDT9P1', 2), ('
154 input query terms found no hit:	['C16orf13', 'C16orf11', 'LOC100272146', 'LOC339240', 'NACAP1', 'LOC441204', 'KLRA1', 'FAM183A', 'FA
10 input query terms found dup hits:	[('SUGT1P1', 2), ('PTPRVP', 2), ('SNORA62', 3), ('IFITM4P', 7), ('HLA-DRB6', 2), ('FUNDC2P2', 2), ('
190 input query terms found no hit:	['NARFL', 'NFKBIL2', 'LOC150197', 'TMEM84', 'LOC162632', 'PPPDE1', 'PPPDE2', 'C1orf38', 'C1orf31', '
11 input query terms found dup hits:	[('PIP5K1P1', 2), ('HBD', 2), ('PPP1R2P1', 9), ('HSD17B7P2', 2), ('RPSAP9', 2), ('SNORD68', 2), ('SN
149 input query terms found no hit:	['FAM153C', 'C9orf167', 'CLK2P', 'CCDC76', 'CCDC75', 'CCDC72', 'HIST3H2BB', 'PRAC', 'LOC285780', 'LO
15 input query terms found dup hits:	[('SNORD58C', 2), ('UOX', 2), ('UBE2Q2P1', 3), ('PPP4R1L', 2), ('SNORD63', 3), ('ESPNP', 2), ('HBBP1
158 input query terms found no hit:	[

15 input query terms found dup hits:	[('FAM66D', 3), ('FAM66A', 2), ('THSD1P1', 2), ('EEF1DP3', 2), ('PGM5P2', 2), ('UBE2MP1', 2), ('HAR1
169 input query terms found no hit:	['LOC284551', 'LOC285548', 'LOC728410', 'LOC541473', 'DULLARD', 'KIAA0368', 'EFTUD1', 'TWISTNB', 'SF
13 input query terms found dup hits:	[('S100A7L2', 2), ('POM121L8P', 2), ('MEG8', 2), ('KIR3DX1', 5), ('RFPL1S', 2), ('SNORD91B', 2), ('C
165 input query terms found no hit:	['TMEM188', 'PDZD3', 'FAM102B', 'FAM102A', 'SMCR7L', 'G6PC', 'OSTCL', 'LOC653544', 'LOC653545', 'USP
16 input query terms found dup hits:	[('PCNAP1', 2), ('SNORA63', 6), ('SERHL', 2), ('CEACAM22P', 2), ('SNORA16A', 2), ('FAM41AY1', 2), ('
147 input query terms found no hit:	['LRRC37A4', 'LOC100131726', 'CPSF3L', 'COL4A3BP', 'PAR1', 'LOC92973', 'MICALCL', 'SMCR7', 'HIST4H4'
15 input query terms found dup hits:	[('MBL1P', 2), ('SDHAP3', 2), ('PSORS1C3', 8), ('MYADML', 2), ('POM121L10P', 2), ('HLA-J', 9), ('HLA
153 input query terms found no hit:	[

In [26]:
genetic_data

Unnamed: 0,TCGA-OR-A5LC-01,TCGA-OR-A5JJ-01,TCGA-OR-A5K3-01,TCGA-PK-A5HA-01,TCGA-OR-A5LN-01,TCGA-OR-A5JA-01,TCGA-OR-A5K0-01,TCGA-OR-A5JY-01,TCGA-OR-A5J9-01,TCGA-OR-A5K4-01,...,TCGA-OR-A5JT-01,TCGA-OR-A5KW-01,TCGA-OR-A5J8-01,TCGA-OR-A5JQ-01,TCGA-OR-A5JV-01,TCGA-OR-A5KX-01,TCGA-OR-A5L5-01,TCGA-P6-A5OG-01,TCGA-OR-A5LR-01,TCGA-OR-A5LT-01
ARHGEF10L,-3.610292,-1.217192,-1.786692,-1.329092,-0.944392,-2.431192,-0.927692,-2.263992,-0.773992,-1.902792,...,-0.863492,-0.979992,-0.217092,-0.166492,-1.806992,-2.295592,-0.664092,-1.707392,-1.648492,0.110608
HIF3A,-0.811626,-1.097126,-0.336626,-3.119026,0.294474,-0.000626,-2.902726,-0.038026,-1.558426,6.704374,...,6.097974,-0.775426,1.577074,2.414674,0.083474,-1.224126,4.864274,-0.846526,5.636474,-3.161926
RNF17,-0.531035,-0.531035,-0.531035,-0.531035,-0.531035,-0.531035,-0.531035,-0.531035,-0.531035,0.110765,...,-0.531035,-0.531035,-0.531035,-0.047535,-0.531035,-0.531035,-0.531035,-0.531035,0.346165,0.356765
RNF10,0.562928,0.398728,1.649328,0.525628,0.649828,0.737528,1.218828,0.940428,1.004428,1.155528,...,1.065328,0.998528,0.649828,0.458028,0.587428,1.318828,1.042828,-0.344672,1.626528,0.328428
RNF11,-0.735278,-0.698278,-0.527978,-0.639778,-0.482678,-0.101578,-0.736178,0.551822,0.407322,-0.363978,...,-0.211078,-0.070978,-0.112878,-0.053578,0.081922,-0.197478,-0.107178,-0.931178,-0.188578,0.143722
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
GNGT1,-1.281390,-0.749090,-1.281390,-1.281390,-1.281390,-1.281390,0.779110,2.326910,-1.281390,-1.281390,...,-1.281390,-1.281390,-1.281390,-1.281390,-1.281390,-0.717790,-1.281390,-0.380790,-1.281390,-1.281390
TULP3,0.180823,0.425623,-0.333277,-0.350477,0.257623,0.743623,0.424323,0.311723,0.120023,0.279723,...,0.116023,0.111123,1.111123,0.369123,0.040023,-0.731777,-0.038177,-0.328377,0.329723,0.118423
BCL6B,0.684673,0.075973,-0.968527,0.968073,1.061373,-0.677927,-2.529227,-0.588927,1.577873,-0.527027,...,-0.030227,-0.508927,0.993173,0.244273,0.462973,0.735173,0.129673,3.400573,0.408673,-2.482227
GSTK1,1.219405,1.561405,2.331205,1.034305,1.308005,0.183105,0.528505,1.042305,1.334905,-3.908095,...,0.780705,0.853105,0.093105,1.335505,1.161405,-1.075095,0.658005,0.240205,0.236505,-0.423795


In [27]:
merged_data = selected_clinical_data.join(genetic_data.T).dropna()
merged_data.head()

Unnamed: 0_level_0,Adrenocortical Cancer,Age,Gender,ARHGEF10L,HIF3A,RNF17,RNF10,RNF11,RNF13,GTF2IP1,...,SLC7A10,PLA2G2C,TULP2,NPY5R,GNGT2,GNGT1,TULP3,BCL6B,GSTK1,SELP
sampleID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
TCGA-OR-A5J1-01,1,58,1,-0.641092,-0.325826,-0.531035,1.266428,0.355422,0.03719,0.243706,...,-1.520186,-0.086682,-0.182978,-0.615817,-0.281533,3.02111,-0.927577,-1.006227,1.119905,-2.185533
TCGA-OR-A5J2-01,1,44,0,-1.864792,2.766674,0.321165,1.000728,0.836122,0.35439,-0.436694,...,-0.318586,1.056018,0.393822,2.366583,-0.955033,-1.28139,1.020723,1.226373,1.164005,0.265067
TCGA-OR-A5J3-01,1,23,0,-0.723192,-0.362926,-0.531035,0.639828,-0.199578,-0.48331,0.143606,...,-0.574486,-0.086682,-0.748878,-0.113317,-3.803333,-0.61009,0.397623,-0.675227,1.196005,-3.161633
TCGA-OR-A5J5-01,1,30,1,-1.576792,-2.086226,2.463765,1.382228,-1.115678,-1.23621,0.615806,...,-0.279486,-0.086682,0.078622,1.095983,-0.908533,-1.28139,0.661823,0.458273,0.839605,-5.525533
TCGA-OR-A5J6-01,1,29,0,-2.311992,5.225974,-0.531035,0.967928,-0.393778,-0.38231,-0.060194,...,-2.090786,1.607218,2.481122,-0.946617,-0.570533,-1.28139,-0.425177,0.938573,0.495005,-1.733333


In [28]:
merged_data

Unnamed: 0_level_0,Adrenocortical Cancer,Age,Gender,ARHGEF10L,HIF3A,RNF17,RNF10,RNF11,RNF13,GTF2IP1,...,SLC7A10,PLA2G2C,TULP2,NPY5R,GNGT2,GNGT1,TULP3,BCL6B,GSTK1,SELP
sampleID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
TCGA-OR-A5J1-01,1,58,1,-0.641092,-0.325826,-0.531035,1.266428,0.355422,0.03719,0.243706,...,-1.520186,-0.086682,-0.182978,-0.615817,-0.281533,3.02111,-0.927577,-1.006227,1.119905,-2.185533
TCGA-OR-A5J2-01,1,44,0,-1.864792,2.766674,0.321165,1.000728,0.836122,0.35439,-0.436694,...,-0.318586,1.056018,0.393822,2.366583,-0.955033,-1.28139,1.020723,1.226373,1.164005,0.265067
TCGA-OR-A5J3-01,1,23,0,-0.723192,-0.362926,-0.531035,0.639828,-0.199578,-0.48331,0.143606,...,-0.574486,-0.086682,-0.748878,-0.113317,-3.803333,-0.61009,0.397623,-0.675227,1.196005,-3.161633
TCGA-OR-A5J5-01,1,30,1,-1.576792,-2.086226,2.463765,1.382228,-1.115678,-1.23621,0.615806,...,-0.279486,-0.086682,0.078622,1.095983,-0.908533,-1.28139,0.661823,0.458273,0.839605,-5.525533
TCGA-OR-A5J6-01,1,29,0,-2.311992,5.225974,-0.531035,0.967928,-0.393778,-0.38231,-0.060194,...,-2.090786,1.607218,2.481122,-0.946617,-0.570533,-1.28139,-0.425177,0.938573,0.495005,-1.733333
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
TCGA-PA-A5YG-01,1,51,1,0.189908,1.377574,-0.531035,0.798828,0.174722,0.92979,-0.043394,...,-1.542586,0.857918,0.195722,1.314483,0.393367,-0.73319,-0.104177,0.280873,2.095805,1.767267
TCGA-PK-A5H8-01,1,42,1,-3.279892,3.689074,-0.531035,1.198428,-0.118078,-0.39651,0.470906,...,-1.390686,-0.086682,-0.048778,-1.587117,-1.890433,-1.28139,-0.098077,-0.338827,1.313105,-3.278333
TCGA-PK-A5H9-01,1,27,0,0.724908,4.032174,-0.099235,0.344828,0.183822,0.16789,1.447606,...,-1.052486,-0.086682,-0.748878,1.641683,-1.610433,-1.28139,-0.707577,1.106373,0.321005,0.311967
TCGA-PK-A5HA-01,1,63,1,-1.329092,-3.119026,-0.531035,0.525628,-0.639778,-0.04121,0.112606,...,-2.090786,-0.086682,-0.115378,-1.587117,-0.018633,-1.28139,-0.350477,0.968073,1.034305,-0.030433


In [29]:
def judge_and_remove_biased_features(df, trait, trait_type):
    assert trait_type in ["binary", "continuous"], f"The trait must be either a binary or a continuous variable!"
    if trait_type == "binary":
        trait_biased = judge_binary_variable_biased(df, trait)
    else:
        trait_biased = judge_continuous_variable_biased(df, trait)
    if trait_biased:
        print(f"The distribution of the feature \'{trait}\' in this dataset is severely biased.\n")
    else:
        print(f"The distribution of the feature \'{trait}\' in this dataset is fine.\n")
    if "Age" in df.columns:
        age_biased = judge_continuous_variable_biased(df, 'Age')
        if age_biased:
            print(f"The distribution of the feature \'Age\' in this dataset is severely biased.\n")
            df = df.drop(columns='Age')
        else:
            print(f"The distribution of the feature \'Age\' in this dataset is fine.\n")
    if "Gender" in df.columns:
        gender_biased = judge_binary_variable_biased(df, 'Gender')
        if gender_biased:
            print(f"The distribution of the feature \'Gender\' in this dataset is severely biased.\n")
            df = df.drop(columns='Gender')
        else:
            print(f"The distribution of the feature \'Gender\' in this dataset is fine.\n")

    return trait_biased, df

In [30]:
def judge_binary_variable_biased(dataframe, col_name, min_proportion=0.1, min_num=5):
    """
    Check if the distribution of a binary variable in the dataset is too biased to be usable for analysis
    :param dataframe:
    :param col_name:
    :param min_proportion:
    :param min_num:
    :return:
    """
    label_counter = dataframe[col_name].value_counts()
    total_samples = len(dataframe)
    rare_label_num = label_counter.min()
    rare_label = label_counter.idxmin()
    rare_label_proportion = rare_label_num / total_samples

    print(
        f"For the feature \'{col_name}\', the least common label is '{rare_label}' with {rare_label_num} occurrences. This represents {rare_label_proportion:.2%} of the dataset.")

    biased = (len(label_counter) < 2) or ((rare_label_proportion < min_proportion) and (rare_label_num < min_num))
    return bool(biased)


In [31]:
def judge_continuous_variable_biased(dataframe, col_name):
    """Check if the distribution of a continuous variable in the dataset is too biased to be usable for analysis.
    As a starting point, we consider it biased if all values are the same. For the next step, maybe ask GPT to judge
    based on quartile statistics combined with its common sense knowledge about this feature.
    """
    quartiles = dataframe[col_name].quantile([0.25, 0.5, 0.75])
    min_value = dataframe[col_name].min()
    max_value = dataframe[col_name].max()

    # Printing quartile information
    print(f"Quartiles for '{col_name}':")
    print(f"  25%: {quartiles[0.25]}")
    print(f"  50% (Median): {quartiles[0.5]}")
    print(f"  75%: {quartiles[0.75]}")
    print(f"Min: {min_value}")
    print(f"Max: {max_value}")

    biased = min_value == max_value

    return bool(biased)

In [32]:
print(f"The merged dataset contains {len(merged_data)} samples.")
is_trait_biased, merge_data = judge_and_remove_biased_features(merged_data, TRAIT, trait_type=trait_type)
is_trait_biased

The merged dataset contains 79 samples.
For the feature 'Adrenocortical Cancer', the least common label is '1' with 79 occurrences. This represents 100.00% of the dataset.
The distribution of the feature 'Adrenocortical Cancer' in this dataset is severely biased.

Quartiles for 'Age':
  25%: 35.0
  50% (Median): 49.0
  75%: 59.5
Min: 14
Max: 77
The distribution of the feature 'Age' in this dataset is fine.

For the feature 'Gender', the least common label is '1' with 31 occurrences. This represents 39.24% of the dataset.
The distribution of the feature 'Gender' in this dataset is fine.



True

In [33]:
merged_data.head()
if not is_trait_biased:
    merge_data.to_csv(os.path.join(OUTPUT_DIR, cohort + '.csv'), index=False)

In [34]:
from typing import Callable, Optional, List, Tuple, Union, Any

In [35]:
def save_cohort_info(cohort: str, info_path: str, is_available: bool, is_biased: Optional[bool] = None,
                     df: Optional[pd.DataFrame] = None, note: str = '') -> None:
    """
    Add or update information about the usability and quality of a dataset for statistical analysis.

    Parameters:
    cohort (str): A unique identifier for the dataset.
    info_path (str): File path to the JSON file where records are stored.
    is_available (bool): Indicates whether both the genetic data and trait data are available in the dataset, and can be
     preprocessed into a dataframe.
    is_biased (bool, optional): Indicates whether the dataset is too biased to be usable.
        Required if `is_available` is True.
    df (pandas.DataFrame, optional): The preprocessed dataset. Required if `is_available` is True.
    note (str, optional): Additional notes about the dataset.

    Returns:
    None: The function does not return a value but updates or creates a record in the specified JSON file.
    """
    if is_available:
        assert (df is not None) and (is_biased is not None), "'df' and 'is_biased' should be provided if this cohort " \
                                                             "is relevant."
    is_usable = is_available and (not is_biased)
    new_record = {"is_usable": is_usable,
                  "is_available": is_available,
                  "is_biased": is_biased if is_available else None,
                  "has_age": "Age" in df.columns if is_available else None,
                  "has_gender": "Gender" in df.columns if is_available else None,
                  "sample_size": len(df) if is_available else None,
                  "note": note}
    
    if not os.path.exists(info_path):
        with open(info_path, 'w') as file:
            json.dump({}, file)
        print(f"A new JSON file was created at: {info_path}")

    with open(info_path, "r") as file:
        records = json.load(file)
    records[cohort] = new_record

    temp_path = info_path + ".tmp"
    try:
        with open(temp_path, 'w') as file:
            json.dump(records, file)
        os.replace(temp_path, info_path)

    except Exception as e:
        print(f"An error occurred: {e}")
        if os.path.exists(temp_path):
            os.remove(temp_path)
        raise

In [34]:
import json

save_cohort_info(cohort, JSON_PATH, is_available, is_trait_biased, merged_data)

## 2.2. The GEO dataset

In GEO, there may be one or multiple cohorts for a trait. Each cohort is identified by an accession number. We iterate over all accession numbers in the corresponding subdirectory, preprocess the cohort data, and save them to csv files.

In [3]:
dataset = 'GEO'
trait_subdir = "Adrenocortical-Cancer"

trait_path = os.path.join(DATA_ROOT, dataset, trait_subdir)
os.listdir(trait_path)

['GSE108088',
 'GSE108089',
 'GSE143383',
 'GSE169253',
 'GSE19750',
 'GSE19776',
 'GSE19856',
 'GSE21660',
 'GSE32206',
 'GSE33371',
 'GSE35066',
 'GSE36353',
 'GSE49276',
 'GSE49277',
 'GSE49278',
 'GSE49280',
 'GSE52296',
 'GSE67766',
 'GSE68606',
 'GSE68950',
 'GSE75415',
 'GSE76019',
 'GSE90713']

Repeat the below steps for all the accession numbers

In [4]:
def get_relevant_filepaths(cohort_dir):
    """Find the file paths of a SOFT file and a matrix file from the given data directory of a cohort.
    If there are multiple SOFT files or matrix files, simply choose the first one. May be replaced by better
    strategies later.
    """
    files = os.listdir(cohort_dir)
    soft_files = [f for f in files if 'soft' in f.lower()]
    matrix_files = [f for f in files if 'matrix' in f.lower()]
    assert len(soft_files) > 0 and len(matrix_files) > 0
    soft_file_path = os.path.join(cohort_dir, soft_files[0])
    matrix_file_path = os.path.join(cohort_dir, matrix_files[0])

    return soft_file_path, matrix_file_path

In [4]:
# Finished
cohort = accession_num = "GSE33371"
cohort_dir = os.path.join(trait_path, accession_num)
soft_file, matrix_file = get_relevant_filepaths(cohort_dir)
soft_file, matrix_file

import io
background_prefixes = ['!Series_title', '!Series_summary', '!Series_overall_design']
clinical_prefixes = ['!Sample_geo_accession', '!Sample_characteristics_ch1']

background_info, clinical_data = get_background_and_clinical_data(matrix_file, background_prefixes, clinical_prefixes)
print(background_info)

clinical_data

!Series_title	"Beta-catenin status effects in human adrenocortical carcinomas (33), adenomas (22), and normal adrenal cortex (10)"
!Series_summary	"We scored adrenocortical carcinomas and adenomas for abnormal beta-catenin staining, and sequenced the beta-catenin gene in some samples. We compared adrenocortincal carcinomas with and without abnormal beta-catenin staining and found many significant expression differences and significant results from enrichment testing. A similar comparison in the adenomas gave relatively few differences, and they did not correlate to differences found for the carcinomas.  Abnormal beta-catenin staining was associated with mitotic rate and poorer patient survival in the carcinomas.  In a second independent data set (given in a supplement) we again found beta-catenin associated with poor survival.  The array data given is the same as GEO series GSE10927, with additional characteristics about beta-catenin, and new patient followup data.  The analysis shown 

Unnamed: 0,!Sample_geo_accession,GSM825367,GSM825368,GSM825369,GSM825370,GSM825371,GSM825372,GSM825373,GSM825374,GSM825375,...,GSM825422,GSM825423,GSM825424,GSM825425,GSM825426,GSM825427,GSM825428,GSM825429,GSM825430,GSM825431
0,!Sample_characteristics_ch1,age: 71,age: 58,age: 71,age: 44,age: 32,age: 28,age: 55,age: 78,age: 41,...,age: 19,age: 61,age: 33,age: 31,age: 60,age: 51,age: 71,age: 25,age: 47,age: 49
1,!Sample_characteristics_ch1,Sex: F,Sex: F,Sex: M,Sex: M,Sex: F,Sex: M,Sex: M,Sex: M,Sex: M,...,Sex: M,Sex: F,Sex: F,Sex: F,Sex: F,Sex: F,Sex: M,Sex: F,Sex: F,Sex: F
2,!Sample_characteristics_ch1,side of body: Left,side of body: Right,side of body: Right,side of body: Left,side of body: Right,side of body: Left,side of body: Left,side of body: Left,side of body: unknown,...,side of body: Left,side of body: Right,side of body: Left,side of body: not applicable,side of body: Right,side of body: Right,side of body: Right,side of body: Right,side of body: Right,side of body: Left
3,!Sample_characteristics_ch1,clinical characteristics: Adrenalectomy for me...,clinical characteristics: Adrenalectomy for AC...,clinical characteristics: Adrenalectomy for re...,clinical characteristics: Adrenalectomy for AC...,clinical characteristics: Adrenalectomy for ACA,clinical characteristics: Adrenalectomy for ph...,clinical characteristics: Adrenalectomy for mu...,clinical characteristics: Adrenalectomy for re...,clinical characteristics: Adrenalectomy for re...,...,clinical characteristics: Left adrenal mass,clinical characteristics: Cushing Syndrome,clinical characteristics: Cortisol secreting t...,clinical characteristics: History of Adrenocor...,clinical characteristics: Cushing Syndrome,clinical characteristics: Right adrenal mass; ...,clinical characteristics: Right adrenal tumor;...,clinical characteristics: History of adrenal m...,clinical characteristics: Cushing Syndrome; Ri...,clinical characteristics: Elevated serum aldos...
4,!Sample_characteristics_ch1,tumor diameter (cm): not applicable,tumor diameter (cm): not applicable,tumor diameter (cm): not applicable,tumor diameter (cm): not applicable,tumor diameter (cm): not applicable,tumor diameter (cm): not applicable,tumor diameter (cm): not applicable,tumor diameter (cm): not applicable,tumor diameter (cm): not applicable,...,tumor diameter (cm): 10.5,tumor diameter (cm): 9.8,tumor diameter (cm): 13,tumor diameter (cm): unknown,tumor diameter (cm): 18.5,tumor diameter (cm): 15.5,tumor diameter (cm): 22,tumor diameter (cm): 26,tumor diameter (cm): unknown,tumor diameter (cm): 8.7
5,!Sample_characteristics_ch1,tumor weight (gm): not applicable,tumor weight (gm): not applicable,tumor weight (gm): not applicable,tumor weight (gm): not applicable,tumor weight (gm): not applicable,tumor weight (gm): not applicable,tumor weight (gm): not applicable,tumor weight (gm): not applicable,tumor weight (gm): not applicable,...,tumor weight (gm): unknown,tumor weight (gm): 253,tumor weight (gm): 660,tumor weight (gm): unknown,tumor weight (gm): 1805,tumor weight (gm): 1305,tumor weight (gm): 2440,tumor weight (gm): 2300,tumor weight (gm): unknown,tumor weight (gm): 397
6,!Sample_characteristics_ch1,weiss score of tumor: not applicable,weiss score of tumor: not applicable,weiss score of tumor: not applicable,weiss score of tumor: not applicable,weiss score of tumor: not applicable,weiss score of tumor: not applicable,weiss score of tumor: not applicable,weiss score of tumor: not applicable,weiss score of tumor: not applicable,...,weiss score of tumor: High,weiss score of tumor: High,weiss score of tumor: High,weiss score of tumor: Low,weiss score of tumor: High,weiss score of tumor: High,weiss score of tumor: High,weiss score of tumor: High,weiss score of tumor: High,weiss score of tumor: High
7,!Sample_characteristics_ch1,mitotic rate of tumor: not applicable,mitotic rate of tumor: not applicable,mitotic rate of tumor: not applicable,mitotic rate of tumor: not applicable,mitotic rate of tumor: not applicable,mitotic rate of tumor: not applicable,mitotic rate of tumor: not applicable,mitotic rate of tumor: not applicable,mitotic rate of tumor: not applicable,...,mitotic rate of tumor: 27,mitotic rate of tumor: 27,mitotic rate of tumor: 70,mitotic rate of tumor: 16,mitotic rate of tumor: 40,mitotic rate of tumor: 23,mitotic rate of tumor: 22,mitotic rate of tumor: 34,mitotic rate of tumor: 37,mitotic rate of tumor: 28
8,!Sample_characteristics_ch1,tumor stage: not applicable,tumor stage: not applicable,tumor stage: not applicable,tumor stage: not applicable,tumor stage: not applicable,tumor stage: not applicable,tumor stage: not applicable,tumor stage: not applicable,tumor stage: not applicable,...,tumor stage: 2,tumor stage: 2,tumor stage: 3,tumor stage: 4,tumor stage: 2,tumor stage: 3,tumor stage: 4,tumor stage: 4,tumor stage: 1,tumor stage: 2
9,!Sample_characteristics_ch1,years to last followup: unknown,years to last followup: unknown,years to last followup: unknown,years to last followup: unknown,years to last followup: unknown,years to last followup: unknown,years to last followup: unknown,years to last followup: unknown,years to last followup: unknown,...,years to last followup: 0.619178082,years to last followup: 0.942465753,years to last followup: unknown,years to last followup: unknown,years to last followup: 0.364383562,years to last followup: 2.232876712,years to last followup: 0.624657534,years to last followup: 0.898630137,years to last followup: 7.81,years to last followup: unknown


In [5]:
tumor_stage_row = clinical_data.iloc[6]
tumor_stage_row.unique()

array(['!Sample_characteristics_ch1',
       'weiss score of tumor: not applicable',
       'weiss score of tumor: High', 'weiss score of tumor: Low'],
      dtype=object)

In [6]:
tumor_stage_counts = tumor_stage_row.value_counts()
tumor_stage_counts


6
weiss score of tumor: not applicable    32
weiss score of tumor: High              20
weiss score of tumor: Low               13
!Sample_characteristics_ch1              1
Name: count, dtype: int64

In [7]:
genetic_data = get_genetic_data(matrix_file)
genetic_data

Unnamed: 0_level_0,GSM825367,GSM825368,GSM825369,GSM825370,GSM825371,GSM825372,GSM825373,GSM825374,GSM825375,GSM825376,...,GSM825422,GSM825423,GSM825424,GSM825425,GSM825426,GSM825427,GSM825428,GSM825429,GSM825430,GSM825431
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1007_s_at,3.426349,3.507046,3.503791,3.483730,3.593175,3.388279,3.323458,3.274850,3.367542,3.458184,...,3.629919,3.613207,3.721563,3.629715,3.336460,3.820530,4.179609,3.808751,3.198107,4.115178
1053_at,2.707570,2.615950,2.390935,2.568202,2.609594,2.567026,2.592177,2.635484,2.485721,2.726727,...,2.745855,2.640481,2.729165,2.755875,2.434569,2.678518,2.472756,2.545307,2.694605,3.105851
117_at,2.725912,2.958564,2.517196,2.913284,2.880814,2.670246,2.710963,2.809560,2.525045,2.656098,...,2.322219,2.328380,2.334454,2.361728,2.260071,3.437909,2.158362,2.457882,2.444045,2.264818
121_at,2.986772,3.145196,3.185542,3.296665,3.298416,3.145196,3.084219,3.203577,3.289589,3.124178,...,2.883093,3.007321,2.942008,2.873321,2.885361,2.829947,2.801404,3.012837,3.133219,2.977266
1255_g_at,2.214844,2.155336,2.178977,2.198657,2.004321,2.167317,2.079181,2.053078,2.176091,2.232996,...,2.117271,2.278754,2.190332,2.158362,2.167317,2.127105,1.973128,2.220108,2.315970,2.107210
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
AFFX-ThrX-5_at,2.060698,1.977724,2.008600,2.060698,2.086360,2.093422,2.152288,2.089905,1.954243,2.079181,...,2.071882,2.075547,2.049218,2.167317,2.033424,2.071882,2.021189,2.075547,1.924279,2.045323
AFFX-ThrX-M_at,1.949390,2.012837,2.130334,2.187521,2.082785,2.056905,2.017033,2.117271,2.110590,2.089905,...,2.206826,2.049218,2.117271,2.152288,2.149219,2.123852,2.086360,2.096910,2.178977,2.037426
AFFX-TrpnX-3_at,2.139879,2.041393,2.086360,2.060698,2.025306,2.012837,1.982271,2.089905,2.008600,2.041393,...,2.008600,2.041393,2.082785,2.075547,2.041393,2.025306,2.049218,2.060698,2.017033,2.025306
AFFX-TrpnX-5_at,2.060698,2.107210,1.934498,2.056905,2.110590,2.146128,2.107210,2.103804,2.079181,2.041393,...,2.146128,2.041393,1.982271,2.021189,2.049218,2.008600,2.195900,2.045323,2.060698,2.107210


In [8]:
is_gene_availabe = True
trait_row = 6
age_row = 0
gender_row = 1

trait_type = 'binary'

In [9]:
# Verify and use the functions generated by GPT

# 这个函数将组织类型（tissue type）转换为有关癫痫存在与否的二进制值。
# 它是基于特定的假设，即如果组织类型是“胰腺导管腺癌”（Pancreatic Ductal Adenocarcinoma），则认为癫痫存在（返回1）；否则，认为癫痫不存在（返回0）。
def convert_trait(tissue_type):
    """
    Convert tissue type to epilepsy presence (binary).
    Assuming epilepsy presence for 'Hippocampus' tissue.
    """
    if tissue_type == 'weiss score of tumor: High':
        return 1  # Epilepsy present
    else:
        return 0  # Epilepsy not present


# 这个函数的目的是将年龄的字符串表示转换为一个连续的数值型表示。如果年龄未知（例如，标记为'n.a.'），则返回None。
# 函数尝试从传入的字符串中提取出一个整数作为年龄值。如果字符串的格式不符合预期，导致提取失败，同样返回None。
def convert_age(age_string):
    """
    Convert age string to a continuous numerical value.
    Unknown values are converted to None.
    """
    if age_string.lower() == 'n.a.':
        return None
    try:
        # Extract age as an integer from the string
        age = int(age_string.split(': ')[1])
        return age
    except (ValueError, IndexError):
        # In case of any format error or unexpected string structure
        return None


# 这个函数将性别的字符串表示转换为二进制值，其中“female”对应1，“male”对应0。如果性别未知或字符串不符合预期格式，则返回None。
# It sometimes maps 'female' to 0, and sometimes 1. Does it matter?
def convert_gender(gender_string):
    """
    Convert gender string to a binary value.
    'female' is represented as 1, 'male' as 0.
    Unknown values are converted to None.
    """
    if (gender_string.lower() == 'sex: female' or gender_string.lower() == 'sex: f'):
        return 1
    elif (gender_string.lower() == 'sex: male' or gender_string.lower() == 'sex: m') :  # changeed 
        return 0
    else:
        return None

In [10]:
selected_clinical_data = geo_select_clinical_features(clinical_data, TRAIT, trait_row, convert_trait, age_row=age_row,
                                                      convert_age=convert_age, gender_row=gender_row,
                                                      convert_gender=convert_gender)
selected_clinical_data.head()

  clinical_df = clinical_df.applymap(convert_fn)
  clinical_df = clinical_df.applymap(convert_fn)
  clinical_df = clinical_df.applymap(convert_fn)


Unnamed: 0,GSM825367,GSM825368,GSM825369,GSM825370,GSM825371,GSM825372,GSM825373,GSM825374,GSM825375,GSM825376,...,GSM825422,GSM825423,GSM825424,GSM825425,GSM825426,GSM825427,GSM825428,GSM825429,GSM825430,GSM825431
Adrenocortical Cancer,0,0,0,0,0,0,0,0,0,0,...,1,1,1,0,1,1,1,1,1,1
Age,71,58,71,44,32,28,55,78,41,58,...,19,61,33,31,60,51,71,25,47,49
Gender,1,1,0,0,1,0,0,0,0,0,...,0,1,1,1,1,1,0,1,1,1


In [11]:
requires_gene_mapping = True

if requires_gene_mapping:
    gene_annotation = get_gene_annotation(soft_file)
    gene_annotation_summary = preview_df(gene_annotation)
    print(gene_annotation_summary)

gene_row_ids = genetic_data.index[:20].tolist()
gene_row_ids

if requires_gene_mapping:
    print(f'''
    As a biomedical research team, we are analyzing a gene expression dataset, and find that its row headers are some identifiers related to genes:
    {gene_row_ids}
    To get the mapping from those identifiers to actual gene symbols, we extracted the gene annotation data from a series in the GEO database, and saved it to a Python dictionary. Please read the dictionary, and decide which key stores the identifiers, and which key stores the gene symbols. Please strictly follow this format in your answer:
    identifier_key = 'key_name1'
    gene_symbol_key = 'key_name2'

    Gene annotation dictionary:
    {gene_annotation_summary}
    ''')

gene_annotation



{'ID': ['1007_s_at', '1053_at', '117_at', '121_at', '1255_g_at'], 'GB_ACC': ['U48705', 'M87338', 'X51757', 'X69699', 'L36861'], 'SPOT_ID': [nan, nan, nan, nan, nan], 'Species Scientific Name': ['Homo sapiens', 'Homo sapiens', 'Homo sapiens', 'Homo sapiens', 'Homo sapiens'], 'Annotation Date': ['Oct 6, 2014', 'Oct 6, 2014', 'Oct 6, 2014', 'Oct 6, 2014', 'Oct 6, 2014'], 'Sequence Type': ['Exemplar sequence', 'Exemplar sequence', 'Exemplar sequence', 'Exemplar sequence', 'Exemplar sequence'], 'Sequence Source': ['Affymetrix Proprietary Database', 'GenBank', 'Affymetrix Proprietary Database', 'GenBank', 'Affymetrix Proprietary Database'], 'Target Description': ['U48705 /FEATURE=mRNA /DEFINITION=HSU48705 Human receptor tyrosine kinase DDR gene, complete cds', 'M87338 /FEATURE= /DEFINITION=HUMA1SBU Human replication factor C, 40-kDa subunit (A1) mRNA, complete cds', "X51757 /FEATURE=cds /DEFINITION=HSP70B Human heat-shock protein HSP70B' gene", 'X69699 /FEATURE= /DEFINITION=HSPAX8A H.sapiens

Unnamed: 0,ID,GB_ACC,SPOT_ID,Species Scientific Name,Annotation Date,Sequence Type,Sequence Source,Target Description,Representative Public ID,Gene Title,Gene Symbol,ENTREZ_GENE_ID,RefSeq Transcript ID,Gene Ontology Biological Process,Gene Ontology Cellular Component,Gene Ontology Molecular Function
0,1007_s_at,U48705,,Homo sapiens,"Oct 6, 2014",Exemplar sequence,Affymetrix Proprietary Database,U48705 /FEATURE=mRNA /DEFINITION=HSU48705 Huma...,U48705,discoidin domain receptor tyrosine kinase 1 //...,DDR1 /// MIR4640,780 /// 100616237,NM_001202521 /// NM_001202522 /// NM_001202523...,0001558 // regulation of cell growth // inferr...,0005576 // extracellular region // inferred fr...,0000166 // nucleotide binding // inferred from...
1,1053_at,M87338,,Homo sapiens,"Oct 6, 2014",Exemplar sequence,GenBank,M87338 /FEATURE= /DEFINITION=HUMA1SBU Human re...,M87338,"replication factor C (activator 1) 2, 40kDa",RFC2,5982,NM_001278791 /// NM_001278792 /// NM_001278793...,0000278 // mitotic cell cycle // traceable aut...,0005634 // nucleus // inferred from electronic...,0000166 // nucleotide binding // inferred from...
2,117_at,X51757,,Homo sapiens,"Oct 6, 2014",Exemplar sequence,Affymetrix Proprietary Database,X51757 /FEATURE=cds /DEFINITION=HSP70B Human h...,X51757,heat shock 70kDa protein 6 (HSP70B'),HSPA6,3310,NM_002155,0000902 // cell morphogenesis // inferred from...,0005737 // cytoplasm // inferred from direct a...,0000166 // nucleotide binding // inferred from...
3,121_at,X69699,,Homo sapiens,"Oct 6, 2014",Exemplar sequence,GenBank,X69699 /FEATURE= /DEFINITION=HSPAX8A H.sapiens...,X69699,paired box 8,PAX8,7849,NM_003466 /// NM_013951 /// NM_013952 /// NM_0...,0001655 // urogenital system development // in...,0005634 // nucleus // inferred from direct ass...,0000979 // RNA polymerase II core promoter seq...
4,1255_g_at,L36861,,Homo sapiens,"Oct 6, 2014",Exemplar sequence,Affymetrix Proprietary Database,L36861 /FEATURE=expanded_cds /DEFINITION=HUMGC...,L36861,guanylate cyclase activator 1A (retina),GUCA1A,2978,NM_000409 /// XM_006715073,0007165 // signal transduction // non-traceabl...,0001750 // photoreceptor outer segment // infe...,0005509 // calcium ion binding // inferred fro...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3608610,AFFX-ThrX-5_at,2.045322979,,,,,,,,,,,,,,
3608611,AFFX-ThrX-M_at,2.037426498,,,,,,,,,,,,,,
3608612,AFFX-TrpnX-3_at,2.025305865,,,,,,,,,,,,,,
3608613,AFFX-TrpnX-5_at,2.10720997,,,,,,,,,,,,,,


In [12]:
if requires_gene_mapping:
    identifier_key = 'ID'
    gene_symbol_key = 'Gene Symbol'
    gene_mapping = get_gene_mapping(gene_annotation, identifier_key, gene_symbol_key)
    genetic_data = apply_gene_mapping(genetic_data, gene_mapping)

In [13]:
genetic_data

Unnamed: 0_level_0,GSM825367,GSM825368,GSM825369,GSM825370,GSM825371,GSM825372,GSM825373,GSM825374,GSM825375,GSM825376,...,GSM825422,GSM825423,GSM825424,GSM825425,GSM825426,GSM825427,GSM825428,GSM825429,GSM825430,GSM825431
Gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ABCB4,4.124145,3.991625,4.013174,4.125416,4.197391,4.258757,4.234467,4.217115,4.161817,4.077186,...,4.035990,4.203848,3.621592,4.276737,3.664736,3.584218,4.253338,3.346744,3.927165,3.870755
ABCC6P1,3.019947,2.647383,3.001734,3.035830,3.135769,2.778151,2.825426,2.909021,2.846955,2.879096,...,2.804821,3.066699,2.849419,3.006894,2.369216,2.510545,2.975432,2.702431,2.469822,3.063333
ABCC6P2,3.019947,2.647383,3.001734,3.035830,3.135769,2.778151,2.825426,2.909021,2.846955,2.879096,...,2.804821,3.066699,2.849419,3.006894,2.369216,2.510545,2.975432,2.702431,2.469822,3.063333
ABCD1P2,1.913814,2.100371,1.732394,2.143015,2.041393,1.903090,2.143015,2.049218,1.968483,1.869232,...,1.995635,1.913814,2.075547,2.110590,2.103804,1.944483,2.176091,1.991226,2.133539,1.838849
AC078883.4,2.025306,2.008600,2.222716,2.283301,2.139879,2.037426,2.158362,2.267172,1.968483,2.075547,...,1.698970,2.000000,2.176091,1.963788,2.238046,2.238046,2.143015,2.045323,2.146128,2.139879
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
abParts,2.164353,2.235528,1.875061,2.117271,2.033424,1.982271,2.008600,2.303196,2.082785,2.064458,...,2.025306,2.086360,2.056905,2.071882,1.954243,2.143015,2.033424,2.079181,2.645422,1.982271
alpha,2.201397,2.071882,2.190332,2.217484,2.217484,2.296665,2.152288,2.257679,2.146128,2.217484,...,2.149219,2.127105,2.139879,2.190332,2.250420,2.198657,2.195900,2.123852,2.363612,2.206826
av27s1,2.071882,2.264818,2.181844,2.033424,2.220108,2.214844,2.089905,2.071882,2.113943,2.155336,...,2.139879,2.012837,2.230449,2.240549,2.127105,2.113943,2.209515,1.812913,2.041393,2.089905
hsa-let-7a-3,2.146748,2.133586,2.044953,2.289606,1.982977,2.155803,2.258677,2.145195,2.027077,1.972587,...,2.293249,2.185691,2.087756,2.199196,2.367256,2.628767,2.489699,2.148333,2.150298,2.277354


In [14]:
genetic_data = normalize_gene_symbols_in_index(genetic_data)

In [15]:
genetic_data

Unnamed: 0,GSM825367,GSM825368,GSM825369,GSM825370,GSM825371,GSM825372,GSM825373,GSM825374,GSM825375,GSM825376,...,GSM825422,GSM825423,GSM825424,GSM825425,GSM825426,GSM825427,GSM825428,GSM825429,GSM825430,GSM825431
A1BG,2.344392,2.274158,2.409933,2.220108,2.382017,2.230449,2.225309,2.235528,2.453318,2.220108,...,2.770852,2.190332,2.492760,2.729165,2.176091,2.465383,2.100371,2.195900,2.841985,2.120574
A1BG-AS1,2.225309,2.152288,2.029384,2.209515,2.033424,2.033424,2.029384,1.949390,1.838849,2.068186,...,2.214844,2.071882,2.127105,2.158362,2.113943,2.079181,2.049218,2.012837,2.086360,1.838849
A1CF,2.258387,2.215618,2.275090,2.361580,2.336395,2.325736,2.125941,2.275871,2.282481,2.282451,...,2.159084,2.174736,2.309599,2.302383,2.312800,2.161667,2.235580,2.187923,2.321805,2.199543
A2M,3.241844,3.274122,3.237545,3.226205,3.260617,3.224270,3.173422,3.203154,3.231411,3.281136,...,3.021480,2.958363,3.048116,2.932255,2.986461,3.246799,2.960222,3.199634,3.262587,3.331146
A2M-AS1,2.758155,2.844477,2.843855,2.719331,2.836324,2.894870,2.625312,2.561101,2.684845,2.831870,...,2.799341,2.161368,2.703291,2.773786,3.175512,2.797268,2.649335,2.850646,2.676694,2.894870
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ZYG11A,2.082785,2.222716,2.206826,2.139879,2.428135,2.267172,2.133539,2.139879,2.193125,2.285557,...,2.190332,2.332438,2.139879,2.056905,2.064458,2.195900,2.037426,2.225309,1.995635,2.167317
ZYG11B,3.290175,3.154617,3.074636,3.071638,3.251223,3.233694,3.158933,3.164545,3.286878,3.168897,...,3.123692,3.093839,3.288377,3.194621,3.121960,3.163507,3.073203,3.172625,3.054001,3.049166
ZYX,3.073090,3.153078,3.055331,3.152671,3.171448,3.329625,3.251829,3.499841,2.973627,3.033582,...,2.835737,2.869494,2.826289,2.711672,3.043238,3.207796,2.891020,2.983929,3.430660,3.085020
ZZEF1,2.613386,2.654351,2.645732,2.545531,2.657407,2.639701,2.723974,2.662660,2.614772,2.551009,...,3.047227,2.924887,2.622894,3.019029,2.529522,2.599417,2.869427,2.567629,2.702477,2.543037


In [16]:
merged_data = geo_merge_clinical_genetic_data(selected_clinical_data, genetic_data)
# The preprocessing runs through, which means is_available should be True
is_available = True

In [17]:
merged_data

Unnamed: 0,Adrenocortical Cancer,Age,Gender,A1BG,A1BG-AS1,A1CF,A2M,A2M-AS1,A2ML1,A2MP1,...,ZWILCH,ZWINT,ZXDA,ZXDB,ZXDC,ZYG11A,ZYG11B,ZYX,ZZEF1,ZZZ3
GSM825367,0.0,71.0,1.0,2.344392,2.225309,2.258387,3.241844,2.758155,2.177419,2.376577,...,2.531471,2.750508,2.662758,2.467995,2.540189,2.082785,3.290175,3.07309,2.613386,3.053991
GSM825368,0.0,58.0,1.0,2.274158,2.152288,2.215618,3.274122,2.844477,2.216885,2.287802,...,2.5887,2.812913,2.716838,2.359753,2.554861,2.222716,3.154617,3.153078,2.654351,3.050952
GSM825369,0.0,71.0,0.0,2.409933,2.029384,2.27509,3.237545,2.843855,2.306074,2.515874,...,2.500122,2.900913,2.682145,2.441456,2.402199,2.206826,3.074636,3.055331,2.645732,2.95849
GSM825370,0.0,44.0,0.0,2.220108,2.209515,2.36158,3.226205,2.719331,2.101516,2.350248,...,2.601973,2.93852,2.691081,2.340014,2.449002,2.139879,3.071638,3.152671,2.545531,3.008351
GSM825371,0.0,32.0,1.0,2.382017,2.033424,2.336395,3.260617,2.836324,2.11548,2.525045,...,2.582165,2.814913,2.699838,2.388292,2.532311,2.428135,3.251223,3.171448,2.657407,3.107854
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
GSM825427,1.0,51.0,1.0,2.465383,2.079181,2.161667,3.246799,2.797268,2.164251,2.146128,...,2.739914,3.301247,2.647383,2.446042,2.946991,2.1959,3.163507,3.207796,2.599417,3.029347
GSM825428,1.0,71.0,0.0,2.100371,2.049218,2.23558,2.960222,2.649335,2.120561,2.217484,...,3.011802,3.423082,2.350248,2.336012,2.730562,2.037426,3.073203,2.89102,2.869427,2.954086
GSM825429,1.0,25.0,1.0,2.1959,2.012837,2.187923,3.199634,2.850646,2.136501,2.32838,...,2.911548,3.344981,2.40654,2.345784,2.62128,2.225309,3.172625,2.983929,2.567629,3.177093
GSM825430,1.0,47.0,1.0,2.841985,2.08636,2.321805,3.262587,2.676694,2.119374,2.20412,...,2.940211,3.49304,2.506505,2.330053,2.492831,1.995635,3.054001,3.43066,2.702477,2.988465


In [18]:
print(f"The merged dataset contains {len(merged_data)} samples.")
is_trait_biased, merged_data = judge_and_remove_biased_features(merged_data, TRAIT, trait_type=trait_type)
is_trait_biased

The merged dataset contains 63 samples.
For the feature 'Adrenocortical Cancer', the least common label is '1.0' with 18 occurrences. This represents 28.57% of the dataset.
The distribution of the feature 'Adrenocortical Cancer' in this dataset is fine.

Quartiles for 'Age':
  25%: 38.5
  50% (Median): 48.0
  75%: 56.0
Min: 19.0
Max: 87.0
The distribution of the feature 'Age' in this dataset is fine.

For the feature 'Gender', the least common label is '0.0' with 23 occurrences. This represents 36.51% of the dataset.
The distribution of the feature 'Gender' in this dataset is fine.



False

In [19]:
if is_available:
    save_cohort_info(cohort, JSON_PATH, is_available, is_trait_biased, merged_data, note='')
else:
    save_cohort_info(cohort, JSON_PATH, is_available)
merged_data.head()
if not is_trait_biased:
    merged_data.to_csv(os.path.join(OUTPUT_DIR, cohort + '.csv'), index=False)

In [20]:
# error
cohort = accession_num = "GSE49280"
cohort_dir = os.path.join(trait_path, accession_num)
soft_file, matrix_file = get_relevant_filepaths(cohort_dir)
soft_file, matrix_file

import io
background_prefixes = ['!Series_title', '!Series_summary', '!Series_overall_design']
clinical_prefixes = ['!Sample_geo_accession', '!Sample_characteristics_ch1']

background_info, clinical_data = get_background_and_clinical_data(matrix_file, background_prefixes, clinical_prefixes)
print(background_info)

clinical_data.head()

!Series_title	"Integrated genomic analyses of adrenocortical tumors (SNP array, DNA methylation, mRNA and miRNA expression)."
!Series_summary	"This SuperSeries is composed of the SubSeries listed below."
!Series_overall_design	"Refer to individual Series"


Unnamed: 0,!Sample_geo_accession,GSM1196390,GSM1196391,GSM1196392,GSM1196393,GSM1196394,GSM1196395,GSM1196396,GSM1196397,GSM1196398,...,GSM1196418,GSM1196419,GSM1196420,GSM1196421,GSM1196422,GSM1196423,GSM1196424,GSM1196425,GSM1196426,GSM1196427
0,!Sample_characteristics_ch1,cell type: Adrenocortical carcinoma,cell type: Adrenocortical carcinoma,cell type: Adrenocortical carcinoma,cell type: Adrenocortical carcinoma,cell type: Adrenocortical carcinoma,cell type: Adrenocortical carcinoma,cell type: Adrenocortical carcinoma,cell type: Adrenocortical carcinoma,cell type: Adrenocortical carcinoma,...,cell type: Adrenocortical carcinoma,cell type: Adrenocortical carcinoma,cell type: Adrenocortical carcinoma,cell type: Adrenocortical carcinoma,cell type: Adrenocortical carcinoma,cell type: Adrenocortical carcinoma,cell type: Adrenocortical carcinoma,cell type: Adrenocortical carcinoma,cell type: Adrenocortical carcinoma,cell type: Adrenocortical carcinoma


In [21]:
tumor_stage_row = clinical_data.iloc[0]
tumor_stage_row.unique()

array(['!Sample_characteristics_ch1',
       'cell type: Adrenocortical carcinoma'], dtype=object)

In [22]:
genetic_data = get_genetic_data(matrix_file)
genetic_data

Unnamed: 0_level_0,GSM1196390,GSM1196391,GSM1196392,GSM1196393,GSM1196394,GSM1196395,GSM1196396,GSM1196397,GSM1196398,GSM1196399,...,GSM1196418,GSM1196419,GSM1196420,GSM1196421,GSM1196422,GSM1196423,GSM1196424,GSM1196425,GSM1196426,GSM1196427
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
rs1000000,0.1065,0.2021,-0.0258,0.0714,0.1821,0.3716,-0.0343,0.0632,0.4660,0.1609,...,0.0428,-0.0564,0.0930,0.1396,0.0075,0.0441,0.0616,0.1482,0.0390,0.0873
rs1000002,0.1184,-0.2697,-0.5366,-0.3392,0.1135,-0.5014,-0.2013,-0.3576,0.0717,-0.1686,...,-0.1097,0.1616,-0.2828,-0.2522,0.2335,-0.1791,0.2011,0.2515,-0.5280,-0.1804
rs10000023,-0.0627,0.0830,-0.1168,-0.0207,-0.0500,0.1033,-0.0169,0.0197,0.2101,0.0587,...,-0.1189,0.0611,0.0374,-0.0478,0.1226,-0.2908,0.1440,0.0704,-0.0951,-0.1550
rs1000003,-0.0874,-0.5206,-0.2224,-0.1689,-0.0666,-0.2013,-0.1730,0.1010,0.2544,-0.1112,...,0.0022,0.0930,-0.1605,-0.1667,0.0316,0.0493,-0.1910,0.0574,-0.1625,-0.1756
rs10000030,0.7767,-0.0285,0.4749,0.2115,-0.0876,-0.3540,0.4442,-0.5963,0.0009,-0.4139,...,0.3487,0.7111,0.5774,0.4109,0.3663,0.2121,0.6078,0.8196,0.2831,0.8315
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
VGXS34742,0.1940,0.2780,0.2586,-0.1372,0.3958,0.1807,0.2403,0.1901,0.3416,0.4133,...,0.2485,0.0686,0.1226,-0.0354,-0.1239,-0.3011,0.0032,-0.0218,-0.4481,0.0698
VGXS34743,0.2213,0.3447,0.2913,-0.2419,0.3323,0.1770,0.2181,0.0668,0.3055,0.4727,...,0.2333,0.2030,0.1948,0.0367,0.0499,-0.1797,0.1574,-0.0151,-0.3437,0.1108
VGXS34744,-0.1274,-0.0201,-0.0016,-0.4076,0.2235,-0.0196,-0.0525,-0.0919,0.2553,0.2095,...,0.0968,-0.0961,-0.1108,-0.4564,-0.4473,-0.7310,-0.1037,-0.5226,-0.7002,-0.2425
VGXS34761,-0.1124,0.0917,0.0726,-0.3928,0.2356,0.0827,-0.0679,0.1435,0.3957,0.2965,...,0.2568,0.0615,0.0917,-0.1877,-0.1994,-0.4965,-0.0968,-0.2921,-0.5193,-0.0352


In [23]:
requires_gene_mapping = True

if requires_gene_mapping:
    gene_annotation = get_gene_annotation(soft_file)
    gene_annotation_summary = preview_df(gene_annotation)
    print(gene_annotation_summary)

gene_row_ids = genetic_data.index[:20].tolist()
gene_row_ids

if requires_gene_mapping:
    print(f'''
    As a biomedical research team, we are analyzing a gene expression dataset, and find that its row headers are some identifiers related to genes:
    {gene_row_ids}
    To get the mapping from those identifiers to actual gene symbols, we extracted the gene annotation data from a series in the GEO database, and saved it to a Python dictionary. Please read the dictionary, and decide which key stores the identifiers, and which key stores the gene symbols. Please strictly follow this format in your answer:
    identifier_key = 'key_name1'
    gene_symbol_key = 'key_name2'

    Gene annotation dictionary:
    {gene_annotation_summary}
    ''')

gene_annotation

MemoryError: 

In [24]:
# Only gender --> SKIP
cohort = accession_num = "GSE143383"
cohort_dir = os.path.join(trait_path, accession_num)
soft_file, matrix_file = get_relevant_filepaths(cohort_dir)
soft_file, matrix_file




from utils import *
import io
background_prefixes = ['!Series_title', '!Series_summary', '!Series_overall_design']
clinical_prefixes = ['!Sample_geo_accession', '!Sample_characteristics_ch1']

background_info, clinical_data = get_background_and_clinical_data(matrix_file, background_prefixes, clinical_prefixes)
print(background_info)

clinical_data.head()

!Series_title	"Gene expression analysis of metastatic adrenocortical tumors"
!Series_summary	"Background: Adrenocortical carcinoma (ACC) is a rare, often-aggressive neoplasm of the adrenal cortex, with a 14.5-month median overall survival. We asked whether tumors from patients with advanced or metastatic ACC would offer clues as to putative genes that might have critical roles in disease progression or in more aggressive disease biology.   Methods: We conducted comprehensive genomic and expression analyses of 43 ACCs.  Results: Copy number gains and losses matched that previously reported. We identified a median mutation rate of 3.38 per megabase (Mb), somewhat higher than in a previous study possibly related to the more advanced disease. The mutational signature was characterized by a predominance of C>T, C>A and T>C transitions. As in previously reports, only cancer genes TP53 (26%) and beta-catenin (CTNNB1, 14%) were mutated in more than 10% of samples. The TCGA-identified putative 

Unnamed: 0,!Sample_geo_accession,GSM4258059,GSM4258060,GSM4258061,GSM4258062,GSM4258063,GSM4258064,GSM4258065,GSM4258066,GSM4258067,...,GSM4258112,GSM4258113,GSM4258114,GSM4258115,GSM4258116,GSM4258117,GSM4258118,GSM4258119,GSM4258120,GSM4258121
0,!Sample_characteristics_ch1,gender: M,gender: F,gender: F,gender: F,gender: F,gender: F,gender: F,gender: F,gender: M,...,gender: F,gender: M,gender: F,gender: M,gender: F,gender: M,gender: M,gender: F,gender: F,gender: M


In [25]:
# Finished????

cohort = accession_num = "GSE90713"
cohort_dir = os.path.join(trait_path, accession_num)
soft_file, matrix_file = get_relevant_filepaths(cohort_dir)
soft_file, matrix_file

from utils import *
import io
background_prefixes = ['!Series_title', '!Series_summary', '!Series_overall_design']
clinical_prefixes = ['!Sample_geo_accession', '!Sample_characteristics_ch1']

background_info, clinical_data = get_background_and_clinical_data(matrix_file, background_prefixes, clinical_prefixes)
print(background_info)

clinical_data

!Series_title	"Expression data from human metastatic adrenocortical carcinoma"
!Series_summary	"CXCR4 expression by metastatic adrenocortical carcinoma is heterogeneous among patients and among lesions"
!Series_summary	"We used microarrays for 57 ACC metastases from 42 patients to evaluate gene expression in different lesions from same patients and over time, focusing on CXCR4 expression and other genes correlating with CXCR4 expression"
!Series_overall_design	"57 ACC metastases from 42 patients were used for RNA extraction and hybridization on Affymetrix microarrays. We sought to obtain data on CXCR4 expression by ACC metastases. Multiple lesion samples were aquired for 9 of the patients, labeled a thru i. Single samples were aquired from the other subjects."


Unnamed: 0,!Sample_geo_accession,GSM2411058,GSM2411059,GSM2411060,GSM2411061,GSM2411062,GSM2411063,GSM2411064,GSM2411065,GSM2411066,...,GSM2411111,GSM2411112,GSM2411113,GSM2411114,GSM2411115,GSM2411116,GSM2411117,GSM2411118,GSM2411119,GSM2411120
0,!Sample_characteristics_ch1,tissue: adrenocortical carcinoma,tissue: adrenocortical carcinoma,tissue: adrenocortical carcinoma,tissue: adrenocortical carcinoma,tissue: adrenocortical carcinoma,tissue: adrenocortical carcinoma,tissue: adrenocortical carcinoma,tissue: adrenocortical carcinoma,tissue: adrenocortical carcinoma,...,tissue: adrenocortical carcinoma,tissue: adrenocortical carcinoma,tissue: adrenocortical carcinoma,tissue: adrenocortical carcinoma,tissue: adrenocortical carcinoma,tissue: adrenocortical carcinoma,tissue: adrenocortical carcinoma,tissue: adrenocortical carcinoma,tissue: adrenocortical carcinoma,tissue: adrenocortical carcinoma
1,!Sample_characteristics_ch1,study: 426,study: 426,study: 426,study: 426,study: 426,study: 426,study: 426,study: 426,study: 426,...,study: 920,study: 920,study: 920,study: 920,study: 920,study: 920,study: 920,study: 920,study: 920,study: 920
2,!Sample_characteristics_ch1,condition: tumor,condition: tumor,condition: tumor,condition: tumor,condition: tumor,condition: tumor,condition: tumor,condition: tumor,condition: tumor,...,condition: tumor,condition: tumor,condition: tumor,condition: tumor,condition: tumor,condition: tumor,condition: tumor,condition: tumor,condition: tumor,condition: tumor
3,!Sample_characteristics_ch1,acc_num: 1,acc_num: 6,acc_num: 7,acc_num: 8,acc_num: 9,acc_num: 11,acc_num: 13,acc_num: 14,acc_num: 15,...,acc_num: 28b,acc_num: 29b,acc_num: 31b,acc_num: 30newb,acc_num: 32newb,acc_num: 3b,acc_num: 4b,acc_num: 5b,acc_num: 7b,acc_num: 8b
4,!Sample_characteristics_ch1,patient: a,patient: b,patient: c,patient: b,patient: d,patient: A_16,patient: A_17,patient: A_20,patient: A_22,...,patient: i,patient: i,patient: i,patient: i,patient: i,patient: B_32_1,patient: B_4,patient: B_5,patient: B_7,patient: B_8


In [26]:
tumor_stage_row = clinical_data.iloc[2]
tumor_stage_row.unique()

array(['!Sample_characteristics_ch1', 'condition: tumor',
       'condition: normal'], dtype=object)

In [27]:
genetic_data = get_genetic_data(matrix_file)
genetic_data

Unnamed: 0_level_0,GSM2411058,GSM2411059,GSM2411060,GSM2411061,GSM2411062,GSM2411063,GSM2411064,GSM2411065,GSM2411066,GSM2411067,...,GSM2411111,GSM2411112,GSM2411113,GSM2411114,GSM2411115,GSM2411116,GSM2411117,GSM2411118,GSM2411119,GSM2411120
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
11715100_at,4.841518,4.947266,4.903320,5.007812,5.004883,5.092773,4.888672,4.941406,5.827148,4.951172,...,4.987305,4.831055,4.747070,5.110352,4.948242,4.866211,5.023438,5.068359,5.014648,4.917969
11715101_s_at,5.293186,5.039062,5.093750,5.549805,5.667969,5.346680,5.472656,5.256836,5.132812,5.692383,...,5.365234,5.356445,5.077148,5.928711,5.514648,5.224609,5.111328,5.124023,5.298828,5.390625
11715102_x_at,4.914528,5.022461,4.991211,5.166992,5.165039,5.042969,4.958008,5.041992,5.163086,5.054688,...,5.144531,4.923828,4.839844,5.366211,5.056641,4.867188,5.034180,5.160156,5.103516,5.023438
11715103_x_at,5.504681,5.315430,5.303711,5.375000,5.358398,5.333984,5.676758,5.392578,6.496094,5.499023,...,5.378906,5.464844,5.498047,5.693359,5.537109,5.397461,5.589844,6.132812,5.736328,5.473633
11715104_s_at,5.056718,5.155273,5.315430,5.049805,5.089844,5.458008,5.041016,5.130859,5.394531,5.101562,...,5.104492,5.126953,5.010742,5.333008,5.175781,4.900391,5.298828,5.469727,5.416992,5.142578
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
AFFX-ThrX-5_at,5.424495,6.557617,7.363281,6.172852,6.250977,5.279297,6.483398,5.523438,5.412109,5.427734,...,6.426758,6.051758,8.982422,5.268555,5.638672,6.011719,7.086914,6.926758,6.325195,5.761719
AFFX-ThrX-M_at,6.360383,8.201172,9.500000,7.684570,7.594727,5.778320,8.029297,6.305664,6.199219,5.916992,...,7.393555,6.994141,10.042969,6.012695,6.388672,7.298828,8.451172,8.498047,7.401367,6.551758
AFFX-TrpnX-3_at,4.642501,4.609375,4.632812,4.607422,4.633789,4.631836,4.649414,4.630859,4.643555,4.661133,...,4.631836,4.645508,4.626953,4.680664,4.624023,4.647461,4.639648,4.665039,4.630859,4.632812
AFFX-TrpnX-5_at,4.758898,4.708008,4.736328,4.763672,4.761719,4.729492,4.740234,4.726562,4.725586,4.826172,...,4.691406,4.776367,4.668945,4.723633,4.730469,4.751953,4.738281,4.726562,4.751953,4.698242


In [28]:
requires_gene_mapping = True

if requires_gene_mapping:
    gene_annotation = get_gene_annotation(soft_file)
    gene_annotation_summary = preview_df(gene_annotation)
    print(gene_annotation_summary)

gene_row_ids = genetic_data.index[:20].tolist()
gene_row_ids

if requires_gene_mapping:
    print(f'''
    As a biomedical research team, we are analyzing a gene expression dataset, and find that its row headers are some identifiers related to genes:
    {gene_row_ids}
    To get the mapping from those identifiers to actual gene symbols, we extracted the gene annotation data from a series in the GEO database, and saved it to a Python dictionary. Please read the dictionary, and decide which key stores the identifiers, and which key stores the gene symbols. Please strictly follow this format in your answer:
    identifier_key = 'key_name1'
    gene_symbol_key = 'key_name2'

    Gene annotation dictionary:
    {gene_annotation_summary}
    ''')

gene_annotation

{'ID': ['11715100_at', '11715101_s_at', '11715102_x_at', '11715103_x_at', '11715104_s_at'], 'GeneChip Array': ['Human Genome PrimeView Array', 'Human Genome PrimeView Array', 'Human Genome PrimeView Array', 'Human Genome PrimeView Array', 'Human Genome PrimeView Array'], 'Species Scientific Name': ['Homo sapiens', 'Homo sapiens', 'Homo sapiens', 'Homo sapiens', 'Homo sapiens'], 'Annotation Date': ['30-Mar-16', '30-Mar-16', '30-Mar-16', '30-Mar-16', '30-Mar-16'], 'Sequence Type': ['Consensus sequence', 'Consensus sequence', 'Consensus sequence', 'Consensus sequence', 'Consensus sequence'], 'Sequence Source': ['Affymetrix Proprietary Database', 'Affymetrix Proprietary Database', 'Affymetrix Proprietary Database', 'Affymetrix Proprietary Database', 'Affymetrix Proprietary Database'], 'Transcript ID(Array Design)': ['g21264570', 'g21264570', 'g21264570', 'g22748780', 'g30039713'], 'Target Description': ['g21264570 /TID=g21264570 /CNT=1 /FEA=FLmRNA /TIER=FL /STK=0 /DEF=g21264570 /REP_ORG=Ho

Unnamed: 0,ID,GeneChip Array,Species Scientific Name,Annotation Date,Sequence Type,Sequence Source,Transcript ID(Array Design),Target Description,GB_ACC,GI,...,Gene Ontology Biological Process,Gene Ontology Cellular Component,Gene Ontology Molecular Function,Pathway,InterPro,Annotation Description,Annotation Transcript Cluster,Transcript Assignments,Annotation Notes,SPOT_ID
0,11715100_at,Human Genome PrimeView Array,Homo sapiens,30-Mar-16,Consensus sequence,Affymetrix Proprietary Database,g21264570,g21264570 /TID=g21264570 /CNT=1 /FEA=FLmRNA /T...,,21264570.0,...,0000183 // chromatin silencing at rDNA // trac...,0000228 // nuclear chromosome // inferred from...,0003677 // DNA binding // inferred from electr...,---,IPR007125 // Histone H2A/H2B/H3 // 9.3E-34 ///...,This probe set was annotated using the Matchin...,"ENST00000614378(11),NM_003534(11),OTTHUMT00000...",ENST00000614378 // ensembl_havana_transcript:k...,---,
1,11715101_s_at,Human Genome PrimeView Array,Homo sapiens,30-Mar-16,Consensus sequence,Affymetrix Proprietary Database,g21264570,g21264570 /TID=g21264570 /CNT=1 /FEA=FLmRNA /T...,,21264570.0,...,0000183 // chromatin silencing at rDNA // trac...,0000228 // nuclear chromosome // inferred from...,0003677 // DNA binding // inferred from electr...,---,IPR007125 // Histone H2A/H2B/H3 // 9.3E-34 ///...,This probe set was annotated using the Matchin...,"ENST00000614378(11),NM_003534(11),OTTHUMT00000...",ENST00000614378 // ensembl_havana_transcript:k...,---,
2,11715102_x_at,Human Genome PrimeView Array,Homo sapiens,30-Mar-16,Consensus sequence,Affymetrix Proprietary Database,g21264570,g21264570 /TID=g21264570 /CNT=1 /FEA=FLmRNA /T...,,21264570.0,...,0000183 // chromatin silencing at rDNA // trac...,0000228 // nuclear chromosome // inferred from...,0003677 // DNA binding // inferred from electr...,---,IPR007125 // Histone H2A/H2B/H3 // 9.3E-34 ///...,This probe set was annotated using the Matchin...,"ENST00000614378(11),NM_003534(11),OTTHUMT00000...",ENST00000614378 // ensembl_havana_transcript:k...,GENSCAN00000029819 // ensembl // 4 // Cross Hy...,
3,11715103_x_at,Human Genome PrimeView Array,Homo sapiens,30-Mar-16,Consensus sequence,Affymetrix Proprietary Database,g22748780,g22748780 /TID=g22748780 /CNT=1 /FEA=FLmRNA /T...,,22748780.0,...,0032007 // negative regulation of TOR signalin...,0005737 // cytoplasm // not recorded /// 00057...,0005515 // protein binding // inferred from ph...,---,IPR008477 // Protein of unknown function DUF75...,This probe set was annotated using the Matchin...,"BC017672(11),BC044250(9),ENST00000327473(11),E...",BC017672 // Homo sapiens tumor necrosis factor...,---,
4,11715104_s_at,Human Genome PrimeView Array,Homo sapiens,30-Mar-16,Consensus sequence,Affymetrix Proprietary Database,g30039713,g30039713 /TID=g30039713 /CNT=1 /FEA=FLmRNA /T...,,30039713.0,...,---,0016020 // membrane // inferred from electroni...,---,---,IPR004878 // Otopetrin // 9.4E-43 /// IPR00487...,This probe set was annotated using the Matchin...,"ENST00000331427(11),ENST00000580223(11),NM_178...",ENST00000331427 // ensembl:known chromosome:GR...,---,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3161338,AFFX-r2-TagJ-5_at,4.615234375,,,,,,,,,...,,,,,,,,,,
3161339,AFFX-r2-TagO-3_at,4.705078125,,,,,,,,,...,,,,,,,,,,
3161340,AFFX-r2-TagO-5_at,4.608398438,,,,,,,,,...,,,,,,,,,,
3161341,AFFX-r2-TagQ-3_at,4.697265625,,,,,,,,,...,,,,,,,,,,


In [29]:
gene_annotation.columns

Index(['ID', 'GeneChip Array', 'Species Scientific Name', 'Annotation Date',
       'Sequence Type', 'Sequence Source', 'Transcript ID(Array Design)',
       'Target Description', 'GB_ACC', 'GI', 'Representative Public ID',
       'Archival UniGene Cluster', 'UniGene ID', 'Genome Version',
       'Alignments', 'Gene Title', 'Gene Symbol', 'Chromosomal Location',
       'Unigene Cluster Type', 'Ensembl', 'Entrez Gene', 'SwissProt', 'EC',
       'OMIM', 'RefSeq Protein ID', 'RefSeq Transcript ID',
       'Gene Ontology Biological Process', 'Gene Ontology Cellular Component',
       'Gene Ontology Molecular Function', 'Pathway', 'InterPro',
       'Annotation Description', 'Annotation Transcript Cluster',
       'Transcript Assignments', 'Annotation Notes', 'SPOT_ID'],
      dtype='object')

In [30]:
if requires_gene_mapping:
    identifier_key = 'ID'
    gene_symbol_key = 'Gene Symbol'
    gene_mapping = get_gene_mapping(gene_annotation, identifier_key, gene_symbol_key)
    genetic_data = apply_gene_mapping(genetic_data, gene_mapping)

In [31]:
genetic_data

Unnamed: 0_level_0,GSM2411058,GSM2411059,GSM2411060,GSM2411061,GSM2411062,GSM2411063,GSM2411064,GSM2411065,GSM2411066,GSM2411067,...,GSM2411111,GSM2411112,GSM2411113,GSM2411114,GSM2411115,GSM2411116,GSM2411117,GSM2411118,GSM2411119,GSM2411120
Gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ABCC6P1,5.180866,4.977051,5.195313,4.862793,4.852539,5.636719,6.330566,5.047852,5.034180,5.624512,...,4.866211,4.717285,5.069336,4.842285,4.742676,4.809570,4.792969,5.490723,5.702148,6.715332
ABCC6P2,5.471013,5.002930,5.356445,4.798828,4.750977,6.299805,7.727539,5.214844,5.150391,6.301758,...,4.853516,4.749023,5.251953,4.809570,4.689453,4.787109,4.773438,6.125977,6.534180,8.548828
ACOT2,7.713635,6.903320,7.163086,8.966797,8.912109,7.658203,7.423828,6.889648,7.335938,7.115234,...,7.334961,7.650391,8.146484,7.863281,7.859375,7.816406,8.263672,5.544922,6.828125,7.619141
ACSM2B,4.743009,4.749023,4.800000,4.703711,4.716211,4.777148,4.722656,4.741016,4.869922,4.773828,...,4.690039,4.709961,4.745898,4.784766,4.738672,4.766797,4.747461,4.816797,4.822461,4.763086
ACY1,6.272151,6.059570,6.395020,7.125488,7.118164,6.563965,6.569336,7.001953,6.913086,7.670410,...,5.946777,6.174805,6.546875,5.881836,6.570313,6.386230,5.936035,6.612793,7.480957,6.257324
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ZYG11A,4.680315,4.662598,4.670898,4.657715,4.712891,4.703125,4.690918,4.664551,4.665039,4.683594,...,4.670410,4.671387,4.714355,4.662598,4.674316,4.694336,4.662598,4.666992,4.657227,4.693359
ZYG11B,6.612026,6.639160,6.410156,6.420166,6.500000,6.626953,6.472412,6.822266,6.800537,6.394043,...,6.310547,6.310059,6.320801,6.378174,6.380615,6.425537,6.324707,6.276855,6.554199,6.743652
ZYX,6.074064,7.055664,6.838867,6.343750,6.431152,6.651855,6.940430,6.520508,7.701660,6.390137,...,6.027832,6.108398,7.544434,6.250977,6.095703,6.333496,6.608398,6.477051,6.934570,6.090332
ZZEF1,5.904870,5.963867,6.434082,5.895508,5.866699,6.193359,5.788086,6.060547,5.915039,6.078613,...,6.214355,6.235352,6.056152,6.332520,5.993164,6.766602,5.933594,6.520996,6.223633,5.875000


In [32]:
genetic_data = normalize_gene_symbols_in_index(genetic_data)

In [33]:
genetic_data

Unnamed: 0,GSM2411058,GSM2411059,GSM2411060,GSM2411061,GSM2411062,GSM2411063,GSM2411064,GSM2411065,GSM2411066,GSM2411067,...,GSM2411111,GSM2411112,GSM2411113,GSM2411114,GSM2411115,GSM2411116,GSM2411117,GSM2411118,GSM2411119,GSM2411120
A1BG,5.106244,5.238281,5.280273,5.072266,5.064453,5.056641,6.063477,4.973633,5.327148,5.094727,...,5.244141,4.891602,5.267578,5.543945,5.151367,5.382812,5.319336,4.903320,5.154297,6.026367
A1CF,4.872861,4.875000,4.913086,4.846191,4.951660,4.916016,4.812012,4.852539,4.972656,4.843262,...,4.786621,4.854004,4.850098,4.797363,4.797852,4.824219,4.800293,4.812012,4.872070,4.813477
A2M,9.978159,11.861328,11.611328,10.238281,9.562500,10.841797,11.351562,10.564453,11.300781,9.648438,...,10.869141,10.480469,11.900391,11.300781,10.591797,11.982422,11.115234,11.400391,10.392578,10.582031
A2ML1,4.740273,5.305664,4.681152,4.795410,4.846680,4.712891,4.683105,4.719238,4.760254,4.727051,...,4.730957,4.722656,4.735352,4.744141,4.723633,4.699707,4.740234,4.731445,4.727539,4.713379
A3GALT2,4.875744,4.892578,4.951172,4.938477,4.877930,5.062500,4.871094,4.910156,4.992188,4.981445,...,4.886719,4.900391,4.869141,4.910156,4.896484,4.831055,4.923828,5.066406,5.002930,4.890625
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ZYG11A,4.680315,4.662598,4.670898,4.657715,4.712891,4.703125,4.690918,4.664551,4.665039,4.683594,...,4.670410,4.671387,4.714355,4.662598,4.674316,4.694336,4.662598,4.666992,4.657227,4.693359
ZYG11B,6.612026,6.639160,6.410156,6.420166,6.500000,6.626953,6.472412,6.822266,6.800537,6.394043,...,6.310547,6.310059,6.320801,6.378174,6.380615,6.425537,6.324707,6.276855,6.554199,6.743652
ZYX,6.074064,7.055664,6.838867,6.343750,6.431152,6.651855,6.940430,6.520508,7.701660,6.390137,...,6.027832,6.108398,7.544434,6.250977,6.095703,6.333496,6.608398,6.477051,6.934570,6.090332
ZZEF1,5.904870,5.963867,6.434082,5.895508,5.866699,6.193359,5.788086,6.060547,5.915039,6.078613,...,6.214355,6.235352,6.056152,6.332520,5.993164,6.766602,5.933594,6.520996,6.223633,5.875000


In [34]:
is_gene_availabe = True
trait_row = 2
age_row = None
gender_row = None

trait_type = 'binary'

In [35]:
# Verify and use the functions generated by GPT

# 这个函数将组织类型（tissue type）转换为有关癫痫存在与否的二进制值。
# 它是基于特定的假设，即如果组织类型是“胰腺导管腺癌”（Pancreatic Ductal Adenocarcinoma），则认为癫痫存在（返回1）；否则，认为癫痫不存在（返回0）。
def convert_trait(tissue_type):
    """
    Convert tissue type to epilepsy presence (binary).
    Assuming epilepsy presence for 'Hippocampus' tissue.
    """
    if tissue_type == 'condition: tumor':
        return 1  # Epilepsy present
    else:
        return 0  # Epilepsy not present


# 这个函数的目的是将年龄的字符串表示转换为一个连续的数值型表示。如果年龄未知（例如，标记为'n.a.'），则返回None。
# 函数尝试从传入的字符串中提取出一个整数作为年龄值。如果字符串的格式不符合预期，导致提取失败，同样返回None。
def convert_age(age_string):
    """
    Convert age string to a continuous numerical value.
    Unknown values are converted to None.
    """
    if age_string.lower() == 'n.a.':
        return None
    try:
        # Extract age as an integer from the string
        age = int(age_string.split(': ')[1])
        return age
    except (ValueError, IndexError):
        # In case of any format error or unexpected string structure
        return None


# 这个函数将性别的字符串表示转换为二进制值，其中“female”对应1，“male”对应0。如果性别未知或字符串不符合预期格式，则返回None。
# It sometimes maps 'female' to 0, and sometimes 1. Does it matter?
def convert_gender(gender_string):
    """
    Convert gender string to a binary value.
    'female' is represented as 1, 'male' as 0.
    Unknown values are converted to None.
    """
    if (gender_string.lower() == 'sex: female' or gender_string.lower() == 'sex: f'):
        return 1
    elif (gender_string.lower() == 'sex: male' or gender_string.lower() == 'sex: m') :  # changeed 
        return 0
    else:
        return None

In [36]:
selected_clinical_data = geo_select_clinical_features(clinical_data, TRAIT, trait_row, convert_trait, age_row=age_row,
                                                      convert_age=convert_age, gender_row=gender_row,
                                                      convert_gender=convert_gender)
selected_clinical_data.head()

  clinical_df = clinical_df.applymap(convert_fn)


Unnamed: 0,GSM2411058,GSM2411059,GSM2411060,GSM2411061,GSM2411062,GSM2411063,GSM2411064,GSM2411065,GSM2411066,GSM2411067,...,GSM2411111,GSM2411112,GSM2411113,GSM2411114,GSM2411115,GSM2411116,GSM2411117,GSM2411118,GSM2411119,GSM2411120
Adrenocortical Cancer,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1


In [37]:
merged_data = geo_merge_clinical_genetic_data(selected_clinical_data, genetic_data)
# The preprocessing runs through, which means is_available should be True
is_available = True

In [38]:
merged_data

Unnamed: 0,Adrenocortical Cancer,A1BG,A1CF,A2M,A2ML1,A3GALT2,A4GALT,A4GNT,AAAS,AACS,...,ZW10,ZWILCH,ZWINT,ZXDB,ZXDC,ZYG11A,ZYG11B,ZYX,ZZEF1,ZZZ3
GSM2411058,1.0,5.106244,4.872861,9.978159,4.740273,4.875744,5.195803,4.952567,6.198320,6.548022,...,7.962596,6.626484,9.489847,5.142927,5.985722,4.680315,6.612026,6.074064,5.904870,7.978461
GSM2411059,1.0,5.238281,4.875000,11.861328,5.305664,4.892578,5.055664,4.983398,6.668945,5.909424,...,7.197266,5.800293,7.072754,5.172852,5.607910,4.662598,6.639160,7.055664,5.963867,7.258789
GSM2411060,1.0,5.280273,4.913086,11.611328,4.681152,4.951172,5.188802,4.914062,7.054688,6.372559,...,6.713867,5.827881,7.602051,5.137695,6.056348,4.670898,6.410156,6.838867,6.434082,7.264160
GSM2411061,1.0,5.072266,4.846191,10.238281,4.795410,4.938477,5.052409,5.004883,6.350586,6.262939,...,7.292969,5.847900,7.145996,5.144043,5.725586,4.657715,6.420166,6.343750,5.895508,7.532715
GSM2411062,1.0,5.064453,4.951660,9.562500,4.846680,4.877930,5.063151,4.928711,6.232422,6.250488,...,7.392578,5.827148,7.178223,5.222168,5.692285,4.712891,6.500000,6.431152,5.866699,7.452637
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
GSM2411116,1.0,5.382812,4.824219,11.982422,4.699707,4.831055,5.071615,4.986328,6.214844,6.644043,...,7.006836,6.724854,7.033691,5.236816,6.435742,4.694336,6.425537,6.333496,6.766602,7.829102
GSM2411117,1.0,5.319336,4.800293,11.115234,4.740234,4.923828,5.130534,4.795898,6.176758,6.421875,...,6.976562,5.992188,7.158203,5.168945,5.774121,4.662598,6.324707,6.608398,5.933594,6.866699
GSM2411118,1.0,4.903320,4.812012,11.400391,4.731445,5.066406,5.210286,4.906250,6.487305,5.935059,...,6.934570,7.105957,8.438965,5.033203,5.518555,4.666992,6.276855,6.477051,6.520996,6.866211
GSM2411119,1.0,5.154297,4.872070,10.392578,4.727539,5.002930,5.390625,4.973633,6.874023,6.340088,...,6.652344,5.964600,7.772949,5.078613,5.781250,4.657227,6.554199,6.934570,6.223633,7.709961


In [39]:
print(f"The merged dataset contains {len(merged_data)} samples.")
is_trait_biased, merged_data = judge_and_remove_biased_features(merged_data, TRAIT, trait_type=trait_type)
is_trait_biased

The merged dataset contains 63 samples.
For the feature 'Adrenocortical Cancer', the least common label is '0.0' with 5 occurrences. This represents 7.94% of the dataset.
The distribution of the feature 'Adrenocortical Cancer' in this dataset is fine.



False

In [40]:
if is_available:
    save_cohort_info(cohort, JSON_PATH, is_available, is_trait_biased, merged_data, note='')
else:
    save_cohort_info(cohort, JSON_PATH, is_available)
merged_data.head()
if not is_trait_biased:
    merged_data.to_csv(os.path.join(OUTPUT_DIR, cohort + '.csv'), index=False)

In [41]:
# Finished
cohort = accession_num = "GSE36353"
cohort_dir = os.path.join(trait_path, accession_num)
soft_file, matrix_file = get_relevant_filepaths(cohort_dir)
soft_file, matrix_file

from utils import *
import io
background_prefixes = ['!Series_title', '!Series_summary', '!Series_overall_design']
clinical_prefixes = ['!Sample_geo_accession', '!Sample_characteristics_ch1']

background_info, clinical_data = get_background_and_clinical_data(matrix_file, background_prefixes, clinical_prefixes)
print(background_info)

clinical_data.head()

!Series_title	"Genome wide DNA Methylation analysis of benign and malignant adrenocortical tumors"
!Series_summary	"Genome wide DNA methylation profiling of normal adrenocortical tissue, adrenocortical adenomas and adrenocortical carcinomas. The Illumina Infinium 27k Human DNA methylation Beadchip v1.2 was used to obtain DNA methylation profiles. Samples included 6 normal adrenocortical tissue samples, 27 adenomas and 15 carcinomas."
!Series_overall_design	"Bisulphite converted DNA from the 48 samples were hybridised to the Illumina Infinium 27k Human Methylation Beadchip v1.2"


Unnamed: 0,!Sample_geo_accession,GSM889444,GSM889445,GSM889446,GSM889447,GSM889448,GSM889449,GSM889450,GSM889451,GSM889452,...,GSM889482,GSM889483,GSM889484,GSM889485,GSM889486,GSM889487,GSM889488,GSM889489,GSM889490,GSM889491
0,!Sample_characteristics_ch1,gender: Female,gender: Female,gender: Female,gender: Female,gender: Female,gender: Male,gender: Female,gender: Male,gender: Male,...,gender: Female,gender: Female,gender: Female,gender: Male,gender: Male,gender: Female,gender: Male,gender: Female,gender: Male,gender: Female
1,!Sample_characteristics_ch1,tissue: adrenal tissue,tissue: adrenal tissue,tissue: adrenal tissue,tissue: adrenal tissue,tissue: adrenal tissue,tissue: adrenal tissue,tissue: adrenal tissue,tissue: adrenal tissue,tissue: adrenal tissue,...,tissue: adrenal tissue,tissue: adrenal tissue,tissue: adrenal tissue,tissue: adrenal tissue,tissue: adrenal tissue,tissue: adrenal tissue,tissue: adrenal tissue,tissue: adrenal tissue,tissue: adrenal tissue,tissue: adrenal tissue
2,!Sample_characteristics_ch1,disease state: carcinoma,disease state: carcinoma,disease state: carcinoma,disease state: carcinoma,disease state: carcinoma,disease state: carcinoma,disease state: carcinoma,disease state: carcinoma,disease state: carcinoma,...,disease state: adenoma,disease state: adenoma,disease state: adenoma,disease state: adenoma,disease state: normal adrenal tissue,disease state: normal adrenal tissue,disease state: normal adrenal tissue,disease state: normal adrenal tissue,disease state: normal adrenal tissue,disease state: normal adrenal tissue


In [42]:
tumor_stage_row = clinical_data.iloc[2]
tumor_stage_row.unique()

array(['!Sample_characteristics_ch1', 'disease state: carcinoma',
       'disease state: adenoma', 'disease state: normal adrenal tissue'],
      dtype=object)

In [43]:
is_gene_availabe = True
trait_row = 2
age_row = None
gender_row = 0

trait_type = 'binary'

In [44]:
genetic_data = get_genetic_data(matrix_file)
genetic_data

Unnamed: 0_level_0,GSM889444,GSM889445,GSM889446,GSM889447,GSM889448,GSM889449,GSM889450,GSM889451,GSM889452,GSM889453,...,GSM889482,GSM889483,GSM889484,GSM889485,GSM889486,GSM889487,GSM889488,GSM889489,GSM889490,GSM889491
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
cg00000292,0.590574,0.513603,0.826171,0.867341,0.598491,0.785434,0.566842,0.801596,0.782646,0.713891,...,0.573703,0.464043,0.399872,0.648043,0.736842,0.601324,0.522049,0.534815,0.586788,0.606669
cg00002426,0.256898,0.105301,0.120422,0.047387,0.158108,0.103077,0.553597,0.270270,0.117484,0.130336,...,0.271988,0.279047,0.216480,0.335722,0.298562,0.293625,0.334871,0.321670,0.222039,0.293077
cg00003994,0.078937,0.095357,0.066558,0.148379,0.302606,0.073422,0.082058,0.103124,0.062960,0.303335,...,0.132663,0.222629,0.102947,0.080621,0.095720,0.093675,0.089851,0.082194,0.095572,0.083108
cg00005847,0.684397,0.808231,0.740177,0.798983,0.600528,0.769102,0.388342,0.522643,0.751418,0.213138,...,0.496777,0.231583,0.250337,0.429693,0.630685,0.597885,0.663874,0.678045,0.581554,0.648205
cg00006414,0.113754,0.094297,0.068810,0.095124,0.099132,0.045772,0.130852,0.078138,0.090765,0.092367,...,0.157187,0.103861,0.148690,0.086650,0.102798,0.095674,0.085801,0.095066,0.097189,0.121290
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
cg27657249,0.167368,0.163048,0.170360,0.206942,0.275417,0.182009,0.162453,0.198111,0.183755,0.193382,...,0.209191,0.212529,0.200447,0.225882,0.157167,0.159855,0.175443,0.169527,0.177136,0.149817
cg27657283,0.082578,0.110022,0.102603,0.111641,0.099657,0.078547,0.147287,0.096087,0.111524,0.118322,...,0.235684,0.106418,0.141741,0.097420,0.093500,0.113155,0.081989,0.092765,0.089083,0.116853
cg27661264,0.352964,0.230991,0.443879,0.258413,0.444090,0.490304,0.374893,0.559374,0.494604,0.399411,...,0.379100,0.509235,0.329875,0.565987,0.378737,0.298566,0.371765,0.343241,0.274428,0.362638
cg27662379,0.037021,0.035945,0.042062,0.025984,0.036624,0.026747,0.043812,0.030194,0.037006,0.024747,...,0.034847,0.034346,0.049740,0.018274,0.042089,0.040415,0.033850,0.037650,0.030767,0.030493


In [45]:
requires_gene_mapping = True

if requires_gene_mapping:
    gene_annotation = get_gene_annotation(soft_file)
    gene_annotation_summary = preview_df(gene_annotation)
    print(gene_annotation_summary)

gene_row_ids = genetic_data.index[:20].tolist()
gene_row_ids

if requires_gene_mapping:
    print(f'''
    As a biomedical research team, we are analyzing a gene expression dataset, and find that its row headers are some identifiers related to genes:
    {gene_row_ids}
    To get the mapping from those identifiers to actual gene symbols, we extracted the gene annotation data from a series in the GEO database, and saved it to a Python dictionary. Please read the dictionary, and decide which key stores the identifiers, and which key stores the gene symbols. Please strictly follow this format in your answer:
    identifier_key = 'key_name1'
    gene_symbol_key = 'key_name2'

    Gene annotation dictionary:
    {gene_annotation_summary}
    ''')

gene_annotation

{'ID': ['cg00000292', 'cg00002426', 'cg00003994', 'cg00005847', 'cg00006414'], 'Name': ['cg00000292', 'cg00002426', 'cg00003994', 'cg00005847', 'cg00006414'], 'IlmnStrand': ['TOP', 'TOP', 'TOP', 'BOT', 'BOT'], 'AddressA_ID': [990370.0, 6580397.0, 7150184.0, 4850717.0, 6980731.0], 'AlleleA_ProbeSeq': ['AAACATTAATTACCAACCACTCTTCCAAAAAACACTTACCATTAAAACCA', 'AATATAATAACATTACCTTACCCATCTTATAATCAAACCAAACAAAAACA', 'AATAATAATAATACCCCCTATAATACTAACTAACAAACATACCCTCTTCA', 'TACTATAATACACCCTATATTTAAAACACTAAACTTACCCCATTAAAACA', 'CTCAAAAACCAAACAAAACAAAACCCCAATACTAATCATTAATAAAATCA'], 'AddressB_ID': [6660678.0, 6100343.0, 7150392.0, 1260113.0, 4280093.0], 'AlleleB_ProbeSeq': ['AAACATTAATTACCAACCGCTCTTCCAAAAAACACTTACCATTAAAACCG', 'AATATAATAACATTACCTTACCCGTCTTATAATCAAACCAAACGAAAACG', 'AATAATAATAATACCCCCTATAATACTAACTAACAAACATACCCTCTTCG', 'TACTATAATACACCCTATATTTAAAACACTAAACTTACCCCATTAAAACG', 'CTCGAAAACCGAACAAAACAAAACCCCAATACTAATCGTTAATAAAATCG'], 'GenomeBuild': [36.0, 36.0, 36.0, 36.0, 36.0], 'Chr': ['16', '3

Unnamed: 0,ID,Name,IlmnStrand,AddressA_ID,AlleleA_ProbeSeq,AddressB_ID,AlleleB_ProbeSeq,GenomeBuild,Chr,MapInfo,...,Distance_to_TSS,CPG_ISLAND,CPG_ISLAND_LOCATIONS,MIR_CPG_ISLAND,RANGE_GB,RANGE_START,RANGE_END,RANGE_STRAND,GB_ACC,ORF
0,cg00000292,cg00000292,TOP,990370.0,AAACATTAATTACCAACCACTCTTCCAAAAAACACTTACCATTAAA...,6660678.0,AAACATTAATTACCAACCGCTCTTCCAAAAAACACTTACCATTAAA...,36.0,16,28797601.0,...,291.0,True,16:28797486-28797825,,NC_000016.8,28797486.0,28797825.0,+,NM_173201.2,487.0
1,cg00002426,cg00002426,TOP,6580397.0,AATATAATAACATTACCTTACCCATCTTATAATCAAACCAAACAAA...,6100343.0,AATATAATAACATTACCTTACCCGTCTTATAATCAAACCAAACGAA...,36.0,3,57718583.0,...,369.0,True,3:57716811-57718675,,NC_000003.10,57716811.0,57718675.0,+,NM_007159.2,7871.0
2,cg00003994,cg00003994,TOP,7150184.0,AATAATAATAATACCCCCTATAATACTAACTAACAAACATACCCTC...,7150392.0,AATAATAATAATACCCCCTATAATACTAACTAACAAACATACCCTC...,36.0,7,15692387.0,...,432.0,True,7:15691512-15693551,,NC_000007.12,15691512.0,15693551.0,-,NM_005924.3,4223.0
3,cg00005847,cg00005847,BOT,4850717.0,TACTATAATACACCCTATATTTAAAACACTAAACTTACCCCATTAA...,1260113.0,TACTATAATACACCCTATATTTAAAACACTAAACTTACCCCATTAA...,36.0,2,176737319.0,...,268.0,False,,,,,,,NM_006898.4,3232.0
4,cg00006414,cg00006414,BOT,6980731.0,CTCAAAAACCAAACAAAACAAAACCCCAATACTAATCATTAATAAA...,4280093.0,CTCGAAAACCGAACAAAACAAAACCCCAATACTAATCGTTAATAAA...,36.0,7,148453770.0,...,671.0,True,7:148453584-148455804,,NC_000007.12,148453584.0,148455804.0,+,NM_020781.2,57541.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1351365,cg27657283,0.1168531,3.68E-38,,,,,,,,...,,,,,,,,,,
1351366,cg27661264,0.3626379,3.68E-38,,,,,,,,...,,,,,,,,,,
1351367,cg27662379,0.03049343,3.68E-38,,,,,,,,...,,,,,,,,,,
1351368,cg27662877,0.05681526,3.68E-38,,,,,,,,...,,,,,,,,,,


In [46]:
gene_annotation.columns

Index(['ID', 'Name', 'IlmnStrand', 'AddressA_ID', 'AlleleA_ProbeSeq',
       'AddressB_ID', 'AlleleB_ProbeSeq', 'GenomeBuild', 'Chr', 'MapInfo',
       'Ploidy', 'Species', 'Source', 'SourceVersion', 'SourceStrand',
       'SourceSeq', 'TopGenomicSeq', 'Next_Base', 'Color_Channel',
       'TSS_Coordinate', 'Gene_Strand', 'Gene_ID', 'Symbol', 'Synonym',
       'Accession', 'GID', 'Annotation', 'Product', 'Distance_to_TSS',
       'CPG_ISLAND', 'CPG_ISLAND_LOCATIONS', 'MIR_CPG_ISLAND', 'RANGE_GB',
       'RANGE_START', 'RANGE_END', 'RANGE_STRAND', 'GB_ACC', 'ORF'],
      dtype='object')

In [47]:
gene_annotation['Symbol']

0          ATP2A1
1           SLMAP
2           MEOX2
3           HOXD3
4          ZNF398
            ...  
1351365       NaN
1351366       NaN
1351367       NaN
1351368       NaN
1351369       NaN
Name: Symbol, Length: 1351370, dtype: object

In [48]:
if requires_gene_mapping:
    identifier_key = 'ID'
    gene_symbol_key = 'Symbol'
    gene_mapping = get_gene_mapping(gene_annotation, identifier_key, gene_symbol_key)
    genetic_data = apply_gene_mapping(genetic_data, gene_mapping)

In [49]:
genetic_data

Unnamed: 0_level_0,GSM889444,GSM889445,GSM889446,GSM889447,GSM889448,GSM889449,GSM889450,GSM889451,GSM889452,GSM889453,...,GSM889482,GSM889483,GSM889484,GSM889485,GSM889486,GSM889487,GSM889488,GSM889489,GSM889490,GSM889491
Gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
39873,0.689279,0.426369,0.711409,0.107636,0.801374,0.333542,0.485201,0.689080,0.308070,0.429268,...,0.551247,0.516344,0.667079,0.695478,0.610036,0.669368,0.729632,0.660225,0.679691,0.660214
39874,0.064024,0.052674,0.061257,0.056175,0.056825,0.041682,0.082841,0.067413,0.070280,0.070044,...,0.114598,0.101350,0.094201,0.076669,0.067392,0.081730,0.053490,0.067670,0.055647,0.062393
39875,0.249730,0.438115,0.440215,0.441576,0.441734,0.247551,0.368339,0.409045,0.252633,0.203710,...,0.388586,0.430048,0.276230,0.268600,0.387995,0.307379,0.350970,0.283600,0.330482,0.310958
39877,0.057814,0.059389,0.070888,0.052026,0.048152,0.061054,0.091866,0.059028,0.071877,0.055394,...,0.092821,0.075314,0.072169,0.057845,0.062215,0.070240,0.045412,0.056299,0.053210,0.069765
39878,0.049203,0.059566,0.053505,0.048289,0.046017,0.044724,0.060468,0.046377,0.058192,0.047501,...,0.077212,0.072336,0.057878,0.049363,0.055631,0.059790,0.053111,0.054395,0.054666,0.049155
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
hCAP-D3,0.135123,0.167544,0.120367,0.074781,0.092117,0.226392,0.219823,0.273371,0.114962,0.117908,...,0.220372,0.249601,0.157433,0.137902,0.187125,0.139870,0.175214,0.196041,0.176399,0.191310
hCAP-H2,0.031693,0.037456,0.034644,0.031599,0.040484,0.028122,0.030194,0.026827,0.049008,0.043003,...,0.035374,0.036672,0.036632,0.036060,0.035811,0.034073,0.027685,0.033319,0.031560,0.029071
hfl-B5,0.031899,0.034486,0.035470,0.036843,0.034103,0.027757,0.037554,0.037559,0.037354,0.038608,...,0.036248,0.041694,0.043964,0.044827,0.039061,0.038787,0.036774,0.039394,0.043053,0.032806
mimitin,0.660241,0.706996,0.743767,0.604390,0.677875,0.775538,0.397100,0.621684,0.769524,0.732552,...,0.562751,0.561515,0.598229,0.598634,0.543056,0.612676,0.617838,0.579518,0.578914,0.560829


In [50]:
genetic_data = normalize_gene_symbols_in_index(genetic_data)

In [51]:
genetic_data

Unnamed: 0,GSM889444,GSM889445,GSM889446,GSM889447,GSM889448,GSM889449,GSM889450,GSM889451,GSM889452,GSM889453,...,GSM889482,GSM889483,GSM889484,GSM889485,GSM889486,GSM889487,GSM889488,GSM889489,GSM889490,GSM889491
A1BG,0.877513,0.790216,0.905854,0.926305,0.813567,0.935604,0.876431,0.900982,0.923077,0.916555,...,0.890737,0.775418,0.609464,0.909835,0.908542,0.770064,0.901580,0.836649,0.881275,0.840217
A2M,0.488576,0.616612,0.415186,0.450625,0.524828,0.255921,0.446866,0.419583,0.255085,0.402749,...,0.427695,0.473264,0.414352,0.301115,0.429467,0.366102,0.558921,0.368966,0.398305,0.449060
A2ML1,0.770778,0.822486,0.758853,0.844311,0.796473,0.622104,0.740103,0.874109,0.719518,0.821388,...,0.547426,0.732379,0.728085,0.755285,0.723343,0.746495,0.798775,0.800428,0.767000,0.759809
A4GALT,0.337080,0.623585,0.400251,0.422990,0.342717,0.503112,0.257814,0.359270,0.445166,0.171016,...,0.306420,0.382156,0.371309,0.335879,0.314983,0.302081,0.317394,0.357834,0.334070,0.320748
A4GNT,0.731916,0.760295,0.797876,0.610334,0.804778,0.720754,0.722959,0.862955,0.590652,0.819226,...,0.646292,0.768062,0.742481,0.785946,0.809538,0.741358,0.797128,0.759670,0.774546,0.740727
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ZWINT,0.055689,0.054493,0.046531,0.050910,0.045079,0.039365,0.099023,0.051996,0.063345,0.051017,...,0.058510,0.056668,0.048632,0.050794,0.049920,0.060809,0.050605,0.049655,0.052283,0.048284
ZXDA,0.471868,0.394791,0.417707,0.432613,0.440079,0.430349,0.407978,0.394574,0.462678,0.421530,...,0.346426,0.405079,0.447421,0.421397,0.404630,0.384359,0.463203,0.400643,0.387745,0.397563
ZYX,0.028720,0.037781,0.039477,0.034675,0.032816,0.038400,0.133074,0.034445,0.042211,0.035093,...,0.061896,0.050344,0.037829,0.034190,0.034174,0.049541,0.040220,0.030688,0.031048,0.036458
ZZEF1,0.080601,0.086067,0.051746,0.062786,0.406175,0.085561,0.072221,0.066332,0.064198,0.065377,...,0.100295,0.118933,0.086164,0.100369,0.125893,0.076795,0.083921,0.100283,0.091715,0.083698


In [52]:
is_gene_availabe = True
trait_row = 2
age_row = None
gender_row = 0

trait_type = 'binary'

# Verify and use the functions generated by GPT

# 这个函数将组织类型（tissue type）转换为有关癫痫存在与否的二进制值。
# 它是基于特定的假设，即如果组织类型是“胰腺导管腺癌”（Pancreatic Ductal Adenocarcinoma），则认为癫痫存在（返回1）；否则，认为癫痫不存在（返回0）。
def convert_trait(tissue_type):
    """
    Convert tissue type to epilepsy presence (binary).
    Assuming epilepsy presence for 'Hippocampus' tissue.
    """
    if tissue_type == 'disease state: normal adrenal tissue':
        return 0  # Epilepsy present
    else:
        return 1  # Epilepsy not present


# 这个函数的目的是将年龄的字符串表示转换为一个连续的数值型表示。如果年龄未知（例如，标记为'n.a.'），则返回None。
# 函数尝试从传入的字符串中提取出一个整数作为年龄值。如果字符串的格式不符合预期，导致提取失败，同样返回None。
def convert_age(age_string):
    """
    Convert age string to a continuous numerical value.
    Unknown values are converted to None.
    """
    if age_string.lower() == 'n.a.':
        return None
    try:
        # Extract age as an integer from the string
        age = int(age_string.split(': ')[1])
        return age
    except (ValueError, IndexError):
        # In case of any format error or unexpected string structure
        return None


# 这个函数将性别的字符串表示转换为二进制值，其中“female”对应1，“male”对应0。如果性别未知或字符串不符合预期格式，则返回None。
# It sometimes maps 'female' to 0, and sometimes 1. Does it matter?
def convert_gender(gender_string):
    """
    Convert gender string to a binary value.
    'female' is represented as 1, 'male' as 0.
    Unknown values are converted to None.
    """
    if (gender_string.lower() == 'sex: female' or gender_string.lower() == 'sex: f' or gender_string.lower() == 'gender: female'):
        return 1
    elif (gender_string.lower() == 'sex: male' or gender_string.lower() == 'sex: m' or gender_string.lower() == 'gender: male') :  # changeed 
        return 0
    else:
        return None

In [53]:
selected_clinical_data = geo_select_clinical_features(clinical_data, TRAIT, trait_row, convert_trait, age_row=age_row,
                                                      convert_age=convert_age, gender_row=gender_row,
                                                      convert_gender=convert_gender)
selected_clinical_data.head()

  clinical_df = clinical_df.applymap(convert_fn)
  clinical_df = clinical_df.applymap(convert_fn)


Unnamed: 0,GSM889444,GSM889445,GSM889446,GSM889447,GSM889448,GSM889449,GSM889450,GSM889451,GSM889452,GSM889453,...,GSM889482,GSM889483,GSM889484,GSM889485,GSM889486,GSM889487,GSM889488,GSM889489,GSM889490,GSM889491
Adrenocortical Cancer,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,0,0,0,0,0,0
Gender,1,1,1,1,1,0,1,0,0,1,...,1,1,1,0,0,1,0,1,0,1


In [54]:
merged_data = geo_merge_clinical_genetic_data(selected_clinical_data, genetic_data)
# The preprocessing runs through, which means is_available should be True
is_available = True

In [55]:
merged_data

Unnamed: 0,Adrenocortical Cancer,Gender,A1BG,A2M,A2ML1,A4GALT,A4GNT,AAAS,AACS,AADAC,...,ZSCAN2,ZSCAN4,ZSWIM1,ZW10,ZWILCH,ZWINT,ZXDA,ZYX,ZZEF1,ZZZ3
GSM889444,1.0,1.0,0.877513,0.488576,0.770778,0.33708,0.731916,0.152761,0.145945,0.883583,...,0.060651,0.861556,0.113921,0.039714,0.068936,0.055689,0.471868,0.02872,0.080601,0.050376
GSM889445,1.0,1.0,0.790216,0.616612,0.822486,0.623585,0.760295,0.15578,0.222403,0.916509,...,0.078385,0.597086,0.099773,0.04438,0.077357,0.054493,0.394791,0.037781,0.086067,0.04965
GSM889446,1.0,1.0,0.905854,0.415186,0.758853,0.400251,0.797876,0.135304,0.105125,0.853768,...,0.090061,0.935782,0.100392,0.050342,0.099744,0.046531,0.417707,0.039477,0.051746,0.061866
GSM889447,1.0,1.0,0.926305,0.450625,0.844311,0.42299,0.610334,0.123305,0.076724,0.69393,...,0.061925,0.113502,0.082969,0.042907,0.059283,0.05091,0.432613,0.034675,0.062786,0.06172
GSM889448,1.0,1.0,0.813567,0.524828,0.796473,0.342717,0.804778,0.091906,0.106987,0.864323,...,0.072116,0.926427,0.096655,0.049146,0.077697,0.045079,0.440079,0.032816,0.406175,0.057205
GSM889449,1.0,0.0,0.935604,0.255921,0.622104,0.503112,0.720754,0.104102,0.098242,0.771726,...,0.051829,0.922562,0.070781,0.041381,0.067504,0.039365,0.430349,0.0384,0.085561,0.049556
GSM889450,1.0,1.0,0.876431,0.446866,0.740103,0.257814,0.722959,0.163676,0.144049,0.902074,...,0.101449,0.865154,0.165601,0.050697,0.126225,0.099023,0.407978,0.133074,0.072221,0.104789
GSM889451,1.0,0.0,0.900982,0.419583,0.874109,0.35927,0.862955,0.117089,0.104022,0.932222,...,0.076846,0.939495,0.100747,0.043479,0.089807,0.051996,0.394574,0.034445,0.066332,0.054134
GSM889452,1.0,0.0,0.923077,0.255085,0.719518,0.445166,0.590652,0.124258,0.11054,0.822212,...,0.071035,0.911982,0.106951,0.056701,0.073805,0.063345,0.462678,0.042211,0.064198,0.061472
GSM889453,1.0,1.0,0.916555,0.402749,0.821388,0.171016,0.819226,0.106355,0.113525,0.896862,...,0.063157,0.915611,0.103295,0.045101,0.077925,0.051017,0.42153,0.035093,0.065377,0.068653


In [56]:
print(f"The merged dataset contains {len(merged_data)} samples.")
is_trait_biased, merged_data = judge_and_remove_biased_features(merged_data, TRAIT, trait_type=trait_type)
is_trait_biased

The merged dataset contains 48 samples.
For the feature 'Adrenocortical Cancer', the least common label is '0.0' with 6 occurrences. This represents 12.50% of the dataset.
The distribution of the feature 'Adrenocortical Cancer' in this dataset is fine.

For the feature 'Gender', the least common label is '0.0' with 16 occurrences. This represents 33.33% of the dataset.
The distribution of the feature 'Gender' in this dataset is fine.



False

In [57]:
if is_available:
    save_cohort_info(cohort, JSON_PATH, is_available, is_trait_biased, merged_data, note='')
else:
    save_cohort_info(cohort, JSON_PATH, is_available)
merged_data.head()
if not is_trait_biased:
    merged_data.to_csv(os.path.join(OUTPUT_DIR, cohort + '.csv'), index=False)

In [58]:
# Finished: asked questions

cohort = accession_num = "GSE19776"
cohort_dir = os.path.join(trait_path, accession_num)
soft_file, matrix_file = get_relevant_filepaths(cohort_dir)
soft_file, matrix_file

from utils import *
import io
background_prefixes = ['!Series_title', '!Series_summary', '!Series_overall_design']
clinical_prefixes = ['!Sample_geo_accession', '!Sample_characteristics_ch1']

background_info, clinical_data = get_background_and_clinical_data(matrix_file, background_prefixes, clinical_prefixes)
print(background_info)

clinical_data

!Series_title	"Adrenocortical Carcinoma Gene Expression Profiling"
!Series_summary	"This SuperSeries is composed of the SubSeries listed below."
!Series_overall_design	"Refer to individual Series"


Unnamed: 0,!Sample_geo_accession,GSM493251,GSM493252,GSM493253,GSM493254,GSM493255,GSM493256,GSM493257,GSM493258,GSM493259,...,GSM1094071,GSM1094072,GSM1094073,GSM1094074,GSM1094075,GSM1094076,GSM1094077,GSM1094078,GSM1094079,GSM1094080
0,!Sample_characteristics_ch1,Stage: NA,Stage: NA,Stage: NA,Stage: NA,Stage: 2,Stage: 4,Stage: 2,Stage: 4,Stage: 4,...,Stage: 1,Stage: Recurrence,Stage: 4,Stage: 2,Stage: Recurrence,Stage: Recurrence,Stage: Recurrence,Stage: 4,Stage: 4,Stage: NA
1,!Sample_characteristics_ch1,tumor grade: NA,tumor grade: NA,tumor grade: NA,tumor grade: NA,tumor grade: 3,tumor grade: 4,tumor grade: 4,tumor grade: 4,tumor grade: 4,...,tumor grade: 1,tumor grade: Unknown,tumor grade: Unknown,tumor grade: 4,tumor grade: 2,tumor grade: Unknown,tumor grade: Unknown,tumor grade: 4,tumor grade: Unknown,tumor grade: Unknown
2,!Sample_characteristics_ch1,functional: NA,functional: NA,functional: NA,functional: NA,functional: None,functional: None,functional: None,functional: Cushings,functional: Cushings,...,functional: None,functional: None,functional: Unknown,"functional: Cortisol, aldosterone, testosterone",functional: None,functional: aldosterone,functional: None,functional: None,functional: Unknown,functional: Unknown
3,!Sample_characteristics_ch1,gender: Unknown,gender: Unknown,gender: Unknown,gender: Unknown,gender: M,gender: F,gender: M,gender: M,gender: M,...,gender: M,gender: M,gender: F,gender: F,gender: F,gender: F,gender: M,gender: M,gender: F,gender: NA
4,!Sample_characteristics_ch1,age in years: Unknown,age in years: Unknown,age in years: Unknown,age in years: Unknown,age in years: 23.3,age in years: 56.5,age in years: 67.8,age in years: 72.1,age in years: 46.9,...,age in years: 57,age in years: 59,age in years: 59,age in years: 55,age in years: 51,age in years: 53,age in years: 69,age in years: 63,age in years: 28,age in years: NA
5,!Sample_characteristics_ch1,survival in years: NA,survival in years: NA,survival in years: NA,survival in years: NA,survival in years: 3,survival in years: 0.6,survival in years: 1.7,survival in years: 0.4,survival in years: 0.1,...,survival in years: 3,survival in years: 7.583,survival in years: Unknown,survival in years: 0.583,survival in years: 6,survival in years: 2.083,survival in years: 2.83,survival in years: 2.08,survival in years: Unknown,survival in years: NA
6,!Sample_characteristics_ch1,survival status: NA,survival status: NA,survival status: NA,survival status: NA,survival status: dead,survival status: dead,survival status: dead,survival status: dead,survival status: dead,...,survival status: alive,survival status: dead,survival status: Unknown,survival status: dead,survival status: alive,survival status: dead,survival status: dead,survival status: alive,survival status: Unknown,survival status: NA
7,!Sample_characteristics_ch1,tumor size in cm: NA,tumor size in cm: NA,tumor size in cm: NA,tumor size in cm: NA,tumor size in cm: 19,tumor size in cm: 9,tumor size in cm: 7.6,tumor size in cm: 9.5,tumor size in cm: 12,...,tumor size in cm: 4,tumor size in cm: 2.5,tumor size in cm: 10,tumor size in cm: 10.5,tumor size in cm: 14.5,tumor size in cm: 14.5,tumor size in cm: 7.8,tumor size in cm: 7.8,tumor size in cm: Unknown,tumor size in cm: Unknown
8,!Sample_characteristics_ch1,tumor weight in grams: NA,tumor weight in grams: NA,tumor weight in grams: NA,tumor weight in grams: NA,tumor weight in grams: 1100,tumor weight in grams: 190,tumor weight in grams: 150,tumor weight in grams: 175,tumor weight in grams: 235,...,tumor weight in grams: 39,tumor weight in grams: unknown,tumor weight in grams: 22,tumor weight in grams: 277,tumor weight in grams: 325,tumor weight in grams: 1243,tumor weight in grams: unknown,tumor weight in grams: 132,tumor weight in grams: unknown,tumor weight in grams: unknown
9,!Sample_characteristics_ch1,batch: 1,batch: 1,batch: 1,batch: 1,batch: 1,batch: 1,batch: 1,batch: 1,batch: 1,...,batch: 2,batch: 2,batch: 2,batch: 2,batch: 2,batch: 2,batch: 2,batch: 2,batch: 2,batch: 2


In [59]:
tumor_stage_row = clinical_data.iloc[1]
tumor_stage_row.unique()

array(['!Sample_characteristics_ch1', 'tumor grade: NA', 'tumor grade: 3',
       'tumor grade: 4', 'tumor grade: 2', 'tumor grade: 1',
       'tumor grade: Unknown'], dtype=object)

In [60]:
is_gene_availabe = True
trait_row = 1
age_row = 4
gender_row = 3

trait_type = 'binary'

# Verify and use the functions generated by GPT

# 这个函数将组织类型（tissue type）转换为有关癫痫存在与否的二进制值。
# 它是基于特定的假设，即如果组织类型是“胰腺导管腺癌”（Pancreatic Ductal Adenocarcinoma），则认为癫痫存在（返回1）；否则，认为癫痫不存在（返回0）。

def convert_trait(tumor_grade):
    if (tumor_grade == 'tumor grade: 2' or tumor_grade == 'tumor grade: 3' or tumor_grade == 'tumor grade: 4'):
        return 1  
    elif tumor_grade == 'tumor grade: 1':
        return 0  
    else:
        return None


# 这个函数的目的是将年龄的字符串表示转换为一个连续的数值型表示。如果年龄未知（例如，标记为'n.a.'），则返回None。
# 函数尝试从传入的字符串中提取出一个整数作为年龄值。如果字符串的格式不符合预期，导致提取失败，同样返回None。
def convert_age(age_string):
    """
    Convert age string to a continuous numerical value.
    Unknown values are converted to None.
    """
    if age_string.lower() == 'n.a.':
        return None
    try:
        # Extract age as an integer from the string
        age = int(age_string.split(': ')[1])
        return age
    except (ValueError, IndexError):
        # In case of any format error or unexpected string structure
        return None


# 这个函数将性别的字符串表示转换为二进制值，其中“female”对应1，“male”对应0。如果性别未知或字符串不符合预期格式，则返回None。
# It sometimes maps 'female' to 0, and sometimes 1. Does it matter?
def convert_gender(gender_string):
    """
    Convert gender string to a binary value.
    'female' is represented as 1, 'male' as 0.
    Unknown values are converted to None.
    """
    if (gender_string.lower() == 'sex: female' or gender_string.lower() == 'sex: f' or gender_string.lower() == 'gender: female' or gender_string.lower() == 'gender: f'):
        return 1
    elif (gender_string.lower() == 'sex: male' or gender_string.lower() == 'sex: m' or gender_string.lower() == 'gender: male' or gender_string.lower() == 'gender: m') :  # changeed 
        return 0
    else:
        return None

In [61]:
selected_clinical_data = geo_select_clinical_features(clinical_data, TRAIT, trait_row, convert_trait, age_row=age_row,
                                                      convert_age=convert_age, gender_row=gender_row,
                                                      convert_gender=convert_gender)
selected_clinical_data.head()

  clinical_df = clinical_df.applymap(convert_fn)
  clinical_df = clinical_df.applymap(convert_fn)
  clinical_df = clinical_df.applymap(convert_fn)


Unnamed: 0,GSM493251,GSM493252,GSM493253,GSM493254,GSM493255,GSM493256,GSM493257,GSM493258,GSM493259,GSM493260,...,GSM1094071,GSM1094072,GSM1094073,GSM1094074,GSM1094075,GSM1094076,GSM1094077,GSM1094078,GSM1094079,GSM1094080
Adrenocortical Cancer,,,,,1.0,1.0,1.0,1.0,1.0,1.0,...,0,,,1,1,,,1,,
Age,,,,,,,,,,,...,57,59.0,59.0,55,51,53.0,69.0,63,28.0,
Gender,,,,,0.0,1.0,0.0,0.0,0.0,1.0,...,0,0.0,1.0,1,1,1.0,0.0,0,1.0,


In [62]:
genetic_data = get_genetic_data(matrix_file)
genetic_data

Unnamed: 0_level_0,GSM493251,GSM493252,GSM493253,GSM493254,GSM493255,GSM493256,GSM493257,GSM493258,GSM493259,GSM493260,...,GSM1094071,GSM1094072,GSM1094073,GSM1094074,GSM1094075,GSM1094076,GSM1094077,GSM1094078,GSM1094079,GSM1094080
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1007_s_at,101.10,48.86,100.83,104.84,566.78,418.37,499.31,432.83,212.42,336.85,...,139.79,35.72,79.18,154.74,166.63,29.67,72.48,10.21,151.86,137.73
1053_at,22.58,18.30,16.96,16.96,18.64,39.96,57.40,34.15,43.71,32.52,...,8.32,8.79,11.64,17.01,17.19,25.84,16.13,43.64,11.46,22.56
117_at,73.33,30.19,155.69,173.28,14.43,224.29,30.10,9.52,9.17,9.09,...,24.91,11.09,309.80,42.61,9.57,38.21,50.33,321.67,14.87,9.55
121_at,11.97,10.74,9.91,9.77,8.73,13.49,8.65,9.59,12.71,8.71,...,8.73,18.09,9.59,8.69,8.69,12.21,8.64,8.64,8.68,8.69
1255_g_at,5.50,5.50,5.50,5.50,5.50,5.50,5.50,5.50,5.50,5.50,...,5.50,5.50,5.50,5.50,5.50,5.50,5.50,5.50,5.50,5.50
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
AFFX-ThrX-5_at,5.50,5.50,5.50,5.50,5.50,5.50,5.50,5.50,5.50,5.50,...,5.50,5.50,5.50,5.50,5.50,5.50,5.50,5.50,5.50,5.50
AFFX-ThrX-M_at,5.50,5.50,5.50,5.50,5.50,5.50,5.50,5.50,5.50,5.50,...,5.50,5.50,5.50,5.50,5.50,5.50,5.50,5.50,5.50,5.50
AFFX-TrpnX-3_at,5.50,5.50,5.50,5.50,5.50,5.50,5.50,5.50,5.50,5.50,...,5.50,5.50,5.50,5.50,5.50,5.50,5.50,5.50,5.50,5.50
AFFX-TrpnX-5_at,5.50,5.50,5.50,5.50,5.50,5.50,5.50,5.50,5.50,5.50,...,5.50,5.50,5.50,5.50,5.50,5.50,5.50,5.50,5.50,5.50


In [63]:
requires_gene_mapping = True

if requires_gene_mapping:
    gene_annotation = get_gene_annotation(soft_file)
    gene_annotation_summary = preview_df(gene_annotation)
    print(gene_annotation_summary)

gene_row_ids = genetic_data.index[:20].tolist()
gene_row_ids

if requires_gene_mapping:
    print(f'''
    As a biomedical research team, we are analyzing a gene expression dataset, and find that its row headers are some identifiers related to genes:
    {gene_row_ids}
    To get the mapping from those identifiers to actual gene symbols, we extracted the gene annotation data from a series in the GEO database, and saved it to a Python dictionary. Please read the dictionary, and decide which key stores the identifiers, and which key stores the gene symbols. Please strictly follow this format in your answer:
    identifier_key = 'key_name1'
    gene_symbol_key = 'key_name2'

    Gene annotation dictionary:
    {gene_annotation_summary}
    ''')

gene_annotation

{'ID': ['1007_s_at', '1053_at', '117_at', '121_at', '1255_g_at'], 'GB_ACC': ['U48705', 'M87338', 'X51757', 'X69699', 'L36861'], 'SPOT_ID': [nan, nan, nan, nan, nan], 'Species Scientific Name': ['Homo sapiens', 'Homo sapiens', 'Homo sapiens', 'Homo sapiens', 'Homo sapiens'], 'Annotation Date': ['Oct 6, 2014', 'Oct 6, 2014', 'Oct 6, 2014', 'Oct 6, 2014', 'Oct 6, 2014'], 'Sequence Type': ['Exemplar sequence', 'Exemplar sequence', 'Exemplar sequence', 'Exemplar sequence', 'Exemplar sequence'], 'Sequence Source': ['Affymetrix Proprietary Database', 'GenBank', 'Affymetrix Proprietary Database', 'GenBank', 'Affymetrix Proprietary Database'], 'Target Description': ['U48705 /FEATURE=mRNA /DEFINITION=HSU48705 Human receptor tyrosine kinase DDR gene, complete cds', 'M87338 /FEATURE= /DEFINITION=HUMA1SBU Human replication factor C, 40-kDa subunit (A1) mRNA, complete cds', "X51757 /FEATURE=cds /DEFINITION=HSP70B Human heat-shock protein HSP70B' gene", 'X69699 /FEATURE= /DEFINITION=HSPAX8A H.sapiens

Unnamed: 0,ID,GB_ACC,SPOT_ID,Species Scientific Name,Annotation Date,Sequence Type,Sequence Source,Target Description,Representative Public ID,Gene Title,Gene Symbol,ENTREZ_GENE_ID,RefSeq Transcript ID,Gene Ontology Biological Process,Gene Ontology Cellular Component,Gene Ontology Molecular Function
0,1007_s_at,U48705,,Homo sapiens,"Oct 6, 2014",Exemplar sequence,Affymetrix Proprietary Database,U48705 /FEATURE=mRNA /DEFINITION=HSU48705 Huma...,U48705,discoidin domain receptor tyrosine kinase 1 //...,DDR1 /// MIR4640,780 /// 100616237,NM_001202521 /// NM_001202522 /// NM_001202523...,0001558 // regulation of cell growth // inferr...,0005576 // extracellular region // inferred fr...,0000166 // nucleotide binding // inferred from...
1,1053_at,M87338,,Homo sapiens,"Oct 6, 2014",Exemplar sequence,GenBank,M87338 /FEATURE= /DEFINITION=HUMA1SBU Human re...,M87338,"replication factor C (activator 1) 2, 40kDa",RFC2,5982,NM_001278791 /// NM_001278792 /// NM_001278793...,0000278 // mitotic cell cycle // traceable aut...,0005634 // nucleus // inferred from electronic...,0000166 // nucleotide binding // inferred from...
2,117_at,X51757,,Homo sapiens,"Oct 6, 2014",Exemplar sequence,Affymetrix Proprietary Database,X51757 /FEATURE=cds /DEFINITION=HSP70B Human h...,X51757,heat shock 70kDa protein 6 (HSP70B'),HSPA6,3310,NM_002155,0000902 // cell morphogenesis // inferred from...,0005737 // cytoplasm // inferred from direct a...,0000166 // nucleotide binding // inferred from...
3,121_at,X69699,,Homo sapiens,"Oct 6, 2014",Exemplar sequence,GenBank,X69699 /FEATURE= /DEFINITION=HSPAX8A H.sapiens...,X69699,paired box 8,PAX8,7849,NM_003466 /// NM_013951 /// NM_013952 /// NM_0...,0001655 // urogenital system development // in...,0005634 // nucleus // inferred from direct ass...,0000979 // RNA polymerase II core promoter seq...
4,1255_g_at,L36861,,Homo sapiens,"Oct 6, 2014",Exemplar sequence,Affymetrix Proprietary Database,L36861 /FEATURE=expanded_cds /DEFINITION=HUMGC...,L36861,guanylate cyclase activator 1A (retina),GUCA1A,2978,NM_000409 /// XM_006715073,0007165 // signal transduction // non-traceabl...,0001750 // photoreceptor outer segment // infe...,0005509 // calcium ion binding // inferred fro...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2996730,AFFX-ThrX-5_at,5.5,,,,,,,,,,,,,,
2996731,AFFX-ThrX-M_at,5.5,,,,,,,,,,,,,,
2996732,AFFX-TrpnX-3_at,5.5,,,,,,,,,,,,,,
2996733,AFFX-TrpnX-5_at,5.5,,,,,,,,,,,,,,


In [64]:
gene_annotation.columns

Index(['ID', 'GB_ACC', 'SPOT_ID', 'Species Scientific Name', 'Annotation Date',
       'Sequence Type', 'Sequence Source', 'Target Description',
       'Representative Public ID', 'Gene Title', 'Gene Symbol',
       'ENTREZ_GENE_ID', 'RefSeq Transcript ID',
       'Gene Ontology Biological Process', 'Gene Ontology Cellular Component',
       'Gene Ontology Molecular Function'],
      dtype='object')

In [65]:
if requires_gene_mapping:
    identifier_key = 'ID'
    gene_symbol_key = 'Gene Symbol'
    gene_mapping = get_gene_mapping(gene_annotation, identifier_key, gene_symbol_key)
    genetic_data = apply_gene_mapping(genetic_data, gene_mapping)

In [66]:
genetic_data

Unnamed: 0_level_0,GSM493251,GSM493252,GSM493253,GSM493254,GSM493255,GSM493256,GSM493257,GSM493258,GSM493259,GSM493260,...,GSM1094071,GSM1094072,GSM1094073,GSM1094074,GSM1094075,GSM1094076,GSM1094077,GSM1094078,GSM1094079,GSM1094080
Gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ABCB4,1341.22,1980.88,1241.750,1284.08,455.520,15.020,8.490,2015.63,369.640,56.960,...,333.370,2458.34,180.17,736.710,460.15,878.59,460.98,8.25,161.38,526.21
ABCC6P1,7.83,10.19,13.070,11.64,76.920,11.640,23.720,13.40,5.810,23.780,...,5.770,9.76,7.01,5.500,8.54,6.73,7.71,10.09,11.64,11.64
ABCC6P2,7.83,10.19,13.070,11.64,76.920,11.640,23.720,13.40,5.810,23.780,...,5.770,9.76,7.01,5.500,8.54,6.73,7.71,10.09,11.64,11.64
ABCD1P2,5.50,5.83,5.500,5.50,5.500,5.500,5.500,5.50,5.500,5.500,...,5.500,5.50,5.50,5.500,5.50,5.50,5.50,5.50,5.50,5.50
AC078883.4,5.50,5.50,5.500,5.50,5.500,5.500,5.500,5.50,5.500,5.500,...,5.500,5.50,5.50,5.500,5.50,6.83,5.50,5.50,5.50,5.50
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
abParts,9.04,9.04,9.040,9.04,35.270,6313.010,233.550,9.04,9.040,9.040,...,9.040,9.04,9.05,9.040,11.09,9.61,9.04,52.44,43.59,9.04
alpha,5.50,5.50,5.500,5.50,5.500,13.290,5.500,5.50,44.860,5.500,...,5.500,5.50,5.50,5.500,5.50,5.50,5.50,5.50,5.50,5.50
av27s1,5.50,5.72,5.640,5.50,5.500,5.500,5.500,5.50,5.500,5.500,...,5.500,5.50,5.50,5.500,5.50,5.50,5.50,5.50,5.50,5.50
hsa-let-7a-3,9.31,9.05,8.065,6.64,8.985,11.525,11.875,8.55,20.685,6.775,...,5.625,5.50,5.50,5.675,5.50,5.50,5.50,5.50,5.50,5.50


In [67]:
genetic_data = normalize_gene_symbols_in_index(genetic_data)

In [68]:
genetic_data

Unnamed: 0,GSM493251,GSM493252,GSM493253,GSM493254,GSM493255,GSM493256,GSM493257,GSM493258,GSM493259,GSM493260,...,GSM1094071,GSM1094072,GSM1094073,GSM1094074,GSM1094075,GSM1094076,GSM1094077,GSM1094078,GSM1094079,GSM1094080
A1BG,5.500000,5.640,5.500000,5.500000,5.500000,5.500000,18.570000,5.500000,5.500000,5.500000,...,5.500000,5.500000,6.410000,5.500000,5.500000,5.980000,5.500000,8.410000,5.500000,5.500000
A1BG-AS1,7.950000,12.980,13.340000,8.350000,8.840000,7.170000,10.140000,6.800000,6.540000,15.120000,...,10.540000,8.270000,16.220000,7.580000,9.610000,13.200000,6.500000,39.740000,7.510000,6.300000
A1CF,7.120000,5.500,6.805000,5.500000,5.500000,5.500000,5.930000,5.500000,6.555000,5.500000,...,5.500000,5.500000,5.500000,5.500000,5.500000,5.500000,5.500000,5.500000,5.500000,5.500000
A2M,3303.540000,1751.345,4304.375000,4467.760000,1225.825000,404.895000,276.810000,922.500000,184.890000,296.650000,...,503.735000,832.570000,1589.435000,924.025000,467.175000,984.970000,678.560000,249.025000,1012.500000,963.050000
A2M-AS1,13.250000,10.690,33.790000,23.860000,27.390000,9.860000,8.260000,49.400000,9.820000,14.360000,...,17.580000,24.270000,27.740000,16.180000,8.800000,16.650000,19.140000,9.390000,21.520000,18.600000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ZYG11A,5.500000,5.500,5.500000,5.500000,5.500000,5.680000,41.110000,5.500000,5.500000,5.500000,...,5.500000,5.500000,5.500000,5.500000,5.500000,5.500000,5.500000,121.390000,5.500000,6.110000
ZYG11B,120.173333,135.650,144.273333,160.646667,192.403333,69.213333,164.600000,153.056667,118.900000,124.290000,...,52.633333,102.250000,73.350000,83.226667,84.833333,115.490000,64.146667,122.486667,66.980000,111.153333
ZYX,109.345000,103.390,115.915000,113.375000,33.390000,169.690000,97.565000,68.930000,105.195000,18.495000,...,18.360000,28.040000,86.410000,41.670000,31.345000,24.975000,43.145000,16.535000,16.925000,19.220000
ZZEF1,46.093333,37.000,36.803333,34.036667,53.216667,21.713333,37.506667,89.816667,77.906667,36.686667,...,58.003333,15.563333,8.223333,11.126667,9.826667,10.496667,13.913333,11.370000,11.766667,15.500000


In [69]:
merged_data = geo_merge_clinical_genetic_data(selected_clinical_data, genetic_data)
# The preprocessing runs through, which means is_available should be True
is_available = True

  merged_data = pd.concat([clinical_df, genetic_df], axis=0).T.dropna()


In [70]:
merged_data

Unnamed: 0,Adrenocortical Cancer,Age,Gender,A1BG,A1BG-AS1,A1CF,A2M,A2M-AS1,A2ML1,A2MP1,...,ZWILCH,ZWINT,ZXDA,ZXDB,ZXDC,ZYG11A,ZYG11B,ZYX,ZZEF1,ZZZ3
GSM493265,1.0,37.0,1.0,29.53,27.98,5.5,630.915,14.83,5.84,5.79,...,21.53,86.84,8.83,6.735,36.676667,5.5,98.686667,34.93,30.373333,71.49
GSM493270,1.0,58.0,1.0,5.5,6.77,5.5,968.32,34.95,5.5,5.79,...,30.695,90.66,13.94,11.26,33.223333,5.5,158.316667,70.065,61.263333,68.105
GSM1094056,0.0,20.0,1.0,5.5,9.68,5.5,380.405,9.32,5.5,5.84,...,48.445,191.94,12.93,9.78,41.331667,5.5,89.596667,15.265,16.493333,108.55
GSM1094057,1.0,68.0,1.0,5.5,7.95,5.5,1465.11,27.9,5.5,154.36,...,48.365,281.2,7.29,7.32,46.606667,5.5,50.603333,28.48,10.89,58.92
GSM1094060,0.0,32.0,1.0,5.5,7.95,5.5,680.91,14.43,5.5,12.75,...,47.235,28.79,10.14,11.98,45.533333,5.5,70.15,15.255,23.983333,101.235
GSM1094061,0.0,43.0,0.0,5.5,6.29,5.5,999.32,11.21,5.5,18.01,...,32.09,81.36,19.22,9.825,28.378333,5.56,76.246667,54.955,10.566667,67.585
GSM1094063,1.0,40.0,0.0,5.5,7.51,5.5,474.915,12.02,5.5,46.27,...,41.44,61.6,23.76,18.295,43.958333,5.5,87.03,23.975,10.793333,94.43
GSM1094066,0.0,27.0,1.0,10.71,22.0,5.945,365.945,11.49,5.56,60.55,...,35.47,65.86,32.64,25.8,55.596667,7.74,89.68,12.18,14.486667,95.235
GSM1094067,0.0,70.0,0.0,5.5,7.2,6.55,1191.28,23.69,5.5,9.22,...,33.985,37.59,24.58,22.49,27.738333,5.5,109.716667,42.44,10.15,61.765
GSM1094071,0.0,57.0,0.0,5.5,10.54,5.5,503.735,17.58,5.5,93.73,...,53.105,159.57,14.95,11.98,41.068333,5.5,52.633333,18.36,58.003333,80.365


In [71]:
print(f"The merged dataset contains {len(merged_data)} samples.")
is_trait_biased, merged_data = judge_and_remove_biased_features(merged_data, TRAIT, trait_type=trait_type)
is_trait_biased

The merged dataset contains 13 samples.
For the feature 'Adrenocortical Cancer', the least common label is '0.0' with 6 occurrences. This represents 46.15% of the dataset.
The distribution of the feature 'Adrenocortical Cancer' in this dataset is fine.

Quartiles for 'Age':
  25%: 37.0
  50% (Median): 51.0
  75%: 58.0
Min: 20.0
Max: 70.0
The distribution of the feature 'Age' in this dataset is fine.

For the feature 'Gender', the least common label is '0.0' with 5 occurrences. This represents 38.46% of the dataset.
The distribution of the feature 'Gender' in this dataset is fine.



False

In [72]:
if is_available:
    save_cohort_info(cohort, JSON_PATH, is_available, is_trait_biased, merged_data, note='')
else:
    save_cohort_info(cohort, JSON_PATH, is_available)
merged_data.head()
if not is_trait_biased:
    merged_data.to_csv(os.path.join(OUTPUT_DIR, cohort + '.csv'), index=False)

In [73]:
# Stop: No trait

cohort = accession_num = "GSE35066"
cohort_dir = os.path.join(trait_path, accession_num)
soft_file, matrix_file = get_relevant_filepaths(cohort_dir)
soft_file, matrix_file

from utils import *
import io
background_prefixes = ['!Series_title', '!Series_summary', '!Series_overall_design']
clinical_prefixes = ['!Sample_geo_accession', '!Sample_characteristics_ch1']

background_info, clinical_data = get_background_and_clinical_data(matrix_file, background_prefixes, clinical_prefixes)
print(background_info)

clinical_data

!Series_title	"SNP array profiling of childhood adrenocortical tumours reveals distinct pathways of tumourigenesis and highlights candidate driver genes"
!Series_summary	"This SuperSeries is composed of the SubSeries listed below."
!Series_overall_design	"Refer to individual Series"


Unnamed: 0,!Sample_geo_accession,GSM610697,GSM610698,GSM610699,GSM610700,GSM610701,GSM610702
0,!Sample_characteristics_ch1,geographical origin: Brazil,geographical origin: Brazil,geographical origin: Brazil,geographical origin: Brazil,geographical origin: Brazil,geographical origin: Brazil
1,!Sample_characteristics_ch1,genger: Female,genger: Female,genger: Female,genger: Male,genger: Female,genger: Female
2,!Sample_characteristics_ch1,age (months): 42.96,histological type: Paired blood sample corresp...,histological type: Paired blood sample corresp...,histological type: Paired blood sample corresp...,histological type: Paired blood sample corresp...,histological type: Paired blood sample corresp...
3,!Sample_characteristics_ch1,histological type: Paired blood sample corresp...,tnm: --,tnm: --,tnm: --,tnm: --,tnm: --
4,!Sample_characteristics_ch1,tnm: --,tp53 mutation: --,tp53 mutation: --,tp53 mutation: --,tp53 mutation: --,tp53 mutation: --
5,!Sample_characteristics_ch1,tp53 mutation: --,virilization: --,virilization: --,virilization: --,virilization: --,virilization: --
6,!Sample_characteristics_ch1,virilization: --,cushing syndrome: --,cushing syndrome: --,cushing syndrome: --,cushing syndrome: --,cushing syndrome: --
7,!Sample_characteristics_ch1,cushing syndrome: --,treatment: --,treatment: --,treatment: --,treatment: --,treatment: --
8,!Sample_characteristics_ch1,treatment: --,follow-up (months): --,follow-up (months): --,follow-up (months): --,follow-up (months): --,follow-up (months): --
9,!Sample_characteristics_ch1,follow-up (months): --,,,,,


In [74]:
# Stop: No gene symbol (Pure miRNA data is not suitable)
cohort = accession_num = "GSE19856"
cohort_dir = os.path.join(trait_path, accession_num)
soft_file, matrix_file = get_relevant_filepaths(cohort_dir)
soft_file, matrix_file

from utils import *
import io
background_prefixes = ['!Series_title', '!Series_summary', '!Series_overall_design']
clinical_prefixes = ['!Sample_geo_accession', '!Sample_characteristics_ch1']

background_info, clinical_data = get_background_and_clinical_data(matrix_file, background_prefixes, clinical_prefixes)
print(background_info)

clinical_data

!Series_title	"Study of miRNA expression in childhood adrenocortical tumors"
!Series_summary	"We studied the miRNA expression profile of a series of childhood adrenocortical tumors (ACT) and age-matched normal adrenal samples"
!Series_overall_design	"25 ACT - 5 normal"


Unnamed: 0,!Sample_geo_accession,GSM495879,GSM495880,GSM495881,GSM495882,GSM495883,GSM495884,GSM495885,GSM495886,GSM495887,...,GSM495899,GSM495900,GSM495901,GSM495902,GSM495903,GSM495904,GSM495905,GSM495906,GSM495907,GSM495908
0,!Sample_characteristics_ch1,tissue: Adrenal cortex,tissue: Adrenal cortex,tissue: Adrenal cortex,tissue: Adrenal cortex,tissue: Adrenal cortex,tissue: Adrenal cortex,tissue: Adrenal cortex,tissue: Adrenal cortex,tissue: Adrenal cortex,...,tissue: Adrenal cortex,tissue: Adrenal cortex,tissue: Adrenal cortex,tissue: Adrenal cortex,tissue: Adrenal cortex,tissue: Adrenal cortex,tissue: Adrenal cortex,tissue: Adrenal cortex,tissue: Adrenal cortex,tissue: Adrenal cortex
1,!Sample_characteristics_ch1,disease state: adrenocortical tumor,disease state: adrenocortical tumor,disease state: adrenocortical tumor,disease state: adrenocortical tumor,disease state: adrenocortical tumor,disease state: adrenocortical tumor,disease state: adrenocortical tumor,disease state: adrenocortical tumor,disease state: adrenocortical tumor,...,disease state: adrenocortical tumor,disease state: adrenocortical tumor,disease state: adrenocortical tumor,disease state: adrenocortical tumor,disease state: adrenocortical tumor,disease state: normal,disease state: normal,disease state: normal,disease state: normal,disease state: normal


In [75]:
tumor_stage_row = clinical_data.iloc[1]
tumor_stage_row.unique()

array(['!Sample_characteristics_ch1',
       'disease state: adrenocortical tumor', 'disease state: normal'],
      dtype=object)

In [76]:
is_gene_availabe = True
trait_row = 1
age_row = None
gender_row = None

trait_type = 'binary'

# Verify and use the functions generated by GPT

# 这个函数将组织类型（tissue type）转换为有关癫痫存在与否的二进制值。
# 它是基于特定的假设，即如果组织类型是“胰腺导管腺癌”（Pancreatic Ductal Adenocarcinoma），则认为癫痫存在（返回1）；否则，认为癫痫不存在（返回0）。
def convert_trait(tissue_type):
    """
    Convert tissue type to epilepsy presence (binary).
    Assuming epilepsy presence for 'Hippocampus' tissue.
    """
    if tissue_type == 'disease state: adrenocortical tumor':
        return 1  # Epilepsy present
    else:
        return 0  # Epilepsy not present


# 这个函数的目的是将年龄的字符串表示转换为一个连续的数值型表示。如果年龄未知（例如，标记为'n.a.'），则返回None。
# 函数尝试从传入的字符串中提取出一个整数作为年龄值。如果字符串的格式不符合预期，导致提取失败，同样返回None。
def convert_age(age_string):
    """
    Convert age string to a continuous numerical value.
    Unknown values are converted to None.
    """
    if age_string.lower() == 'n.a.':
        return None
    try:
        # Extract age as an integer from the string
        age = int(age_string.split(': ')[1])
        return age
    except (ValueError, IndexError):
        # In case of any format error or unexpected string structure
        return None


# 这个函数将性别的字符串表示转换为二进制值，其中“female”对应1，“male”对应0。如果性别未知或字符串不符合预期格式，则返回None。
# It sometimes maps 'female' to 0, and sometimes 1. Does it matter?
def convert_gender(gender_string):
    """
    Convert gender string to a binary value.
    'female' is represented as 1, 'male' as 0.
    Unknown values are converted to None.
    """
    if (gender_string.lower() == 'sex: female' or gender_string.lower() == 'sex: f' or gender_string.lower() == 'gender: female'):
        return 1
    elif (gender_string.lower() == 'sex: male' or gender_string.lower() == 'sex: m' or gender_string.lower() == 'gender: male') :  # changeed 
        return 0
    else:
        return None

In [77]:
selected_clinical_data = geo_select_clinical_features(clinical_data, TRAIT, trait_row, convert_trait, age_row=age_row,
                                                      convert_age=convert_age, gender_row=gender_row,
                                                      convert_gender=convert_gender)
selected_clinical_data.head()

  clinical_df = clinical_df.applymap(convert_fn)


Unnamed: 0,GSM495879,GSM495880,GSM495881,GSM495882,GSM495883,GSM495884,GSM495885,GSM495886,GSM495887,GSM495888,...,GSM495899,GSM495900,GSM495901,GSM495902,GSM495903,GSM495904,GSM495905,GSM495906,GSM495907,GSM495908
Adrenocortical Cancer,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,0,0,0,0,0


In [78]:
genetic_data = get_genetic_data(matrix_file)
genetic_data

Unnamed: 0_level_0,GSM495879,GSM495880,GSM495881,GSM495882,GSM495883,GSM495884,GSM495885,GSM495886,GSM495887,GSM495888,...,GSM495899,GSM495900,GSM495901,GSM495902,GSM495903,GSM495904,GSM495905,GSM495906,GSM495907,GSM495908
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
DarkCorner,-7.21640,-6.898350,-9.59862,-4.718430,-9.114260,-5.497120,-8.27575,-0.418281,-7.752440,5.263860,...,-6.673030,-10.034100,-2.87338,-6.890780,-13.643800,-11.24230,-14.145600,-13.803400,-11.586800,-11.498000
dmr_285,-3.24370,0.029345,-2.96027,-3.947260,-2.508740,-2.449810,-2.76578,-2.222550,-2.485020,0.194956,...,-0.622680,-3.282510,-1.49847,-3.208890,-1.656840,-2.42992,-2.638710,-0.923110,-1.950690,1.098530
dmr_3,-4.14763,-2.179640,-1.86509,-2.264490,-1.913330,-3.938300,-2.03988,-0.926865,1.122930,-0.061020,...,0.368513,1.808110,2.45639,0.092283,-0.197860,1.94856,0.776828,4.097750,0.194788,2.004070
dmr_308,-2.56221,0.432007,-1.16884,0.419284,-1.334440,-2.821480,1.15786,2.676690,-0.005938,0.038004,...,1.137990,0.594746,2.45122,0.424864,1.923710,1.07483,0.986610,0.093465,0.045241,2.145100
dmr_316,7.50020,-0.016628,1.01214,14.927100,4.129020,1.911680,13.18620,1.360270,8.352440,0.329248,...,-1.084340,6.393690,2.88845,25.426200,22.933700,15.93210,11.058300,6.626770,12.788000,2.838900
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
NC2_00092197,-11.77950,-2.961070,-6.99260,-7.504920,-4.824750,-3.665080,-2.74892,-4.332460,-4.197160,-2.492510,...,-6.170910,-11.198800,-4.76573,-8.996460,-11.898900,-6.75238,-8.517520,-3.350240,-5.480380,-3.448570
NC2_00106057,-22.22970,-8.121470,-9.95687,-9.213080,-1.399430,-5.819780,-3.07508,-2.980690,-4.411350,-4.824840,...,-8.517060,-12.004400,-7.94205,-8.859080,-16.778100,-18.80480,-13.504700,-15.415900,-12.599400,-9.197450
NC2_00122731,-18.58320,-7.004070,-10.09400,-8.986480,-3.159230,-5.798770,-3.49107,-6.444480,-9.071440,-9.405050,...,-7.791460,-12.514200,-7.09567,-10.256600,-14.578500,-16.57800,-13.171400,-12.330000,-14.361300,-9.111960
NegativeControl,-166.04000,-118.832000,-130.24200,-124.836000,-107.220000,-124.721000,-111.83800,-113.765000,-116.104000,-110.328000,...,-103.125000,-127.329000,-98.16780,-124.172000,-141.184000,-137.20100,-128.727000,-134.198000,-123.992000,-112.417000


In [79]:
requires_gene_mapping = True

if requires_gene_mapping:
    gene_annotation = get_gene_annotation(soft_file)
    gene_annotation_summary = preview_df(gene_annotation)
    print(gene_annotation_summary)

gene_row_ids = genetic_data.index[:20].tolist()
gene_row_ids

if requires_gene_mapping:
    print(f'''
    As a biomedical research team, we are analyzing a gene expression dataset, and find that its row headers are some identifiers related to genes:
    {gene_row_ids}
    To get the mapping from those identifiers to actual gene symbols, we extracted the gene annotation data from a series in the GEO database, and saved it to a Python dictionary. Please read the dictionary, and decide which key stores the identifiers, and which key stores the gene symbols. Please strictly follow this format in your answer:
    identifier_key = 'key_name1'
    gene_symbol_key = 'key_name2'

    Gene annotation dictionary:
    {gene_annotation_summary}
    ''')

gene_annotation

{'ID': ['hsa-let-7a', 'hsa-let-7a*', 'hsa-let-7b', 'hsa-let-7b*', 'hsa-let-7c'], 'ControlType': ['0', '0', '0', '0', '0'], 'ORGANISM': ['Homo sapiens', 'Homo sapiens', 'Homo sapiens', 'Homo sapiens', 'Homo sapiens'], 'miRNA_ID': ['hsa-let-7a', 'hsa-let-7a*', 'hsa-let-7b', 'hsa-let-7b*', 'hsa-let-7c'], 'SPOT_ID': [nan, nan, nan, nan, nan], 'SPOT_ID.1': [nan, nan, nan, nan, nan]}

    As a biomedical research team, we are analyzing a gene expression dataset, and find that its row headers are some identifiers related to genes:
    ['DarkCorner', 'dmr_285', 'dmr_3', 'dmr_308', 'dmr_316', 'dmr_31a', 'dmr_6', 'ebv-miR-BART1-3p', 'ebv-miR-BART1-5p', 'ebv-miR-BART10', 'ebv-miR-BART10*', 'ebv-miR-BART11-3p', 'ebv-miR-BART11-5p', 'ebv-miR-BART12', 'ebv-miR-BART13', 'ebv-miR-BART13*', 'ebv-miR-BART14', 'ebv-miR-BART14*', 'ebv-miR-BART15', 'ebv-miR-BART16']
    To get the mapping from those identifiers to actual gene symbols, we extracted the gene annotation data from a series in the GEO database,

Unnamed: 0,ID,ControlType,ORGANISM,miRNA_ID,SPOT_ID,SPOT_ID.1
0,hsa-let-7a,0,Homo sapiens,hsa-let-7a,,
1,hsa-let-7a*,0,Homo sapiens,hsa-let-7a*,,
2,hsa-let-7b,0,Homo sapiens,hsa-let-7b,,
3,hsa-let-7b*,0,Homo sapiens,hsa-let-7b*,,
4,hsa-let-7c,0,Homo sapiens,hsa-let-7c,,
...,...,...,...,...,...,...
25536,kshv-miR-K12-8,26.114,,,,
25537,kshv-miR-K12-9,-0.642957,,,,
25538,kshv-miR-K12-9*,0.699658,,,,
25539,miRNABrightCorner30,106091,,,,


In [80]:
gene_annotation.columns

Index(['ID', 'ControlType', 'ORGANISM', 'miRNA_ID', 'SPOT_ID', 'SPOT_ID.1'], dtype='object')

In [81]:
# Stop: NO trait

cohort = accession_num = "GSE32206"
cohort_dir = os.path.join(trait_path, accession_num)
soft_file, matrix_file = get_relevant_filepaths(cohort_dir)
soft_file, matrix_file

from utils import *
import io
background_prefixes = ['!Series_title', '!Series_summary', '!Series_overall_design']
clinical_prefixes = ['!Sample_geo_accession', '!Sample_characteristics_ch1']

background_info, clinical_data = get_background_and_clinical_data(matrix_file, background_prefixes, clinical_prefixes)
print(background_info)

clinical_data

!Series_title	"Identity by descent mapping of founder mutations in cancer using high-resolution tumor SNP data"
!Series_summary	"We present a computational tool, FounderTracker, for discovering founder mutations in cancer, based on the detection of significantly conserved haplotypes in tumor SNP profiles. We demonstrate the relevance of the approach by identifying founder mutations in two different cancers, and we show with simulated data that FounderTracker can detect rare founder mutations with high power and negligible false discovery rate. FounderTracker is a powerful tool for discovering novel founder mutations that may explain part of the ""missing"" heritability in cancer."
!Series_summary	""
!Series_summary	"This SuperSeries is composed of the SubSeries listed below."
!Series_overall_design	"Refer to individual Series."


Unnamed: 0,!Sample_geo_accession,GSM610697,GSM610698,GSM610699,GSM610700,GSM610701,GSM610702
0,!Sample_characteristics_ch1,geographical origin: Brazil,geographical origin: Brazil,geographical origin: Brazil,geographical origin: Brazil,geographical origin: Brazil,geographical origin: Brazil
1,!Sample_characteristics_ch1,genger: Female,genger: Female,genger: Female,genger: Male,genger: Female,genger: Female
2,!Sample_characteristics_ch1,age (months): 42.96,histological type: Paired blood sample corresp...,histological type: Paired blood sample corresp...,histological type: Paired blood sample corresp...,histological type: Paired blood sample corresp...,histological type: Paired blood sample corresp...
3,!Sample_characteristics_ch1,histological type: Paired blood sample corresp...,tnm: --,tnm: --,tnm: --,tnm: --,tnm: --
4,!Sample_characteristics_ch1,tnm: --,tp53 mutation: --,tp53 mutation: --,tp53 mutation: --,tp53 mutation: --,tp53 mutation: --
5,!Sample_characteristics_ch1,tp53 mutation: --,virilization: --,virilization: --,virilization: --,virilization: --,virilization: --
6,!Sample_characteristics_ch1,virilization: --,cushing syndrome: --,cushing syndrome: --,cushing syndrome: --,cushing syndrome: --,cushing syndrome: --
7,!Sample_characteristics_ch1,cushing syndrome: --,treatment: --,treatment: --,treatment: --,treatment: --,treatment: --
8,!Sample_characteristics_ch1,treatment: --,follow-up (months): --,follow-up (months): --,follow-up (months): --,follow-up (months): --,follow-up (months): --
9,!Sample_characteristics_ch1,follow-up (months): --,,,,,


In [82]:
# Finished

cohort = accession_num = "GSE75415"
cohort_dir = os.path.join(trait_path, accession_num)
soft_file, matrix_file = get_relevant_filepaths(cohort_dir)
soft_file, matrix_file

from utils import *
import io
background_prefixes = ['!Series_title', '!Series_summary', '!Series_overall_design']
clinical_prefixes = ['!Sample_geo_accession', '!Sample_characteristics_ch1']

background_info, clinical_data = get_background_and_clinical_data(matrix_file, background_prefixes, clinical_prefixes)
print(background_info)

clinical_data

!Series_title	"Gene exrpression profiling of childhood adrenocortical tumors"
!Series_summary	"Pediatric adrenocortical tumors (ACT) are rare and often fatal malignancies; little is known regarding their etiology and biology. To provide additional insight into the nature of ACT, we determined the gene expression profiles of 24 pediatric tumors (five adenomas, 18 carcinomas, and one undetermined) and seven normal adrenal glands. Distinct patterns of gene expression, validated by quantitative real-time PCR and Western blot analysis, were identified that distinguish normal adrenal cortex from tumor. Differences in gene expression were also identified between adrenocortical adenomas and carcinomas. In addition, pediatric adrenocortical carcinomas were found to share similar patterns of gene expression when compared with those published for adult ACT. This study represents the first microarray analysis of childhood ACT. Our findings lay the groundwork for establishing gene expression profil

Unnamed: 0,!Sample_geo_accession,GSM1954726,GSM1954727,GSM1954728,GSM1954729,GSM1954730,GSM1954731,GSM1954732,GSM1954733,GSM1954734,...,GSM1954747,GSM1954748,GSM1954749,GSM1954750,GSM1954751,GSM1954752,GSM1954753,GSM1954754,GSM1954755,GSM1954756
0,!Sample_characteristics_ch1,gender: female,gender: female,gender: female,gender: female,gender: female,gender: male,gender: female,gender: male,gender: female,...,gender: female,gender: female,gender: unknown,gender: unknown,gender: unknown,gender: unknown,gender: unknown,gender: unknown,gender: unknown,gender: unknown
1,!Sample_characteristics_ch1,histologic type: adrenocortical adenoma,histologic type: adrenocortical adenoma,histologic type: adrenocortical adenoma,histologic type: adrenocortical adenoma,histologic type: adrenocortical adenoma,histologic type: adrenocortical carcinoma,histologic type: adrenocortical carcinoma,histologic type: adrenocortical carcinoma,histologic type: adrenocortical carcinoma,...,histologic type: adrenocortical carcinoma,histologic type: adrenocortical carcinoma,histologic type: unknown,histologic type: normal,histologic type: normal,histologic type: normal,histologic type: normal,histologic type: normal,histologic type: normal,histologic type: normal
2,!Sample_characteristics_ch1,tumor stage: not staged,tumor stage: not staged,tumor stage: not staged,tumor stage: not staged,tumor stage: not staged,tumor stage: 4,tumor stage: 2,tumor stage: 3,tumor stage: 1,...,tumor stage: 4,tumor stage: 3,tumor stage: unknown,tumor stage: not applicable,tumor stage: not applicable,tumor stage: not applicable,tumor stage: not applicable,tumor stage: not applicable,tumor stage: not applicable,tumor stage: not applicable


In [83]:
tumor_stage_row = clinical_data.iloc[1]
tumor_stage_row.unique()

array(['!Sample_characteristics_ch1',
       'histologic type: adrenocortical adenoma',
       'histologic type: adrenocortical carcinoma',
       'histologic type: unknown', 'histologic type: normal'],
      dtype=object)

In [84]:
is_gene_availabe = True
trait_row = 1
age_row = None
gender_row = 0

trait_type = 'binary'

# Verify and use the functions generated by GPT

# 这个函数将组织类型（tissue type）转换为有关癫痫存在与否的二进制值。
# 它是基于特定的假设，即如果组织类型是“胰腺导管腺癌”（Pancreatic Ductal Adenocarcinoma），则认为癫痫存在（返回1）；否则，认为癫痫不存在（返回0）。
def convert_trait(tissue_type):
    """
    Convert tissue type to epilepsy presence (binary).
    Assuming epilepsy presence for 'Hippocampus' tissue.
    """
    if tissue_type == 'histologic type: normal':
        return 0  # Epilepsy present
    elif tissue_type == 'histologic type: unknown':
        return None  # Epilepsy not present
    else:
        return 1


# 这个函数的目的是将年龄的字符串表示转换为一个连续的数值型表示。如果年龄未知（例如，标记为'n.a.'），则返回None。
# 函数尝试从传入的字符串中提取出一个整数作为年龄值。如果字符串的格式不符合预期，导致提取失败，同样返回None。
def convert_age(age_string):
    """
    Convert age string to a continuous numerical value.
    Unknown values are converted to None.
    """
    if age_string.lower() == 'n.a.':
        return None
    try:
        # Extract age as an integer from the string
        age = int(age_string.split(': ')[1])
        return age
    except (ValueError, IndexError):
        # In case of any format error or unexpected string structure
        return None


# 这个函数将性别的字符串表示转换为二进制值，其中“female”对应1，“male”对应0。如果性别未知或字符串不符合预期格式，则返回None。
# It sometimes maps 'female' to 0, and sometimes 1. Does it matter?
def convert_gender(gender_string):
    """
    Convert gender string to a binary value.
    'female' is represented as 1, 'male' as 0.
    Unknown values are converted to None.
    """
    if (gender_string.lower() == 'sex: female' or gender_string.lower() == 'sex: f' or gender_string.lower() == 'gender: female'):
        return 1
    elif (gender_string.lower() == 'sex: male' or gender_string.lower() == 'sex: m' or gender_string.lower() == 'gender: male') :  # changeed 
        return 0
    else:
        return None

In [85]:
selected_clinical_data = geo_select_clinical_features(clinical_data, TRAIT, trait_row, convert_trait, age_row=age_row,
                                                      convert_age=convert_age, gender_row=gender_row,
                                                      convert_gender=convert_gender)
selected_clinical_data.head()

  clinical_df = clinical_df.applymap(convert_fn)
  clinical_df = clinical_df.applymap(convert_fn)


Unnamed: 0,GSM1954726,GSM1954727,GSM1954728,GSM1954729,GSM1954730,GSM1954731,GSM1954732,GSM1954733,GSM1954734,GSM1954735,...,GSM1954747,GSM1954748,GSM1954749,GSM1954750,GSM1954751,GSM1954752,GSM1954753,GSM1954754,GSM1954755,GSM1954756
Adrenocortical Cancer,1,1,1,1,1,1,1,1,1,1,...,1,1,,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Gender,1,1,1,1,1,0,1,0,1,0,...,1,1,,,,,,,,


In [86]:
genetic_data = get_genetic_data(matrix_file)
genetic_data

Unnamed: 0_level_0,GSM1954726,GSM1954727,GSM1954728,GSM1954729,GSM1954730,GSM1954731,GSM1954732,GSM1954733,GSM1954734,GSM1954735,...,GSM1954747,GSM1954748,GSM1954749,GSM1954750,GSM1954751,GSM1954752,GSM1954753,GSM1954754,GSM1954755,GSM1954756
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1007_s_at,8001.9,2372.1,2968.4,3541.5,1470.0,3469.1,4500.6,1744.5,3630.4,5535.4,...,3939.5,2257.8,3697.5,1613.3,1868.5,2791.8,2650.7,2430.2,2373.9,2853.3
1053_at,243.7,204.3,42.3,165.6,113.3,221.4,259.2,415.3,192.2,224.9,...,222.0,181.5,127.7,186.1,55.2,82.7,87.5,103.0,67.4,141.4
117_at,475.3,255.5,158.8,135.8,134.2,1113.7,392.2,166.6,124.8,227.2,...,281.0,1527.1,170.7,205.8,160.9,331.1,96.5,218.1,314.5,124.1
121_at,961.4,1164.1,1469.3,830.7,1145.5,1762.3,1189.0,976.4,808.3,1267.2,...,1751.6,1923.4,946.1,1332.6,1412.8,1611.8,2005.8,1548.6,2192.5,1804.3
1255_g_at,83.7,28.6,15.0,67.3,65.4,105.8,168.0,31.1,73.7,158.4,...,122.4,176.8,77.4,96.1,192.5,14.6,120.6,311.7,225.9,94.9
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
AFFX-ThrX-5_at,10.1,15.7,71.6,17.3,29.5,11.9,23.7,13.8,11.3,72.7,...,10.5,30.8,18.1,8.5,170.8,42.4,31.5,26.3,18.3,34.9
AFFX-ThrX-M_at,146.3,95.0,80.4,13.0,2.3,37.5,176.1,5.2,5.2,94.8,...,130.0,37.4,56.3,64.2,28.4,62.1,73.8,29.1,79.0,56.5
AFFX-TrpnX-3_at,118.6,13.2,10.2,6.0,52.6,25.7,58.0,3.7,3.4,45.4,...,120.3,38.3,40.5,2.9,10.3,8.2,5.0,9.3,58.3,5.0
AFFX-TrpnX-5_at,73.0,18.0,63.3,5.9,27.4,11.8,74.6,10.1,10.3,75.1,...,20.4,108.6,59.1,4.5,26.4,19.5,11.2,65.8,38.3,129.2


In [87]:
requires_gene_mapping = True

if requires_gene_mapping:
    gene_annotation = get_gene_annotation(soft_file)
    gene_annotation_summary = preview_df(gene_annotation)
    print(gene_annotation_summary)

gene_row_ids = genetic_data.index[:20].tolist()
gene_row_ids

if requires_gene_mapping:
    print(f'''
    As a biomedical research team, we are analyzing a gene expression dataset, and find that its row headers are some identifiers related to genes:
    {gene_row_ids}
    To get the mapping from those identifiers to actual gene symbols, we extracted the gene annotation data from a series in the GEO database, and saved it to a Python dictionary. Please read the dictionary, and decide which key stores the identifiers, and which key stores the gene symbols. Please strictly follow this format in your answer:
    identifier_key = 'key_name1'
    gene_symbol_key = 'key_name2'

    Gene annotation dictionary:
    {gene_annotation_summary}
    ''')

gene_annotation

{'ID': ['1007_s_at', '1053_at', '117_at', '121_at', '1255_g_at'], 'GB_ACC': ['U48705', 'M87338', 'X51757', 'X69699', 'L36861'], 'SPOT_ID': [nan, nan, nan, nan, nan], 'Species Scientific Name': ['Homo sapiens', 'Homo sapiens', 'Homo sapiens', 'Homo sapiens', 'Homo sapiens'], 'Annotation Date': ['Oct 6, 2014', 'Oct 6, 2014', 'Oct 6, 2014', 'Oct 6, 2014', 'Oct 6, 2014'], 'Sequence Type': ['Exemplar sequence', 'Exemplar sequence', 'Exemplar sequence', 'Exemplar sequence', 'Exemplar sequence'], 'Sequence Source': ['Affymetrix Proprietary Database', 'GenBank', 'Affymetrix Proprietary Database', 'GenBank', 'Affymetrix Proprietary Database'], 'Target Description': ['U48705 /FEATURE=mRNA /DEFINITION=HSU48705 Human receptor tyrosine kinase DDR gene, complete cds', 'M87338 /FEATURE= /DEFINITION=HUMA1SBU Human replication factor C, 40-kDa subunit (A1) mRNA, complete cds', "X51757 /FEATURE=cds /DEFINITION=HSP70B Human heat-shock protein HSP70B' gene", 'X69699 /FEATURE= /DEFINITION=HSPAX8A H.sapiens

Unnamed: 0,ID,GB_ACC,SPOT_ID,Species Scientific Name,Annotation Date,Sequence Type,Sequence Source,Target Description,Representative Public ID,Gene Title,Gene Symbol,ENTREZ_GENE_ID,RefSeq Transcript ID,Gene Ontology Biological Process,Gene Ontology Cellular Component,Gene Ontology Molecular Function
0,1007_s_at,U48705,,Homo sapiens,"Oct 6, 2014",Exemplar sequence,Affymetrix Proprietary Database,U48705 /FEATURE=mRNA /DEFINITION=HSU48705 Huma...,U48705,discoidin domain receptor tyrosine kinase 1 //...,DDR1 /// MIR4640,780 /// 100616237,NM_001202521 /// NM_001202522 /// NM_001202523...,0001558 // regulation of cell growth // inferr...,0005576 // extracellular region // inferred fr...,0000166 // nucleotide binding // inferred from...
1,1053_at,M87338,,Homo sapiens,"Oct 6, 2014",Exemplar sequence,GenBank,M87338 /FEATURE= /DEFINITION=HUMA1SBU Human re...,M87338,"replication factor C (activator 1) 2, 40kDa",RFC2,5982,NM_001278791 /// NM_001278792 /// NM_001278793...,0000278 // mitotic cell cycle // traceable aut...,0005634 // nucleus // inferred from electronic...,0000166 // nucleotide binding // inferred from...
2,117_at,X51757,,Homo sapiens,"Oct 6, 2014",Exemplar sequence,Affymetrix Proprietary Database,X51757 /FEATURE=cds /DEFINITION=HSP70B Human h...,X51757,heat shock 70kDa protein 6 (HSP70B'),HSPA6,3310,NM_002155,0000902 // cell morphogenesis // inferred from...,0005737 // cytoplasm // inferred from direct a...,0000166 // nucleotide binding // inferred from...
3,121_at,X69699,,Homo sapiens,"Oct 6, 2014",Exemplar sequence,GenBank,X69699 /FEATURE= /DEFINITION=HSPAX8A H.sapiens...,X69699,paired box 8,PAX8,7849,NM_003466 /// NM_013951 /// NM_013952 /// NM_0...,0001655 // urogenital system development // in...,0005634 // nucleus // inferred from direct ass...,0000979 // RNA polymerase II core promoter seq...
4,1255_g_at,L36861,,Homo sapiens,"Oct 6, 2014",Exemplar sequence,Affymetrix Proprietary Database,L36861 /FEATURE=expanded_cds /DEFINITION=HUMGC...,L36861,guanylate cyclase activator 1A (retina),GUCA1A,2978,NM_000409 /// XM_006715073,0007165 // signal transduction // non-traceabl...,0001750 // photoreceptor outer segment // infe...,0005509 // calcium ion binding // inferred fro...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
713082,222380_s_at,336.9,P,0.008057,,,,,,,,,,,,
713083,222381_at,79.8,A,0.466064,,,,,,,,,,,,
713084,222382_x_at,10,A,0.696289,,,,,,,,,,,,
713085,222383_s_at,25.6,A,0.870361,,,,,,,,,,,,


In [88]:
gene_annotation.columns

Index(['ID', 'GB_ACC', 'SPOT_ID', 'Species Scientific Name', 'Annotation Date',
       'Sequence Type', 'Sequence Source', 'Target Description',
       'Representative Public ID', 'Gene Title', 'Gene Symbol',
       'ENTREZ_GENE_ID', 'RefSeq Transcript ID',
       'Gene Ontology Biological Process', 'Gene Ontology Cellular Component',
       'Gene Ontology Molecular Function'],
      dtype='object')

In [89]:
if requires_gene_mapping:
    identifier_key = 'ID'
    gene_symbol_key = 'Gene Symbol'
    gene_mapping = get_gene_mapping(gene_annotation, identifier_key, gene_symbol_key)
    genetic_data = apply_gene_mapping(genetic_data, gene_mapping)

In [90]:
genetic_data

Unnamed: 0_level_0,GSM1954726,GSM1954727,GSM1954728,GSM1954729,GSM1954730,GSM1954731,GSM1954732,GSM1954733,GSM1954734,GSM1954735,...,GSM1954747,GSM1954748,GSM1954749,GSM1954750,GSM1954751,GSM1954752,GSM1954753,GSM1954754,GSM1954755,GSM1954756
Gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ABCB4,147.500000,695.900000,466.20,1479.40,1155.70,676.0,1514.800000,35.9,202.900000,993.100000,...,405.800000,1016.00,141.5,6708.400000,1930.100000,1995.0,2711.300000,3506.00,4596.500000,2504.1
ABCC6P1,957.900000,384.200000,320.80,218.10,445.40,238.3,387.700000,495.9,154.200000,241.300000,...,67.700000,245.70,308.5,118.100000,424.400000,235.7,494.700000,563.50,501.000000,82.0
ABCC6P2,957.900000,384.200000,320.80,218.10,445.40,238.3,387.700000,495.9,154.200000,241.300000,...,67.700000,245.70,308.5,118.100000,424.400000,235.7,494.700000,563.50,501.000000,82.0
ABCD1P2,171.600000,30.100000,26.20,38.70,34.00,34.8,59.300000,82.4,30.200000,176.500000,...,38.700000,78.70,36.6,197.700000,41.700000,65.8,33.200000,101.40,78.100000,72.9
ACOT2,907.700000,313.500000,415.40,465.50,587.40,403.1,219.600000,256.4,305.500000,361.700000,...,205.000000,299.00,266.3,1301.300000,809.900000,711.1,777.500000,387.70,403.400000,916.1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ZYX,298.400000,414.150000,866.25,414.35,758.95,509.8,616.750000,516.8,438.100000,596.250000,...,709.850000,408.95,294.8,1057.650000,1041.700000,1492.7,783.600000,793.45,1040.650000,994.0
ZZEF1,223.833333,59.333333,228.60,138.20,223.80,148.4,336.166667,134.5,160.166667,193.266667,...,132.533333,253.60,270.0,149.333333,364.433333,257.6,231.633333,213.70,136.933333,210.3
ZZZ3,835.000000,525.000000,420.60,427.30,655.40,643.8,580.900000,807.2,741.900000,639.900000,...,723.700000,749.90,528.4,275.900000,425.200000,745.9,541.300000,312.50,736.500000,658.0
abParts,30.800000,29.900000,57.50,12.10,33.60,19.6,25.600000,40.2,11.900000,10.100000,...,71.200000,28.40,58.0,16.800000,22.800000,31.8,22.600000,53.10,29.500000,133.6


In [91]:
genetic_data = normalize_gene_symbols_in_index(genetic_data)

In [92]:
genetic_data

Unnamed: 0,GSM1954726,GSM1954727,GSM1954728,GSM1954729,GSM1954730,GSM1954731,GSM1954732,GSM1954733,GSM1954734,GSM1954735,...,GSM1954747,GSM1954748,GSM1954749,GSM1954750,GSM1954751,GSM1954752,GSM1954753,GSM1954754,GSM1954755,GSM1954756
A1CF,435.000000,344.600000,540.80,268.10,435.40,410.7,514.800000,279.2,275.300000,418.300000,...,441.300000,689.40,401.6,381.400000,323.100000,560.4,651.200000,1040.30,655.800000,872.7
A2M,4469.000000,7595.800000,5954.90,6257.10,6142.20,4218.6,2760.300000,3700.1,4408.800000,3862.600000,...,2561.200000,2880.80,3540.0,7180.500000,11459.700000,5347.7,4320.200000,5798.30,5584.700000,9061.2
A4GALT,42.600000,61.300000,101.80,140.70,142.60,29.3,279.300000,77.1,55.400000,63.300000,...,76.000000,190.00,208.3,37.400000,97.800000,220.8,88.700000,84.70,28.000000,92.7
A4GNT,352.800000,220.600000,201.60,110.00,181.60,136.6,380.400000,194.5,151.600000,198.000000,...,177.800000,363.60,175.6,154.000000,441.100000,442.7,406.300000,368.10,446.900000,342.7
AAAS,299.700000,268.300000,481.90,409.70,488.80,330.6,542.700000,715.2,638.300000,1012.200000,...,180.800000,420.70,497.0,311.900000,431.800000,339.8,439.400000,336.40,419.100000,249.6
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ZXDB,117.400000,24.400000,15.90,20.10,9.50,52.2,261.400000,16.1,19.100000,113.400000,...,110.400000,30.30,100.1,54.800000,34.400000,100.2,189.500000,208.30,19.200000,70.5
ZXDC,135.200000,336.100000,300.30,233.10,289.70,234.8,108.600000,197.9,464.800000,308.600000,...,198.900000,161.20,291.1,203.400000,204.900000,238.4,63.400000,47.50,274.200000,98.3
ZYX,298.400000,414.150000,866.25,414.35,758.95,509.8,616.750000,516.8,438.100000,596.250000,...,709.850000,408.95,294.8,1057.650000,1041.700000,1492.7,783.600000,793.45,1040.650000,994.0
ZZEF1,223.833333,59.333333,228.60,138.20,223.80,148.4,336.166667,134.5,160.166667,193.266667,...,132.533333,253.60,270.0,149.333333,364.433333,257.6,231.633333,213.70,136.933333,210.3


In [93]:
merged_data = geo_merge_clinical_genetic_data(selected_clinical_data, genetic_data)
# The preprocessing runs through, which means is_available should be True
is_available = True

  merged_data = pd.concat([clinical_df, genetic_df], axis=0).T.dropna()


In [94]:
merged_data

Unnamed: 0,Adrenocortical Cancer,Gender,A1CF,A2M,A4GALT,A4GNT,AAAS,AACS,AADAC,AAGAB,...,ZSWIM1,ZSWIM8,ZW10,ZWILCH,ZWINT,ZXDB,ZXDC,ZYX,ZZEF1,ZZZ3
GSM1954726,1.0,1.0,435.0,4469.0,42.6,352.8,299.7,625.0,39.5,210.4,...,216.9,690.9,452.9,37.6,418.0,117.4,135.2,298.4,223.833333,835.0
GSM1954727,1.0,1.0,344.6,7595.8,61.3,220.6,268.3,650.2,323.2,297.45,...,309.1,338.35,487.8,256.0,1331.1,24.4,336.1,414.15,59.333333,525.0
GSM1954728,1.0,1.0,540.8,5954.9,101.8,201.6,481.9,813.1,597.2,367.55,...,215.8,915.85,300.9,200.8,651.5,15.9,300.3,866.25,228.6,420.6
GSM1954729,1.0,1.0,268.1,6257.1,140.7,110.0,409.7,3292.3,1980.8,266.05,...,34.3,518.05,379.8,320.5,1436.8,20.1,233.1,414.35,138.2,427.3
GSM1954730,1.0,1.0,435.4,6142.2,142.6,181.6,488.8,870.8,1894.6,267.3,...,222.2,671.9,309.6,146.6,365.9,9.5,289.7,758.95,223.8,655.4
GSM1954731,1.0,0.0,410.7,4218.6,29.3,136.6,330.6,2399.1,1062.9,545.9,...,220.4,355.0,442.8,288.1,615.4,52.2,234.8,509.8,148.4,643.8
GSM1954732,1.0,1.0,514.8,2760.3,279.3,380.4,542.7,1012.9,213.4,68.8,...,529.3,878.35,388.2,69.9,840.8,261.4,108.6,616.75,336.166667,580.9
GSM1954733,1.0,0.0,279.2,3700.1,77.1,194.5,715.2,1772.3,21.3,319.65,...,163.1,326.2,455.2,715.3,3433.5,16.1,197.9,516.8,134.5,807.2
GSM1954734,1.0,1.0,275.3,4408.8,55.4,151.6,638.3,457.7,32.6,301.3,...,306.0,1865.9,334.9,261.7,2088.8,19.1,464.8,438.1,160.166667,741.9
GSM1954735,1.0,0.0,418.3,3862.6,63.3,198.0,1012.2,2131.8,29.1,266.85,...,305.5,1125.05,565.3,589.1,2288.8,113.4,308.6,596.25,193.266667,639.9


In [95]:
print(f"The merged dataset contains {len(merged_data)} samples.")
is_trait_biased, merged_data = judge_and_remove_biased_features(merged_data, TRAIT, trait_type=trait_type)
is_trait_biased

The merged dataset contains 23 samples.
For the feature 'Adrenocortical Cancer', the least common label is '1.0' with 23 occurrences. This represents 100.00% of the dataset.
The distribution of the feature 'Adrenocortical Cancer' in this dataset is severely biased.

For the feature 'Gender', the least common label is '0.0' with 8 occurrences. This represents 34.78% of the dataset.
The distribution of the feature 'Gender' in this dataset is fine.



True

In [96]:
if is_available:
    save_cohort_info(cohort, JSON_PATH, is_available, is_trait_biased, merged_data, note='')
else:
    save_cohort_info(cohort, JSON_PATH, is_available)
merged_data.head()
if not is_trait_biased:
    merged_data.to_csv(os.path.join(OUTPUT_DIR, cohort + '.csv'), index=False)

In [156]:
# Finished
cohort = accession_num = "GSE68950"
cohort_dir = os.path.join(trait_path, accession_num)
soft_file, matrix_file = get_relevant_filepaths(cohort_dir)
soft_file, matrix_file

from utils import *
import io
background_prefixes = ['!Series_title', '!Series_summary', '!Series_overall_design']
clinical_prefixes = ['!Sample_geo_accession', '!Sample_characteristics_ch1']

background_info, clinical_data = get_background_and_clinical_data(matrix_file, background_prefixes, clinical_prefixes)
print(background_info)

clinical_data

!Series_title	"caArray_golub-00327: Sanger cell line Affymetrix gene expression project"
!Series_summary	"The microarray gene expression pattern was studied using 798 different cancer cell lines. The cancer cell lines are obtained from different centers. Annotation information were provided in the supplementary file."
!Series_overall_design	"golub-00327"
!Series_overall_design	"Assay Type: Gene Expression"
!Series_overall_design	"Provider: Affymetrix"
!Series_overall_design	"Array Designs: HT_HG-U133A"
!Series_overall_design	"Organism: Homo sapiens (ncbitax)"
!Series_overall_design	"Tissue Sites: leukemia, Urinary tract, Lung, BiliaryTract, Autonomic Ganglion, Thyroid gland, Stomach, Breast, Pancreas, Head and Neck, Lymphoma, Colorectal, Placenta, Liver, Brain, Bone, pleura, Skin, endometrium, Ovary, cervix, Oesophagus, Connective and Soft Tissue, Muscle, Kidney, Prostate, Adrenal Gland, Eye, Testis, Smooth Muscle Tissue, Vulva, Unknow"
!Series_overall_design	"Material Types: cell, syn

Unnamed: 0,!Sample_geo_accession,GSM1687570,GSM1687571,GSM1687572,GSM1687573,GSM1687574,GSM1687575,GSM1687576,GSM1687577,GSM1687578,...,GSM1688358,GSM1688359,GSM1688360,GSM1688361,GSM1688362,GSM1688363,GSM1688364,GSM1688365,GSM1688366,GSM1688367
0,!Sample_characteristics_ch1,cosmic id: 924101,cosmic id: 906800,cosmic id: 687452,cosmic id: 924100,cosmic id: 910924,cosmic id: 906798,cosmic id: 906797,cosmic id: 906797,cosmic id: 910922,...,cosmic id: 909781,cosmic id: 909782,cosmic id: 909782,cosmic id: 909784,cosmic id: 909785,cosmic id: 909785,cosmic id: 909904,cosmic id: 909905,cosmic id: 687592,cosmic id: 909907
1,!Sample_characteristics_ch1,disease state: L2 Acute Lymphoblastic Leukemia,disease state: NS Acute Lymphoblastic Leukemia,disease state: carcinoma,disease state: adenocarcinoma,disease state: adenocarcinoma,disease state: transitional cell carcinoma,disease state: transitional cell carcinoma,disease state: transitional cell carcinoma,disease state: clear cell renal cell carcinoma,...,disease state: renal cell carcinoma,disease state: retinoblastoma,disease state: retinoblastoma,disease state: malignant melanoma,disease state: follicular lymphoma,disease state: follicular lymphoma,disease state: carcinoma,disease state: glioblastoma multiforme,disease state: glioblastoma multiforme,disease state: ductal carcinoma
2,!Sample_characteristics_ch1,disease location: Hematopoietic and Lymphoid T...,disease location: Hematopoietic and Lymphoid T...,disease location: bladder,disease location: prostate,disease location: stomach,disease location: ureter,disease location: bladder,disease location: bladder,disease location: kidney,...,disease location: kidney,disease location: retina,disease location: retina,disease location: skin,disease location: lymph node,disease location: lymph node,disease location: pancreas,disease location: brain,disease location: temporal lobe,disease location: breast
3,!Sample_characteristics_ch1,organism part: Leukemia,organism part: Leukemia,organism part: Urinary tract,organism part: Prostate,organism part: Stomach,organism part: Urinary tract,organism part: Urinary tract,organism part: Urinary tract,organism part: Kidney,...,organism part: Kidney,organism part: Eye,organism part: Eye,organism part: Skin,organism part: Lymphoma,organism part: Lymphoma,organism part: Pancreas,organism part: Brain,organism part: Brain,organism part: Breast
4,!Sample_characteristics_ch1,sample: 736,sample: 494,sample: 7,sample: 746,sample: 439,sample: 168,sample: 152,sample: 37,sample: 450,...,sample: 470,sample: 246,sample: 246,sample: 714,sample: 482,sample: 49,sample: 234,sample: 41,sample: 397,sample: 726
5,!Sample_characteristics_ch1,cell line code: 749,cell line code: 493,cell line code: 505,cell line code: 760,cell line code: 437,cell line code: 151,cell line code: 134,cell line code: 134,cell line code: 449,...,cell line code: 469,cell line code: 231,cell line code: 231,cell line code: 727,cell line code: 481,cell line code: 481,cell line code: 219,cell line code: 401,cell line code: 390,cell line code: 738
6,!Sample_characteristics_ch1,supplier: DSMZ,supplier: DSMZ,supplier: ATCC,supplier: DSMZ,supplier: DSMZ,supplier: DSMZ,supplier: DSMZ,supplier: DSMZ,supplier: Unspecified,...,supplier: HSRRB,supplier: ATCC,supplier: ATCC,supplier: ATCC,supplier: DSMZ,supplier: DSMZ,supplier: DSMZ,supplier: HSRRB,supplier: HSRRB,supplier: ATCC
7,!Sample_characteristics_ch1,affy_batch: 1,affy_batch: 1,affy_batch: 2,affy_batch: 1,affy_batch: 1,affy_batch: 1,affy_batch: 1,affy_batch: 2,affy_batch: 1,...,affy_batch: 1,affy_batch: 1,affy_batch: 2,affy_batch: 1,affy_batch: 1,affy_batch: 2,affy_batch: 1,affy_batch: 2,affy_batch: 1,affy_batch: 1
8,!Sample_characteristics_ch1,crna plate: 8,crna plate: 6,crna plate: 11,crna plate: 8,crna plate: 5,crna plate: 2,crna plate: 2,crna plate: 12,crna plate: 5,...,crna plate: 5,crna plate: 3,crna plate: 12,crna plate: 8,crna plate: 5,crna plate: 12,crna plate: 3,crna plate: 12,crna plate: 4,crna plate: 8


In [157]:
tumor_stage_row = clinical_data.iloc[3]
tumor_stage_row.unique()

array(['!Sample_characteristics_ch1', 'organism part: Leukemia',
       'organism part: Urinary tract', 'organism part: Prostate',
       'organism part: Stomach', 'organism part: Kidney',
       'organism part: Thyroid Gland', 'organism part: Brain',
       'organism part: Skin', 'organism part: Muscle',
       'organism part: Head and Neck', 'organism part: Ovary',
       'organism part: Lung', 'organism part: Autonomic Ganglion',
       'organism part: Endometrium', 'organism part: Pancreas',
       'organism part: Cervix', 'organism part: Breast',
       'organism part: Colorectal', 'organism part: Liver',
       'organism part: Vulva', 'organism part: Bone',
       'organism part: Oesophagus', 'organism part: BiliaryTract',
       'organism part: Connective and Soft Tissue',
       'organism part: Lymphoma', 'organism part: Pleura',
       'organism part: Testis', 'organism part: Placenta',
       'organism part: Adrenal Gland', 'organism part: Unknow',
       'organism part: Smoo

In [158]:
tumor_stage_counts = tumor_stage_row.value_counts()
tumor_stage_counts

3
organism part: Lung                          139
organism part: Leukemia                      116
organism part: Brain                          60
organism part: Skin                           46
organism part: Breast                         44
organism part: Autonomic Ganglion             41
organism part: Colorectal                     40
organism part: Bone                           32
organism part: Head and Neck                  25
organism part: Oesophagus                     25
organism part: Stomach                        25
organism part: Ovary                          24
organism part: Kidney                         24
organism part: Lymphoma                       20
organism part: Urinary tract                  20
organism part: Pancreas                       17
organism part: Cervix                         14
organism part: Thyroid Gland                  13
organism part: Liver                          11
organism part: Endometrium                    11
organism part: Con

In [159]:
is_gene_availabe = True
trait_row = 3
age_row = None
gender_row = None

trait_type = 'binary'

# Verify and use the functions generated by GPT

# 这个函数将组织类型（tissue type）转换为有关癫痫存在与否的二进制值。
# 它是基于特定的假设，即如果组织类型是“胰腺导管腺癌”（Pancreatic Ductal Adenocarcinoma），则认为癫痫存在（返回1）；否则，认为癫痫不存在（返回0）。
def convert_trait(tissue_type):
    if tissue_type == 'organism part: Adrenal Gland':
        return 1  # Epilepsy present
    else:
        return 0  # Epilepsy not present


# 这个函数的目的是将年龄的字符串表示转换为一个连续的数值型表示。如果年龄未知（例如，标记为'n.a.'），则返回None。
# 函数尝试从传入的字符串中提取出一个整数作为年龄值。如果字符串的格式不符合预期，导致提取失败，同样返回None。
def convert_age(age_string):
    """
    Convert age string to a continuous numerical value.
    Unknown values are converted to None.
    """
    if age_string.lower() == 'n.a.':
        return None
    try:
        # Extract age as an integer from the string
        age = int(age_string.split(': ')[1])
        return age
    except (ValueError, IndexError):
        # In case of any format error or unexpected string structure
        return None


# 这个函数将性别的字符串表示转换为二进制值，其中“female”对应1，“male”对应0。如果性别未知或字符串不符合预期格式，则返回None。
# It sometimes maps 'female' to 0, and sometimes 1. Does it matter?
def convert_gender(gender_string):
    """
    Convert gender string to a binary value.
    'female' is represented as 1, 'male' as 0.
    Unknown values are converted to None.
    """
    if (gender_string.lower() == 'sex: female' or gender_string.lower() == 'sex: f' or gender_string.lower() == 'gender: female'):
        return 1
    elif (gender_string.lower() == 'sex: male' or gender_string.lower() == 'sex: m' or gender_string.lower() == 'gender: male') :  # changeed 
        return 0
    else:
        return None

In [160]:
selected_clinical_data = geo_select_clinical_features(clinical_data, TRAIT, trait_row, convert_trait, age_row=age_row,
                                                      convert_age=convert_age, gender_row=gender_row,
                                                      convert_gender=convert_gender)
selected_clinical_data.head()

  clinical_df = clinical_df.applymap(convert_fn)


Unnamed: 0,GSM1687570,GSM1687571,GSM1687572,GSM1687573,GSM1687574,GSM1687575,GSM1687576,GSM1687577,GSM1687578,GSM1687579,...,GSM1688358,GSM1688359,GSM1688360,GSM1688361,GSM1688362,GSM1688363,GSM1688364,GSM1688365,GSM1688366,GSM1688367
Adrenocortical Cancer,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [161]:
genetic_data = get_genetic_data(matrix_file)
genetic_data.head()

Unnamed: 0_level_0,GSM1687570,GSM1687571,GSM1687572,GSM1687573,GSM1687574,GSM1687575,GSM1687576,GSM1687577,GSM1687578,GSM1687579,...,GSM1688358,GSM1688359,GSM1688360,GSM1688361,GSM1688362,GSM1688363,GSM1688364,GSM1688365,GSM1688366,GSM1688367
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1007_s_at,282.4,109.1,332.4,519.1,1874.4,358.8,3092.0,858.2,102.6,60.8,...,287.3,433.0,124.4,777.6,55.8,13.4,1777.6,125.8,226.5,1193.4
1053_at,234.6,225.3,203.9,155.9,135.9,126.3,501.1,128.3,105.1,126.6,...,165.5,340.3,137.3,240.2,159.2,44.0,390.0,42.9,129.4,155.5
117_at,12.6,85.4,1.9,6.0,6.0,24.1,2.0,2.9,4.6,18.3,...,2.4,2624.0,580.9,28.9,6.9,1.3,9.3,0.9,2.2,2.0
121_at,71.9,102.1,57.6,138.5,443.9,112.5,75.7,18.1,2271.1,403.9,...,1281.2,55.6,18.1,78.8,70.6,16.5,69.4,35.7,40.7,67.0
1255_g_at,0.9,2.8,0.7,4.1,5.4,3.1,2.4,0.9,1.7,4.7,...,1.7,351.1,90.7,1.9,4.4,0.5,1.4,6.4,1.6,3.4


In [162]:
requires_gene_mapping = True

if requires_gene_mapping:
    gene_annotation = get_gene_annotation(soft_file)
    gene_annotation_summary = preview_df(gene_annotation)
    print(gene_annotation_summary)

gene_row_ids = genetic_data.index[:20].tolist()
gene_row_ids

if requires_gene_mapping:
    print(f'''
    As a biomedical research team, we are analyzing a gene expression dataset, and find that its row headers are some identifiers related to genes:
    {gene_row_ids}
    To get the mapping from those identifiers to actual gene symbols, we extracted the gene annotation data from a series in the GEO database, and saved it to a Python dictionary. Please read the dictionary, and decide which key stores the identifiers, and which key stores the gene symbols. Please strictly follow this format in your answer:
    identifier_key = 'key_name1'
    gene_symbol_key = 'key_name2'

    Gene annotation dictionary:
    {gene_annotation_summary}
    ''')

gene_annotation

{'ID': ['1007_s_at', '1053_at', '117_at', '121_at', '1255_g_at'], 'GB_ACC': ['U48705', 'M87338', 'X51757', 'X69699', 'L36861'], 'SPOT_ID': [nan, nan, nan, nan, nan], 'Species Scientific Name': ['Homo sapiens', 'Homo sapiens', 'Homo sapiens', 'Homo sapiens', 'Homo sapiens'], 'Annotation Date': ['Mar 8, 2007', 'Mar 8, 2007', 'Mar 8, 2007', 'Mar 8, 2007', 'Mar 8, 2007'], 'Sequence Type': ['Exemplar sequence', 'Exemplar sequence', 'Exemplar sequence', 'Exemplar sequence', 'Exemplar sequence'], 'Sequence Source': [nan, nan, nan, nan, nan], 'Target Description': ['U48705 /FEATURE=mRNA /DEFINITION=HSU48705 Human receptor tyrosine kinase DDR gene, complete cds', 'M87338 /FEATURE= /DEFINITION=HUMA1SBU Human replication factor C, 40-kDa subunit (A1) mRNA, complete cds', "X51757 /FEATURE=cds /DEFINITION=HSP70B Human heat-shock protein HSP70B' gene", 'X69699 /FEATURE= /DEFINITION=HSPAX8A H.sapiens Pax8 mRNA', 'L36861 /FEATURE=expanded_cds /DEFINITION=HUMGCAPB Homo sapiens guanylate cyclase activat

Unnamed: 0,ID,GB_ACC,SPOT_ID,Species Scientific Name,Annotation Date,Sequence Type,Sequence Source,Target Description,Representative Public ID,Gene Title,Gene Symbol,ENTREZ_GENE_ID,RefSeq Transcript ID,Gene Ontology Biological Process,Gene Ontology Cellular Component,Gene Ontology Molecular Function
0,1007_s_at,U48705,,Homo sapiens,"Mar 8, 2007",Exemplar sequence,,U48705 /FEATURE=mRNA /DEFINITION=HSU48705 Huma...,U48705,"discoidin domain receptor family, member 1",DDR1,780,NM_001954 /// NM_013993 /// NM_013994,0006468 // protein amino acid phosphorylation ...,0005615 // extracellular space // inferred fro...,0000166 // nucleotide binding // inferred from...
1,1053_at,M87338,,Homo sapiens,"Mar 8, 2007",Exemplar sequence,,M87338 /FEATURE= /DEFINITION=HUMA1SBU Human re...,M87338,"replication factor C (activator 1) 2, 40kDa",RFC2,5982,NM_002914 /// NM_181471,0006260 // DNA replication // inferred from el...,0005634 // nucleus // inferred from electronic...,0000166 // nucleotide binding // inferred from...
2,117_at,X51757,,Homo sapiens,"Mar 8, 2007",Exemplar sequence,,X51757 /FEATURE=cds /DEFINITION=HSP70B Human h...,X51757,heat shock 70kDa protein 6 (HSP70B'),HSPA6,3310,NM_002155 /// XM_001134322,0006457 // protein folding // inferred from el...,,0000166 // nucleotide binding // inferred from...
3,121_at,X69699,,Homo sapiens,"Mar 8, 2007",Exemplar sequence,,X69699 /FEATURE= /DEFINITION=HSPAX8A H.sapiens...,X69699,paired box gene 8,PAX8,7849,NM_003466 /// NM_013951 /// NM_013952 /// NM_0...,0001656 // metanephros development // inferred...,0005634 // nucleus // inferred from electronic...,0003700 // transcription factor activity // tr...
4,1255_g_at,L36861,,Homo sapiens,"Mar 8, 2007",Exemplar sequence,,L36861 /FEATURE=expanded_cds /DEFINITION=HUMGC...,L36861,guanylate cyclase activator 1A (retina),GUCA1A,2978,NM_000409,0007165 // signal transduction // non-traceabl...,,0005509 // calcium ion binding // inferred fro...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17800116,AFFX-BioC-3_at,283.9,,,,,,,,,,,,,,
17800117,AFFX-BioDn-5_at,3035.0,,,,,,,,,,,,,,
17800118,AFFX-BioDn-3_at,5720.6,,,,,,,,,,,,,,
17800119,AFFX-CreX-5_at,14758.7,,,,,,,,,,,,,,


In [163]:
gene_annotation.columns

Index(['ID', 'GB_ACC', 'SPOT_ID', 'Species Scientific Name', 'Annotation Date',
       'Sequence Type', 'Sequence Source', 'Target Description',
       'Representative Public ID', 'Gene Title', 'Gene Symbol',
       'ENTREZ_GENE_ID', 'RefSeq Transcript ID',
       'Gene Ontology Biological Process', 'Gene Ontology Cellular Component',
       'Gene Ontology Molecular Function'],
      dtype='object')

In [164]:
if requires_gene_mapping:
    identifier_key = 'ID'
    gene_symbol_key = 'Gene Symbol'
    gene_mapping = get_gene_mapping(gene_annotation, identifier_key, gene_symbol_key)
    genetic_data = apply_gene_mapping(genetic_data, gene_mapping)

In [165]:
genetic_data

Unnamed: 0_level_0,GSM1687570,GSM1687571,GSM1687572,GSM1687573,GSM1687574,GSM1687575,GSM1687576,GSM1687577,GSM1687578,GSM1687579,...,GSM1688358,GSM1688359,GSM1688360,GSM1688361,GSM1688362,GSM1688363,GSM1688364,GSM1688365,GSM1688366,GSM1688367
Gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ABCB4,259.400000,180.500000,4.200000,26.000000,36.500000,50.600000,70.700000,17.90,166.100000,6.800000,...,13.300000,21.400000,14.500000,555.600000,13.600000,9.400000,26.300000,4.200000,23.300000,13.600000
ACOT1,54.300000,269.600000,35.900000,63.400000,57.500000,323.600000,53.000000,14.40,35.100000,43.200000,...,39.600000,124.400000,31.500000,201.800000,27.800000,9.500000,227.600000,21.600000,33.200000,441.600000
ACSM2,8.300000,4.900000,0.600000,5.400000,5.700000,2.500000,13.800000,8.10,102.700000,3.000000,...,3.000000,0.900000,0.800000,8.700000,1.900000,2.800000,19.000000,1.800000,14.800000,1.000000
ACSM3,13418.200000,17818.800000,4635.400000,13449.900000,12204.700000,16411.300000,16341.800000,4182.80,9947.700000,3564.800000,...,9886.500000,10840.700000,3032.000000,10719.200000,8010.500000,3099.900000,13392.000000,3407.400000,8757.700000,11117.600000
ADAM21P,3.700000,54.000000,4.900000,5.900000,3.700000,36.900000,31.100000,14.70,32.700000,15.800000,...,59.900000,15.500000,4.100000,45.000000,3.200000,1.000000,17.600000,11.100000,19.700000,25.500000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ZYG11BL,16.200000,40.233333,21.966667,42.900000,30.533333,44.800000,90.466667,26.20,34.166667,22.400000,...,44.466667,31.666667,11.966667,19.133333,21.366667,11.933333,67.400000,20.833333,27.266667,19.433333
ZYX,54.400000,65.550000,123.800000,112.850000,174.700000,546.100000,433.400000,109.05,678.500000,80.300000,...,488.650000,48.700000,8.500000,125.550000,259.250000,58.500000,318.050000,384.800000,198.850000,66.600000
ZZEF1,55.566667,98.433333,22.566667,44.566667,61.533333,137.966667,150.300000,27.20,47.433333,48.366667,...,50.700000,79.233333,12.933333,74.133333,36.433333,14.100000,51.133333,16.133333,41.200000,37.966667
ZZZ3,231.000000,420.700000,415.200000,190.400000,415.400000,589.200000,1376.500000,260.10,251.500000,331.100000,...,257.500000,194.500000,60.000000,236.600000,120.500000,43.600000,465.400000,139.400000,180.400000,266.700000


In [166]:
genetic_data = normalize_gene_symbols_in_index(genetic_data)

In [167]:
genetic_data

Unnamed: 0,GSM1687570,GSM1687571,GSM1687572,GSM1687573,GSM1687574,GSM1687575,GSM1687576,GSM1687577,GSM1687578,GSM1687579,...,GSM1688358,GSM1688359,GSM1688360,GSM1688361,GSM1688362,GSM1688363,GSM1688364,GSM1688365,GSM1688366,GSM1688367
A2M,15.400000,3.500000,2.100000,37.600000,4.000000,24.300000,12.2,20.90,5.100000,6.300000,...,3.70,26.800000,0.900000,893.100000,27.600000,0.6,19.600000,8.700000,13.20,26.300000
A4GALT,1.900000,7.800000,20.400000,4.200000,27.100000,54.400000,18.3,2.80,26.400000,3.900000,...,8.40,23.000000,5.700000,30.600000,7.700000,1.4,44.200000,14.200000,11.40,14.200000
A4GNT,2.100000,34.900000,0.400000,1.300000,16.600000,26.000000,22.6,6.40,10.500000,22.800000,...,9.70,1.000000,1.600000,7.500000,3.200000,1.0,22.800000,5.000000,7.70,3.700000
AAAS,40.400000,94.200000,30.200000,88.000000,78.700000,181.900000,211.5,30.70,177.800000,16.300000,...,97.60,123.700000,31.500000,105.000000,84.200000,21.5,144.100000,31.100000,58.40,36.100000
AACS,140.800000,91.200000,167.200000,595.600000,481.200000,745.800000,901.4,178.30,402.700000,82.000000,...,164.20,384.000000,127.700000,265.100000,72.100000,10.6,548.100000,52.300000,115.90,527.100000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ZXDB,42.800000,32.500000,2.200000,31.600000,35.200000,39.800000,29.6,1.10,4.000000,25.300000,...,2.90,1.600000,4.100000,9.200000,16.800000,7.3,19.500000,3.500000,16.90,4.200000
ZXDC,63.100000,111.400000,61.500000,8.900000,208.300000,327.000000,496.3,91.60,102.900000,24.300000,...,110.10,86.600000,15.500000,68.800000,73.900000,19.5,268.200000,19.300000,41.40,97.200000
ZYX,54.400000,65.550000,123.800000,112.850000,174.700000,546.100000,433.4,109.05,678.500000,80.300000,...,488.65,48.700000,8.500000,125.550000,259.250000,58.5,318.050000,384.800000,198.85,66.600000
ZZEF1,55.566667,98.433333,22.566667,44.566667,61.533333,137.966667,150.3,27.20,47.433333,48.366667,...,50.70,79.233333,12.933333,74.133333,36.433333,14.1,51.133333,16.133333,41.20,37.966667


In [168]:
merged_data = geo_merge_clinical_genetic_data(selected_clinical_data, genetic_data)
# The preprocessing runs through, which means is_available should be True
is_available = True

In [169]:
merged_data

Unnamed: 0,Adrenocortical Cancer,A2M,A4GALT,A4GNT,AAAS,AACS,AADAC,AAK1,AAMP,AANAT,...,ZSCAN2,ZSWIM1,ZW10,ZWILCH,ZWINT,ZXDB,ZXDC,ZYX,ZZEF1,ZZZ3
GSM1687570,0.0,15.4,1.9,2.1,40.4,140.8,4.8,9.400000,97.6,0.5,...,2.2,42.1,196.4,287.2,1944.3,42.8,63.1,54.40,55.566667,231.0
GSM1687571,0.0,3.5,7.8,34.9,94.2,91.2,25.5,33.766667,149.2,3.5,...,2.4,48.4,341.1,407.7,2480.6,32.5,111.4,65.55,98.433333,420.7
GSM1687572,0.0,2.1,20.4,0.4,30.2,167.2,5.9,2.200000,179.1,0.3,...,1.2,6.0,304.2,248.7,388.6,2.2,61.5,123.80,22.566667,415.2
GSM1687573,0.0,37.6,4.2,1.3,88.0,595.6,8.4,17.566667,203.7,2.2,...,14.1,1.3,140.5,271.4,1544.0,31.6,8.9,112.85,44.566667,190.4
GSM1687574,0.0,4.0,27.1,16.6,78.7,481.2,4.1,11.433333,684.4,7.8,...,14.5,31.0,282.9,253.0,3481.2,35.2,208.3,174.70,61.533333,415.4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
GSM1688363,0.0,0.6,1.4,1.0,21.5,10.6,2.7,3.233333,74.9,0.4,...,0.8,1.5,34.0,76.4,1042.9,7.3,19.5,58.50,14.100000,43.6
GSM1688364,0.0,19.6,44.2,22.8,144.1,548.1,142.3,6.133333,975.1,12.2,...,25.6,2.4,341.7,396.6,2340.7,19.5,268.2,318.05,51.133333,465.4
GSM1688365,0.0,8.7,14.2,5.0,31.1,52.3,2.2,2.733333,98.8,1.3,...,1.0,3.1,47.5,34.1,91.2,3.5,19.3,384.80,16.133333,139.4
GSM1688366,0.0,13.2,11.4,7.7,58.4,115.9,6.4,8.133333,175.9,0.6,...,4.9,3.7,110.5,472.5,2737.3,16.9,41.4,198.85,41.200000,180.4


In [170]:
print(f"The merged dataset contains {len(merged_data)} samples.")
is_trait_biased, merged_data = judge_and_remove_biased_features(merged_data, TRAIT, trait_type=trait_type)
is_trait_biased

The merged dataset contains 798 samples.
For the feature 'Adrenocortical Cancer', the least common label is '1.0' with 2 occurrences. This represents 0.25% of the dataset.
The distribution of the feature 'Adrenocortical Cancer' in this dataset is severely biased.



True

In [171]:
if is_available:
    save_cohort_info(cohort, JSON_PATH, is_available, is_trait_biased, merged_data, note='')
else:
    save_cohort_info(cohort, JSON_PATH, is_available)
merged_data.head()
if not is_trait_biased:
    merged_data.to_csv(os.path.join(OUTPUT_DIR, cohort + '.csv'), index=False)

In [113]:
# Finished (tumor_grade问了TA)

cohort = accession_num = "GSE19750"
cohort_dir = os.path.join(trait_path, accession_num)
soft_file, matrix_file = get_relevant_filepaths(cohort_dir)
soft_file, matrix_file

from utils import *
import io
background_prefixes = ['!Series_title', '!Series_summary', '!Series_overall_design']
clinical_prefixes = ['!Sample_geo_accession', '!Sample_characteristics_ch1']

background_info, clinical_data = get_background_and_clinical_data(matrix_file, background_prefixes, clinical_prefixes)
print(background_info)

clinical_data

!Series_title	"Adrenocortical Carcinoma Gene Expression Profiling [Affymetrix]"
!Series_summary	"Background: Adrenocortical carcinoma (ACC) is associated with poor survival rates.  The objective of the study was to analyze ACC gene expression profiling data prognostic biomarkers and novel therapeutic targets."
!Series_summary	"Methods: 44 ACC and 4 normal adrenal glands were profiled on Affymetrix U133 Plus 2 expression microarrays and pathway and transcriptional enrichment analysis performed.  Protein levels were determined by western blot.  Drug efficacy was assessed against ACC cell lines.  Previously published expression datasets were analyzed as validation data sets."
!Series_summary	"Results: Pathway enrichment analysis identified marked dysregulation of cyclin-dependent kinases and mitosis.   Over-expression of PTTG1, which encodes securin, a negative regulator of p53, was identified as a marker of poor survival.  Median survival for patients with tumors expressing high PTTG1 le

Unnamed: 0,!Sample_geo_accession,GSM493251,GSM493252,GSM493253,GSM493254,GSM493255,GSM493256,GSM493257,GSM493258,GSM493259,...,GSM1094071,GSM1094072,GSM1094073,GSM1094074,GSM1094075,GSM1094076,GSM1094077,GSM1094078,GSM1094079,GSM1094080
0,!Sample_characteristics_ch1,Stage: NA,Stage: NA,Stage: NA,Stage: NA,Stage: 2,Stage: 4,Stage: 2,Stage: 4,Stage: 4,...,Stage: 1,Stage: Recurrence,Stage: 4,Stage: 2,Stage: Recurrence,Stage: Recurrence,Stage: Recurrence,Stage: 4,Stage: 4,Stage: NA
1,!Sample_characteristics_ch1,tumor grade: NA,tumor grade: NA,tumor grade: NA,tumor grade: NA,tumor grade: 3,tumor grade: 4,tumor grade: 4,tumor grade: 4,tumor grade: 4,...,tumor grade: 1,tumor grade: Unknown,tumor grade: Unknown,tumor grade: 4,tumor grade: 2,tumor grade: Unknown,tumor grade: Unknown,tumor grade: 4,tumor grade: Unknown,tumor grade: Unknown
2,!Sample_characteristics_ch1,functional: NA,functional: NA,functional: NA,functional: NA,functional: None,functional: None,functional: None,functional: Cushings,functional: Cushings,...,functional: None,functional: None,functional: Unknown,"functional: Cortisol, aldosterone, testosterone",functional: None,functional: aldosterone,functional: None,functional: None,functional: Unknown,functional: Unknown
3,!Sample_characteristics_ch1,gender: Unknown,gender: Unknown,gender: Unknown,gender: Unknown,gender: M,gender: F,gender: M,gender: M,gender: M,...,gender: M,gender: M,gender: F,gender: F,gender: F,gender: F,gender: M,gender: M,gender: F,gender: NA
4,!Sample_characteristics_ch1,age in years: Unknown,age in years: Unknown,age in years: Unknown,age in years: Unknown,age in years: 23.3,age in years: 56.5,age in years: 67.8,age in years: 72.1,age in years: 46.9,...,age in years: 57,age in years: 59,age in years: 59,age in years: 55,age in years: 51,age in years: 53,age in years: 69,age in years: 63,age in years: 28,age in years: NA
5,!Sample_characteristics_ch1,survival in years: NA,survival in years: NA,survival in years: NA,survival in years: NA,survival in years: 3,survival in years: 0.6,survival in years: 1.7,survival in years: 0.4,survival in years: 0.1,...,survival in years: 3,survival in years: 7.583,survival in years: Unknown,survival in years: 0.583,survival in years: 6,survival in years: 2.083,survival in years: 2.83,survival in years: 2.08,survival in years: Unknown,survival in years: NA
6,!Sample_characteristics_ch1,survival status: NA,survival status: NA,survival status: NA,survival status: NA,survival status: dead,survival status: dead,survival status: dead,survival status: dead,survival status: dead,...,survival status: alive,survival status: dead,survival status: Unknown,survival status: dead,survival status: alive,survival status: dead,survival status: dead,survival status: alive,survival status: Unknown,survival status: NA
7,!Sample_characteristics_ch1,tumor size in cm: NA,tumor size in cm: NA,tumor size in cm: NA,tumor size in cm: NA,tumor size in cm: 19,tumor size in cm: 9,tumor size in cm: 7.6,tumor size in cm: 9.5,tumor size in cm: 12,...,tumor size in cm: 4,tumor size in cm: 2.5,tumor size in cm: 10,tumor size in cm: 10.5,tumor size in cm: 14.5,tumor size in cm: 14.5,tumor size in cm: 7.8,tumor size in cm: 7.8,tumor size in cm: Unknown,tumor size in cm: Unknown
8,!Sample_characteristics_ch1,tumor weight in grams: NA,tumor weight in grams: NA,tumor weight in grams: NA,tumor weight in grams: NA,tumor weight in grams: 1100,tumor weight in grams: 190,tumor weight in grams: 150,tumor weight in grams: 175,tumor weight in grams: 235,...,tumor weight in grams: 39,tumor weight in grams: unknown,tumor weight in grams: 22,tumor weight in grams: 277,tumor weight in grams: 325,tumor weight in grams: 1243,tumor weight in grams: unknown,tumor weight in grams: 132,tumor weight in grams: unknown,tumor weight in grams: unknown
9,!Sample_characteristics_ch1,batch: 1,batch: 1,batch: 1,batch: 1,batch: 1,batch: 1,batch: 1,batch: 1,batch: 1,...,batch: 2,batch: 2,batch: 2,batch: 2,batch: 2,batch: 2,batch: 2,batch: 2,batch: 2,batch: 2


In [114]:
tumor_stage_row = clinical_data.iloc[1]
tumor_stage_row.unique()

array(['!Sample_characteristics_ch1', 'tumor grade: NA', 'tumor grade: 3',
       'tumor grade: 4', 'tumor grade: 2', 'tumor grade: 1',
       'tumor grade: Unknown'], dtype=object)

In [115]:
tumor_stage_counts = tumor_stage_row.value_counts()
tumor_stage_counts

1
tumor grade: Unknown           16
tumor grade: 2                  9
tumor grade: 4                  8
tumor grade: 1                  7
tumor grade: NA                 4
tumor grade: 3                  4
!Sample_characteristics_ch1     1
Name: count, dtype: int64

In [116]:
is_gene_availabe = True
trait_row = 1
age_row = 4
gender_row = 3

trait_type = 'binary'

def convert_trait(tumor_grade):
    if (tumor_grade == 'tumor grade: 2' or tumor_grade == 'tumor grade: 3' or tumor_grade == 'tumor grade: 4'):
        return 1  
    elif tumor_grade == 'tumor grade: 1':
        return 0  
    else:
        return None


# 这个函数的目的是将年龄的字符串表示转换为一个连续的数值型表示。如果年龄未知（例如，标记为'n.a.'），则返回None。
# 函数尝试从传入的字符串中提取出一个整数作为年龄值。如果字符串的格式不符合预期，导致提取失败，同样返回None。
def convert_age(age_string):
    """
    Convert age string to a continuous numerical value.
    Unknown values are converted to None.
    """
    if age_string.lower() == 'n.a.':
        return None
    try:
        # Extract age as an integer from the string
        age = int(age_string.split(': ')[1])
        return age
    except (ValueError, IndexError):
        # In case of any format error or unexpected string structure
        return None


# 这个函数将性别的字符串表示转换为二进制值，其中“female”对应1，“male”对应0。如果性别未知或字符串不符合预期格式，则返回None。
# It sometimes maps 'female' to 0, and sometimes 1. Does it matter?
def convert_gender(gender_string):
    """
    Convert gender string to a binary value.
    'female' is represented as 1, 'male' as 0.
    Unknown values are converted to None.
    """
    if (gender_string.lower() == 'sex: female' or gender_string.lower() == 'sex: f' or gender_string.lower() == 'gender: female' or gender_string.lower() == 'gender: f'):
        return 1
    elif (gender_string.lower() == 'sex: male' or gender_string.lower() == 'sex: m' or gender_string.lower() == 'gender: male' or gender_string.lower() == 'gender: m') :  # changeed 
        return 0
    else:
        return None

In [117]:
selected_clinical_data = geo_select_clinical_features(clinical_data, TRAIT, trait_row, convert_trait, age_row=age_row,
                                                      convert_age=convert_age, gender_row=gender_row,
                                                      convert_gender=convert_gender)
selected_clinical_data.head()

  clinical_df = clinical_df.applymap(convert_fn)
  clinical_df = clinical_df.applymap(convert_fn)
  clinical_df = clinical_df.applymap(convert_fn)


Unnamed: 0,GSM493251,GSM493252,GSM493253,GSM493254,GSM493255,GSM493256,GSM493257,GSM493258,GSM493259,GSM493260,...,GSM1094071,GSM1094072,GSM1094073,GSM1094074,GSM1094075,GSM1094076,GSM1094077,GSM1094078,GSM1094079,GSM1094080
Adrenocortical Cancer,,,,,1.0,1.0,1.0,1.0,1.0,1.0,...,0,,,1,1,,,1,,
Age,,,,,,,,,,,...,57,59.0,59.0,55,51,53.0,69.0,63,28.0,
Gender,,,,,0.0,1.0,0.0,0.0,0.0,1.0,...,0,0.0,1.0,1,1,1.0,0.0,0,1.0,


In [118]:
genetic_data = get_genetic_data(matrix_file)
genetic_data

Unnamed: 0_level_0,GSM493251,GSM493252,GSM493253,GSM493254,GSM493255,GSM493256,GSM493257,GSM493258,GSM493259,GSM493260,...,GSM1094071,GSM1094072,GSM1094073,GSM1094074,GSM1094075,GSM1094076,GSM1094077,GSM1094078,GSM1094079,GSM1094080
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1007_s_at,101.10,48.86,100.83,104.84,566.78,418.37,499.31,432.83,212.42,336.85,...,139.79,35.72,79.18,154.74,166.63,29.67,72.48,10.21,151.86,137.73
1053_at,22.58,18.30,16.96,16.96,18.64,39.96,57.40,34.15,43.71,32.52,...,8.32,8.79,11.64,17.01,17.19,25.84,16.13,43.64,11.46,22.56
117_at,73.33,30.19,155.69,173.28,14.43,224.29,30.10,9.52,9.17,9.09,...,24.91,11.09,309.80,42.61,9.57,38.21,50.33,321.67,14.87,9.55
121_at,11.97,10.74,9.91,9.77,8.73,13.49,8.65,9.59,12.71,8.71,...,8.73,18.09,9.59,8.69,8.69,12.21,8.64,8.64,8.68,8.69
1255_g_at,5.50,5.50,5.50,5.50,5.50,5.50,5.50,5.50,5.50,5.50,...,5.50,5.50,5.50,5.50,5.50,5.50,5.50,5.50,5.50,5.50
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
AFFX-ThrX-5_at,5.50,5.50,5.50,5.50,5.50,5.50,5.50,5.50,5.50,5.50,...,5.50,5.50,5.50,5.50,5.50,5.50,5.50,5.50,5.50,5.50
AFFX-ThrX-M_at,5.50,5.50,5.50,5.50,5.50,5.50,5.50,5.50,5.50,5.50,...,5.50,5.50,5.50,5.50,5.50,5.50,5.50,5.50,5.50,5.50
AFFX-TrpnX-3_at,5.50,5.50,5.50,5.50,5.50,5.50,5.50,5.50,5.50,5.50,...,5.50,5.50,5.50,5.50,5.50,5.50,5.50,5.50,5.50,5.50
AFFX-TrpnX-5_at,5.50,5.50,5.50,5.50,5.50,5.50,5.50,5.50,5.50,5.50,...,5.50,5.50,5.50,5.50,5.50,5.50,5.50,5.50,5.50,5.50


In [119]:
requires_gene_mapping = True

if requires_gene_mapping:
    gene_annotation = get_gene_annotation(soft_file)
    gene_annotation_summary = preview_df(gene_annotation)
    print(gene_annotation_summary)

gene_row_ids = genetic_data.index[:20].tolist()
gene_row_ids

if requires_gene_mapping:
    print(f'''
    As a biomedical research team, we are analyzing a gene expression dataset, and find that its row headers are some identifiers related to genes:
    {gene_row_ids}
    To get the mapping from those identifiers to actual gene symbols, we extracted the gene annotation data from a series in the GEO database, and saved it to a Python dictionary. Please read the dictionary, and decide which key stores the identifiers, and which key stores the gene symbols. Please strictly follow this format in your answer:
    identifier_key = 'key_name1'
    gene_symbol_key = 'key_name2'

    Gene annotation dictionary:
    {gene_annotation_summary}
    ''')

gene_annotation

{'ID': ['1007_s_at', '1053_at', '117_at', '121_at', '1255_g_at'], 'GB_ACC': ['U48705', 'M87338', 'X51757', 'X69699', 'L36861'], 'SPOT_ID': [nan, nan, nan, nan, nan], 'Species Scientific Name': ['Homo sapiens', 'Homo sapiens', 'Homo sapiens', 'Homo sapiens', 'Homo sapiens'], 'Annotation Date': ['Oct 6, 2014', 'Oct 6, 2014', 'Oct 6, 2014', 'Oct 6, 2014', 'Oct 6, 2014'], 'Sequence Type': ['Exemplar sequence', 'Exemplar sequence', 'Exemplar sequence', 'Exemplar sequence', 'Exemplar sequence'], 'Sequence Source': ['Affymetrix Proprietary Database', 'GenBank', 'Affymetrix Proprietary Database', 'GenBank', 'Affymetrix Proprietary Database'], 'Target Description': ['U48705 /FEATURE=mRNA /DEFINITION=HSU48705 Human receptor tyrosine kinase DDR gene, complete cds', 'M87338 /FEATURE= /DEFINITION=HUMA1SBU Human replication factor C, 40-kDa subunit (A1) mRNA, complete cds', "X51757 /FEATURE=cds /DEFINITION=HSP70B Human heat-shock protein HSP70B' gene", 'X69699 /FEATURE= /DEFINITION=HSPAX8A H.sapiens

Unnamed: 0,ID,GB_ACC,SPOT_ID,Species Scientific Name,Annotation Date,Sequence Type,Sequence Source,Target Description,Representative Public ID,Gene Title,Gene Symbol,ENTREZ_GENE_ID,RefSeq Transcript ID,Gene Ontology Biological Process,Gene Ontology Cellular Component,Gene Ontology Molecular Function
0,1007_s_at,U48705,,Homo sapiens,"Oct 6, 2014",Exemplar sequence,Affymetrix Proprietary Database,U48705 /FEATURE=mRNA /DEFINITION=HSU48705 Huma...,U48705,discoidin domain receptor tyrosine kinase 1 //...,DDR1 /// MIR4640,780 /// 100616237,NM_001202521 /// NM_001202522 /// NM_001202523...,0001558 // regulation of cell growth // inferr...,0005576 // extracellular region // inferred fr...,0000166 // nucleotide binding // inferred from...
1,1053_at,M87338,,Homo sapiens,"Oct 6, 2014",Exemplar sequence,GenBank,M87338 /FEATURE= /DEFINITION=HUMA1SBU Human re...,M87338,"replication factor C (activator 1) 2, 40kDa",RFC2,5982,NM_001278791 /// NM_001278792 /// NM_001278793...,0000278 // mitotic cell cycle // traceable aut...,0005634 // nucleus // inferred from electronic...,0000166 // nucleotide binding // inferred from...
2,117_at,X51757,,Homo sapiens,"Oct 6, 2014",Exemplar sequence,Affymetrix Proprietary Database,X51757 /FEATURE=cds /DEFINITION=HSP70B Human h...,X51757,heat shock 70kDa protein 6 (HSP70B'),HSPA6,3310,NM_002155,0000902 // cell morphogenesis // inferred from...,0005737 // cytoplasm // inferred from direct a...,0000166 // nucleotide binding // inferred from...
3,121_at,X69699,,Homo sapiens,"Oct 6, 2014",Exemplar sequence,GenBank,X69699 /FEATURE= /DEFINITION=HSPAX8A H.sapiens...,X69699,paired box 8,PAX8,7849,NM_003466 /// NM_013951 /// NM_013952 /// NM_0...,0001655 // urogenital system development // in...,0005634 // nucleus // inferred from direct ass...,0000979 // RNA polymerase II core promoter seq...
4,1255_g_at,L36861,,Homo sapiens,"Oct 6, 2014",Exemplar sequence,Affymetrix Proprietary Database,L36861 /FEATURE=expanded_cds /DEFINITION=HUMGC...,L36861,guanylate cyclase activator 1A (retina),GUCA1A,2978,NM_000409 /// XM_006715073,0007165 // signal transduction // non-traceabl...,0001750 // photoreceptor outer segment // infe...,0005509 // calcium ion binding // inferred fro...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2679118,AFFX-ThrX-5_at,5.5,,,,,,,,,,,,,,
2679119,AFFX-ThrX-M_at,5.5,,,,,,,,,,,,,,
2679120,AFFX-TrpnX-3_at,5.5,,,,,,,,,,,,,,
2679121,AFFX-TrpnX-5_at,5.5,,,,,,,,,,,,,,


In [120]:
gene_annotation.columns

Index(['ID', 'GB_ACC', 'SPOT_ID', 'Species Scientific Name', 'Annotation Date',
       'Sequence Type', 'Sequence Source', 'Target Description',
       'Representative Public ID', 'Gene Title', 'Gene Symbol',
       'ENTREZ_GENE_ID', 'RefSeq Transcript ID',
       'Gene Ontology Biological Process', 'Gene Ontology Cellular Component',
       'Gene Ontology Molecular Function'],
      dtype='object')

In [121]:
if requires_gene_mapping:
    identifier_key = 'ID'
    gene_symbol_key = 'Gene Symbol'
    gene_mapping = get_gene_mapping(gene_annotation, identifier_key, gene_symbol_key)
    genetic_data = apply_gene_mapping(genetic_data, gene_mapping)

In [122]:
genetic_data

Unnamed: 0_level_0,GSM493251,GSM493252,GSM493253,GSM493254,GSM493255,GSM493256,GSM493257,GSM493258,GSM493259,GSM493260,...,GSM1094071,GSM1094072,GSM1094073,GSM1094074,GSM1094075,GSM1094076,GSM1094077,GSM1094078,GSM1094079,GSM1094080
Gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ABCB4,1341.22,1980.88,1241.750,1284.08,455.520,15.020,8.490,2015.63,369.640,56.960,...,333.370,2458.34,180.17,736.710,460.15,878.59,460.98,8.25,161.38,526.21
ABCC6P1,7.83,10.19,13.070,11.64,76.920,11.640,23.720,13.40,5.810,23.780,...,5.770,9.76,7.01,5.500,8.54,6.73,7.71,10.09,11.64,11.64
ABCC6P2,7.83,10.19,13.070,11.64,76.920,11.640,23.720,13.40,5.810,23.780,...,5.770,9.76,7.01,5.500,8.54,6.73,7.71,10.09,11.64,11.64
ABCD1P2,5.50,5.83,5.500,5.50,5.500,5.500,5.500,5.50,5.500,5.500,...,5.500,5.50,5.50,5.500,5.50,5.50,5.50,5.50,5.50,5.50
AC078883.4,5.50,5.50,5.500,5.50,5.500,5.500,5.500,5.50,5.500,5.500,...,5.500,5.50,5.50,5.500,5.50,6.83,5.50,5.50,5.50,5.50
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
abParts,9.04,9.04,9.040,9.04,35.270,6313.010,233.550,9.04,9.040,9.040,...,9.040,9.04,9.05,9.040,11.09,9.61,9.04,52.44,43.59,9.04
alpha,5.50,5.50,5.500,5.50,5.500,13.290,5.500,5.50,44.860,5.500,...,5.500,5.50,5.50,5.500,5.50,5.50,5.50,5.50,5.50,5.50
av27s1,5.50,5.72,5.640,5.50,5.500,5.500,5.500,5.50,5.500,5.500,...,5.500,5.50,5.50,5.500,5.50,5.50,5.50,5.50,5.50,5.50
hsa-let-7a-3,9.31,9.05,8.065,6.64,8.985,11.525,11.875,8.55,20.685,6.775,...,5.625,5.50,5.50,5.675,5.50,5.50,5.50,5.50,5.50,5.50


In [123]:
genetic_data = normalize_gene_symbols_in_index(genetic_data)

In [124]:
genetic_data

Unnamed: 0,GSM493251,GSM493252,GSM493253,GSM493254,GSM493255,GSM493256,GSM493257,GSM493258,GSM493259,GSM493260,...,GSM1094071,GSM1094072,GSM1094073,GSM1094074,GSM1094075,GSM1094076,GSM1094077,GSM1094078,GSM1094079,GSM1094080
A1BG,5.500000,5.640,5.500000,5.500000,5.500000,5.500000,18.570000,5.500000,5.500000,5.500000,...,5.500000,5.500000,6.410000,5.500000,5.500000,5.980000,5.500000,8.410000,5.500000,5.500000
A1BG-AS1,7.950000,12.980,13.340000,8.350000,8.840000,7.170000,10.140000,6.800000,6.540000,15.120000,...,10.540000,8.270000,16.220000,7.580000,9.610000,13.200000,6.500000,39.740000,7.510000,6.300000
A1CF,7.120000,5.500,6.805000,5.500000,5.500000,5.500000,5.930000,5.500000,6.555000,5.500000,...,5.500000,5.500000,5.500000,5.500000,5.500000,5.500000,5.500000,5.500000,5.500000,5.500000
A2M,3303.540000,1751.345,4304.375000,4467.760000,1225.825000,404.895000,276.810000,922.500000,184.890000,296.650000,...,503.735000,832.570000,1589.435000,924.025000,467.175000,984.970000,678.560000,249.025000,1012.500000,963.050000
A2M-AS1,13.250000,10.690,33.790000,23.860000,27.390000,9.860000,8.260000,49.400000,9.820000,14.360000,...,17.580000,24.270000,27.740000,16.180000,8.800000,16.650000,19.140000,9.390000,21.520000,18.600000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ZYG11A,5.500000,5.500,5.500000,5.500000,5.500000,5.680000,41.110000,5.500000,5.500000,5.500000,...,5.500000,5.500000,5.500000,5.500000,5.500000,5.500000,5.500000,121.390000,5.500000,6.110000
ZYG11B,120.173333,135.650,144.273333,160.646667,192.403333,69.213333,164.600000,153.056667,118.900000,124.290000,...,52.633333,102.250000,73.350000,83.226667,84.833333,115.490000,64.146667,122.486667,66.980000,111.153333
ZYX,109.345000,103.390,115.915000,113.375000,33.390000,169.690000,97.565000,68.930000,105.195000,18.495000,...,18.360000,28.040000,86.410000,41.670000,31.345000,24.975000,43.145000,16.535000,16.925000,19.220000
ZZEF1,46.093333,37.000,36.803333,34.036667,53.216667,21.713333,37.506667,89.816667,77.906667,36.686667,...,58.003333,15.563333,8.223333,11.126667,9.826667,10.496667,13.913333,11.370000,11.766667,15.500000


In [125]:
merged_data = geo_merge_clinical_genetic_data(selected_clinical_data, genetic_data)
# The preprocessing runs through, which means is_available should be True
is_available = True

  merged_data = pd.concat([clinical_df, genetic_df], axis=0).T.dropna()


In [126]:
merged_data

Unnamed: 0,Adrenocortical Cancer,Age,Gender,A1BG,A1BG-AS1,A1CF,A2M,A2M-AS1,A2ML1,A2MP1,...,ZWILCH,ZWINT,ZXDA,ZXDB,ZXDC,ZYG11A,ZYG11B,ZYX,ZZEF1,ZZZ3
GSM493265,1.0,37.0,1.0,29.53,27.98,5.5,630.915,14.83,5.84,5.79,...,21.53,86.84,8.83,6.735,36.676667,5.5,98.686667,34.93,30.373333,71.49
GSM493270,1.0,58.0,1.0,5.5,6.77,5.5,968.32,34.95,5.5,5.79,...,30.695,90.66,13.94,11.26,33.223333,5.5,158.316667,70.065,61.263333,68.105
GSM1094056,0.0,20.0,1.0,5.5,9.68,5.5,380.405,9.32,5.5,5.84,...,48.445,191.94,12.93,9.78,41.331667,5.5,89.596667,15.265,16.493333,108.55
GSM1094057,1.0,68.0,1.0,5.5,7.95,5.5,1465.11,27.9,5.5,154.36,...,48.365,281.2,7.29,7.32,46.606667,5.5,50.603333,28.48,10.89,58.92
GSM1094060,0.0,32.0,1.0,5.5,7.95,5.5,680.91,14.43,5.5,12.75,...,47.235,28.79,10.14,11.98,45.533333,5.5,70.15,15.255,23.983333,101.235
GSM1094061,0.0,43.0,0.0,5.5,6.29,5.5,999.32,11.21,5.5,18.01,...,32.09,81.36,19.22,9.825,28.378333,5.56,76.246667,54.955,10.566667,67.585
GSM1094063,1.0,40.0,0.0,5.5,7.51,5.5,474.915,12.02,5.5,46.27,...,41.44,61.6,23.76,18.295,43.958333,5.5,87.03,23.975,10.793333,94.43
GSM1094066,0.0,27.0,1.0,10.71,22.0,5.945,365.945,11.49,5.56,60.55,...,35.47,65.86,32.64,25.8,55.596667,7.74,89.68,12.18,14.486667,95.235
GSM1094067,0.0,70.0,0.0,5.5,7.2,6.55,1191.28,23.69,5.5,9.22,...,33.985,37.59,24.58,22.49,27.738333,5.5,109.716667,42.44,10.15,61.765
GSM1094071,0.0,57.0,0.0,5.5,10.54,5.5,503.735,17.58,5.5,93.73,...,53.105,159.57,14.95,11.98,41.068333,5.5,52.633333,18.36,58.003333,80.365


In [127]:
print(f"The merged dataset contains {len(merged_data)} samples.")
is_trait_biased, merged_data = judge_and_remove_biased_features(merged_data, TRAIT, trait_type=trait_type)
is_trait_biased

The merged dataset contains 13 samples.
For the feature 'Adrenocortical Cancer', the least common label is '0.0' with 6 occurrences. This represents 46.15% of the dataset.
The distribution of the feature 'Adrenocortical Cancer' in this dataset is fine.

Quartiles for 'Age':
  25%: 37.0
  50% (Median): 51.0
  75%: 58.0
Min: 20.0
Max: 70.0
The distribution of the feature 'Age' in this dataset is fine.

For the feature 'Gender', the least common label is '0.0' with 5 occurrences. This represents 38.46% of the dataset.
The distribution of the feature 'Gender' in this dataset is fine.



False

In [128]:
if is_available:
    save_cohort_info(cohort, JSON_PATH, is_available, is_trait_biased, merged_data, note='')
else:
    save_cohort_info(cohort, JSON_PATH, is_available)
merged_data.head()
if not is_trait_biased:
    merged_data.to_csv(os.path.join(OUTPUT_DIR, cohort + '.csv'), index=False)

In [129]:
# Stoped: MemoryError

cohort = accession_num = "GSE108089"
cohort_dir = os.path.join(trait_path, accession_num)
soft_file, matrix_file = get_relevant_filepaths(cohort_dir)
soft_file, matrix_file

from utils import *
import io
background_prefixes = ['!Series_title', '!Series_summary', '!Series_overall_design']
clinical_prefixes = ['!Sample_geo_accession', '!Sample_characteristics_ch1']

background_info, clinical_data = get_background_and_clinical_data(matrix_file, background_prefixes, clinical_prefixes)
print(background_info)

clinical_data

!Series_title	"Comprehensive molecular profiling of children with recurrent cancer"
!Series_summary	"This SuperSeries is composed of the SubSeries listed below."
!Series_overall_design	"Refer to individual Series"


Unnamed: 0,!Sample_geo_accession,GSM2889381,GSM2889382,GSM2889383,GSM2889384,GSM2889385,GSM2889386,GSM2889387,GSM2889388,GSM2889389,...,GSM2889414,GSM2889415,GSM2889416,GSM2889417,GSM2889418,GSM2889419,GSM2889420,GSM2889421,GSM2889422,GSM2889423
0,!Sample_characteristics_ch1,condition: Atypical meningioma,condition: Choroid plexus carcinoma / Malignan...,condition: Pilocytisc/pilomyxoid astrocytoma,condition: Pleomorphic xanthoastrocytoma,condition: Mesoblastisc nephroma,condition: Signetringcell carcinoma,condition: Ganglioglioma / Diffuse astrocytoma,condition: Chondrosarkoma,"condition: Chordoma, dedefferentiated/anaplati...",...,condition: Anaplastic ependymoma,condition: Enchodroms,condition: Pineoblastoma,condition: Osteochondroma,condition: Malignant peripheral nerve sheeth t...,condition: Ewing sarcoma,condition: Adrenocortical carcinoma,condition: Anaplastic ependymoma,condition: Rhabdomyosarcoma,condition: Ependymom


In [153]:
# Finished

cohort = accession_num = "GSE49277"
cohort_dir = os.path.join(trait_path, accession_num)
soft_file, matrix_file = get_relevant_filepaths(cohort_dir)
soft_file, matrix_file

from utils import *
import io
background_prefixes = ['!Series_title', '!Series_summary', '!Series_overall_design']
clinical_prefixes = ['!Sample_geo_accession', '!Sample_characteristics_ch1']

background_info, clinical_data = get_background_and_clinical_data(matrix_file, background_prefixes, clinical_prefixes)
print(background_info)

clinical_data

!Series_title	"Genome-scale methylome profiling of adrenocortical carcinomas (ACC) and adenomas (ACA)"
!Series_summary	"Genome-scale DNA methylation was analyzed in a cohort of ACC and ACA to identify DNA methylation changes."
!Series_overall_design	"Bisulfite converted DNA from 51 fresh frozen ACC and 30 ACA samples were hybridized to Illumina HumanMethylation27 BeadChips."


Unnamed: 0,!Sample_geo_accession,GSM1196428,GSM1196429,GSM1196430,GSM1196431,GSM1196432,GSM1196433,GSM1196434,GSM1196435,GSM1196436,...,GSM1196501,GSM1196502,GSM1196503,GSM1196504,GSM1196505,GSM1196506,GSM1196507,GSM1196508,GSM1196509,GSM1196510
0,!Sample_characteristics_ch1,cell type: Adrenocortical carcinoma,cell type: Adrenocortical carcinoma,cell type: Adrenocortical carcinoma,cell type: Adrenocortical carcinoma,cell type: Adrenocortical carcinoma,cell type: Adrenocortical carcinoma,cell type: Adrenocortical carcinoma,cell type: Adrenocortical carcinoma,cell type: Adrenocortical carcinoma,...,cell type: Adrenocortical adenoma,cell type: Adrenocortical adenoma,cell type: Adrenocortical adenoma,cell type: Adrenocortical adenoma,cell type: Adrenocortical adenoma,cell type: Adrenocortical adenoma,cell type: Adrenocortical adenoma,cell type: Adrenocortical adenoma,cell type: Adrenocortical adenoma,cell type: Adrenocortical adenoma


In [154]:
tumor_stage_row = clinical_data.iloc[0]
tumor_stage_row.unique()

array(['!Sample_characteristics_ch1',
       'cell type: Adrenocortical carcinoma',
       'cell type: Adrenocortical adenoma'], dtype=object)

In [155]:
is_gene_availabe = True
trait_row = 0
age_row = None
gender_row = None

trait_type = 'binary'

# Verify and use the functions generated by GPT

# 这个函数将组织类型（tissue type）转换为有关癫痫存在与否的二进制值。
# 它是基于特定的假设，即如果组织类型是“胰腺导管腺癌”（Pancreatic Ductal Adenocarcinoma），则认为癫痫存在（返回1）；否则，认为癫痫不存在（返回0）。
def convert_trait(tissue_type):
    """
    Convert tissue type to epilepsy presence (binary).
    Assuming epilepsy presence for 'Hippocampus' tissue.
    """
    if tissue_type == 'condition: Adrenocortical carcinoma':
        return 1  # Epilepsy present
    else:
        return 0  # Epilepsy not present


# def convert_trait(tumor_grade):
#     if (tumor_grade == 'tumor grade: 2' or tumor_grade == 'tumor grade: 3' or tumor_grade == 'tumor grade: 4'):
#         return 1  
#     elif tumor_grade == 'tumor grade: 1':
#         return 0  
#     else:
#         return None

# 这个函数的目的是将年龄的字符串表示转换为一个连续的数值型表示。如果年龄未知（例如，标记为'n.a.'），则返回None。
# 函数尝试从传入的字符串中提取出一个整数作为年龄值。如果字符串的格式不符合预期，导致提取失败，同样返回None。
def convert_age(age_string):
    """
    Convert age string to a continuous numerical value.
    Unknown values are converted to None.
    """
    if age_string.lower() == 'n.a.':
        return None
    try:
        # Extract age as an integer from the string
        age = int(age_string.split(': ')[1])
        return age
    except (ValueError, IndexError):
        # In case of any format error or unexpected string structure
        return None


# 这个函数将性别的字符串表示转换为二进制值，其中“female”对应1，“male”对应0。如果性别未知或字符串不符合预期格式，则返回None。
# It sometimes maps 'female' to 0, and sometimes 1. Does it matter?
def convert_gender(gender_string):
    """
    Convert gender string to a binary value.
    'female' is represented as 1, 'male' as 0.
    Unknown values are converted to None.
    """
    if (gender_string.lower() == 'sex: female' or gender_string.lower() == 'sex: f' or gender_string.lower() == 'gender: female' or gender_string.lower() == 'gender: f'):
        return 1
    elif (gender_string.lower() == 'sex: male' or gender_string.lower() == 'sex: m' or gender_string.lower() == 'gender: male' or gender_string.lower() == 'gender: m') :  # changeed 
        return 0
    else:
        return None

In [133]:
selected_clinical_data = geo_select_clinical_features(clinical_data, TRAIT, trait_row, convert_trait, age_row=age_row,
                                                      convert_age=convert_age, gender_row=gender_row,
                                                      convert_gender=convert_gender)
selected_clinical_data.head()

  clinical_df = clinical_df.applymap(convert_fn)


Unnamed: 0,GSM1196428,GSM1196429,GSM1196430,GSM1196431,GSM1196432,GSM1196433,GSM1196434,GSM1196435,GSM1196436,GSM1196437,...,GSM1196501,GSM1196502,GSM1196503,GSM1196504,GSM1196505,GSM1196506,GSM1196507,GSM1196508,GSM1196509,GSM1196510
Adrenocortical Cancer,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [134]:
genetic_data = get_genetic_data(matrix_file)
genetic_data

Unnamed: 0_level_0,GSM1196428,GSM1196429,GSM1196430,GSM1196431,GSM1196432,GSM1196433,GSM1196434,GSM1196435,GSM1196436,GSM1196437,...,GSM1196501,GSM1196502,GSM1196503,GSM1196504,GSM1196505,GSM1196506,GSM1196507,GSM1196508,GSM1196509,GSM1196510
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
cg00000292,0.853864,0.873688,0.835080,0.598675,0.770182,0.435826,0.779509,0.832013,0.787016,0.727145,...,0.584722,0.407948,0.590603,0.567782,0.359869,0.653657,0.771832,0.349547,0.693256,0.438942
cg00002426,0.183064,0.173539,0.205998,0.209957,0.176864,0.227453,0.194177,0.214679,0.097024,0.366682,...,0.217829,0.206476,0.169502,0.150175,0.228998,0.277141,0.248073,0.319916,0.258845,0.210580
cg00003994,0.146456,0.123106,0.045628,0.081798,0.190403,0.100151,0.060844,0.056040,0.060491,0.219284,...,0.117100,0.054069,0.075588,0.076611,0.062579,0.188656,0.110269,0.058055,0.099765,0.113560
cg00005847,0.855151,0.811072,0.830661,0.512991,0.738095,0.367623,0.814609,0.681481,0.798670,0.616220,...,0.436448,0.807319,0.468079,0.474581,0.455604,0.473226,0.173221,0.703806,0.385992,0.438669
cg00006414,0.019880,0.036970,0.021106,0.026866,0.051002,0.038998,0.028451,0.029114,0.020005,0.043336,...,0.026824,0.018624,0.044337,0.022170,0.028007,0.035340,0.038400,0.033976,0.084529,0.069691
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
cg27657283,0.108293,0.072173,0.068453,0.106587,0.072691,0.093206,0.119652,0.081737,0.082157,0.114968,...,0.144649,0.068423,0.083626,0.052279,0.054980,0.118309,0.064773,0.051555,0.219870,0.126036
cg27661264,0.359134,0.337503,0.233228,0.584919,0.343895,0.469068,0.506269,0.609027,0.360137,0.374381,...,0.364477,0.355658,0.384973,0.356396,0.352938,0.434815,0.371330,0.563619,0.529970,0.117175
cg27662379,0.008482,0.008813,0.008864,0.010997,0.024967,0.014811,0.020630,0.017429,0.014848,0.012520,...,0.034058,0.009198,0.012508,0.014009,0.007969,0.010037,0.019814,0.014395,0.030977,0.025664
cg27662877,0.022540,0.018977,0.013427,0.031917,0.035480,0.047936,0.035445,0.027623,0.032830,0.087929,...,0.034659,0.030906,0.027370,0.027446,0.017592,0.028750,0.036526,0.035873,0.048063,0.095890


In [135]:
requires_gene_mapping = True

if requires_gene_mapping:
    gene_annotation = get_gene_annotation(soft_file)
    gene_annotation_summary = preview_df(gene_annotation)
    print(gene_annotation_summary)

gene_row_ids = genetic_data.index[:20].tolist()
gene_row_ids

if requires_gene_mapping:
    print(f'''
    As a biomedical research team, we are analyzing a gene expression dataset, and find that its row headers are some identifiers related to genes:
    {gene_row_ids}
    To get the mapping from those identifiers to actual gene symbols, we extracted the gene annotation data from a series in the GEO database, and saved it to a Python dictionary. Please read the dictionary, and decide which key stores the identifiers, and which key stores the gene symbols. Please strictly follow this format in your answer:
    identifier_key = 'key_name1'
    gene_symbol_key = 'key_name2'

    Gene annotation dictionary:
    {gene_annotation_summary}
    ''')


{'ID': ['cg00000292', 'cg00002426', 'cg00003994', 'cg00005847', 'cg00006414'], 'Name': ['cg00000292', 'cg00002426', 'cg00003994', 'cg00005847', 'cg00006414'], 'IlmnStrand': ['TOP', 'TOP', 'TOP', 'BOT', 'BOT'], 'AddressA_ID': [990370.0, 6580397.0, 7150184.0, 4850717.0, 6980731.0], 'AlleleA_ProbeSeq': ['AAACATTAATTACCAACCACTCTTCCAAAAAACACTTACCATTAAAACCA', 'AATATAATAACATTACCTTACCCATCTTATAATCAAACCAAACAAAAACA', 'AATAATAATAATACCCCCTATAATACTAACTAACAAACATACCCTCTTCA', 'TACTATAATACACCCTATATTTAAAACACTAAACTTACCCCATTAAAACA', 'CTCAAAAACCAAACAAAACAAAACCCCAATACTAATCATTAATAAAATCA'], 'AddressB_ID': [6660678.0, 6100343.0, 7150392.0, 1260113.0, 4280093.0], 'AlleleB_ProbeSeq': ['AAACATTAATTACCAACCGCTCTTCCAAAAAACACTTACCATTAAAACCG', 'AATATAATAACATTACCTTACCCGTCTTATAATCAAACCAAACGAAAACG', 'AATAATAATAATACCCCCTATAATACTAACTAACAAACATACCCTCTTCG', 'TACTATAATACACCCTATATTTAAAACACTAAACTTACCCCATTAAAACG', 'CTCGAAAACCGAACAAAACAAAACCCCAATACTAATCGTTAATAAAATCG'], 'GenomeBuild': [36.0, 36.0, 36.0, 36.0, 36.0], 'Chr': ['16', '3

In [136]:
gene_annotation.columns

Index(['ID', 'Name', 'IlmnStrand', 'AddressA_ID', 'AlleleA_ProbeSeq',
       'AddressB_ID', 'AlleleB_ProbeSeq', 'GenomeBuild', 'Chr', 'MapInfo',
       'Ploidy', 'Species', 'Source', 'SourceVersion', 'SourceStrand',
       'SourceSeq', 'TopGenomicSeq', 'Next_Base', 'Color_Channel',
       'TSS_Coordinate', 'Gene_Strand', 'Gene_ID', 'Symbol', 'Synonym',
       'Accession', 'GID', 'Annotation', 'Product', 'Distance_to_TSS',
       'CPG_ISLAND', 'CPG_ISLAND_LOCATIONS', 'MIR_CPG_ISLAND', 'RANGE_GB',
       'RANGE_START', 'RANGE_END', 'RANGE_STRAND', 'GB_ACC', 'ORF'],
      dtype='object')

In [138]:
if requires_gene_mapping:
    identifier_key = 'ID'
    gene_symbol_key = 'Symbol'
    gene_mapping = get_gene_mapping(gene_annotation, identifier_key, gene_symbol_key)
    genetic_data = apply_gene_mapping(genetic_data, gene_mapping)

In [139]:
genetic_data

Unnamed: 0_level_0,GSM1196428,GSM1196429,GSM1196430,GSM1196431,GSM1196432,GSM1196433,GSM1196434,GSM1196435,GSM1196436,GSM1196437,...,GSM1196501,GSM1196502,GSM1196503,GSM1196504,GSM1196505,GSM1196506,GSM1196507,GSM1196508,GSM1196509,GSM1196510
Gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
39873,0.673936,0.767293,0.462837,0.723856,0.762069,0.667737,0.626288,0.565551,0.443497,0.342631,...,0.513662,0.660481,0.552440,0.520529,0.479534,0.420159,0.610865,0.699102,0.693876,0.663477
39874,0.021972,0.027994,0.021949,0.035239,0.043132,0.030231,0.040192,0.025203,0.034531,0.049580,...,0.069800,0.023893,0.033235,0.045726,0.035371,0.036387,0.038013,0.061310,0.068406,0.080097
39875,0.257827,0.335861,0.430710,0.261546,0.417659,0.235851,0.444144,0.425444,0.279365,0.418262,...,0.397831,0.425723,0.161505,0.417688,0.279023,0.414954,0.339570,0.422286,0.398828,0.105795
39877,0.029100,0.029369,0.020777,0.038653,0.048058,0.036818,0.031426,0.040999,0.031261,0.044107,...,0.061673,0.030315,0.035297,0.030667,0.038152,0.039511,0.034959,0.046571,0.067835,0.083983
39878,0.037687,0.036334,0.041640,0.041173,0.052805,0.061906,0.048494,0.041754,0.048721,0.049560,...,0.059706,0.043102,0.038224,0.047865,0.049129,0.050167,0.038020,0.059438,0.068282,0.077149
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
hCAP-D3,0.086622,0.293496,0.511238,0.346532,0.331784,0.211769,0.771208,0.387434,0.370193,0.178503,...,0.195070,0.139705,0.209928,0.274113,0.115736,0.183836,0.199735,0.245198,0.204434,0.156846
hCAP-H2,0.018871,0.033459,0.017983,0.032419,0.056566,0.035238,0.029841,0.019733,0.031213,0.032566,...,0.039907,0.023451,0.037152,0.022379,0.021213,0.031702,0.034143,0.051531,0.052094,0.045951
hfl-B5,0.025806,0.026141,0.023396,0.033722,0.041915,0.024440,0.029792,0.029799,0.024806,0.028791,...,0.037977,0.023297,0.023845,0.022994,0.027307,0.030210,0.029699,0.036275,0.039542,0.050389
mimitin,0.795290,0.764163,0.795372,0.662529,0.704436,0.654050,0.785667,0.690498,0.790545,0.636648,...,0.640861,0.639229,0.648629,0.687194,0.611046,0.588422,0.551924,0.418062,0.579748,0.585062


In [140]:
genetic_data = normalize_gene_symbols_in_index(genetic_data)

In [141]:
genetic_data

Unnamed: 0,GSM1196428,GSM1196429,GSM1196430,GSM1196431,GSM1196432,GSM1196433,GSM1196434,GSM1196435,GSM1196436,GSM1196437,...,GSM1196501,GSM1196502,GSM1196503,GSM1196504,GSM1196505,GSM1196506,GSM1196507,GSM1196508,GSM1196509,GSM1196510
A1BG,0.940233,0.955381,0.938320,0.872892,0.916779,0.856238,0.955121,0.934917,0.936212,0.921644,...,0.935269,0.913290,0.934775,0.876953,0.889109,0.939685,0.815038,0.561779,0.883203,0.890838
A2M,0.749546,0.245808,0.694354,0.460610,0.506125,0.281712,0.565541,0.474934,0.689960,0.745397,...,0.437034,0.584193,0.562674,0.578834,0.324707,0.527137,0.561500,0.482367,0.434855,0.641983
A2ML1,0.790944,0.862709,0.828572,0.857309,0.823975,0.823414,0.851734,0.866243,0.856738,0.832075,...,0.796702,0.762879,0.781257,0.700352,0.744846,0.712289,0.735023,0.824379,0.797928,0.724440
A4GALT,0.374478,0.458769,0.362774,0.277094,0.565643,0.452039,0.435517,0.336927,0.447911,0.367933,...,0.356745,0.279629,0.240643,0.425051,0.302600,0.342275,0.300820,0.312059,0.343013,0.261995
A4GNT,0.634482,0.861004,0.768237,0.732757,0.848816,0.802036,0.815582,0.811549,0.768936,0.744320,...,0.766266,0.729659,0.790821,0.782283,0.760710,0.810943,0.830674,0.865127,0.758200,0.620545
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ZWINT,0.031971,0.037855,0.033131,0.036843,0.043600,0.045227,0.041920,0.032721,0.032561,0.053831,...,0.053125,0.034923,0.031704,0.028981,0.032775,0.031853,0.032822,0.043026,0.066233,0.072853
ZXDA,0.468787,0.468326,0.457744,0.577638,0.449071,0.473941,0.508964,0.514308,0.402810,0.496837,...,0.428336,0.532234,0.419962,0.428588,0.465011,0.403560,0.539247,0.501775,0.458448,0.518576
ZYX,0.029696,0.019655,0.029563,0.034270,0.032857,0.035739,0.027779,0.023625,0.027304,0.023680,...,0.040760,0.027923,0.029210,0.032056,0.032080,0.024537,0.036141,0.028701,0.040045,0.033771
ZZEF1,0.079243,0.072206,0.066787,0.174444,0.439872,0.073742,0.225078,0.072467,0.072189,0.117451,...,0.097018,0.065814,0.082594,0.078288,0.085644,0.085534,0.130871,0.056252,0.089391,0.095180


In [142]:
merged_data = geo_merge_clinical_genetic_data(selected_clinical_data, genetic_data)
# The preprocessing runs through, which means is_available should be True
is_available = True

In [143]:
merged_data

Unnamed: 0,Adrenocortical Cancer,A1BG,A2M,A2ML1,A4GALT,A4GNT,AAAS,AACS,AADAC,AADACL2,...,ZSCAN2,ZSCAN4,ZSWIM1,ZW10,ZWILCH,ZWINT,ZXDA,ZYX,ZZEF1,ZZZ3
GSM1196428,0.0,0.940233,0.749546,0.790944,0.374478,0.634482,0.052668,0.217289,0.830051,0.672389,...,0.040573,0.810611,0.046088,0.035631,0.037018,0.031971,0.468787,0.029696,0.079243,0.022379
GSM1196429,0.0,0.955381,0.245808,0.862709,0.458769,0.861004,0.062860,0.070667,0.920872,0.661086,...,0.054629,0.947878,0.069931,0.035145,0.036839,0.037855,0.468326,0.019655,0.072206,0.044275
GSM1196430,0.0,0.938320,0.694354,0.828572,0.362774,0.768237,0.062411,0.085289,0.801509,0.542964,...,0.040528,0.850798,0.063188,0.038473,0.043276,0.033131,0.457744,0.029563,0.066787,0.047593
GSM1196431,0.0,0.872892,0.460610,0.857309,0.277094,0.732757,0.070863,0.099313,0.895421,0.578508,...,0.083012,0.805824,0.102142,0.043941,0.059123,0.036843,0.577638,0.034270,0.174444,0.045351
GSM1196432,0.0,0.916779,0.506125,0.823975,0.565643,0.848816,0.084068,0.232778,0.901168,0.628704,...,0.172733,0.926969,0.104711,0.047560,0.068287,0.043600,0.449071,0.032857,0.439872,0.053067
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
GSM1196506,0.0,0.939685,0.527137,0.712289,0.342275,0.810943,0.078483,0.089091,0.857550,0.629493,...,0.051485,0.727464,0.075486,0.031816,0.108818,0.031853,0.403560,0.024537,0.085534,0.053363
GSM1196507,0.0,0.815038,0.561500,0.735023,0.300820,0.830674,0.085613,0.076625,0.851060,0.626565,...,0.055969,0.875060,0.072526,0.044406,0.048262,0.032822,0.539247,0.036141,0.130871,0.039618
GSM1196508,0.0,0.561779,0.482367,0.824379,0.312059,0.865127,0.100088,0.102076,0.894283,0.600336,...,0.079980,0.787975,0.122041,0.045710,0.093470,0.043026,0.501775,0.028701,0.056252,0.035059
GSM1196509,0.0,0.883203,0.434855,0.797928,0.343013,0.758200,0.114223,0.184925,0.860588,0.568541,...,0.097542,0.762363,0.158570,0.054024,0.108634,0.066233,0.458448,0.040045,0.089391,0.090369


In [144]:
print(f"The merged dataset contains {len(merged_data)} samples.")
is_trait_biased, merged_data = judge_and_remove_biased_features(merged_data, TRAIT, trait_type=trait_type)
is_trait_biased

The merged dataset contains 81 samples.
For the feature 'Adrenocortical Cancer', the least common label is '0.0' with 81 occurrences. This represents 100.00% of the dataset.
The distribution of the feature 'Adrenocortical Cancer' in this dataset is severely biased.



True

In [152]:
if is_available:
    save_cohort_info(cohort, JSON_PATH, is_available, is_trait_biased, merged_data, note='')
else:
    save_cohort_info(cohort, JSON_PATH, is_available)
merged_data.head()
if not is_trait_biased:
    merged_data.to_csv(os.path.join(OUTPUT_DIR, cohort + '.csv'), index=False)

In [146]:
# Stopped: MemoryError

cohort = accession_num = "GSE49276"
cohort_dir = os.path.join(trait_path, accession_num)
soft_file, matrix_file = get_relevant_filepaths(cohort_dir)
soft_file, matrix_file

from utils import *
import io
background_prefixes = ['!Series_title', '!Series_summary', '!Series_overall_design']
clinical_prefixes = ['!Sample_geo_accession', '!Sample_characteristics_ch1']

background_info, clinical_data = get_background_and_clinical_data(matrix_file, background_prefixes, clinical_prefixes)
print(background_info)

clinical_data

!Series_title	"SNP data from 45 adrenocortical carcinomas"
!Series_summary	"SNP array data from 45 adrenocortical carcinomas were used to detect recurrent copy number alterations."
!Series_overall_design	"7 tumors were analyzed with Illumina Human610-Quad v1.0 BeadChip. 38 tumors were analyzed with Illumina HumanOmniExpress BeadChip."


Unnamed: 0,!Sample_geo_accession,GSM1196390,GSM1196391,GSM1196392,GSM1196393,GSM1196394,GSM1196395,GSM1196396,GSM1196397,GSM1196398,...,GSM1196418,GSM1196419,GSM1196420,GSM1196421,GSM1196422,GSM1196423,GSM1196424,GSM1196425,GSM1196426,GSM1196427
0,!Sample_characteristics_ch1,cell type: Adrenocortical carcinoma,cell type: Adrenocortical carcinoma,cell type: Adrenocortical carcinoma,cell type: Adrenocortical carcinoma,cell type: Adrenocortical carcinoma,cell type: Adrenocortical carcinoma,cell type: Adrenocortical carcinoma,cell type: Adrenocortical carcinoma,cell type: Adrenocortical carcinoma,...,cell type: Adrenocortical carcinoma,cell type: Adrenocortical carcinoma,cell type: Adrenocortical carcinoma,cell type: Adrenocortical carcinoma,cell type: Adrenocortical carcinoma,cell type: Adrenocortical carcinoma,cell type: Adrenocortical carcinoma,cell type: Adrenocortical carcinoma,cell type: Adrenocortical carcinoma,cell type: Adrenocortical carcinoma


In [147]:
tumor_stage_row = clinical_data.iloc[0]
tumor_stage_row.unique()

array(['!Sample_characteristics_ch1',
       'cell type: Adrenocortical carcinoma'], dtype=object)

In [148]:
is_gene_availabe = True
trait_row = 0
age_row = None
gender_row = None

trait_type = 'binary'

# Verify and use the functions generated by GPT

# 这个函数将组织类型（tissue type）转换为有关癫痫存在与否的二进制值。
# 它是基于特定的假设，即如果组织类型是“胰腺导管腺癌”（Pancreatic Ductal Adenocarcinoma），则认为癫痫存在（返回1）；否则，认为癫痫不存在（返回0）。
def convert_trait(tissue_type):
    """
    Convert tissue type to epilepsy presence (binary).
    Assuming epilepsy presence for 'Hippocampus' tissue.
    """
    if tissue_type == 'cell type: Adrenocortical carcinoma':
        return 1  # Epilepsy present
    else:
        return 0  # Epilepsy not present


# def convert_trait(tumor_grade):
#     if (tumor_grade == 'tumor grade: 2' or tumor_grade == 'tumor grade: 3' or tumor_grade == 'tumor grade: 4'):
#         return 1  
#     elif tumor_grade == 'tumor grade: 1':
#         return 0  
#     else:
#         return None

# 这个函数的目的是将年龄的字符串表示转换为一个连续的数值型表示。如果年龄未知（例如，标记为'n.a.'），则返回None。
# 函数尝试从传入的字符串中提取出一个整数作为年龄值。如果字符串的格式不符合预期，导致提取失败，同样返回None。
def convert_age(age_string):
    """
    Convert age string to a continuous numerical value.
    Unknown values are converted to None.
    """
    if age_string.lower() == 'n.a.':
        return None
    try:
        # Extract age as an integer from the string
        age = int(age_string.split(': ')[1])
        return age
    except (ValueError, IndexError):
        # In case of any format error or unexpected string structure
        return None


# 这个函数将性别的字符串表示转换为二进制值，其中“female”对应1，“male”对应0。如果性别未知或字符串不符合预期格式，则返回None。
# It sometimes maps 'female' to 0, and sometimes 1. Does it matter?
def convert_gender(gender_string):
    """
    Convert gender string to a binary value.
    'female' is represented as 1, 'male' as 0.
    Unknown values are converted to None.
    """
    if (gender_string.lower() == 'sex: female' or gender_string.lower() == 'sex: f' or gender_string.lower() == 'gender: female' or gender_string.lower() == 'gender: f'):
        return 1
    elif (gender_string.lower() == 'sex: male' or gender_string.lower() == 'sex: m' or gender_string.lower() == 'gender: male' or gender_string.lower() == 'gender: m') :  # changeed 
        return 0
    else:
        return None

In [149]:
selected_clinical_data = geo_select_clinical_features(clinical_data, TRAIT, trait_row, convert_trait, age_row=age_row,
                                                      convert_age=convert_age, gender_row=gender_row,
                                                      convert_gender=convert_gender)
selected_clinical_data.head()

  clinical_df = clinical_df.applymap(convert_fn)


Unnamed: 0,GSM1196390,GSM1196391,GSM1196392,GSM1196393,GSM1196394,GSM1196395,GSM1196396,GSM1196397,GSM1196398,GSM1196399,...,GSM1196418,GSM1196419,GSM1196420,GSM1196421,GSM1196422,GSM1196423,GSM1196424,GSM1196425,GSM1196426,GSM1196427
Adrenocortical Cancer,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1


In [150]:
genetic_data = get_genetic_data(matrix_file)
genetic_data

Unnamed: 0_level_0,GSM1196390,GSM1196391,GSM1196392,GSM1196393,GSM1196394,GSM1196395,GSM1196396,GSM1196397,GSM1196398,GSM1196399,...,GSM1196418,GSM1196419,GSM1196420,GSM1196421,GSM1196422,GSM1196423,GSM1196424,GSM1196425,GSM1196426,GSM1196427
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
rs1000000,0.1065,0.2021,-0.0258,0.0714,0.1821,0.3716,-0.0343,0.0632,0.4660,0.1609,...,0.0428,-0.0564,0.0930,0.1396,0.0075,0.0441,0.0616,0.1482,0.0390,0.0873
rs1000002,0.1184,-0.2697,-0.5366,-0.3392,0.1135,-0.5014,-0.2013,-0.3576,0.0717,-0.1686,...,-0.1097,0.1616,-0.2828,-0.2522,0.2335,-0.1791,0.2011,0.2515,-0.5280,-0.1804
rs10000023,-0.0627,0.0830,-0.1168,-0.0207,-0.0500,0.1033,-0.0169,0.0197,0.2101,0.0587,...,-0.1189,0.0611,0.0374,-0.0478,0.1226,-0.2908,0.1440,0.0704,-0.0951,-0.1550
rs1000003,-0.0874,-0.5206,-0.2224,-0.1689,-0.0666,-0.2013,-0.1730,0.1010,0.2544,-0.1112,...,0.0022,0.0930,-0.1605,-0.1667,0.0316,0.0493,-0.1910,0.0574,-0.1625,-0.1756
rs10000030,0.7767,-0.0285,0.4749,0.2115,-0.0876,-0.3540,0.4442,-0.5963,0.0009,-0.4139,...,0.3487,0.7111,0.5774,0.4109,0.3663,0.2121,0.6078,0.8196,0.2831,0.8315
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
VGXS34742,0.1940,0.2780,0.2586,-0.1372,0.3958,0.1807,0.2403,0.1901,0.3416,0.4133,...,0.2485,0.0686,0.1226,-0.0354,-0.1239,-0.3011,0.0032,-0.0218,-0.4481,0.0698
VGXS34743,0.2213,0.3447,0.2913,-0.2419,0.3323,0.1770,0.2181,0.0668,0.3055,0.4727,...,0.2333,0.2030,0.1948,0.0367,0.0499,-0.1797,0.1574,-0.0151,-0.3437,0.1108
VGXS34744,-0.1274,-0.0201,-0.0016,-0.4076,0.2235,-0.0196,-0.0525,-0.0919,0.2553,0.2095,...,0.0968,-0.0961,-0.1108,-0.4564,-0.4473,-0.7310,-0.1037,-0.5226,-0.7002,-0.2425
VGXS34761,-0.1124,0.0917,0.0726,-0.3928,0.2356,0.0827,-0.0679,0.1435,0.3957,0.2965,...,0.2568,0.0615,0.0917,-0.1877,-0.1994,-0.4965,-0.0968,-0.2921,-0.5193,-0.0352


In [151]:
requires_gene_mapping = True

if requires_gene_mapping:
    gene_annotation = get_gene_annotation(soft_file)
    gene_annotation_summary = preview_df(gene_annotation)
    print(gene_annotation_summary)

gene_row_ids = genetic_data.index[:20].tolist()
gene_row_ids

if requires_gene_mapping:
    print(f'''
    As a biomedical research team, we are analyzing a gene expression dataset, and find that its row headers are some identifiers related to genes:
    {gene_row_ids}
    To get the mapping from those identifiers to actual gene symbols, we extracted the gene annotation data from a series in the GEO database, and saved it to a Python dictionary. Please read the dictionary, and decide which key stores the identifiers, and which key stores the gene symbols. Please strictly follow this format in your answer:
    identifier_key = 'key_name1'
    gene_symbol_key = 'key_name2'

    Gene annotation dictionary:
    {gene_annotation_summary}
    ''')

gene_annotation

MemoryError: 

In [None]:
gene_annotation.columns

In [None]:
if requires_gene_mapping:
    identifier_key = 'ID'
    gene_symbol_key = 'Gene Symbol'
    gene_mapping = get_gene_mapping(gene_annotation, identifier_key, gene_symbol_key)
    genetic_data = apply_gene_mapping(genetic_data, gene_mapping)

In [None]:
genetic_data

In [None]:
genetic_data = normalize_gene_symbols_in_index(genetic_data)

In [None]:
genetic_data

In [172]:
# Stopped: MemoryError

cohort = accession_num = "GSE52296"
cohort_dir = os.path.join(trait_path, accession_num)
soft_file, matrix_file = get_relevant_filepaths(cohort_dir)
soft_file, matrix_file

from utils import *
import io
background_prefixes = ['!Series_title', '!Series_summary', '!Series_overall_design']
clinical_prefixes = ['!Sample_geo_accession', '!Sample_characteristics_ch1']

background_info, clinical_data = get_background_and_clinical_data(matrix_file, background_prefixes, clinical_prefixes)
print(background_info)

clinical_data

!Series_title	"SNP data from 76 adrenocortical carcinomas"
!Series_summary	"SNP array data from 76 adrenocortical carcinomas were used to detect recurrent copy number alterations."
!Series_overall_design	"76 tumors were analyzed with Illumina HumanCore-12v1 BeadChip."


Unnamed: 0,!Sample_geo_accession,GSM1262431,GSM1262432,GSM1262433,GSM1262434,GSM1262435,GSM1262436,GSM1262437,GSM1262438,GSM1262439,...,GSM1262497,GSM1262498,GSM1262499,GSM1262500,GSM1262501,GSM1262502,GSM1262503,GSM1262504,GSM1262505,GSM1262506
0,!Sample_characteristics_ch1,Sex: F,Sex: F,Sex: F,Sex: F,Sex: M,Sex: M,Sex: F,Sex: F,Sex: F,...,Sex: F,Sex: F,Sex: F,Sex: F,Sex: M,Sex: M,Sex: F,Sex: F,Sex: F,Sex: F
1,!Sample_characteristics_ch1,age: 68,age: 79,age: 45,age: 68,age: 31,age: 63,age: 47,age: 43,age: 76,...,age: 24,age: 72,age: 37,age: 49,age: 63,age: 60,age: 53,age: 49,age: 65,age: 29
2,!Sample_characteristics_ch1,cell type: Adrenocortical carcinoma,cell type: Adrenocortical carcinoma,cell type: Adrenocortical carcinoma,cell type: Adrenocortical carcinoma,cell type: Adrenocortical carcinoma,cell type: Adrenocortical carcinoma,cell type: Adrenocortical carcinoma,cell type: Adrenocortical carcinoma,cell type: Adrenocortical carcinoma,...,cell type: Adrenocortical carcinoma,cell type: Adrenocortical carcinoma,cell type: Adrenocortical carcinoma,cell type: Adrenocortical carcinoma,cell type: Adrenocortical carcinoma,cell type: Adrenocortical carcinoma,cell type: Adrenocortical carcinoma,cell type: Adrenocortical carcinoma,cell type: Adrenocortical carcinoma,cell type: Adrenocortical carcinoma


In [173]:
tumor_stage_row = clinical_data.iloc[2]
tumor_stage_row.unique()

array(['!Sample_characteristics_ch1',
       'cell type: Adrenocortical carcinoma'], dtype=object)

In [174]:
is_gene_availabe = True
trait_row = 2
age_row = 1
gender_row = 0

trait_type = 'binary'

# Verify and use the functions generated by GPT

# 这个函数将组织类型（tissue type）转换为有关癫痫存在与否的二进制值。
# 它是基于特定的假设，即如果组织类型是“胰腺导管腺癌”（Pancreatic Ductal Adenocarcinoma），则认为癫痫存在（返回1）；否则，认为癫痫不存在（返回0）。
def convert_trait(tissue_type):
    """
    Convert tissue type to epilepsy presence (binary).
    Assuming epilepsy presence for 'Hippocampus' tissue.
    """
    if tissue_type == 'cell type: Adrenocortical carcinoma':
        return 1  # Epilepsy present
    else:
        return 0  # Epilepsy not present


# def convert_trait(tumor_grade):
#     if (tumor_grade == 'tumor grade: 2' or tumor_grade == 'tumor grade: 3' or tumor_grade == 'tumor grade: 4'):
#         return 1  
#     elif tumor_grade == 'tumor grade: 1':
#         return 0  
#     else:
#         return None

# 这个函数的目的是将年龄的字符串表示转换为一个连续的数值型表示。如果年龄未知（例如，标记为'n.a.'），则返回None。
# 函数尝试从传入的字符串中提取出一个整数作为年龄值。如果字符串的格式不符合预期，导致提取失败，同样返回None。
def convert_age(age_string):
    """
    Convert age string to a continuous numerical value.
    Unknown values are converted to None.
    """
    if age_string.lower() == 'n.a.':
        return None
    try:
        # Extract age as an integer from the string
        age = int(age_string.split(': ')[1])
        return age
    except (ValueError, IndexError):
        # In case of any format error or unexpected string structure
        return None


# 这个函数将性别的字符串表示转换为二进制值，其中“female”对应1，“male”对应0。如果性别未知或字符串不符合预期格式，则返回None。
# It sometimes maps 'female' to 0, and sometimes 1. Does it matter?
def convert_gender(gender_string):
    """
    Convert gender string to a binary value.
    'female' is represented as 1, 'male' as 0.
    Unknown values are converted to None.
    """
    if (gender_string.lower() == 'sex: female' or gender_string.lower() == 'sex: f' or gender_string.lower() == 'gender: female' or gender_string.lower() == 'gender: f'):
        return 1
    elif (gender_string.lower() == 'sex: male' or gender_string.lower() == 'sex: m' or gender_string.lower() == 'gender: male' or gender_string.lower() == 'gender: m') :  # changeed 
        return 0
    else:
        return None

In [175]:
selected_clinical_data = geo_select_clinical_features(clinical_data, TRAIT, trait_row, convert_trait, age_row=age_row,
                                                      convert_age=convert_age, gender_row=gender_row,
                                                      convert_gender=convert_gender)
selected_clinical_data.head()

  clinical_df = clinical_df.applymap(convert_fn)
  clinical_df = clinical_df.applymap(convert_fn)
  clinical_df = clinical_df.applymap(convert_fn)


Unnamed: 0,GSM1262431,GSM1262432,GSM1262433,GSM1262434,GSM1262435,GSM1262436,GSM1262437,GSM1262438,GSM1262439,GSM1262440,...,GSM1262497,GSM1262498,GSM1262499,GSM1262500,GSM1262501,GSM1262502,GSM1262503,GSM1262504,GSM1262505,GSM1262506
Adrenocortical Cancer,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
Age,68,79,45,68,31,63,47,43,76,37,...,24,72,37,49,63,60,53,49,65,29
Gender,1,1,1,1,0,0,1,1,1,0,...,1,1,1,1,0,0,1,1,1,1


In [176]:
genetic_data = get_genetic_data(matrix_file)
genetic_data.head()

Unnamed: 0_level_0,GSM1262431,GSM1262432,GSM1262433,GSM1262434,GSM1262435,GSM1262436,GSM1262437,GSM1262438,GSM1262439,GSM1262440,...,GSM1262497,GSM1262498,GSM1262499,GSM1262500,GSM1262501,GSM1262502,GSM1262503,GSM1262504,GSM1262505,GSM1262506
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1KG_10_101715768,-0.0171,-0.079,-0.0767,-0.0574,0.1267,0.0599,-0.0041,-0.0891,-0.0882,0.0274,...,0.039,-0.0518,-0.0336,-0.1126,-0.0117,-0.0319,-0.0696,0.0085,-0.0883,-0.0143
1KG_10_102265185,0.7398,0.2588,0.7608,-0.982,-0.2913,0.7423,0.9168,0.6207,1.065,1.2469,...,0.8523,0.9312,0.8324,0.7108,0.5138,0.6199,0.5734,0.8608,0.6259,0.62
1KG_10_102584498,0.1463,-0.0861,-0.0004,0.0137,0.3353,0.1679,0.0609,-0.0952,-0.0129,0.0825,...,-0.0055,-0.0705,-0.0535,-0.1776,-0.0815,-0.0657,-0.1601,-0.0177,-0.0909,0.0407
1KG_10_105796247,0.425,0.1208,0.4771,0.3891,0.6188,0.4318,0.4521,0.3531,0.4691,0.5237,...,0.5056,0.4456,0.376,0.2916,0.2201,0.2975,0.2365,0.3229,0.3548,0.5135
1KG_10_1066786,-0.0786,-0.2808,-0.4123,-0.0665,0.0945,-0.0363,0.0108,-0.2025,-0.1018,-0.013,...,-0.1158,-0.12,-0.2127,-0.2758,-0.1879,-0.0461,-0.2037,-0.1851,-0.2983,-0.1691


In [177]:
requires_gene_mapping = True

if requires_gene_mapping:
    gene_annotation = get_gene_annotation(soft_file)
    gene_annotation_summary = preview_df(gene_annotation)
    print(gene_annotation_summary)

gene_row_ids = genetic_data.index[:20].tolist()
gene_row_ids

if requires_gene_mapping:
    print(f'''
    As a biomedical research team, we are analyzing a gene expression dataset, and find that its row headers are some identifiers related to genes:
    {gene_row_ids}
    To get the mapping from those identifiers to actual gene symbols, we extracted the gene annotation data from a series in the GEO database, and saved it to a Python dictionary. Please read the dictionary, and decide which key stores the identifiers, and which key stores the gene symbols. Please strictly follow this format in your answer:
    identifier_key = 'key_name1'
    gene_symbol_key = 'key_name2'

    Gene annotation dictionary:
    {gene_annotation_summary}
    ''')

gene_annotation

MemoryError: 

In [None]:
gene_annotation.columns

In [None]:
if requires_gene_mapping:
    identifier_key = 'ID'
    gene_symbol_key = 'Gene Symbol'
    gene_mapping = get_gene_mapping(gene_annotation, identifier_key, gene_symbol_key)
    genetic_data = apply_gene_mapping(genetic_data, gene_mapping)

In [178]:
# Finished
cohort = accession_num = "GSE108088"
cohort_dir = os.path.join(trait_path, accession_num)
soft_file, matrix_file = get_relevant_filepaths(cohort_dir)
soft_file, matrix_file

from utils import *
import io
background_prefixes = ['!Series_title', '!Series_summary', '!Series_overall_design']
clinical_prefixes = ['!Sample_geo_accession', '!Sample_characteristics_ch1']

background_info, clinical_data = get_background_and_clinical_data(matrix_file, background_prefixes, clinical_prefixes)
print(background_info)

clinical_data

!Series_title	"Comprehensive molecular profiling of children with recurrent cancer II"
!Series_summary	"to explore possible treatment targets and reasons for agressive children cacners by comprehensive molecular profiling on several platforms"
!Series_summary	"to explore copy number aberrations related to cancers"
!Series_overall_design	"diagnostics of children meeting the oncologist with recurrent or agressive cancers where treatment options have been exhausted"


Unnamed: 0,!Sample_geo_accession,GSM2889381,GSM2889382,GSM2889383,GSM2889384,GSM2889385,GSM2889386,GSM2889387,GSM2889388,GSM2889389,...,GSM2889414,GSM2889415,GSM2889416,GSM2889417,GSM2889418,GSM2889419,GSM2889420,GSM2889421,GSM2889422,GSM2889423
0,!Sample_characteristics_ch1,condition: Atypical meningioma,condition: Choroid plexus carcinoma / Malignan...,condition: Pilocytisc/pilomyxoid astrocytoma,condition: Pleomorphic xanthoastrocytoma,condition: Mesoblastisc nephroma,condition: Signetringcell carcinoma,condition: Ganglioglioma / Diffuse astrocytoma,condition: Chondrosarkoma,"condition: Chordoma, dedefferentiated/anaplati...",...,condition: Anaplastic ependymoma,condition: Enchodroms,condition: Pineoblastoma,condition: Osteochondroma,condition: Malignant peripheral nerve sheeth t...,condition: Ewing sarcoma,condition: Adrenocortical carcinoma,condition: Anaplastic ependymoma,condition: Rhabdomyosarcoma,condition: Ependymom


In [179]:
tumor_stage_row = clinical_data.iloc[0]
tumor_stage_row.unique()

array(['!Sample_characteristics_ch1', 'condition: Atypical meningioma',
       'condition: Choroid plexus carcinoma / Malignant peripheral nerve sheeth tumor',
       'condition: Pilocytisc/pilomyxoid astrocytoma',
       'condition: Pleomorphic xanthoastrocytoma',
       'condition: Mesoblastisc nephroma',
       'condition: Signetringcell carcinoma',
       'condition: Ganglioglioma / Diffuse astrocytoma',
       'condition: Chondrosarkoma',
       'condition: Chordoma, dedefferentiated/anaplatic type (INI1-loss)',
       'condition: Hepatoblastoma',
       'condition: Diffuse midline glioma H3K27M-mutated',
       'condition: Anaplastisc ependymoma',
       'condition: Juvenile xanthogranuloma',
       'condition: Anaplastisc pleomorfic xanthoastrocytoma / Glioblastoma',
       'condition: Alveolar rhabdomyosarcoma',
       'condition: Precursor T-lymphoblastic lymphoma',
       'condition: Glioblastoma',
       'condition: Malignant peripheral nerve sheeth tumor',
       'condition

In [180]:
is_gene_availabe = True
trait_row = 0
age_row = None
gender_row = None

trait_type = 'binary'

# Verify and use the functions generated by GPT

# 这个函数将组织类型（tissue type）转换为有关癫痫存在与否的二进制值。
# 它是基于特定的假设，即如果组织类型是“胰腺导管腺癌”（Pancreatic Ductal Adenocarcinoma），则认为癫痫存在（返回1）；否则，认为癫痫不存在（返回0）。
def convert_trait(tissue_type):
    """
    Convert tissue type to epilepsy presence (binary).
    Assuming epilepsy presence for 'Hippocampus' tissue.
    """
    if tissue_type == 'condition: Adrenocortical carcinoma':
        return 1  # Epilepsy present
    else:
        return 0  # Epilepsy not present


# def convert_trait(tumor_grade):
#     if (tumor_grade == 'tumor grade: 2' or tumor_grade == 'tumor grade: 3' or tumor_grade == 'tumor grade: 4'):
#         return 1  
#     elif tumor_grade == 'tumor grade: 1':
#         return 0  
#     else:
#         return None

# 这个函数的目的是将年龄的字符串表示转换为一个连续的数值型表示。如果年龄未知（例如，标记为'n.a.'），则返回None。
# 函数尝试从传入的字符串中提取出一个整数作为年龄值。如果字符串的格式不符合预期，导致提取失败，同样返回None。
def convert_age(age_string):
    """
    Convert age string to a continuous numerical value.
    Unknown values are converted to None.
    """
    if age_string.lower() == 'n.a.':
        return None
    try:
        # Extract age as an integer from the string
        age = int(age_string.split(': ')[1])
        return age
    except (ValueError, IndexError):
        # In case of any format error or unexpected string structure
        return None


# 这个函数将性别的字符串表示转换为二进制值，其中“female”对应1，“male”对应0。如果性别未知或字符串不符合预期格式，则返回None。
# It sometimes maps 'female' to 0, and sometimes 1. Does it matter?
def convert_gender(gender_string):
    """
    Convert gender string to a binary value.
    'female' is represented as 1, 'male' as 0.
    Unknown values are converted to None.
    """
    if (gender_string.lower() == 'sex: female' or gender_string.lower() == 'sex: f' or gender_string.lower() == 'gender: female' or gender_string.lower() == 'gender: f'):
        return 1
    elif (gender_string.lower() == 'sex: male' or gender_string.lower() == 'sex: m' or gender_string.lower() == 'gender: male' or gender_string.lower() == 'gender: m') :  # changeed 
        return 0
    else:
        return None

In [181]:
selected_clinical_data = geo_select_clinical_features(clinical_data, TRAIT, trait_row, convert_trait, age_row=age_row,
                                                      convert_age=convert_age, gender_row=gender_row,
                                                      convert_gender=convert_gender)
selected_clinical_data.head()

  clinical_df = clinical_df.applymap(convert_fn)


Unnamed: 0,GSM2889381,GSM2889382,GSM2889383,GSM2889384,GSM2889385,GSM2889386,GSM2889387,GSM2889388,GSM2889389,GSM2889390,...,GSM2889414,GSM2889415,GSM2889416,GSM2889417,GSM2889418,GSM2889419,GSM2889420,GSM2889421,GSM2889422,GSM2889423
Adrenocortical Cancer,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0


In [182]:
genetic_data = get_genetic_data(matrix_file)
genetic_data

Unnamed: 0_level_0,GSM2889381,GSM2889382,GSM2889383,GSM2889384,GSM2889385,GSM2889386,GSM2889387,GSM2889388,GSM2889389,GSM2889390,...,GSM2889414,GSM2889415,GSM2889416,GSM2889417,GSM2889418,GSM2889419,GSM2889420,GSM2889421,GSM2889422,GSM2889423
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1007_s_at,11.632633,9.908213,11.827391,11.958660,8.923858,9.266323,11.089168,8.470315,10.613575,7.913814,...,11.752788,8.436344,9.346787,7.883638,9.192208,9.482101,10.579143,11.138810,10.251416,10.644898
1053_at,7.337010,10.201116,7.642338,7.346276,7.319823,7.954301,6.362022,6.081242,6.553467,7.600938,...,7.270631,6.396677,5.713165,6.357230,6.809926,7.412866,8.555444,8.216679,7.847031,7.769231
117_at,6.155619,5.734094,6.963633,6.015577,5.207735,5.381928,5.498432,5.975086,6.425805,7.771491,...,4.953888,5.893032,5.168161,6.312296,6.172038,5.837645,5.060681,6.882564,5.804388,4.921321
121_at,6.661803,7.240843,7.367418,7.191843,8.568599,7.485777,5.858277,7.495252,7.265906,6.939669,...,6.998448,7.804581,8.771655,7.507523,7.867807,7.402482,7.488434,7.705329,7.487017,7.048244
1255_g_at,8.660903,2.994038,4.527065,3.175358,3.188646,3.323390,3.846578,3.374182,3.540232,3.046619,...,3.346600,3.068174,4.218592,3.139722,3.774415,3.929085,3.140174,3.061480,4.168771,3.022849
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
AFFX-ThrX-5_at,6.998247,4.717028,4.837961,4.430565,5.083317,4.684815,9.259063,8.409200,8.789234,11.294144,...,7.720076,9.150521,4.780660,9.496483,10.338204,7.174550,6.740168,7.458485,6.663228,6.122767
AFFX-ThrX-M_at,7.905597,4.778656,5.421778,4.496537,5.612980,4.905915,8.985258,9.384753,9.636073,11.218547,...,8.528286,10.324751,3.472218,10.550219,11.102371,7.850730,7.744783,8.414700,7.837027,7.030191
AFFX-TrpnX-3_at,2.632044,2.440085,2.370605,2.445787,2.425345,2.484874,2.594820,2.440124,2.819379,4.289839,...,2.576324,2.350293,2.389836,2.560649,2.454420,2.345227,2.506558,2.623461,2.471355,2.454033
AFFX-TrpnX-5_at,2.962745,3.074502,3.207309,3.055263,3.188459,2.964358,3.261142,3.112587,3.413876,3.218851,...,3.090914,3.072767,3.290665,3.125538,3.250400,2.972254,2.988300,3.147530,3.095629,2.929060


In [183]:
requires_gene_mapping = True

if requires_gene_mapping:
    gene_annotation = get_gene_annotation(soft_file)
    gene_annotation_summary = preview_df(gene_annotation)
    print(gene_annotation_summary)

gene_row_ids = genetic_data.index[:20].tolist()
gene_row_ids

if requires_gene_mapping:
    print(f'''
    As a biomedical research team, we are analyzing a gene expression dataset, and find that its row headers are some identifiers related to genes:
    {gene_row_ids}
    To get the mapping from those identifiers to actual gene symbols, we extracted the gene annotation data from a series in the GEO database, and saved it to a Python dictionary. Please read the dictionary, and decide which key stores the identifiers, and which key stores the gene symbols. Please strictly follow this format in your answer:
    identifier_key = 'key_name1'
    gene_symbol_key = 'key_name2'

    Gene annotation dictionary:
    {gene_annotation_summary}
    ''')

gene_annotation

{'ID': ['1007_s_at', '1053_at', '117_at', '121_at', '1255_g_at'], 'GB_ACC': ['U48705', 'M87338', 'X51757', 'X69699', 'L36861'], 'SPOT_ID': [nan, nan, nan, nan, nan], 'Species Scientific Name': ['Homo sapiens', 'Homo sapiens', 'Homo sapiens', 'Homo sapiens', 'Homo sapiens'], 'Annotation Date': ['Oct 6, 2014', 'Oct 6, 2014', 'Oct 6, 2014', 'Oct 6, 2014', 'Oct 6, 2014'], 'Sequence Type': ['Exemplar sequence', 'Exemplar sequence', 'Exemplar sequence', 'Exemplar sequence', 'Exemplar sequence'], 'Sequence Source': ['Affymetrix Proprietary Database', 'GenBank', 'Affymetrix Proprietary Database', 'GenBank', 'Affymetrix Proprietary Database'], 'Target Description': ['U48705 /FEATURE=mRNA /DEFINITION=HSU48705 Human receptor tyrosine kinase DDR gene, complete cds', 'M87338 /FEATURE= /DEFINITION=HUMA1SBU Human replication factor C, 40-kDa subunit (A1) mRNA, complete cds', "X51757 /FEATURE=cds /DEFINITION=HSP70B Human heat-shock protein HSP70B' gene", 'X69699 /FEATURE= /DEFINITION=HSPAX8A H.sapiens

Unnamed: 0,ID,GB_ACC,SPOT_ID,Species Scientific Name,Annotation Date,Sequence Type,Sequence Source,Target Description,Representative Public ID,Gene Title,Gene Symbol,ENTREZ_GENE_ID,RefSeq Transcript ID,Gene Ontology Biological Process,Gene Ontology Cellular Component,Gene Ontology Molecular Function
0,1007_s_at,U48705,,Homo sapiens,"Oct 6, 2014",Exemplar sequence,Affymetrix Proprietary Database,U48705 /FEATURE=mRNA /DEFINITION=HSU48705 Huma...,U48705,discoidin domain receptor tyrosine kinase 1 //...,DDR1 /// MIR4640,780 /// 100616237,NM_001202521 /// NM_001202522 /// NM_001202523...,0001558 // regulation of cell growth // inferr...,0005576 // extracellular region // inferred fr...,0000166 // nucleotide binding // inferred from...
1,1053_at,M87338,,Homo sapiens,"Oct 6, 2014",Exemplar sequence,GenBank,M87338 /FEATURE= /DEFINITION=HUMA1SBU Human re...,M87338,"replication factor C (activator 1) 2, 40kDa",RFC2,5982,NM_001278791 /// NM_001278792 /// NM_001278793...,0000278 // mitotic cell cycle // traceable aut...,0005634 // nucleus // inferred from electronic...,0000166 // nucleotide binding // inferred from...
2,117_at,X51757,,Homo sapiens,"Oct 6, 2014",Exemplar sequence,Affymetrix Proprietary Database,X51757 /FEATURE=cds /DEFINITION=HSP70B Human h...,X51757,heat shock 70kDa protein 6 (HSP70B'),HSPA6,3310,NM_002155,0000902 // cell morphogenesis // inferred from...,0005737 // cytoplasm // inferred from direct a...,0000166 // nucleotide binding // inferred from...
3,121_at,X69699,,Homo sapiens,"Oct 6, 2014",Exemplar sequence,GenBank,X69699 /FEATURE= /DEFINITION=HSPAX8A H.sapiens...,X69699,paired box 8,PAX8,7849,NM_003466 /// NM_013951 /// NM_013952 /// NM_0...,0001655 // urogenital system development // in...,0005634 // nucleus // inferred from direct ass...,0000979 // RNA polymerase II core promoter seq...
4,1255_g_at,L36861,,Homo sapiens,"Oct 6, 2014",Exemplar sequence,Affymetrix Proprietary Database,L36861 /FEATURE=expanded_cds /DEFINITION=HUMGC...,L36861,guanylate cyclase activator 1A (retina),GUCA1A,2978,NM_000409 /// XM_006715073,0007165 // signal transduction // non-traceabl...,0001750 // photoreceptor outer segment // infe...,0005509 // calcium ion binding // inferred fro...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2405738,AFFX-ThrX-5_at,6.122767291,,,,,,,,,,,,,,
2405739,AFFX-ThrX-M_at,7.030190594,,,,,,,,,,,,,,
2405740,AFFX-TrpnX-3_at,2.454033124,,,,,,,,,,,,,,
2405741,AFFX-TrpnX-5_at,2.929060487,,,,,,,,,,,,,,


In [184]:
gene_annotation.columns

Index(['ID', 'GB_ACC', 'SPOT_ID', 'Species Scientific Name', 'Annotation Date',
       'Sequence Type', 'Sequence Source', 'Target Description',
       'Representative Public ID', 'Gene Title', 'Gene Symbol',
       'ENTREZ_GENE_ID', 'RefSeq Transcript ID',
       'Gene Ontology Biological Process', 'Gene Ontology Cellular Component',
       'Gene Ontology Molecular Function'],
      dtype='object')

In [185]:
if requires_gene_mapping:
    identifier_key = 'ID'
    gene_symbol_key = 'Gene Symbol'
    gene_mapping = get_gene_mapping(gene_annotation, identifier_key, gene_symbol_key)
    genetic_data = apply_gene_mapping(genetic_data, gene_mapping)

In [186]:
genetic_data

Unnamed: 0_level_0,GSM2889381,GSM2889382,GSM2889383,GSM2889384,GSM2889385,GSM2889386,GSM2889387,GSM2889388,GSM2889389,GSM2889390,...,GSM2889414,GSM2889415,GSM2889416,GSM2889417,GSM2889418,GSM2889419,GSM2889420,GSM2889421,GSM2889422,GSM2889423
Gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ABCB4,6.961528,4.614781,7.307627,6.800363,5.353925,7.766458,7.201556,3.322718,4.282999,9.343120,...,6.685381,6.216733,5.839022,6.832801,4.510018,4.907617,10.999279,4.885687,5.614270,6.050458
ABCC6P1,7.857824,8.475545,8.122764,8.262756,6.239403,7.024476,5.726159,4.983406,4.794567,8.003902,...,7.986060,6.931884,6.373791,7.243752,4.903729,4.942779,6.825852,7.099728,6.772416,8.103741
ABCC6P2,7.857824,8.475545,8.122764,8.262756,6.239403,7.024476,5.726159,4.983406,4.794567,8.003902,...,7.986060,6.931884,6.373791,7.243752,4.903729,4.942779,6.825852,7.099728,6.772416,8.103741
ABCD1P2,3.282839,3.384542,3.583015,3.693280,3.659119,3.297807,3.077137,3.715791,3.521275,3.872615,...,3.373987,4.005146,3.765076,3.627202,3.900364,3.796208,3.813467,3.416828,3.381835,3.444124
AC078883.4,4.918088,5.089238,5.002341,5.127558,5.015107,5.063883,4.670841,5.151271,5.563431,6.110059,...,5.280227,5.371916,5.862932,5.396481,5.211429,5.196257,5.523391,5.867906,5.196955,5.182184
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
abParts,10.281154,4.546864,4.466047,4.828189,6.367943,10.036273,5.342937,4.842982,5.424726,6.899275,...,4.452045,6.235608,5.254296,8.209720,4.999726,4.764322,4.767630,4.665493,10.966828,4.681283
alpha,4.390690,3.319960,3.306106,3.474087,3.662178,5.427700,3.728478,3.518694,2.988476,5.461940,...,3.116597,3.572145,3.474087,5.356280,3.226779,3.634203,3.457508,3.818807,5.762159,3.293067
av27s1,2.995613,3.093084,2.893304,3.044181,3.068038,4.723471,3.093084,3.435453,3.606269,4.498578,...,2.989656,3.916891,4.056041,3.388610,3.430777,3.056446,3.026054,3.031052,2.940405,2.991152
hsa-let-7a-3,4.176371,4.526591,4.814560,4.763223,4.640421,4.582546,8.159591,5.007854,4.850381,7.443819,...,4.318515,5.343503,6.769091,4.882332,4.976009,4.807871,4.682796,4.735828,4.525317,4.347550


In [187]:
genetic_data = normalize_gene_symbols_in_index(genetic_data)

In [188]:
genetic_data

Unnamed: 0,GSM2889381,GSM2889382,GSM2889383,GSM2889384,GSM2889385,GSM2889386,GSM2889387,GSM2889388,GSM2889389,GSM2889390,...,GSM2889414,GSM2889415,GSM2889416,GSM2889417,GSM2889418,GSM2889419,GSM2889420,GSM2889421,GSM2889422,GSM2889423
A1BG,5.523549,6.286303,4.722706,6.001011,4.748103,5.784339,5.073090,5.322978,5.552850,10.028951,...,6.278403,6.272556,4.600967,4.657533,5.614995,6.211944,5.638128,6.302610,5.382192,5.806563
A1BG-AS1,5.312258,5.636363,4.377456,4.989505,4.427621,4.967411,6.596631,4.705341,4.475880,6.932284,...,5.400511,5.276987,5.441775,4.602252,5.203488,4.949106,4.697487,4.967411,4.467902,4.628504
A1CF,3.603395,3.772548,3.782952,3.666064,4.272122,4.168318,3.925859,4.190043,4.335184,9.638197,...,3.803316,4.356464,5.655094,3.978726,4.182107,3.903059,3.856601,3.738244,3.695734,3.464957
A2M,8.439460,6.234144,8.099240,7.572022,8.578123,7.410907,7.690870,7.885234,7.469755,9.430534,...,7.195041,7.508384,6.215682,8.150064,8.086973,7.541484,7.209844,7.469303,7.554861,6.911613
A2M-AS1,6.784101,4.090659,5.154216,7.076977,6.310073,5.152857,5.589711,4.979747,4.542321,5.548948,...,5.360263,6.652004,6.177088,7.062321,6.452616,5.194912,5.040409,4.631870,4.431435,6.644617
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ZYG11A,3.875443,6.901420,3.827137,3.950170,3.911654,3.895866,2.871190,4.332635,3.780255,4.327800,...,4.271786,3.688504,5.309030,4.069993,4.565107,4.002745,4.078990,3.747381,3.798240,4.433770
ZYG11B,8.324267,8.563336,8.407677,8.419080,8.368285,7.756354,7.599136,7.606243,7.776769,4.288961,...,7.871825,9.108076,4.800296,7.931003,7.282819,7.921015,8.153173,7.890696,7.555687,8.743453
ZYX,6.495728,7.109820,6.811351,7.635857,8.528983,7.296132,9.704778,7.226106,6.748325,9.603028,...,6.175848,6.383634,5.994175,7.025328,6.692828,6.590379,6.212945,6.775403,7.361366,5.754192
ZZEF1,5.451428,6.005562,5.749764,5.674700,5.761255,5.862683,5.487030,5.811785,5.731223,6.066631,...,5.459600,5.614055,6.096445,5.627089,5.651303,5.808518,6.284363,5.593404,5.630627,5.498746


In [189]:
merged_data = geo_merge_clinical_genetic_data(selected_clinical_data, genetic_data)
# The preprocessing runs through, which means is_available should be True
is_available = True

In [190]:
merged_data

Unnamed: 0,Adrenocortical Cancer,A1BG,A1BG-AS1,A1CF,A2M,A2M-AS1,A2ML1,A2MP1,A4GALT,A4GNT,...,ZWILCH,ZWINT,ZXDA,ZXDB,ZXDC,ZYG11A,ZYG11B,ZYX,ZZEF1,ZZZ3
GSM2889381,0.0,5.523549,5.312258,3.603395,8.43946,6.784101,3.275863,3.821555,5.833153,3.929618,...,7.404137,8.066233,7.023525,5.423627,6.130173,3.875443,8.324267,6.495728,5.451428,8.522046
GSM2889382,0.0,6.286303,5.636363,3.772548,6.234144,4.090659,3.355778,3.929229,5.390436,4.152816,...,9.315457,11.280892,6.738129,5.336899,6.629826,6.90142,8.563336,7.10982,6.005562,8.770399
GSM2889383,0.0,4.722706,4.377456,3.782952,8.09924,5.154216,3.262292,4.774103,4.825479,4.381623,...,7.241154,7.623172,7.457557,5.477725,6.04995,3.827137,8.407677,6.811351,5.749764,8.118409
GSM2889384,0.0,6.001011,4.989505,3.666064,7.572022,7.076977,3.777149,4.317872,5.557962,4.262175,...,6.538005,6.921254,7.068052,5.379503,6.370581,3.95017,8.41908,7.635857,5.6747,8.082138
GSM2889385,0.0,4.748103,4.427621,4.272122,8.578123,6.310073,3.301956,4.272148,5.374137,4.317426,...,7.874055,7.974063,6.767503,4.873842,5.980389,3.911654,8.368285,8.528983,5.761255,7.635416
GSM2889386,0.0,5.784339,4.967411,4.168318,7.410907,5.152857,3.246321,3.977649,5.160725,6.856612,...,7.595532,9.056683,6.62446,5.334879,5.982906,3.895866,7.756354,7.296132,5.862683,8.03046
GSM2889387,0.0,5.07309,6.596631,3.925859,7.69087,5.589711,3.595074,3.510832,7.146452,3.627231,...,5.044171,2.99506,5.120191,6.015022,6.120873,2.87119,7.599136,9.704778,5.48703,5.420636
GSM2889388,0.0,5.322978,4.705341,4.190043,7.885234,4.979747,3.422809,3.838083,5.613246,4.328624,...,6.942579,7.837928,6.339747,4.788353,6.381847,4.332635,7.606243,7.226106,5.811785,7.711138
GSM2889389,0.0,5.55285,4.47588,4.335184,7.469755,4.542321,3.403456,4.072234,6.805424,5.046665,...,6.345089,8.777369,7.087475,5.236264,5.944496,3.780255,7.776769,6.748325,5.731223,6.964051
GSM2889390,0.0,10.028951,6.932284,9.638197,9.430534,5.548948,3.741756,4.228405,8.081186,4.556953,...,3.688915,2.822378,3.48363,5.67357,5.90322,4.3278,4.288961,9.603028,6.066631,4.419665


In [191]:
print(f"The merged dataset contains {len(merged_data)} samples.")
is_trait_biased, merged_data = judge_and_remove_biased_features(merged_data, TRAIT, trait_type=trait_type)
is_trait_biased

The merged dataset contains 43 samples.
For the feature 'Adrenocortical Cancer', the least common label is '1.0' with 1 occurrences. This represents 2.33% of the dataset.
The distribution of the feature 'Adrenocortical Cancer' in this dataset is severely biased.



True

In [192]:
if is_available:
    save_cohort_info(cohort, JSON_PATH, is_available, is_trait_biased, merged_data, note='')
else:
    save_cohort_info(cohort, JSON_PATH, is_available)
merged_data.head()
if not is_trait_biased:
    merged_data.to_csv(os.path.join(OUTPUT_DIR, cohort + '.csv'), index=False)

In [193]:
# Stopped: No gene mapping
cohort = accession_num = "GSE49278"
cohort_dir = os.path.join(trait_path, accession_num)
soft_file, matrix_file = get_relevant_filepaths(cohort_dir)
soft_file, matrix_file
from utils import *
import io
background_prefixes = ['!Series_title', '!Series_summary', '!Series_overall_design']
clinical_prefixes = ['!Sample_geo_accession', '!Sample_characteristics_ch1']

background_info, clinical_data = get_background_and_clinical_data(matrix_file, background_prefixes, clinical_prefixes)
print(background_info)

clinical_data

!Series_title	"Expression profiling by array of 44 adrenocortical carcinomas"
!Series_summary	"Gene expression profiles of adrenocortical carcinomas were analyzed using Affymetrix Human Gene 2.0 ST Array to identify homogeneous molecular subgroups"
!Series_overall_design	"Gene expression profiles of 44 adrenocortical carcinomas were hybridized using Affymetrix Human Gene 2.0 ST Array"


Unnamed: 0,!Sample_geo_accession,GSM1196511,GSM1196512,GSM1196513,GSM1196514,GSM1196515,GSM1196516,GSM1196517,GSM1196518,GSM1196519,...,GSM1196545,GSM1196546,GSM1196547,GSM1196548,GSM1196549,GSM1196550,GSM1196551,GSM1196552,GSM1196553,GSM1196554
0,!Sample_characteristics_ch1,age (years): 70,age (years): 26,age (years): 53,age (years): 73,age (years): 15,age (years): 51,age (years): 63,age (years): 26,age (years): 29,...,age (years): 79,age (years): 28,age (years): 40,age (years): 44,age (years): 28,age (years): 53,age (years): 28,age (years): 52,age (years): 30,age (years): 46
1,!Sample_characteristics_ch1,gender: F,gender: F,gender: F,gender: M,gender: F,gender: F,gender: M,gender: F,gender: M,...,gender: F,gender: F,gender: F,gender: F,gender: F,gender: F,gender: M,gender: M,gender: F,gender: F
2,!Sample_characteristics_ch1,cell type: Adrenocortical carcinoma,cell type: Adrenocortical carcinoma,cell type: Adrenocortical carcinoma,cell type: Adrenocortical carcinoma,cell type: Adrenocortical carcinoma,cell type: Adrenocortical carcinoma,cell type: Adrenocortical carcinoma,cell type: Adrenocortical carcinoma,cell type: Adrenocortical carcinoma,...,cell type: Adrenocortical carcinoma,cell type: Adrenocortical carcinoma,cell type: Adrenocortical carcinoma,cell type: Adrenocortical carcinoma,cell type: Adrenocortical carcinoma,cell type: Adrenocortical carcinoma,cell type: Adrenocortical carcinoma,cell type: Adrenocortical carcinoma,cell type: Adrenocortical carcinoma,cell type: Adrenocortical carcinoma


In [194]:
tumor_stage_row = clinical_data.iloc[2]
tumor_stage_row.unique()

array(['!Sample_characteristics_ch1',
       'cell type: Adrenocortical carcinoma'], dtype=object)

In [195]:
is_gene_availabe = True
trait_row = 2
age_row = 0
gender_row = 1

trait_type = 'binary'

# Verify and use the functions generated by GPT

# 这个函数将组织类型（tissue type）转换为有关癫痫存在与否的二进制值。
# 它是基于特定的假设，即如果组织类型是“胰腺导管腺癌”（Pancreatic Ductal Adenocarcinoma），则认为癫痫存在（返回1）；否则，认为癫痫不存在（返回0）。
def convert_trait(tissue_type):
    """
    Convert tissue type to epilepsy presence (binary).
    Assuming epilepsy presence for 'Hippocampus' tissue.
    """
    if tissue_type == 'cell type: Adrenocortical carcinoma':
        return 1  # Epilepsy present
    else:
        return 0  # Epilepsy not present


# def convert_trait(tumor_grade):
#     if (tumor_grade == 'tumor grade: 2' or tumor_grade == 'tumor grade: 3' or tumor_grade == 'tumor grade: 4'):
#         return 1  
#     elif tumor_grade == 'tumor grade: 1':
#         return 0  
#     else:
#         return None

# 这个函数的目的是将年龄的字符串表示转换为一个连续的数值型表示。如果年龄未知（例如，标记为'n.a.'），则返回None。
# 函数尝试从传入的字符串中提取出一个整数作为年龄值。如果字符串的格式不符合预期，导致提取失败，同样返回None。
def convert_age(age_string):
    """
    Convert age string to a continuous numerical value.
    Unknown values are converted to None.
    """
    if age_string.lower() == 'n.a.':
        return None
    try:
        # Extract age as an integer from the string
        age = int(age_string.split(': ')[1])
        return age
    except (ValueError, IndexError):
        # In case of any format error or unexpected string structure
        return None


# 这个函数将性别的字符串表示转换为二进制值，其中“female”对应1，“male”对应0。如果性别未知或字符串不符合预期格式，则返回None。
# It sometimes maps 'female' to 0, and sometimes 1. Does it matter?
def convert_gender(gender_string):
    """
    Convert gender string to a binary value.
    'female' is represented as 1, 'male' as 0.
    Unknown values are converted to None.
    """
    if (gender_string.lower() == 'sex: female' or gender_string.lower() == 'sex: f' or gender_string.lower() == 'gender: female' or gender_string.lower() == 'gender: f'):
        return 1
    elif (gender_string.lower() == 'sex: male' or gender_string.lower() == 'sex: m' or gender_string.lower() == 'gender: male' or gender_string.lower() == 'gender: m') :  # changeed 
        return 0
    else:
        return None

In [196]:
selected_clinical_data = geo_select_clinical_features(clinical_data, TRAIT, trait_row, convert_trait, age_row=age_row,
                                                      convert_age=convert_age, gender_row=gender_row,
                                                      convert_gender=convert_gender)
selected_clinical_data.head()

  clinical_df = clinical_df.applymap(convert_fn)
  clinical_df = clinical_df.applymap(convert_fn)
  clinical_df = clinical_df.applymap(convert_fn)


Unnamed: 0,GSM1196511,GSM1196512,GSM1196513,GSM1196514,GSM1196515,GSM1196516,GSM1196517,GSM1196518,GSM1196519,GSM1196520,...,GSM1196545,GSM1196546,GSM1196547,GSM1196548,GSM1196549,GSM1196550,GSM1196551,GSM1196552,GSM1196553,GSM1196554
Adrenocortical Cancer,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
Age,70,26,53,73,15,51,63,26,29,79,...,79,28,40,44,28,53,28,52,30,46
Gender,1,1,1,0,1,1,0,1,0,1,...,1,1,1,1,1,1,0,0,1,1


In [197]:
genetic_data = get_genetic_data(matrix_file)
genetic_data.head()

Unnamed: 0_level_0,GSM1196511,GSM1196512,GSM1196513,GSM1196514,GSM1196515,GSM1196516,GSM1196517,GSM1196518,GSM1196519,GSM1196520,...,GSM1196545,GSM1196546,GSM1196547,GSM1196548,GSM1196549,GSM1196550,GSM1196551,GSM1196552,GSM1196553,GSM1196554
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
16650001,3.11446,2.761934,3.1917,2.981038,3.113831,2.687413,3.468881,2.411585,3.761057,2.974074,...,2.440173,2.954796,3.445082,3.388275,2.45053,2.293126,3.136449,2.748609,3.587116,3.194252
16650003,2.070307,1.83154,2.303189,2.430376,1.507325,2.382929,2.808405,2.031501,2.797925,2.567698,...,1.833832,2.025689,2.493108,2.3134,1.594192,2.106784,3.733405,2.427485,3.297156,1.92065
16650005,2.532754,3.371765,2.26475,2.647668,2.559651,3.508271,1.959297,2.764491,2.65512,1.712738,...,2.16553,4.164357,3.455904,4.223868,2.515237,2.956488,3.047515,1.870629,2.264684,4.401433
16650007,1.968311,2.229541,1.762466,2.827752,1.62615,2.184046,1.214179,1.664709,1.55988,2.373817,...,3.381329,2.235444,2.027248,1.226888,1.948129,1.840212,2.096553,2.489499,2.316459,1.641595
16650009,1.418189,1.31471,1.571579,1.233351,1.753973,1.033928,1.259945,1.23922,1.104874,1.285327,...,1.765765,0.843827,1.600953,1.369317,0.956487,1.137052,1.658009,1.689291,1.196682,1.994568


In [198]:
requires_gene_mapping = True

if requires_gene_mapping:
    gene_annotation = get_gene_annotation(soft_file)
    gene_annotation_summary = preview_df(gene_annotation)
    print(gene_annotation_summary)

gene_row_ids = genetic_data.index[:20].tolist()
gene_row_ids

if requires_gene_mapping:
    print(f'''
    As a biomedical research team, we are analyzing a gene expression dataset, and find that its row headers are some identifiers related to genes:
    {gene_row_ids}
    To get the mapping from those identifiers to actual gene symbols, we extracted the gene annotation data from a series in the GEO database, and saved it to a Python dictionary. Please read the dictionary, and decide which key stores the identifiers, and which key stores the gene symbols. Please strictly follow this format in your answer:
    identifier_key = 'key_name1'
    gene_symbol_key = 'key_name2'

    Gene annotation dictionary:
    {gene_annotation_summary}
    ''')

gene_annotation

{'ID': ['16657436', '16657440', '16657445', '16657447', '16657450'], 'RANGE_STRAND': ['+', '+', '+', '+', '+'], 'RANGE_START': [12190.0, 29554.0, 69091.0, 160446.0, 317811.0], 'RANGE_END': [13639.0, 31109.0, 70008.0, 161525.0, 328581.0], 'total_probes': [25.0, 28.0, 8.0, 13.0, 36.0], 'GB_ACC': ['NR_046018', nan, nan, nan, 'NR_024368'], 'SPOT_ID': ['chr1:12190-13639', 'chr1:29554-31109', 'chr1:69091-70008', 'chr1:160446-161525', 'chr1:317811-328581'], 'RANGE_GB': ['NC_000001.10', 'NC_000001.10', 'NC_000001.10', 'NC_000001.10', 'NC_000001.10']}

    As a biomedical research team, we are analyzing a gene expression dataset, and find that its row headers are some identifiers related to genes:
    ['16650001', '16650003', '16650005', '16650007', '16650009', '16650011', '16650013', '16650015', '16650017', '16650019', '16650021', '16650023', '16650025', '16650027', '16650029', '16650031', '16650033', '16650035', '16650037', '16650041']
    To get the mapping from those identifiers to actual g

Unnamed: 0,ID,RANGE_STRAND,RANGE_START,RANGE_END,total_probes,GB_ACC,SPOT_ID,RANGE_GB
0,16657436,+,12190.0,13639.0,25.0,NR_046018,chr1:12190-13639,NC_000001.10
1,16657440,+,29554.0,31109.0,28.0,,chr1:29554-31109,NC_000001.10
2,16657445,+,69091.0,70008.0,8.0,,chr1:69091-70008,NC_000001.10
3,16657447,+,160446.0,161525.0,13.0,,chr1:160446-161525,NC_000001.10
4,16657450,+,317811.0,328581.0,36.0,NR_024368,chr1:317811-328581,NC_000001.10
...,...,...,...,...,...,...,...,...
2413168,17127713,5.932226625,,,,,,
2413169,17127715,13.32740757,,,,,,
2413170,17127717,4.268049934,,,,,,
2413171,17127719,13.16387444,,,,,,


In [199]:
gene_annotation.columns

Index(['ID', 'RANGE_STRAND', 'RANGE_START', 'RANGE_END', 'total_probes',
       'GB_ACC', 'SPOT_ID', 'RANGE_GB'],
      dtype='object')

In [200]:
# Finished
cohort = accession_num = "GSE76019"
cohort_dir = os.path.join(trait_path, accession_num)
soft_file, matrix_file = get_relevant_filepaths(cohort_dir)
soft_file, matrix_file

from utils import *
import io
background_prefixes = ['!Series_title', '!Series_summary', '!Series_overall_design']
clinical_prefixes = ['!Sample_geo_accession', '!Sample_characteristics_ch1']

background_info, clinical_data = get_background_and_clinical_data(matrix_file, background_prefixes, clinical_prefixes)
print(background_info)

clinical_data

!Series_title	"Gene expression profiling of pediatric adrenocortical tumors of patients treated on the Children's Oncology Group XXX protocol."
!Series_summary	"We have previously observed that expression of HLA genes associate with histology of adrenocortical tumors (PMID 17234769)."
!Series_summary	"Here, we used gene expression microarrays to associate the diagnostic tumor expression of these genes with outcome among 34 patients treated on the COG ARAR0332 protocol."
!Series_overall_design	"We used microarrays to explore the expression profiles of a large group of uniformly-treated pediatric adrenocortical carcinomas."
!Series_overall_design	"Specimens were harvested during surgery and snap frozen in liquid nitrogen to preserve tissue integrity."


Unnamed: 0,!Sample_geo_accession,GSM1972883,GSM1972884,GSM1972885,GSM1972886,GSM1972887,GSM1972888,GSM1972889,GSM1972890,GSM1972891,...,GSM1972907,GSM1972908,GSM1972909,GSM1972910,GSM1972911,GSM1972912,GSM1972913,GSM1972914,GSM1972915,GSM1972916
0,!Sample_characteristics_ch1,histology: ACC,histology: ACC,histology: ACC,histology: ACC,histology: ACC,histology: ACC,histology: ACC,histology: ACC,histology: ACC,...,histology: ACC,histology: ACC,histology: ACC,histology: ACC,histology: ACC,histology: ACC,histology: ACC,histology: ACC,histology: ACC,histology: ACC
1,!Sample_characteristics_ch1,Stage: III,Stage: I,Stage: I,Stage: III,Stage: II,Stage: I,Stage: II,Stage: IV,Stage: I,...,Stage: III,Stage: II,Stage: I,Stage: IV,Stage: III,Stage: I,Stage: I,Stage: IV,Stage: III,Stage: II
2,!Sample_characteristics_ch1,efs.time: 5.07323750855578,efs.time: 5.17453798767967,efs.time: 4.33127994524298,efs.time: 4.50376454483231,efs.time: 4.29568788501027,efs.time: 5.48117727583847,efs.time: 4.290212183436,efs.time: 3.35112936344969,efs.time: 4.87063655030801,...,efs.time: 7.08829568788501,efs.time: 2.01232032854209,efs.time: 1.70841889117043,efs.time: 0.563997262149213,efs.time: 2.45311430527036,efs.time: 2.13004791238877,efs.time: 1.6290212183436,efs.time: 0.750171115674196,efs.time: 1.90828199863107,efs.time: 0.511978097193703
3,!Sample_characteristics_ch1,efs.event: 0,efs.event: 0,efs.event: 0,efs.event: 0,efs.event: 0,efs.event: 0,efs.event: 0,efs.event: 0,efs.event: 0,...,efs.event: 0,efs.event: 0,efs.event: 0,efs.event: 1,efs.event: 0,efs.event: 0,efs.event: 0,efs.event: 1,efs.event: 0,efs.event: 1


In [201]:
tumor_stage_row = clinical_data.iloc[0]
tumor_stage_row.unique()

array(['!Sample_characteristics_ch1', 'histology: ACC'], dtype=object)

In [10]:
is_gene_availabe = True
trait_row = 0
age_row = None
gender_row = None

trait_type = 'binary'

# Verify and use the functions generated by GPT

# 这个函数将组织类型（tissue type）转换为有关癫痫存在与否的二进制值。
# 它是基于特定的假设，即如果组织类型是“胰腺导管腺癌”（Pancreatic Ductal Adenocarcinoma），则认为癫痫存在（返回1）；否则，认为癫痫不存在（返回0）。
def convert_trait(tissue_type):
    """
    Convert tissue type to epilepsy presence (binary).
    Assuming epilepsy presence for 'Hippocampus' tissue.
    """
    if tissue_type == 'histology: ACC':
        return 1  # Epilepsy present
    else:
        return 0  # Epilepsy not present


# def convert_trait(tumor_grade):
#     if (tumor_grade == 'tumor grade: 2' or tumor_grade == 'tumor grade: 3' or tumor_grade == 'tumor grade: 4'):
#         return 1  
#     elif tumor_grade == 'tumor grade: 1':
#         return 0  
#     else:
#         return None

# 这个函数的目的是将年龄的字符串表示转换为一个连续的数值型表示。如果年龄未知（例如，标记为'n.a.'），则返回None。
# 函数尝试从传入的字符串中提取出一个整数作为年龄值。如果字符串的格式不符合预期，导致提取失败，同样返回None。
def convert_age(age_string):
    """
    Convert age string to a continuous numerical value.
    Unknown values are converted to None.
    """
    if age_string.lower() == 'n.a.':
        return None
    try:
        # Extract age as an integer from the string
        age = int(age_string.split(': ')[1])
        return age
    except (ValueError, IndexError):
        # In case of any format error or unexpected string structure
        return None


# 这个函数将性别的字符串表示转换为二进制值，其中“female”对应1，“male”对应0。如果性别未知或字符串不符合预期格式，则返回None。
# It sometimes maps 'female' to 0, and sometimes 1. Does it matter?
def convert_gender(gender_string):
    """
    Convert gender string to a binary value.
    'female' is represented as 1, 'male' as 0.
    Unknown values are converted to None.
    """
    if (gender_string.lower() == 'sex: female' or gender_string.lower() == 'sex: f' or gender_string.lower() == 'gender: female' or gender_string.lower() == 'gender: f'):
        return 1
    elif (gender_string.lower() == 'sex: male' or gender_string.lower() == 'sex: m' or gender_string.lower() == 'gender: male' or gender_string.lower() == 'gender: m') :  # changeed 
        return 0
    else:
        return None

In [203]:
selected_clinical_data = geo_select_clinical_features(clinical_data, TRAIT, trait_row, convert_trait, age_row=age_row,
                                                      convert_age=convert_age, gender_row=gender_row,
                                                      convert_gender=convert_gender)
selected_clinical_data.head()

  clinical_df = clinical_df.applymap(convert_fn)


Unnamed: 0,GSM1972883,GSM1972884,GSM1972885,GSM1972886,GSM1972887,GSM1972888,GSM1972889,GSM1972890,GSM1972891,GSM1972892,...,GSM1972907,GSM1972908,GSM1972909,GSM1972910,GSM1972911,GSM1972912,GSM1972913,GSM1972914,GSM1972915,GSM1972916
Adrenocortical Cancer,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1


In [204]:
genetic_data = get_genetic_data(matrix_file)
genetic_data

Unnamed: 0_level_0,GSM1972883,GSM1972884,GSM1972885,GSM1972886,GSM1972887,GSM1972888,GSM1972889,GSM1972890,GSM1972891,GSM1972892,...,GSM1972907,GSM1972908,GSM1972909,GSM1972910,GSM1972911,GSM1972912,GSM1972913,GSM1972914,GSM1972915,GSM1972916
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1007_PM_s_at,11.198248,7.933991,7.960398,9.394266,10.052110,9.110888,10.281812,10.182810,7.545310,8.466753,...,10.139733,10.113103,7.980672,8.076911,9.057903,7.906223,9.819896,9.295794,9.881930,9.663820
1053_PM_at,6.636225,6.589175,7.139055,6.789730,6.894620,6.224652,6.027137,6.040128,7.009429,5.893735,...,5.476718,6.046398,6.339334,6.445096,7.831981,7.071057,6.239265,7.505965,6.833472,7.676311
117_PM_at,3.567993,3.623477,4.011449,3.738795,3.371988,5.343177,3.733957,3.447104,4.302283,3.524911,...,3.187503,2.973845,2.946495,3.472866,6.236267,3.505690,3.362644,4.042302,3.317488,4.425911
121_PM_at,5.908458,5.115113,6.658453,5.386049,5.481440,5.187689,5.126747,6.048818,6.691498,5.234772,...,6.683220,5.072790,6.936309,6.305064,4.728112,4.960299,6.832099,5.184749,5.206011,5.279768
1255_PM_g_at,2.419798,2.585927,2.580811,2.574198,2.587390,2.546555,2.569797,2.546564,2.604273,2.546547,...,2.418516,2.644498,2.739033,2.756333,2.813312,2.584168,2.686655,2.524916,2.545832,2.671332
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
AFFX-ThrX-5_at,5.841242,5.736721,5.930751,5.975872,5.439575,5.470157,4.898815,6.075566,5.827021,6.247272,...,6.464635,6.477860,5.318148,5.011828,5.184427,5.472901,5.296042,6.166961,5.266876,5.218927
AFFX-ThrX-M_at,6.882192,6.936577,6.894621,7.019251,6.407738,6.998752,5.947718,6.902733,6.820512,7.359015,...,7.446475,7.737522,6.722462,6.484635,6.614603,6.902994,6.762439,7.857219,6.839717,6.623335
AFFX-TrpnX-3_at,2.554189,2.572302,2.573867,2.538020,2.481296,2.700988,2.531195,2.713880,2.710393,2.684654,...,2.604511,2.563804,2.426904,2.620220,2.485755,2.644209,2.569908,2.580324,2.593163,2.400546
AFFX-TrpnX-5_at,2.728991,2.746145,2.812163,2.769375,2.712377,2.834052,2.820497,3.123452,2.807358,2.787314,...,2.722211,2.884076,2.907535,2.882485,2.802728,2.752660,2.800514,2.869417,2.901859,2.733121


In [205]:
requires_gene_mapping = True

if requires_gene_mapping:
    gene_annotation = get_gene_annotation(soft_file)
    gene_annotation_summary = preview_df(gene_annotation)
    print(gene_annotation_summary)

gene_row_ids = genetic_data.index[:20].tolist()
gene_row_ids

if requires_gene_mapping:
    print(f'''
    As a biomedical research team, we are analyzing a gene expression dataset, and find that its row headers are some identifiers related to genes:
    {gene_row_ids}
    To get the mapping from those identifiers to actual gene symbols, we extracted the gene annotation data from a series in the GEO database, and saved it to a Python dictionary. Please read the dictionary, and decide which key stores the identifiers, and which key stores the gene symbols. Please strictly follow this format in your answer:
    identifier_key = 'key_name1'
    gene_symbol_key = 'key_name2'

    Gene annotation dictionary:
    {gene_annotation_summary}
    ''')

gene_annotation

{'ID': ['1007_PM_s_at', '1053_PM_at', '117_PM_at', '121_PM_at', '1255_PM_g_at'], 'GB_ACC': ['U48705', 'M87338', 'X51757', 'X69699', 'L36861'], 'SPOT_ID': [nan, nan, nan, nan, nan], 'Species Scientific Name': ['Homo sapiens', 'Homo sapiens', 'Homo sapiens', 'Homo sapiens', 'Homo sapiens'], 'Annotation Date': ['Aug 20, 2010', 'Aug 20, 2010', 'Aug 20, 2010', 'Aug 20, 2010', 'Aug 20, 2010'], 'Sequence Type': ['Exemplar sequence', 'Exemplar sequence', 'Exemplar sequence', 'Exemplar sequence', 'Exemplar sequence'], 'Sequence Source': ['Affymetrix Proprietary Database', 'GenBank', 'Affymetrix Proprietary Database', 'GenBank', 'Affymetrix Proprietary Database'], 'Target Description': ['U48705 /FEATURE=mRNA /DEFINITION=HSU48705 Human receptor tyrosine kinase DDR gene, complete cds', 'M87338 /FEATURE= /DEFINITION=HUMA1SBU Human replication factor C, 40-kDa subunit (A1) mRNA, complete cds', "X51757 /FEATURE=cds /DEFINITION=HSP70B Human heat-shock protein HSP70B' gene", 'X69699 /FEATURE= /DEFINITI

Unnamed: 0,ID,GB_ACC,SPOT_ID,Species Scientific Name,Annotation Date,Sequence Type,Sequence Source,Target Description,Representative Public ID,Gene Title,Gene Symbol,ENTREZ_GENE_ID,RefSeq Transcript ID,Gene Ontology Biological Process,Gene Ontology Cellular Component,Gene Ontology Molecular Function
0,1007_PM_s_at,U48705,,Homo sapiens,"Aug 20, 2010",Exemplar sequence,Affymetrix Proprietary Database,U48705 /FEATURE=mRNA /DEFINITION=HSU48705 Huma...,U48705,discoidin domain receptor tyrosine kinase 1,DDR1,780,NM_001954 /// NM_013993 /// NM_013994,0001558 // regulation of cell growth // inferr...,0005576 // extracellular region // inferred fr...,0000166 // nucleotide binding // inferred from...
1,1053_PM_at,M87338,,Homo sapiens,"Aug 20, 2010",Exemplar sequence,GenBank,M87338 /FEATURE= /DEFINITION=HUMA1SBU Human re...,M87338,"replication factor C (activator 1) 2, 40kDa",RFC2,5982,NM_002914 /// NM_181471,0006260 // DNA replication // not recorded ///...,0005634 // nucleus // inferred from electronic...,0000166 // nucleotide binding // inferred from...
2,117_PM_at,X51757,,Homo sapiens,"Aug 20, 2010",Exemplar sequence,Affymetrix Proprietary Database,X51757 /FEATURE=cds /DEFINITION=HSP70B Human h...,X51757,heat shock 70kDa protein 6 (HSP70B'),HSPA6,3310,NM_002155,0006950 // response to stress // inferred from...,,0000166 // nucleotide binding // inferred from...
3,121_PM_at,X69699,,Homo sapiens,"Aug 20, 2010",Exemplar sequence,GenBank,X69699 /FEATURE= /DEFINITION=HSPAX8A H.sapiens...,X69699,paired box 8,PAX8,7849,NM_003466 /// NM_013951 /// NM_013952 /// NM_0...,0001656 // metanephros development // inferred...,0005634 // nucleus // inferred from electronic...,0003677 // DNA binding // inferred from direct...
4,1255_PM_g_at,L36861,,Homo sapiens,"Aug 20, 2010",Exemplar sequence,Affymetrix Proprietary Database,L36861 /FEATURE=expanded_cds /DEFINITION=HUMGC...,L36861,guanylate cyclase activator 1A (retina),GUCA1A,2978,NM_000409,0007165 // signal transduction // non-traceabl...,0016020 // membrane // inferred from electroni...,0005509 // calcium ion binding // inferred fro...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1915054,AFFX-ThrX-5_at,5.218926936,,,,,,,,,,,,,,
1915055,AFFX-ThrX-M_at,6.623334728,,,,,,,,,,,,,,
1915056,AFFX-TrpnX-3_at,2.400545959,,,,,,,,,,,,,,
1915057,AFFX-TrpnX-5_at,2.733121104,,,,,,,,,,,,,,


In [206]:
gene_annotation.columns

Index(['ID', 'GB_ACC', 'SPOT_ID', 'Species Scientific Name', 'Annotation Date',
       'Sequence Type', 'Sequence Source', 'Target Description',
       'Representative Public ID', 'Gene Title', 'Gene Symbol',
       'ENTREZ_GENE_ID', 'RefSeq Transcript ID',
       'Gene Ontology Biological Process', 'Gene Ontology Cellular Component',
       'Gene Ontology Molecular Function'],
      dtype='object')

In [207]:
if requires_gene_mapping:
    identifier_key = 'ID'
    gene_symbol_key = 'Gene Symbol'
    gene_mapping = get_gene_mapping(gene_annotation, identifier_key, gene_symbol_key)
    genetic_data = apply_gene_mapping(genetic_data, gene_mapping)

In [208]:
genetic_data

Unnamed: 0_level_0,GSM1972883,GSM1972884,GSM1972885,GSM1972886,GSM1972887,GSM1972888,GSM1972889,GSM1972890,GSM1972891,GSM1972892,...,GSM1972907,GSM1972908,GSM1972909,GSM1972910,GSM1972911,GSM1972912,GSM1972913,GSM1972914,GSM1972915,GSM1972916
Gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ABCB4,8.146158,10.977810,10.817974,8.328435,7.702278,9.928567,10.189140,10.418385,10.703676,6.957320,...,11.107936,7.462309,10.275823,10.943833,9.631248,9.684394,8.455126,8.191178,6.759635,6.429762
ABCC6P2,7.723734,8.084656,6.997038,6.348997,7.657459,5.774166,6.133703,4.234780,5.267539,5.438076,...,7.690533,5.694893,8.007111,6.423638,5.311714,6.375171,7.027140,7.036405,8.993887,7.220674
ACOT2,8.106406,7.817518,8.597533,7.436279,7.458640,6.967079,7.762308,7.919932,7.028197,8.076595,...,7.144096,8.266708,7.390831,8.582612,8.178542,7.883154,8.806078,7.615123,6.970443,7.752594
ACSM2B,2.661260,2.671672,2.593285,2.577828,2.405197,2.640131,2.487925,2.582874,2.530571,2.657053,...,2.613758,2.626113,2.800945,2.511766,2.525154,2.578395,2.667927,2.626038,2.625373,2.757021
ACSM2B,2.790370,2.759889,2.838374,2.866704,2.899911,2.740075,2.764121,2.775007,2.780009,3.104381,...,2.966712,2.824844,2.741376,2.803511,2.991721,2.802611,2.969826,2.868441,2.845737,2.812486
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ZYX,3.707414,3.824522,3.970674,3.953966,3.314021,3.750447,4.481623,5.350892,4.075008,4.400005,...,3.663827,3.892046,3.643871,3.635140,2.974060,3.826820,4.372133,3.554603,3.158875,3.565214
ZZEF1,4.953882,4.852440,4.961707,5.035332,5.086926,5.210269,5.084559,5.412141,4.688110,5.128178,...,5.296990,4.951781,4.795016,4.912349,5.306210,4.846485,5.563509,4.626624,5.128579,4.568179
ZZZ3,8.010302,7.458704,7.640166,7.455243,7.439838,8.055964,7.422100,7.491325,7.031681,7.838766,...,7.369171,7.530654,7.545323,7.729064,7.862185,7.626514,7.703143,7.479374,7.334361,7.149404
psiTPTE22,4.675962,3.057613,4.286915,2.946979,5.315133,3.884333,3.142242,2.875372,3.487803,2.855120,...,2.890170,2.886104,3.314108,4.380311,3.229161,3.017693,3.086272,2.787978,2.905757,5.173520


In [209]:
genetic_data = normalize_gene_symbols_in_index(genetic_data)

In [210]:
genetic_data

Unnamed: 0,GSM1972883,GSM1972884,GSM1972885,GSM1972886,GSM1972887,GSM1972888,GSM1972889,GSM1972890,GSM1972891,GSM1972892,...,GSM1972907,GSM1972908,GSM1972909,GSM1972910,GSM1972911,GSM1972912,GSM1972913,GSM1972914,GSM1972915,GSM1972916
A1BG,3.527235,3.942945,3.671302,3.426890,4.123651,3.610940,3.854197,3.521270,3.702748,3.445015,...,3.452039,3.895396,3.793464,5.371881,3.303510,4.107289,3.424955,5.033872,3.835654,3.242641
A1CF,2.768377,3.032119,3.119842,3.150053,2.992697,3.102878,2.946598,2.842032,2.964397,3.112626,...,3.098948,2.943170,2.772752,2.973187,2.890580,3.198151,2.844137,2.723355,3.003175,3.194115
A2M,6.851029,7.004668,6.664395,7.381467,7.472109,6.890454,6.771337,7.154243,7.811227,7.096253,...,6.461667,7.453001,6.572939,6.456971,6.982855,6.699946,6.735206,5.936276,6.416536,6.893309
A2ML1,2.728755,2.687359,2.488260,2.902495,2.682164,2.781564,2.816926,2.812049,2.587465,2.584169,...,2.742156,4.398266,2.548673,2.714165,2.708875,2.685465,2.743197,2.585032,2.677635,2.906108
A4GALT,5.240525,4.727202,5.289459,4.964199,4.803111,5.253464,5.141669,5.316193,5.079566,5.296088,...,4.850719,4.863967,5.088200,4.287189,4.269244,4.386475,3.986574,4.226469,5.957873,4.183791
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ZYG11A,3.654807,3.037838,3.700867,3.236223,2.817592,3.508038,3.216719,3.302665,4.821776,3.345955,...,2.993568,3.124257,3.321288,3.786787,2.562922,2.999410,3.342488,4.395671,3.291762,3.143920
ZYG11B,8.106748,7.736487,7.929907,7.365184,7.334866,7.647479,8.404628,8.471322,7.707527,8.667955,...,8.148548,7.750400,7.696247,7.931895,7.958301,8.363429,8.426547,7.308236,8.403576,7.374028
ZYX,3.707414,3.824522,3.970674,3.953966,3.314021,3.750447,4.481623,5.350892,4.075008,4.400005,...,3.663827,3.892046,3.643871,3.635140,2.974060,3.826820,4.372133,3.554603,3.158875,3.565214
ZZEF1,4.953882,4.852440,4.961707,5.035332,5.086926,5.210269,5.084559,5.412141,4.688110,5.128178,...,5.296990,4.951781,4.795016,4.912349,5.306210,4.846485,5.563509,4.626624,5.128579,4.568179


In [211]:
merged_data = geo_merge_clinical_genetic_data(selected_clinical_data, genetic_data)
# The preprocessing runs through, which means is_available should be True
is_available = True

In [212]:
merged_data

Unnamed: 0,Adrenocortical Cancer,A1BG,A1CF,A2M,A2ML1,A4GALT,A4GNT,AAA1,AAAS,AACS,...,ZWILCH,ZWINT,ZXDA,ZXDB,ZXDC,ZYG11A,ZYG11B,ZYX,ZZEF1,ZZZ3
GSM1972883,1.0,3.527235,2.768377,6.851029,2.728755,5.240525,3.477598,2.727,5.676317,7.434579,...,8.509048,10.628571,5.649859,4.77159,6.386965,3.654807,8.106748,3.707414,4.953882,8.010302
GSM1972884,1.0,3.942945,3.032119,7.004668,2.687359,4.727202,3.252023,2.865537,5.516883,10.974107,...,7.971477,8.86177,5.179289,4.474295,5.412766,3.037838,7.736487,3.824522,4.85244,7.458704
GSM1972885,1.0,3.671302,3.119842,6.664395,2.48826,5.289459,3.501842,2.831561,5.06409,9.689137,...,7.075808,7.899708,5.244801,4.730381,5.308705,3.700867,7.929907,3.970674,4.961707,7.640166
GSM1972886,1.0,3.42689,3.150053,7.381467,2.902495,4.964199,3.465034,2.848093,5.627305,10.187094,...,8.322642,10.53777,5.080376,4.545587,6.046721,3.236223,7.365184,3.953966,5.035332,7.455243
GSM1972887,1.0,4.123651,2.992697,7.472109,2.682164,4.803111,3.165443,2.904853,5.582517,8.040422,...,8.334319,10.887228,5.542302,4.696713,6.357251,2.817592,7.334866,3.314021,5.086926,7.439838
GSM1972888,1.0,3.61094,3.102878,6.890454,2.781564,5.253464,3.304358,2.92575,5.896222,8.707148,...,7.457277,8.960566,4.444406,4.576363,5.658594,3.508038,7.647479,3.750447,5.210269,8.055964
GSM1972889,1.0,3.854197,2.946598,6.771337,2.816926,5.141669,3.375946,2.828268,5.466029,7.212664,...,6.303927,7.350314,6.185353,5.333719,6.151752,3.216719,8.404628,4.481623,5.084559,7.4221
GSM1972890,1.0,3.52127,2.842032,7.154243,2.812049,5.316193,3.417647,2.882909,5.840752,7.864725,...,6.783898,8.696888,5.681972,4.933937,5.540337,3.302665,8.471322,5.350892,5.412141,7.491325
GSM1972891,1.0,3.702748,2.964397,7.811227,2.587465,5.079566,3.310055,2.862623,5.371422,10.000973,...,7.68534,8.678479,5.005627,4.676198,5.377851,4.821776,7.707527,4.075008,4.68811,7.031681
GSM1972892,1.0,3.445015,3.112626,7.096253,2.584169,5.296088,3.36417,2.8343,5.265393,6.611711,...,6.465606,7.714065,5.411703,4.579231,6.845369,3.345955,8.667955,4.400005,5.128178,7.838766


In [213]:
print(f"The merged dataset contains {len(merged_data)} samples.")
is_trait_biased, merged_data = judge_and_remove_biased_features(merged_data, TRAIT, trait_type=trait_type)
is_trait_biased

The merged dataset contains 34 samples.
For the feature 'Adrenocortical Cancer', the least common label is '1.0' with 34 occurrences. This represents 100.00% of the dataset.
The distribution of the feature 'Adrenocortical Cancer' in this dataset is severely biased.



True

In [214]:
if is_available:
    save_cohort_info(cohort, JSON_PATH, is_available, is_trait_biased, merged_data, note='')
else:
    save_cohort_info(cohort, JSON_PATH, is_available)
merged_data.head()
if not is_trait_biased:
    merged_data.to_csv(os.path.join(OUTPUT_DIR, cohort + '.csv'), index=False)

In [215]:
# Stopped: No gene mapping
cohort = accession_num = "GSE169253"
cohort_dir = os.path.join(trait_path, accession_num)
soft_file, matrix_file = get_relevant_filepaths(cohort_dir)
soft_file, matrix_file

from utils import *
import io
background_prefixes = ['!Series_title', '!Series_summary', '!Series_overall_design']
clinical_prefixes = ['!Sample_geo_accession', '!Sample_characteristics_ch1']

background_info, clinical_data = get_background_and_clinical_data(matrix_file, background_prefixes, clinical_prefixes)
print(background_info)

clinical_data

!Series_title	"microRNA expression profile of pediatric adrenocortical tumors"
!Series_summary	"Here, we used a microarray technique to provide miRNA expression data of a set of 37 adrenocortical tumors (ACT) and 9 non-neoplastic adrenal controls from Brazilian patients assisted in two treatment centers in the state of São Paulo (HC-FMRP-USP and Centro Infantil Boldrini of Campinas)."
!Series_overall_design	"We identified miRNA signatures associated with pediatric adrenocortical tumors and patients' outcome."


Unnamed: 0,!Sample_geo_accession,GSM5191580,GSM5191581,GSM5191582,GSM5191583,GSM5191584,GSM5191585,GSM5191586,GSM5191587,GSM5191588,...,GSM5191616,GSM5191617,GSM5191618,GSM5191619,GSM5191620,GSM5191621,GSM5191622,GSM5191623,GSM5191624,GSM5191625
0,!Sample_characteristics_ch1,tissue: Tumor,tissue: Tumor,tissue: Tumor,tissue: Tumor,tissue: Tumor,tissue: Tumor,tissue: Tumor,tissue: Tumor,tissue: Tumor,...,tissue: Tumor,tissue: Non-neoplastic Adrenal,tissue: Non-neoplastic Adrenal,tissue: Non-neoplastic Adrenal,tissue: Non-neoplastic Adrenal,tissue: Non-neoplastic Adrenal,tissue: Non-neoplastic Adrenal,tissue: Non-neoplastic Adrenal,tissue: Non-neoplastic Adrenal,tissue: Non-neoplastic Adrenal
1,!Sample_characteristics_ch1,gender: Female,gender: Female,gender: Male,gender: Female,gender: Female,gender: Female,gender: Female,gender: Female,gender: Female,...,gender: Female,,,,,,,,,
2,!Sample_characteristics_ch1,age at diagnosis (months): 101,age at diagnosis (months): 13,age at diagnosis (months): 12,age at diagnosis (months): 29,age at diagnosis (months): 18,age at diagnosis (months): 137,age at diagnosis (months): 16,age at diagnosis (months): 95,age at diagnosis (months): 92,...,age at diagnosis (months): 34,,,,,,,,,
3,!Sample_characteristics_ch1,sandrin stage: 2,sandrin stage: 1,sandrin stage: 2,sandrin stage: 1,sandrin stage: 1,sandrin stage: 2,sandrin stage: 1,sandrin stage: 4,sandrin stage: 4,...,sandrin stage: 1,,,,,,,,,
4,!Sample_characteristics_ch1,metastasis: Absent,metastasis: Absent,metastasis: Absent,metastasis: Absent,metastasis: Absent,metastasis: Absent,metastasis: Absent,metastasis: Present,metastasis: Present,...,metastasis: Absent,,,,,,,,,
5,!Sample_characteristics_ch1,relapse: present,relapse: absent,relapse: absent,relapse: present,relapse: absent,relapse: present,relapse: absent,relapse: present,relapse: present,...,relapse: absent,,,,,,,,,
6,!Sample_characteristics_ch1,vital status: dead,vital status: alive,vital status: alive,vital status: dead,vital status: alive,vital status: alive,vital status: alive,vital status: dead,vital status: dead,...,vital status: alive,,,,,,,,,


In [216]:
tumor_stage_row = clinical_data.iloc[0]
tumor_stage_row.unique()

array(['!Sample_characteristics_ch1', 'tissue: Tumor',
       'tissue: Non-neoplastic Adrenal'], dtype=object)

In [217]:
is_gene_availabe = True
trait_row = 2
age_row = None
gender_row = None

trait_type = 'binary'

# Verify and use the functions generated by GPT

# 这个函数将组织类型（tissue type）转换为有关癫痫存在与否的二进制值。
# 它是基于特定的假设，即如果组织类型是“胰腺导管腺癌”（Pancreatic Ductal Adenocarcinoma），则认为癫痫存在（返回1）；否则，认为癫痫不存在（返回0）。
def convert_trait(tissue_type):
    """
    Convert tissue type to epilepsy presence (binary).
    Assuming epilepsy presence for 'Hippocampus' tissue.
    """
    if tissue_type == 'tissue: Tumor':
        return 1  # Epilepsy present
    else:
        return 0  # Epilepsy not present

In [218]:
selected_clinical_data = geo_select_clinical_features(clinical_data, TRAIT, trait_row, convert_trait, age_row=age_row,
                                                      convert_age=convert_age, gender_row=gender_row,
                                                      convert_gender=convert_gender)
selected_clinical_data.head()

  clinical_df = clinical_df.applymap(convert_fn)


Unnamed: 0,GSM5191580,GSM5191581,GSM5191582,GSM5191583,GSM5191584,GSM5191585,GSM5191586,GSM5191587,GSM5191588,GSM5191589,...,GSM5191616,GSM5191617,GSM5191618,GSM5191619,GSM5191620,GSM5191621,GSM5191622,GSM5191623,GSM5191624,GSM5191625
Adrenocortical Cancer,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [219]:
genetic_data = get_genetic_data(matrix_file)
genetic_data

Unnamed: 0_level_0,GSM5191580,GSM5191581,GSM5191582,GSM5191583,GSM5191584,GSM5191585,GSM5191586,GSM5191587,GSM5191588,GSM5191589,...,GSM5191616,GSM5191617,GSM5191618,GSM5191619,GSM5191620,GSM5191621,GSM5191622,GSM5191623,GSM5191624,GSM5191625
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
(-)3xSLv1,6.118841,6.157061,6.335244,6.169506,6.240245,6.182555,6.234796,6.238807,6.326222,6.116632,...,6.275570,6.359939,6.111164,6.431596,6.424909,6.592276,6.277234,6.378845,6.257708,6.248218
A_25_P00010019,6.299119,8.455914,10.030732,8.614498,8.260569,6.156289,9.278782,9.181840,9.148011,7.061471,...,8.409829,6.443060,7.243682,6.375311,6.409273,6.444193,6.223879,6.458491,6.147493,6.358181
A_25_P00010020,6.083133,7.270414,8.654528,7.275610,7.228507,6.077727,7.992475,8.030643,7.993448,6.422413,...,7.339864,6.325392,6.493788,6.214819,6.177738,6.231950,6.111671,6.213652,6.293622,6.265370
A_25_P00010021,6.073802,6.217121,7.033983,6.302130,6.375919,6.176956,6.654326,7.128662,6.347221,6.175253,...,6.516075,6.196105,6.116335,6.210674,6.313971,6.436155,6.091414,6.143457,6.371254,6.194190
A_25_P00010023,6.255995,8.369547,8.928362,6.998827,7.619062,6.196523,7.914588,9.366658,8.620604,6.521392,...,7.135663,6.169613,6.351142,6.261724,6.388602,6.527635,6.128198,6.246212,6.260787,6.352774
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
NC1_00000215,6.267363,6.148166,6.403182,6.150068,6.271648,6.231808,6.283637,6.285456,6.302437,6.176041,...,6.357121,6.406852,6.180986,6.733853,6.699904,6.815947,6.316242,6.685653,6.484752,6.380198
NC2_00079215,6.097034,6.094926,6.254911,6.095361,6.160075,6.150899,6.162629,6.161875,6.243301,6.088003,...,6.222913,6.285008,6.098394,6.350706,6.190702,6.370142,6.253163,6.251502,6.238752,6.184274
NC2_00092197,6.228357,6.112642,6.317175,6.146962,6.252238,6.217719,6.215052,6.227773,6.254591,6.104834,...,6.312603,6.341783,6.105114,6.381271,6.328268,6.633423,6.198914,6.304024,6.285849,6.366904
NC2_00106057,6.137844,6.139085,6.449337,6.161739,6.262088,6.271165,6.260400,6.256848,6.328337,6.124522,...,6.368682,6.474196,6.168280,6.876222,6.709560,7.003939,6.514869,6.711632,6.346856,6.351947


In [220]:
requires_gene_mapping = True

if requires_gene_mapping:
    gene_annotation = get_gene_annotation(soft_file)
    gene_annotation_summary = preview_df(gene_annotation)
    print(gene_annotation_summary)

gene_row_ids = genetic_data.index[:20].tolist()
gene_row_ids

if requires_gene_mapping:
    print(f'''
    As a biomedical research team, we are analyzing a gene expression dataset, and find that its row headers are some identifiers related to genes:
    {gene_row_ids}
    To get the mapping from those identifiers to actual gene symbols, we extracted the gene annotation data from a series in the GEO database, and saved it to a Python dictionary. Please read the dictionary, and decide which key stores the identifiers, and which key stores the gene symbols. Please strictly follow this format in your answer:
    identifier_key = 'key_name1'
    gene_symbol_key = 'key_name2'

    Gene annotation dictionary:
    {gene_annotation_summary}
    ''')

gene_annotation

{'ID': ['(-)3xSLv1', 'A_25_P00010019', 'A_25_P00010020', 'A_25_P00010021', 'A_25_P00010023'], 'miRNA_ID': [nan, 'hsa-miR-329', 'hsa-miR-329', 'hsa-miR-655', 'hsa-miR-369-3p'], 'Accession_String': [nan, 'mir|hsa-miR-329|mir|MIMAT0001629', 'mir|hsa-miR-329|mir|MIMAT0001629', 'mir|hsa-miR-655|mir|MIMAT0003331', 'mir|hsa-miR-369-3p|mir|MIMAT0000721'], 'SPOT_ID': ['NegativeControl', nan, nan, nan, nan]}

    As a biomedical research team, we are analyzing a gene expression dataset, and find that its row headers are some identifiers related to genes:
    ['(-)3xSLv1', 'A_25_P00010019', 'A_25_P00010020', 'A_25_P00010021', 'A_25_P00010023', 'A_25_P00010037', 'A_25_P00010038', 'A_25_P00010039', 'A_25_P00010040', 'A_25_P00010041', 'A_25_P00010042', 'A_25_P00010043', 'A_25_P00010044', 'A_25_P00010047', 'A_25_P00010048', 'A_25_P00010053', 'A_25_P00010054', 'A_25_P00010062', 'A_25_P00010063', 'A_25_P00010070']
    To get the mapping from those identifiers to actual gene symbols, we extracted the ge

Unnamed: 0,ID,miRNA_ID,Accession_String,SPOT_ID
0,(-)3xSLv1,,,NegativeControl
1,A_25_P00010019,hsa-miR-329,mir|hsa-miR-329|mir|MIMAT0001629,
2,A_25_P00010020,hsa-miR-329,mir|hsa-miR-329|mir|MIMAT0001629,
3,A_25_P00010021,hsa-miR-655,mir|hsa-miR-655|mir|MIMAT0003331,
4,A_25_P00010023,hsa-miR-369-3p,mir|hsa-miR-369-3p|mir|MIMAT0000721,
...,...,...,...,...
126792,A_25_P00012251,6.060039071,,
126793,A_25_P00013186,6.008080699,,
126794,A_25_P00013312,5.980089392,,
126795,A_25_P00012716,6.063907312,,


In [221]:
gene_annotation.columns

Index(['ID', 'miRNA_ID', 'Accession_String', 'SPOT_ID'], dtype='object')

In [222]:
# Stopped: No trait 
cohort = accession_num = "GSE67766"
cohort_dir = os.path.join(trait_path, accession_num)
soft_file, matrix_file = get_relevant_filepaths(cohort_dir)
soft_file, matrix_file

from utils import *
import io
background_prefixes = ['!Series_title', '!Series_summary', '!Series_overall_design']
clinical_prefixes = ['!Sample_geo_accession', '!Sample_characteristics_ch1']

background_info, clinical_data = get_background_and_clinical_data(matrix_file, background_prefixes, clinical_prefixes)
print(background_info)

clinical_data

!Series_title	"Cancer Cells Hijack PRC2 to Modify Multiple Cytokine Pathways"
!Series_summary	"This SuperSeries is composed of the SubSeries listed below."
!Series_overall_design	"Refer to individual Series"


Unnamed: 0,!Sample_geo_accession,GSM1652385,GSM1652386,GSM1652387,GSM1652388,GSM1652389,GSM1652390,GSM1652391,GSM1652392,GSM1652393,...,GSM1652399,GSM1652400,GSM1652401,GSM1652402,GSM1652403,GSM1652404,GSM1652405,GSM1652406,GSM1652407,GSM1652408
0,!Sample_characteristics_ch1,cell line: SW-13,cell line: SW-13,cell line: SW-13,cell line: SW-13,cell line: SW-13,cell line: SW-13,cell line: SW-13,cell line: SW-13,cell line: SW-13,...,cell line: SW-13,cell line: SW-13,cell line: SW-13,cell line: SW-13,cell line: SW-13,cell line: SW-13,cell line: SW-13,cell line: SW-13,cell line: SW-13,cell line: SW-13


In [223]:
# age + sex (problematic)
cohort = accession_num = "GSE68606"
cohort_dir = os.path.join(trait_path, accession_num)
soft_file, matrix_file = get_relevant_filepaths(cohort_dir)
soft_file, matrix_file

from utils import *
import io
background_prefixes = ['!Series_title', '!Series_summary', '!Series_overall_design']
clinical_prefixes = ['!Sample_geo_accession', '!Sample_characteristics_ch1']

background_info, clinical_data = get_background_and_clinical_data(matrix_file, background_prefixes, clinical_prefixes)
print(background_info)

clinical_data

BadGzipFile: Not a gzipped file (b'!S')

In [11]:
# Finished
cohort = accession_num = "GSE21660"
cohort_dir = os.path.join(trait_path, accession_num)
soft_file, matrix_file = get_relevant_filepaths(cohort_dir)
soft_file, matrix_file

from utils import *
import io
background_prefixes = ['!Series_title', '!Series_summary', '!Series_overall_design']
clinical_prefixes = ['!Sample_geo_accession', '!Sample_characteristics_ch1']

background_info, clinical_data = get_background_and_clinical_data(matrix_file, background_prefixes, clinical_prefixes)
print(background_info)

clinical_data

!Series_title	"Advancing a Clinically Relevant Perspective of the Clonal Nature of Cancer"
!Series_summary	"We used DNA content-based flow cytometry to distinguish and isolate nuclei from clonal populations in primary tissues from three disparate cancers with variable clinical histories. We then developed a methodology to adapt flow cytometrically purified nuclei samples for use with whole genome technologies including aCGH and next generation sequencing (NGS). Our results demonstrate that selected aberrations in the genomes of distinct clonal populations in each patient create clinically relevant contexts at least with respect to the cancer types profiled in this study."
!Series_overall_design	"We applied DNA content based flow sorting to isolate the nuclei of clonal populations from tumor biopsies. Genomic DNA from each sorted population was amplified with phi29 polymerase. A 1ug aliquot of each amplified sample was digested with DNAse 1 then labeled with Cy5 using a Klenow-based com

Unnamed: 0,!Sample_geo_accession,GSM540550,GSM540551,GSM540552,GSM540553,GSM540554,GSM540555,GSM540556,GSM540557,GSM540558,...,GSM540601,GSM540602,GSM540603,GSM540604,GSM540605,GSM540606,GSM540607,GSM540608,GSM540609,GSM540610
0,!Sample_characteristics_ch1,tissue: Pancreatic Ductal Adenocarcinoma,tissue: Pancreatic Ductal Adenocarcinoma,tissue: Pancreatic Ductal Adenocarcinoma,tissue: Pancreatic Ductal Adenocarcinoma,tissue: Pancreatic Ductal Adenocarcinoma,tissue: Pancreatic Ductal Adenocarcinoma,tissue: Pancreatic Ductal Adenocarcinoma,tissue: Pancreatic Ductal Adenocarcinoma,tissue: Pancreatic Ductal Adenocarcinoma,...,tissue: Adrenal Cortical Carcinoma,tissue: Adrenal Cortical Carcinoma,tissue: Adrenal Cortical Carcinoma,tissue: Prostate Carcinoma,tissue: Prostate Carcinoma,tissue: Prostate Carcinoma,tissue: Prostate Carcinoma,tissue: Prostate Carcinoma,tissue: Prostate Carcinoma,tissue: Prostate Carcinoma


In [12]:
tumor_stage_row = clinical_data.iloc[0]
tumor_stage_row.unique()

array(['!Sample_characteristics_ch1',
       'tissue: Pancreatic Ductal Adenocarcinoma',
       'tissue: Adrenal Cortical Carcinoma', 'tissue: Prostate Carcinoma'],
      dtype=object)

In [13]:
is_gene_availabe = True
trait_row = 0
age_row = None
gender_row = None

def convert_trait(tissue_type):
    """
    Convert tissue type to epilepsy presence (binary).
    Assuming epilepsy presence for 'Hippocampus' tissue.
    """
    if tissue_type == 'tissue: Adrenal Cortical Carcinoma':
        return 1  # Epilepsy present
    else:
        return 0  # Epilepsy not present

In [14]:
selected_clinical_data = geo_select_clinical_features(clinical_data, TRAIT, trait_row, convert_trait, age_row=age_row,
                                                      convert_age=convert_age, gender_row=gender_row,
                                                      convert_gender=convert_gender)
selected_clinical_data.head()

  clinical_df = clinical_df.applymap(convert_fn)


Unnamed: 0,GSM540550,GSM540551,GSM540552,GSM540553,GSM540554,GSM540555,GSM540556,GSM540557,GSM540558,GSM540559,...,GSM540601,GSM540602,GSM540603,GSM540604,GSM540605,GSM540606,GSM540607,GSM540608,GSM540609,GSM540610
Adrenocortical Cancer,0,0,0,0,0,0,0,0,0,0,...,1,1,1,0,0,0,0,0,0,0


In [15]:
genetic_data = get_genetic_data(matrix_file)
genetic_data

Unnamed: 0_level_0,GSM540550,GSM540551,GSM540552,GSM540553,GSM540554,GSM540555,GSM540556,GSM540557,GSM540558,GSM540559,...,GSM540601,GSM540602,GSM540603,GSM540604,GSM540605,GSM540606,GSM540607,GSM540608,GSM540609,GSM540610
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0201,0.00599,0.0117,-0.0779,0.01020,-0.00558,-0.0871,-0.0895,-0.1140,-0.00695,...,0.0592,0.0563,0.002070,0.0208,0.00841,0.0122,-0.01280,0.0748,-0.000229,-0.0148
2,0.0000,0.00000,0.0000,-0.1730,-0.43800,0.00000,0.0000,-0.6140,-0.0190,0.00000,...,0.0000,0.0000,0.000000,0.0000,0.00000,0.0000,0.00000,-0.5670,-0.004620,-0.4700
3,-0.2000,0.00000,0.0000,0.0000,0.00000,0.00000,0.0000,-0.3190,0.0000,0.00000,...,0.0000,0.0000,0.000000,0.0000,0.00000,0.0000,0.00000,-0.4420,0.000000,-0.6850
4,0.1210,0.19900,0.1570,0.2230,0.01230,0.13200,0.0682,0.1330,0.1480,0.03560,...,0.1920,0.3650,0.364000,0.0321,-0.24800,0.0240,-0.08650,0.0336,0.167000,0.0337
5,-0.1050,-0.06010,-0.1140,-0.1780,-0.00527,0.00652,-0.1610,-0.1200,-0.1520,-0.02830,...,0.0991,-0.0738,-0.058400,0.0120,-0.06350,-0.1470,-0.24700,0.1170,-0.115000,0.0135
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
243500,-0.2810,-0.34800,-0.3390,-0.2320,0.02470,0.02690,-0.2440,0.0160,-0.1790,-0.34600,...,0.1480,-0.1520,-0.211000,-0.0559,0.08870,-0.0608,0.00423,-0.4000,-0.069500,-0.0305
243501,-0.1770,0.00000,-0.2670,0.0000,-0.28800,0.00000,0.0000,0.0000,0.0433,0.00000,...,0.2160,0.0000,0.122000,0.0000,0.00000,0.0000,0.00000,0.0000,0.000000,-1.0400
243502,-0.5300,1.30000,0.0179,0.0000,-0.66700,0.00000,0.0000,0.0000,0.0000,0.00000,...,0.0000,-0.2540,0.000000,0.0000,0.00000,0.0000,0.00000,-0.1480,-0.302000,-0.9700
243503,0.0344,0.02240,0.0275,-0.0692,0.03500,0.01120,-0.0938,-0.0330,-0.1030,0.00613,...,0.0709,0.0626,-0.000328,0.0182,0.01140,0.0342,-0.03030,0.0543,0.014000,-0.0138


In [16]:
requires_gene_mapping = True

if requires_gene_mapping:
    gene_annotation = get_gene_annotation(soft_file)
    gene_annotation_summary = preview_df(gene_annotation)
    print(gene_annotation_summary)

gene_row_ids = genetic_data.index[:20].tolist()
gene_row_ids

if requires_gene_mapping:
    print(f'''
    As a biomedical research team, we are analyzing a gene expression dataset, and find that its row headers are some identifiers related to genes:
    {gene_row_ids}
    To get the mapping from those identifiers to actual gene symbols, we extracted the gene annotation data from a series in the GEO database, and saved it to a Python dictionary. Please read the dictionary, and decide which key stores the identifiers, and which key stores the gene symbols. Please strictly follow this format in your answer:
    identifier_key = 'key_name1'
    gene_symbol_key = 'key_name2'

    Gene annotation dictionary:
    {gene_annotation_summary}
    ''')

gene_annotation

{'ID': ['1', '2', '3', '4', '5'], 'COL': ['267', '267', '267', '267', '267'], 'ROW': [912.0, 910.0, 908.0, 906.0, 904.0], 'SPOT_ID': ['HsCGHBrightCorner', 'DarkCorner', 'DarkCorner', 'A_16_P20527812', 'A_16_P01708709'], 'CONTROL_TYPE': ['pos', 'pos', 'pos', 'FALSE', 'FALSE'], 'GB_ACC': [nan, nan, nan, nan, 'NM_138295'], 'GENE_SYMBOL': [nan, nan, nan, nan, 'PKD1L1'], 'GENE_NAME': [nan, nan, nan, nan, 'polycystic kidney disease 1 like 1'], 'ACCESSION_STRING': [nan, nan, nan, nan, 'ref|NM_138295|ref|NM_025031'], 'CHROMOSOMAL_LOCATION': [nan, nan, nan, 'chr16:076331867-076331926', 'chr7:047626734-047626793'], 'CYTOBAND': [nan, nan, nan, 'hs|q23.1', 'hs|p12.3'], 'DESCRIPTION': [nan, nan, nan, nan, 'Homo sapiens polycystic kidney disease 1 like 1 (PKD1L1), mRNA.'], 'GB_RANGE': [nan, nan, nan, 'NC_000016.8[076331867..076331926]', 'NC_000007.11[047626734..047626793]']}

    As a biomedical research team, we are analyzing a gene expression dataset, and find that its row headers are some identif

Unnamed: 0,ID,COL,ROW,SPOT_ID,CONTROL_TYPE,GB_ACC,GENE_SYMBOL,GENE_NAME,ACCESSION_STRING,CHROMOSOMAL_LOCATION,CYTOBAND,DESCRIPTION,GB_RANGE
0,1,267,912.0,HsCGHBrightCorner,pos,,,,,,,,
1,2,267,910.0,DarkCorner,pos,,,,,,,,
2,3,267,908.0,DarkCorner,pos,,,,,,,,
3,4,267,906.0,A_16_P20527812,FALSE,,,,,chr16:076331867-076331926,hs|q23.1,,NC_000016.8[076331867..076331926]
4,5,267,904.0,A_16_P01708709,FALSE,NM_138295,PKD1L1,polycystic kidney disease 1 like 1,ref|NM_138295|ref|NM_025031,chr7:047626734-047626793,hs|p12.3,Homo sapiens polycystic kidney disease 1 like ...,NC_000007.11[047626734..047626793]
...,...,...,...,...,...,...,...,...,...,...,...,...,...
15092790,243500,-3.05E-02,,,,,,,,,,,
15092791,243501,-1.04E+00,,,,,,,,,,,
15092792,243502,-9.70E-01,,,,,,,,,,,
15092793,243503,-1.38E-02,,,,,,,,,,,


In [17]:
gene_annotation.columns

Index(['ID', 'COL', 'ROW', 'SPOT_ID', 'CONTROL_TYPE', 'GB_ACC', 'GENE_SYMBOL',
       'GENE_NAME', 'ACCESSION_STRING', 'CHROMOSOMAL_LOCATION', 'CYTOBAND',
       'DESCRIPTION', 'GB_RANGE'],
      dtype='object')

In [18]:
if requires_gene_mapping:
    identifier_key = 'ID'
    gene_symbol_key = 'GENE_SYMBOL'
    gene_mapping = get_gene_mapping(gene_annotation, identifier_key, gene_symbol_key)
    genetic_data = apply_gene_mapping(genetic_data, gene_mapping)

In [19]:
genetic_data

Unnamed: 0_level_0,GSM540550,GSM540551,GSM540552,GSM540553,GSM540554,GSM540555,GSM540556,GSM540557,GSM540558,GSM540559,...,GSM540601,GSM540602,GSM540603,GSM540604,GSM540605,GSM540606,GSM540607,GSM540608,GSM540609,GSM540610
Gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
,-0.091366,-0.009954,0.064026,0.223229,0.015514,0.133943,0.010000,0.036996,-0.109300,0.058823,...,0.152000,0.040586,0.114029,-0.021514,0.110771,0.135557,0.125829,0.009514,0.174671,0.049243
15E1.2,-0.018767,-0.036233,0.126533,0.015967,-0.001400,-0.229000,0.095173,-0.137293,-0.145300,0.137067,...,0.146700,0.267000,0.207333,0.039400,-0.047812,0.011360,-0.109767,0.300333,0.013167,-0.008633
2'-PDE,-0.095650,-0.038200,-0.273500,-0.188500,0.077200,-0.170500,-0.023250,-0.265000,-0.224050,0.028700,...,-0.189500,-0.117450,-0.166800,-0.037500,-0.123500,0.019000,-0.111050,0.182000,-0.032200,-0.052400
76P,-0.177225,-0.016975,-0.015500,-0.230250,-0.381500,-0.132625,-0.074625,0.013000,-0.158050,-0.093265,...,0.067575,0.068450,-0.156500,0.048800,-0.054375,-0.010340,-0.110975,0.131200,-0.107950,-0.035025
7A5,-0.060111,0.072104,0.055131,-0.020300,0.201556,0.109311,0.029308,0.029044,0.107101,-0.007541,...,0.151611,0.148789,0.090900,-0.000616,0.151533,0.079336,-0.004712,-0.077400,-0.000033,0.031200
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
bA9F11.1,0.150150,0.095300,-0.018000,0.263000,0.034750,-0.021791,0.147900,-0.086950,-0.105000,-0.064150,...,-0.048850,0.065850,0.111600,-0.127850,-0.217500,-0.151800,-0.283000,-0.072160,0.191000,0.003150
dJ341D10.1,0.006610,-0.161000,-0.120000,-0.133000,-0.138000,-0.177000,-0.132000,0.051200,-0.039200,-0.014300,...,-0.188000,-0.012400,0.119000,-0.193000,-0.092300,-0.124000,-0.271000,-0.176000,-0.100000,-0.269000
hCAP-D3,-0.104746,-0.010522,0.026106,0.072607,-0.014310,0.027170,0.043281,0.107378,0.029616,-0.188333,...,0.173267,-0.299222,-0.090567,-0.004926,0.055044,0.054733,0.012011,0.007011,0.017700,0.026992
hCAP-H2,0.053800,0.027662,0.011612,0.063725,-0.114200,0.026400,0.140775,-0.134775,-0.026850,0.035800,...,-0.031643,-0.244000,0.083365,-0.079825,-0.076025,-0.071883,-0.031462,0.068550,0.098075,0.060400


In [20]:
genetic_data = normalize_gene_symbols_in_index(genetic_data)

In [21]:
genetic_data

Unnamed: 0,GSM540550,GSM540551,GSM540552,GSM540553,GSM540554,GSM540555,GSM540556,GSM540557,GSM540558,GSM540559,...,GSM540601,GSM540602,GSM540603,GSM540604,GSM540605,GSM540606,GSM540607,GSM540608,GSM540609,GSM540610
A1BG,0.261000,0.023700,0.117000,0.154000,0.141000,0.239000,0.197000,0.216000,0.318000,0.085300,...,-0.164000,0.109000,0.205000,-0.053900,-0.069000,-0.034000,0.054000,-0.154000,0.242000,0.074100
A2M,-0.071700,-0.075717,0.184000,0.176317,0.034750,0.049178,-0.039567,0.045017,0.107967,0.116025,...,0.539667,0.245167,0.143050,0.039052,-0.059468,0.115183,0.112083,0.168233,0.139317,0.074317
A2ML1,-0.056388,-0.025460,0.121440,0.166813,0.002068,0.105455,0.087587,0.114687,0.150363,0.049269,...,0.537412,0.228625,0.226325,0.063661,0.065370,-0.013030,-0.029762,0.093887,0.101675,-0.023225
A4GALT,0.115900,0.056557,-0.057271,0.139629,-0.060130,-0.018657,0.206329,-0.027757,-0.015286,-0.037541,...,-0.090871,-0.217963,-0.032471,-0.165701,-0.124071,-0.120486,-0.127914,0.012549,0.150286,-0.038199
A4GNT,0.185500,-0.096250,-0.289000,-0.030060,-0.063700,-0.112750,-0.056290,0.054900,0.011600,0.062265,...,-0.253000,0.114100,-0.105850,-0.118550,-0.061350,0.013800,0.006250,-0.114600,-0.029010,-0.064800
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ZYG11A,0.021000,-0.025660,-0.214200,-0.152360,-0.020380,-0.004088,-0.022060,0.012460,0.035980,-0.095848,...,-0.150060,-0.114840,-0.113584,-0.069040,-0.112720,-0.080224,-0.191920,-0.193660,-0.168400,-0.114120
ZYG11B,0.018978,-0.048632,-0.240545,-0.211436,-0.014087,-0.029066,0.012171,0.010093,0.011982,-0.161527,...,-0.117764,-0.096468,-0.157382,-0.115042,-0.050454,-0.110373,-0.230038,-0.178318,-0.209545,-0.195591
ZYX,-0.034700,0.021967,0.089867,0.084667,-0.098300,0.109900,0.071067,-0.106867,-0.011533,0.158667,...,0.127000,0.010613,0.163800,-0.020167,-0.006195,0.057200,0.041467,0.072733,0.206333,0.196667
ZZEF1,-0.129433,-0.158727,-0.175639,-0.123181,-0.021133,-0.288200,-0.128013,-0.138720,-0.097573,0.044883,...,-0.167607,-0.057523,-0.091887,0.025984,-0.078663,-0.026909,-0.100020,0.083072,-0.234480,-0.077160


In [22]:
merged_data = geo_merge_clinical_genetic_data(selected_clinical_data, genetic_data)
# The preprocessing runs through, which means is_available should be True
is_available = True

In [23]:
merged_data

Unnamed: 0,Adrenocortical Cancer,A1BG,A2M,A2ML1,A4GALT,A4GNT,AAA1,AAAS,AACS,AADAC,...,ZW10,ZWILCH,ZWINT,ZXDB,ZXDC,ZYG11A,ZYG11B,ZYX,ZZEF1,ZZZ3
GSM540550,0.0,0.2610,-0.071700,-0.056388,0.115900,0.18550,-0.074332,0.284667,-0.050671,0.14100,...,-0.011806,-0.069583,0.0818,-0.26000,0.19156,0.021000,0.018978,-0.034700,-0.129433,-0.079166
GSM540551,0.0,0.0237,-0.075717,-0.025460,0.056557,-0.09625,0.058537,0.048833,-0.124633,-0.00876,...,0.074980,0.105617,0.1890,-0.39000,-0.02402,-0.025660,-0.048632,0.021967,-0.158727,-0.043763
GSM540552,0.0,0.1170,0.184000,0.121440,-0.057271,-0.28900,0.049406,0.168933,0.118744,-0.20600,...,0.117400,0.014768,0.1560,-0.51600,-0.26312,-0.214200,-0.240545,0.089867,-0.175639,-0.262375
GSM540553,0.0,0.1540,0.176317,0.166813,0.139629,-0.03006,-0.036533,0.314667,0.071756,0.12700,...,0.120940,0.090387,0.0823,-0.42100,0.12542,-0.152360,-0.211436,0.084667,-0.123181,-0.159617
GSM540554,0.0,0.1410,0.034750,0.002068,-0.060130,-0.06370,0.322153,0.049267,-0.056361,0.14400,...,0.015800,0.022417,0.0821,-0.58000,0.07364,-0.020380,-0.014087,-0.098300,-0.021133,0.018503
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
GSM540606,0.0,-0.0340,0.115183,-0.013030,-0.120486,0.01380,0.096020,-0.082280,-0.084410,0.14000,...,0.053585,0.004537,0.0128,0.00522,-0.05314,-0.080224,-0.110373,0.057200,-0.026909,-0.087575
GSM540607,0.0,0.0540,0.112083,-0.029762,-0.127914,0.00625,0.126520,-0.074100,-0.090500,0.15000,...,-0.000366,-0.128767,0.0391,0.53800,-0.00796,-0.191920,-0.230038,0.041467,-0.100020,-0.126593
GSM540608,0.0,-0.1540,0.168233,0.093887,0.012549,-0.11460,-0.061250,0.163033,0.201889,0.07380,...,0.075702,0.154717,0.1470,-0.50600,0.14348,-0.193660,-0.178318,0.072733,0.083072,-0.300383
GSM540609,0.0,0.2420,0.139317,0.101675,0.150286,-0.02901,0.000785,0.065833,-0.030744,0.07190,...,0.066480,-0.007978,0.1480,-0.48600,0.05238,-0.168400,-0.209545,0.206333,-0.234480,-0.298500


In [24]:
print(f"The merged dataset contains {len(merged_data)} samples.")
is_trait_biased, merged_data = judge_and_remove_biased_features(merged_data, TRAIT, trait_type=trait_type)
is_trait_biased

The merged dataset contains 61 samples.
For the feature 'Adrenocortical Cancer', the least common label is '1.0' with 25 occurrences. This represents 40.98% of the dataset.
The distribution of the feature 'Adrenocortical Cancer' in this dataset is fine.



False

In [25]:
if is_available:
    save_cohort_info(cohort, JSON_PATH, is_available, is_trait_biased, merged_data, note='')
else:
    save_cohort_info(cohort, JSON_PATH, is_available)
merged_data.head()
if not is_trait_biased:
    merged_data.to_csv(os.path.join(OUTPUT_DIR, cohort + '.csv'), index=False)

In [26]:
# Finished

cohort = accession_num = "GSE21660"
cohort_dir = os.path.join(trait_path, accession_num)
soft_file, matrix_file = get_relevant_filepaths(cohort_dir)
soft_file, matrix_file

from utils import *
import io
background_prefixes = ['!Series_title', '!Series_summary', '!Series_overall_design']
clinical_prefixes = ['!Sample_geo_accession', '!Sample_characteristics_ch1']

background_info, clinical_data = get_background_and_clinical_data(matrix_file, background_prefixes, clinical_prefixes)
print(background_info)

clinical_data

!Series_title	"Advancing a Clinically Relevant Perspective of the Clonal Nature of Cancer"
!Series_summary	"We used DNA content-based flow cytometry to distinguish and isolate nuclei from clonal populations in primary tissues from three disparate cancers with variable clinical histories. We then developed a methodology to adapt flow cytometrically purified nuclei samples for use with whole genome technologies including aCGH and next generation sequencing (NGS). Our results demonstrate that selected aberrations in the genomes of distinct clonal populations in each patient create clinically relevant contexts at least with respect to the cancer types profiled in this study."
!Series_overall_design	"We applied DNA content based flow sorting to isolate the nuclei of clonal populations from tumor biopsies. Genomic DNA from each sorted population was amplified with phi29 polymerase. A 1ug aliquot of each amplified sample was digested with DNAse 1 then labeled with Cy5 using a Klenow-based com

Unnamed: 0,!Sample_geo_accession,GSM540550,GSM540551,GSM540552,GSM540553,GSM540554,GSM540555,GSM540556,GSM540557,GSM540558,...,GSM540601,GSM540602,GSM540603,GSM540604,GSM540605,GSM540606,GSM540607,GSM540608,GSM540609,GSM540610
0,!Sample_characteristics_ch1,tissue: Pancreatic Ductal Adenocarcinoma,tissue: Pancreatic Ductal Adenocarcinoma,tissue: Pancreatic Ductal Adenocarcinoma,tissue: Pancreatic Ductal Adenocarcinoma,tissue: Pancreatic Ductal Adenocarcinoma,tissue: Pancreatic Ductal Adenocarcinoma,tissue: Pancreatic Ductal Adenocarcinoma,tissue: Pancreatic Ductal Adenocarcinoma,tissue: Pancreatic Ductal Adenocarcinoma,...,tissue: Adrenal Cortical Carcinoma,tissue: Adrenal Cortical Carcinoma,tissue: Adrenal Cortical Carcinoma,tissue: Prostate Carcinoma,tissue: Prostate Carcinoma,tissue: Prostate Carcinoma,tissue: Prostate Carcinoma,tissue: Prostate Carcinoma,tissue: Prostate Carcinoma,tissue: Prostate Carcinoma


In [None]:
is_gene_availabe = True
trait_row = 0
age_row = None
gender_row = None

def convert_trait(tissue_type):
    """
    Convert tissue type to epilepsy presence (binary).
    Assuming epilepsy presence for 'Hippocampus' tissue.
    """
    if tissue_type == 'tissue: Adrenal Cortical Carcinoma':
        return 1  # Epilepsy present
    else:
        return 0  # Epilepsy not present

In [None]:
selected_clinical_data = geo_select_clinical_features(clinical_data, TRAIT, trait_row, convert_trait, age_row=age_row,
                                                      convert_age=convert_age, gender_row=gender_row,
                                                      convert_gender=convert_gender)
selected_clinical_data.head()

  clinical_df = clinical_df.applymap(convert_fn)


Unnamed: 0,GSM540550,GSM540551,GSM540552,GSM540553,GSM540554,GSM540555,GSM540556,GSM540557,GSM540558,GSM540559,...,GSM540601,GSM540602,GSM540603,GSM540604,GSM540605,GSM540606,GSM540607,GSM540608,GSM540609,GSM540610
Adrenocortical Cancer,1,1,1,1,1,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0


In [None]:
genetic_data = get_genetic_data(matrix_file)
genetic_data

Unnamed: 0_level_0,GSM540550,GSM540551,GSM540552,GSM540553,GSM540554,GSM540555,GSM540556,GSM540557,GSM540558,GSM540559,...,GSM540601,GSM540602,GSM540603,GSM540604,GSM540605,GSM540606,GSM540607,GSM540608,GSM540609,GSM540610
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0201,0.00599,0.0117,-0.0779,0.01020,-0.00558,-0.0871,-0.0895,-0.1140,-0.00695,...,0.0592,0.0563,0.002070,0.0208,0.00841,0.0122,-0.01280,0.0748,-0.000229,-0.0148
2,0.0000,0.00000,0.0000,-0.1730,-0.43800,0.00000,0.0000,-0.6140,-0.0190,0.00000,...,0.0000,0.0000,0.000000,0.0000,0.00000,0.0000,0.00000,-0.5670,-0.004620,-0.4700
3,-0.2000,0.00000,0.0000,0.0000,0.00000,0.00000,0.0000,-0.3190,0.0000,0.00000,...,0.0000,0.0000,0.000000,0.0000,0.00000,0.0000,0.00000,-0.4420,0.000000,-0.6850
4,0.1210,0.19900,0.1570,0.2230,0.01230,0.13200,0.0682,0.1330,0.1480,0.03560,...,0.1920,0.3650,0.364000,0.0321,-0.24800,0.0240,-0.08650,0.0336,0.167000,0.0337
5,-0.1050,-0.06010,-0.1140,-0.1780,-0.00527,0.00652,-0.1610,-0.1200,-0.1520,-0.02830,...,0.0991,-0.0738,-0.058400,0.0120,-0.06350,-0.1470,-0.24700,0.1170,-0.115000,0.0135
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
243500,-0.2810,-0.34800,-0.3390,-0.2320,0.02470,0.02690,-0.2440,0.0160,-0.1790,-0.34600,...,0.1480,-0.1520,-0.211000,-0.0559,0.08870,-0.0608,0.00423,-0.4000,-0.069500,-0.0305
243501,-0.1770,0.00000,-0.2670,0.0000,-0.28800,0.00000,0.0000,0.0000,0.0433,0.00000,...,0.2160,0.0000,0.122000,0.0000,0.00000,0.0000,0.00000,0.0000,0.000000,-1.0400
243502,-0.5300,1.30000,0.0179,0.0000,-0.66700,0.00000,0.0000,0.0000,0.0000,0.00000,...,0.0000,-0.2540,0.000000,0.0000,0.00000,0.0000,0.00000,-0.1480,-0.302000,-0.9700
243503,0.0344,0.02240,0.0275,-0.0692,0.03500,0.01120,-0.0938,-0.0330,-0.1030,0.00613,...,0.0709,0.0626,-0.000328,0.0182,0.01140,0.0342,-0.03030,0.0543,0.014000,-0.0138


In [None]:
requires_gene_mapping = True

if requires_gene_mapping:
    gene_annotation = get_gene_annotation(soft_file)
    gene_annotation_summary = preview_df(gene_annotation)
    print(gene_annotation_summary)

gene_row_ids = genetic_data.index[:20].tolist()
gene_row_ids

if requires_gene_mapping:
    print(f'''
    As a biomedical research team, we are analyzing a gene expression dataset, and find that its row headers are some identifiers related to genes:
    {gene_row_ids}
    To get the mapping from those identifiers to actual gene symbols, we extracted the gene annotation data from a series in the GEO database, and saved it to a Python dictionary. Please read the dictionary, and decide which key stores the identifiers, and which key stores the gene symbols. Please strictly follow this format in your answer:
    identifier_key = 'key_name1'
    gene_symbol_key = 'key_name2'

    Gene annotation dictionary:
    {gene_annotation_summary}
    ''')

gene_annotation

{'ID': ['1', '2', '3', '4', '5'], 'COL': ['267', '267', '267', '267', '267'], 'ROW': [912.0, 910.0, 908.0, 906.0, 904.0], 'SPOT_ID': ['HsCGHBrightCorner', 'DarkCorner', 'DarkCorner', 'A_16_P20527812', 'A_16_P01708709'], 'CONTROL_TYPE': ['pos', 'pos', 'pos', 'FALSE', 'FALSE'], 'GB_ACC': [nan, nan, nan, nan, 'NM_138295'], 'GENE_SYMBOL': [nan, nan, nan, nan, 'PKD1L1'], 'GENE_NAME': [nan, nan, nan, nan, 'polycystic kidney disease 1 like 1'], 'ACCESSION_STRING': [nan, nan, nan, nan, 'ref|NM_138295|ref|NM_025031'], 'CHROMOSOMAL_LOCATION': [nan, nan, nan, 'chr16:076331867-076331926', 'chr7:047626734-047626793'], 'CYTOBAND': [nan, nan, nan, 'hs|q23.1', 'hs|p12.3'], 'DESCRIPTION': [nan, nan, nan, nan, 'Homo sapiens polycystic kidney disease 1 like 1 (PKD1L1), mRNA.'], 'GB_RANGE': [nan, nan, nan, 'NC_000016.8[076331867..076331926]', 'NC_000007.11[047626734..047626793]']}

    As a biomedical research team, we are analyzing a gene expression dataset, and find that its row headers are some identif

Unnamed: 0,ID,COL,ROW,SPOT_ID,CONTROL_TYPE,GB_ACC,GENE_SYMBOL,GENE_NAME,ACCESSION_STRING,CHROMOSOMAL_LOCATION,CYTOBAND,DESCRIPTION,GB_RANGE
0,1,267,912.0,HsCGHBrightCorner,pos,,,,,,,,
1,2,267,910.0,DarkCorner,pos,,,,,,,,
2,3,267,908.0,DarkCorner,pos,,,,,,,,
3,4,267,906.0,A_16_P20527812,FALSE,,,,,chr16:076331867-076331926,hs|q23.1,,NC_000016.8[076331867..076331926]
4,5,267,904.0,A_16_P01708709,FALSE,NM_138295,PKD1L1,polycystic kidney disease 1 like 1,ref|NM_138295|ref|NM_025031,chr7:047626734-047626793,hs|p12.3,Homo sapiens polycystic kidney disease 1 like ...,NC_000007.11[047626734..047626793]
...,...,...,...,...,...,...,...,...,...,...,...,...,...
15092790,243500,-3.05E-02,,,,,,,,,,,
15092791,243501,-1.04E+00,,,,,,,,,,,
15092792,243502,-9.70E-01,,,,,,,,,,,
15092793,243503,-1.38E-02,,,,,,,,,,,


In [None]:
gene_annotation.columns

Index(['ID', 'COL', 'ROW', 'SPOT_ID', 'CONTROL_TYPE', 'GB_ACC', 'GENE_SYMBOL',
       'GENE_NAME', 'ACCESSION_STRING', 'CHROMOSOMAL_LOCATION', 'CYTOBAND',
       'DESCRIPTION', 'GB_RANGE'],
      dtype='object')

In [None]:
if requires_gene_mapping:
    identifier_key = 'ID'
    gene_symbol_key = 'GENE_SYMBOL'
    gene_mapping = get_gene_mapping(gene_annotation, identifier_key, gene_symbol_key)
    genetic_data = apply_gene_mapping(genetic_data, gene_mapping)

In [None]:
genetic_data

Unnamed: 0_level_0,GSM540550,GSM540551,GSM540552,GSM540553,GSM540554,GSM540555,GSM540556,GSM540557,GSM540558,GSM540559,...,GSM540601,GSM540602,GSM540603,GSM540604,GSM540605,GSM540606,GSM540607,GSM540608,GSM540609,GSM540610
Gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
,-0.091366,-0.009954,0.064026,0.223229,0.015514,0.133943,0.010000,0.036996,-0.109300,0.058823,...,0.152000,0.040586,0.114029,-0.021514,0.110771,0.135557,0.125829,0.009514,0.174671,0.049243
15E1.2,-0.018767,-0.036233,0.126533,0.015967,-0.001400,-0.229000,0.095173,-0.137293,-0.145300,0.137067,...,0.146700,0.267000,0.207333,0.039400,-0.047812,0.011360,-0.109767,0.300333,0.013167,-0.008633
2'-PDE,-0.095650,-0.038200,-0.273500,-0.188500,0.077200,-0.170500,-0.023250,-0.265000,-0.224050,0.028700,...,-0.189500,-0.117450,-0.166800,-0.037500,-0.123500,0.019000,-0.111050,0.182000,-0.032200,-0.052400
76P,-0.177225,-0.016975,-0.015500,-0.230250,-0.381500,-0.132625,-0.074625,0.013000,-0.158050,-0.093265,...,0.067575,0.068450,-0.156500,0.048800,-0.054375,-0.010340,-0.110975,0.131200,-0.107950,-0.035025
7A5,-0.060111,0.072104,0.055131,-0.020300,0.201556,0.109311,0.029308,0.029044,0.107101,-0.007541,...,0.151611,0.148789,0.090900,-0.000616,0.151533,0.079336,-0.004712,-0.077400,-0.000033,0.031200
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
bA9F11.1,0.150150,0.095300,-0.018000,0.263000,0.034750,-0.021791,0.147900,-0.086950,-0.105000,-0.064150,...,-0.048850,0.065850,0.111600,-0.127850,-0.217500,-0.151800,-0.283000,-0.072160,0.191000,0.003150
dJ341D10.1,0.006610,-0.161000,-0.120000,-0.133000,-0.138000,-0.177000,-0.132000,0.051200,-0.039200,-0.014300,...,-0.188000,-0.012400,0.119000,-0.193000,-0.092300,-0.124000,-0.271000,-0.176000,-0.100000,-0.269000
hCAP-D3,-0.104746,-0.010522,0.026106,0.072607,-0.014310,0.027170,0.043281,0.107378,0.029616,-0.188333,...,0.173267,-0.299222,-0.090567,-0.004926,0.055044,0.054733,0.012011,0.007011,0.017700,0.026992
hCAP-H2,0.053800,0.027662,0.011612,0.063725,-0.114200,0.026400,0.140775,-0.134775,-0.026850,0.035800,...,-0.031643,-0.244000,0.083365,-0.079825,-0.076025,-0.071883,-0.031462,0.068550,0.098075,0.060400


In [None]:
genetic_data = normalize_gene_symbols_in_index(genetic_data)

In [None]:
genetic_data

Unnamed: 0,GSM540550,GSM540551,GSM540552,GSM540553,GSM540554,GSM540555,GSM540556,GSM540557,GSM540558,GSM540559,...,GSM540601,GSM540602,GSM540603,GSM540604,GSM540605,GSM540606,GSM540607,GSM540608,GSM540609,GSM540610
A1BG,0.261000,0.023700,0.117000,0.154000,0.141000,0.239000,0.197000,0.216000,0.318000,0.085300,...,-0.164000,0.109000,0.205000,-0.053900,-0.069000,-0.034000,0.054000,-0.154000,0.242000,0.074100
A2M,-0.071700,-0.075717,0.184000,0.176317,0.034750,0.049178,-0.039567,0.045017,0.107967,0.116025,...,0.539667,0.245167,0.143050,0.039052,-0.059468,0.115183,0.112083,0.168233,0.139317,0.074317
A2ML1,-0.056388,-0.025460,0.121440,0.166813,0.002068,0.105455,0.087587,0.114687,0.150363,0.049269,...,0.537412,0.228625,0.226325,0.063661,0.065370,-0.013030,-0.029762,0.093887,0.101675,-0.023225
A4GALT,0.115900,0.056557,-0.057271,0.139629,-0.060130,-0.018657,0.206329,-0.027757,-0.015286,-0.037541,...,-0.090871,-0.217963,-0.032471,-0.165701,-0.124071,-0.120486,-0.127914,0.012549,0.150286,-0.038199
A4GNT,0.185500,-0.096250,-0.289000,-0.030060,-0.063700,-0.112750,-0.056290,0.054900,0.011600,0.062265,...,-0.253000,0.114100,-0.105850,-0.118550,-0.061350,0.013800,0.006250,-0.114600,-0.029010,-0.064800
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ZYG11A,0.021000,-0.025660,-0.214200,-0.152360,-0.020380,-0.004088,-0.022060,0.012460,0.035980,-0.095848,...,-0.150060,-0.114840,-0.113584,-0.069040,-0.112720,-0.080224,-0.191920,-0.193660,-0.168400,-0.114120
ZYG11B,0.018978,-0.048632,-0.240545,-0.211436,-0.014087,-0.029066,0.012171,0.010093,0.011982,-0.161527,...,-0.117764,-0.096468,-0.157382,-0.115042,-0.050454,-0.110373,-0.230038,-0.178318,-0.209545,-0.195591
ZYX,-0.034700,0.021967,0.089867,0.084667,-0.098300,0.109900,0.071067,-0.106867,-0.011533,0.158667,...,0.127000,0.010613,0.163800,-0.020167,-0.006195,0.057200,0.041467,0.072733,0.206333,0.196667
ZZEF1,-0.129433,-0.158727,-0.175639,-0.123181,-0.021133,-0.288200,-0.128013,-0.138720,-0.097573,0.044883,...,-0.167607,-0.057523,-0.091887,0.025984,-0.078663,-0.026909,-0.100020,0.083072,-0.234480,-0.077160


In [None]:
merged_data = geo_merge_clinical_genetic_data(selected_clinical_data, genetic_data)
# The preprocessing runs through, which means is_available should be True
is_available = True

In [None]:
merged_data

Unnamed: 0,Adrenocortical Cancer,A1BG,A2M,A2ML1,A4GALT,A4GNT,AAA1,AAAS,AACS,AADAC,...,ZW10,ZWILCH,ZWINT,ZXDB,ZXDC,ZYG11A,ZYG11B,ZYX,ZZEF1,ZZZ3
GSM540550,1.0,0.2610,-0.071700,-0.056388,0.115900,0.18550,-0.074332,0.284667,-0.050671,0.14100,...,-0.011806,-0.069583,0.0818,-0.26000,0.19156,0.021000,0.018978,-0.034700,-0.129433,-0.079166
GSM540551,1.0,0.0237,-0.075717,-0.025460,0.056557,-0.09625,0.058537,0.048833,-0.124633,-0.00876,...,0.074980,0.105617,0.1890,-0.39000,-0.02402,-0.025660,-0.048632,0.021967,-0.158727,-0.043763
GSM540552,1.0,0.1170,0.184000,0.121440,-0.057271,-0.28900,0.049406,0.168933,0.118744,-0.20600,...,0.117400,0.014768,0.1560,-0.51600,-0.26312,-0.214200,-0.240545,0.089867,-0.175639,-0.262375
GSM540553,1.0,0.1540,0.176317,0.166813,0.139629,-0.03006,-0.036533,0.314667,0.071756,0.12700,...,0.120940,0.090387,0.0823,-0.42100,0.12542,-0.152360,-0.211436,0.084667,-0.123181,-0.159617
GSM540554,1.0,0.1410,0.034750,0.002068,-0.060130,-0.06370,0.322153,0.049267,-0.056361,0.14400,...,0.015800,0.022417,0.0821,-0.58000,0.07364,-0.020380,-0.014087,-0.098300,-0.021133,0.018503
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
GSM540606,0.0,-0.0340,0.115183,-0.013030,-0.120486,0.01380,0.096020,-0.082280,-0.084410,0.14000,...,0.053585,0.004537,0.0128,0.00522,-0.05314,-0.080224,-0.110373,0.057200,-0.026909,-0.087575
GSM540607,0.0,0.0540,0.112083,-0.029762,-0.127914,0.00625,0.126520,-0.074100,-0.090500,0.15000,...,-0.000366,-0.128767,0.0391,0.53800,-0.00796,-0.191920,-0.230038,0.041467,-0.100020,-0.126593
GSM540608,0.0,-0.1540,0.168233,0.093887,0.012549,-0.11460,-0.061250,0.163033,0.201889,0.07380,...,0.075702,0.154717,0.1470,-0.50600,0.14348,-0.193660,-0.178318,0.072733,0.083072,-0.300383
GSM540609,0.0,0.2420,0.139317,0.101675,0.150286,-0.02901,0.000785,0.065833,-0.030744,0.07190,...,0.066480,-0.007978,0.1480,-0.48600,0.05238,-0.168400,-0.209545,0.206333,-0.234480,-0.298500


In [None]:
print(f"The merged dataset contains {len(merged_data)} samples.")
is_trait_biased, merged_data = judge_and_remove_biased_features(merged_data, TRAIT, trait_type=trait_type)
is_trait_biased

The merged dataset contains 61 samples.
For the feature 'Adrenocortical Cancer', the least common label is '1.0' with 29 occurrences. This represents 47.54% of the dataset.
The distribution of the feature 'Adrenocortical Cancer' in this dataset is fine.



False

In [None]:
if is_available:
    save_cohort_info(cohort, JSON_PATH, is_available, is_trait_biased, merged_data, note='')
else:
    save_cohort_info(cohort, JSON_PATH, is_available)
merged_data.head()
if not is_trait_biased:
    merged_data.to_csv(os.path.join(OUTPUT_DIR, cohort + '.csv'), index=False)

### Initial filtering and clinical data preprocessing

In [None]:
import gzip

In [None]:
# 这段代码定义了一个名为line_generator的生成器函数，它的功能是逐行读取来自文件或字符串的内容，并逐行产生处理过的数据（去除了每行末尾的空白字符）。
def line_generator(source, source_type):
    """Generator that yields lines from a file or a string.

    Parameters:
    - source: File path or string content.
    - source_type: 'file' or 'string'.
    """
    if source_type == 'file':
        with gzip.open(source, 'rt') as f:
            for line in f:
                yield line.strip()
    elif source_type == 'string':
        for line in source.split('\n'):
            yield line.strip()
    else:
        raise ValueError("source_type must be 'file' or 'string'")

In [None]:
# 这个函数filter_content_by_prefix的目的是从一个文件或一串文本中过滤出满足特定前缀条件的行。
def filter_content_by_prefix(
    source: str,
    prefixes_a: List[str],
    prefixes_b: Optional[List[str]] = None,
    unselect: bool = False,
    source_type: str = 'file',
    return_df_a: bool = True,
    return_df_b: bool = True
) -> Tuple[Union[str, pd.DataFrame], Optional[Union[str, pd.DataFrame]]]:
    """
    Filters rows from a file or a list of strings based on specified prefixes.

    Parameters:
    - source (str): File path or string content to filter.
    - prefixes_a (List[str]): Primary list of prefixes to filter by.
    - prefixes_b (Optional[List[str]]): Optional secondary list of prefixes to filter by.
    - unselect (bool): If True, selects rows that do not start with the specified prefixes.
    - source_type (str): 'file' if source is a file path, 'string' if source is a string of text.
    - return_df_a (bool): If True, returns filtered content for prefixes_a as a pandas DataFrame.
    - return_df_b (bool): If True, and if prefixes_b is provided, returns filtered content for prefixes_b as a pandas DataFrame.

    Returns:
    - Tuple: A tuple where the first element is the filtered content for prefixes_a, and the second element is the filtered content for prefixes_b.
    """
    filtered_lines_a = []
    filtered_lines_b = []
    prefix_set_a = set(prefixes_a)
    if prefixes_b is not None:
        prefix_set_b = set(prefixes_b)

    # Use generator to get lines
    for line in line_generator(source, source_type):
        matched_a = any(line.startswith(prefix) for prefix in prefix_set_a)
        if matched_a != unselect:
            filtered_lines_a.append(line)
        if prefixes_b is not None:
            matched_b = any(line.startswith(prefix) for prefix in prefix_set_b)
            if matched_b != unselect:
                filtered_lines_b.append(line)

    filtered_content_a = '\n'.join(filtered_lines_a)
    if return_df_a:
        filtered_content_a = pd.read_csv(io.StringIO(filtered_content_a), delimiter='\t', low_memory=False, on_bad_lines='skip')
    filtered_content_b = None
    if filtered_lines_b:
        filtered_content_b = '\n'.join(filtered_lines_b)
        if return_df_b:
            filtered_content_b = pd.read_csv(io.StringIO(filtered_content_b), delimiter='\t', low_memory=False, on_bad_lines='skip')

    return filtered_content_a, filtered_content_b



In [None]:
# 这个get_background_and_clinical_data函数旨在从一个特定格式的文件中提取数据集的背景信息和样本特征数据。
# 这个函数使用前面定义的filter_content_by_prefix函数来筛选出文件中满足指定前缀条件的行。

def get_background_and_clinical_data(file_path,
                                     prefixes_a=['!Series_title', '!Series_summary', '!Series_overall_design'],
                                     prefixes_b=['!Sample_geo_accession', '!Sample_characteristics_ch1']):
    """Extract from a matrix file the background information about the dataset, and sample characteristics data"""
    background_info, clinical_data = filter_content_by_prefix(file_path, prefixes_a, prefixes_b, unselect=False,
                                                              source_type='file',
                                                              return_df_a=False, return_df_b=True)
    return background_info, clinical_data

In [None]:
import io
background_prefixes = ['!Series_title', '!Series_summary', '!Series_overall_design']
clinical_prefixes = ['!Sample_geo_accession', '!Sample_characteristics_ch1']

background_info, clinical_data = get_background_and_clinical_data(matrix_file, background_prefixes, clinical_prefixes)
print(background_info)

clinical_data.head()

!Series_title	"Advancing a Clinically Relevant Perspective of the Clonal Nature of Cancer"
!Series_summary	"We used DNA content-based flow cytometry to distinguish and isolate nuclei from clonal populations in primary tissues from three disparate cancers with variable clinical histories. We then developed a methodology to adapt flow cytometrically purified nuclei samples for use with whole genome technologies including aCGH and next generation sequencing (NGS). Our results demonstrate that selected aberrations in the genomes of distinct clonal populations in each patient create clinically relevant contexts at least with respect to the cancer types profiled in this study."
!Series_overall_design	"We applied DNA content based flow sorting to isolate the nuclei of clonal populations from tumor biopsies. Genomic DNA from each sorted population was amplified with phi29 polymerase. A 1ug aliquot of each amplified sample was digested with DNAse 1 then labeled with Cy5 using a Klenow-based com

Unnamed: 0,!Sample_geo_accession,GSM540550,GSM540551,GSM540552,GSM540553,GSM540554,GSM540555,GSM540556,GSM540557,GSM540558,...,GSM540601,GSM540602,GSM540603,GSM540604,GSM540605,GSM540606,GSM540607,GSM540608,GSM540609,GSM540610
0,!Sample_characteristics_ch1,tissue: Pancreatic Ductal Adenocarcinoma,tissue: Pancreatic Ductal Adenocarcinoma,tissue: Pancreatic Ductal Adenocarcinoma,tissue: Pancreatic Ductal Adenocarcinoma,tissue: Pancreatic Ductal Adenocarcinoma,tissue: Pancreatic Ductal Adenocarcinoma,tissue: Pancreatic Ductal Adenocarcinoma,tissue: Pancreatic Ductal Adenocarcinoma,tissue: Pancreatic Ductal Adenocarcinoma,...,tissue: Adrenal Cortical Carcinoma,tissue: Adrenal Cortical Carcinoma,tissue: Adrenal Cortical Carcinoma,tissue: Prostate Carcinoma,tissue: Prostate Carcinoma,tissue: Prostate Carcinoma,tissue: Prostate Carcinoma,tissue: Prostate Carcinoma,tissue: Prostate Carcinoma,tissue: Prostate Carcinoma


In [None]:
clinical_data.head()

Unnamed: 0,!Sample_geo_accession,GSM540550,GSM540551,GSM540552,GSM540553,GSM540554,GSM540555,GSM540556,GSM540557,GSM540558,...,GSM540601,GSM540602,GSM540603,GSM540604,GSM540605,GSM540606,GSM540607,GSM540608,GSM540609,GSM540610
0,!Sample_characteristics_ch1,tissue: Pancreatic Ductal Adenocarcinoma,tissue: Pancreatic Ductal Adenocarcinoma,tissue: Pancreatic Ductal Adenocarcinoma,tissue: Pancreatic Ductal Adenocarcinoma,tissue: Pancreatic Ductal Adenocarcinoma,tissue: Pancreatic Ductal Adenocarcinoma,tissue: Pancreatic Ductal Adenocarcinoma,tissue: Pancreatic Ductal Adenocarcinoma,tissue: Pancreatic Ductal Adenocarcinoma,...,tissue: Adrenal Cortical Carcinoma,tissue: Adrenal Cortical Carcinoma,tissue: Adrenal Cortical Carcinoma,tissue: Prostate Carcinoma,tissue: Prostate Carcinoma,tissue: Prostate Carcinoma,tissue: Prostate Carcinoma,tissue: Prostate Carcinoma,tissue: Prostate Carcinoma,tissue: Prostate Carcinoma


In [None]:
clinical_data

Unnamed: 0,!Sample_geo_accession,GSM540550,GSM540551,GSM540552,GSM540553,GSM540554,GSM540555,GSM540556,GSM540557,GSM540558,...,GSM540601,GSM540602,GSM540603,GSM540604,GSM540605,GSM540606,GSM540607,GSM540608,GSM540609,GSM540610
0,!Sample_characteristics_ch1,tissue: Pancreatic Ductal Adenocarcinoma,tissue: Pancreatic Ductal Adenocarcinoma,tissue: Pancreatic Ductal Adenocarcinoma,tissue: Pancreatic Ductal Adenocarcinoma,tissue: Pancreatic Ductal Adenocarcinoma,tissue: Pancreatic Ductal Adenocarcinoma,tissue: Pancreatic Ductal Adenocarcinoma,tissue: Pancreatic Ductal Adenocarcinoma,tissue: Pancreatic Ductal Adenocarcinoma,...,tissue: Adrenal Cortical Carcinoma,tissue: Adrenal Cortical Carcinoma,tissue: Adrenal Cortical Carcinoma,tissue: Prostate Carcinoma,tissue: Prostate Carcinoma,tissue: Prostate Carcinoma,tissue: Prostate Carcinoma,tissue: Prostate Carcinoma,tissue: Prostate Carcinoma,tissue: Prostate Carcinoma


In [None]:
show(clinical_data)

PandasGUI INFO — pandasgui.gui — Opening PandasGUI

Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`


Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`


Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`


Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`


Series.__getitem__ treating keys as positions is deprecated. In a future

<pandasgui.gui.PandasGui at 0x1f2304ac160>

In [None]:
# 这个get_unique_values_by_row函数的目的是为了从给定的pandas DataFrame中，按行整理每行中的唯一值，并将这些唯一值以字典的形式组织起来。
# 这个字典的键是DataFrame的行索引，而值是一个包含了该行中唯一值的列表
def get_unique_values_by_row(dataframe, max_len=30):
    """
    Organize the unique values in each row of the given dataframe, to get a dictionary
    :param dataframe:
    :param max_len:
    :return:
    """
    if '!Sample_geo_accession' in dataframe.columns:
        dataframe = dataframe.drop(columns=['!Sample_geo_accession'])
    unique_values_dict = {}
    for index, row in dataframe.iterrows():
        unique_values = list(row.unique())[:max_len]
        unique_values_dict[index] = unique_values
    return unique_values_dict

In [None]:
clinical_data_unique = get_unique_values_by_row(clinical_data)
clinical_data_unique

{0: ['tissue: Pancreatic Ductal Adenocarcinoma',
  'tissue: Adrenal Cortical Carcinoma',
  'tissue: Prostate Carcinoma']}

这个字典只显示了一个键值对，可能是因为示例只展示了一个简化的输出，或者clinical_data DataFrame实际上只有一行数据。在实际应用中，如果DataFrame有多行，这个字典将包含更多的键值对，每个键值对对应一行中的唯一值列表。

Analyze the metadata to determine data relevance and find ways to extract the clinical data.
Reference prompt:

In [None]:
f'''As a biomedical research team, we are selecting datasets to study the association between the human trait \'{TRAIT}\' and genetic factors, optionally considering the influence of age and gender. After searching the GEO database and parsing the matrix file of a series, we obtained background information and sample characteristics data. We will provide textual information about the dataset background, and a Python dictionary storing a list of unique values for each field of the sample characteristics data. Please carefully review the provided information and answer the following questions about this dataset:
1. Does this dataset contain gene expression data? (Note: Pure miRNA data is not suitable.)
2. For each of the traits \'{TRAIT}\', 'age', and 'gender', please address these points:
   (1) Is there human data available for this trait?
   (2) If so, identify the key in the sample characteristics dictionary where unique values of this trait is recorded. The key is an integer. The trait information might be explicitly recorded, or can be inferred from the field with some biomedical knowledge or understanding about the data collection process.
   (3) Choose an appropriate data type (either 'continuous' or 'binary') for each trait. Write a Python function to convert any given value of the trait to this data type. The function should handle inference about the trait value and convert unknown values to None.
   Name the functions 'convert_trait', 'convert_age', and 'convert_gender', respectively.

Background information about the dataset:
{background_info}

Sample characteristics dictionary (from "!Sample_characteristics_ch1", converted to a Python dictionary that stores the unique values for each field):
{clinical_data_unique}
'''

'As a biomedical research team, we are selecting datasets to study the association between the human trait \'Adrenocortical Cancer\' and genetic factors, optionally considering the influence of age and gender. After searching the GEO database and parsing the matrix file of a series, we obtained background information and sample characteristics data. We will provide textual information about the dataset background, and a Python dictionary storing a list of unique values for each field of the sample characteristics data. Please carefully review the provided information and answer the following questions about this dataset:\n1. Does this dataset contain gene expression data? (Note: Pure miRNA data is not suitable.)\n2. For each of the traits \'Adrenocortical Cancer\', \'age\', and \'gender\', please address these points:\n   (1) Is there human data available for this trait?\n   (2) If so, identify the key in the sample characteristics dictionary where unique values of this trait is record

Understand and verify the answer from GPT, to assign values to the below variables. Assign None to the 'row_id' variables if relevant data row was not found.
Later we need to let GPT format its answer to automatically do these. But given the complexity of this step, let's grow some insight from the free-text answers for now.

In [None]:
age_row = gender_row = None
convert_age = convert_gender = None

In [None]:
is_gene_availabe = True
trait_row = 0
age_row = None
gender_row = None

trait_type = 'binary'

is_available = is_gene_availabe and (trait_row is not None)
if not is_available:
    save_cohort_info(cohort, JSON_PATH, is_available)
    print("This cohort is not usable. Please skip the following steps and jump to the next accession number.")

In [None]:
is_available = is_gene_availabe and (trait_row is not None)
if not is_available:
    save_cohort_info(cohort, JSON_PATH, is_available)
    print("This cohort is not usable. Please skip the following steps and jump to the next accession number.")

In [None]:
# Verify and use the functions generated by GPT

# 这个函数将组织类型（tissue type）转换为有关癫痫存在与否的二进制值。
# 它是基于特定的假设，即如果组织类型是“胰腺导管腺癌”（Pancreatic Ductal Adenocarcinoma），则认为癫痫存在（返回1）；否则，认为癫痫不存在（返回0）。
def convert_trait(tissue_type):
    """
    Convert tissue type to epilepsy presence (binary).
    Assuming epilepsy presence for 'Hippocampus' tissue.
    """
    if tissue_type == 'tissue: Pancreatic Ductal Adenocarcinoma':
        return 1  # Epilepsy present
    else:
        return 0  # Epilepsy not present


# 这个函数的目的是将年龄的字符串表示转换为一个连续的数值型表示。如果年龄未知（例如，标记为'n.a.'），则返回None。
# 函数尝试从传入的字符串中提取出一个整数作为年龄值。如果字符串的格式不符合预期，导致提取失败，同样返回None。
def convert_age(age_string):
    """
    Convert age string to a continuous numerical value.
    Unknown values are converted to None.
    """
    if age_string.lower() == 'n.a.':
        return None
    try:
        # Extract age as an integer from the string
        age = int(age_string.split(': ')[1])
        return age
    except (ValueError, IndexError):
        # In case of any format error or unexpected string structure
        return None


# 这个函数将性别的字符串表示转换为二进制值，其中“female”对应1，“male”对应0。如果性别未知或字符串不符合预期格式，则返回None。
# It sometimes maps 'female' to 0, and sometimes 1. Does it matter?
def convert_gender(gender_string):
    """
    Convert gender string to a binary value.
    'female' is represented as 1, 'male' as 0.
    Unknown values are converted to None.
    """
    if (gender_string.lower() == 'sex: female' or gender_string.lower() == 'sex: f'):
        return 1
    elif (gender_string.lower() == 'sex: male' or gender_string.lower() == 'sex: m') :  # changeed 
        return 0
    else:
        return None

In [None]:
# 这个get_feature_data函数的目的是从一个样本特征的DataFrame中提取特定特征的数据，并使用给定的转换函数将这些数据转换为二进制或连续变量的格式。
def get_feature_data(clinical_df, row_id, feature, convert_fn):
    """select the row corresponding to a feature in the sample characteristics dataframe, and convert the feature into
    a binary or continuous variable"""
    clinical_df = clinical_df.iloc[row_id:row_id + 1].drop(columns=['!Sample_geo_accession'], errors='ignore')
    clinical_df.index = [feature]
    clinical_df = clinical_df.applymap(convert_fn)

    return clinical_df

In [None]:
# 这个geo_select_clinical_features函数是为了从GEO（Gene Expression Omnibus）数据库系列中代表样本特征的DataFrame提取并处理特定的临床特征。
# 函数通过集成不同的特征处理流程，使得从复杂的数据集中提取、转换和整合特定特征变得简单直接。

def geo_select_clinical_features(clinical_df: pd.DataFrame, trait: str, trait_row: int,
                                 convert_trait: Callable,
                                 age_row: Optional[int] = None,
                                 convert_age: Optional[Callable] = None,
                                 gender_row: Optional[int] = None,
                                 convert_gender: Optional[Callable] = None) -> pd.DataFrame:
    """
    Extracts and processes specific clinical features from a DataFrame representing
    sample characteristics in the GEO database series.

    Parameters:
    - clinical_df (pd.DataFrame): DataFrame containing clinical data.
    - trait (str): The trait of interest.
    - trait_row (int): Row identifier for the trait in the DataFrame.
    - convert_trait (Callable): Function to convert trait data into a desired format.
    - age_row (int, optional): Row identifier for age data. Default is None.
    - convert_age (Callable, optional): Function to convert age data. Default is None.
    - gender_row (int, optional): Row identifier for gender data. Default is None.
    - convert_gender (Callable, optional): Function to convert gender data. Default is None.

    Returns:
    pd.DataFrame: A DataFrame containing the selected and processed clinical features.
    """
    feature_list = []

    trait_data = get_feature_data(clinical_df, trait_row, trait, convert_trait)
    feature_list.append(trait_data)
    if age_row is not None:
        age_data = get_feature_data(clinical_df, age_row, 'Age', convert_age)
        feature_list.append(age_data)
    if gender_row is not None:
        gender_data = get_feature_data(clinical_df, gender_row, 'Gender', convert_gender)
        feature_list.append(gender_data)

    selected_clinical_df = pd.concat(feature_list, axis=0)
    return selected_clinical_df

In [None]:
clinical_data

Unnamed: 0,!Sample_geo_accession,GSM540550,GSM540551,GSM540552,GSM540553,GSM540554,GSM540555,GSM540556,GSM540557,GSM540558,...,GSM540601,GSM540602,GSM540603,GSM540604,GSM540605,GSM540606,GSM540607,GSM540608,GSM540609,GSM540610
0,!Sample_characteristics_ch1,tissue: Pancreatic Ductal Adenocarcinoma,tissue: Pancreatic Ductal Adenocarcinoma,tissue: Pancreatic Ductal Adenocarcinoma,tissue: Pancreatic Ductal Adenocarcinoma,tissue: Pancreatic Ductal Adenocarcinoma,tissue: Pancreatic Ductal Adenocarcinoma,tissue: Pancreatic Ductal Adenocarcinoma,tissue: Pancreatic Ductal Adenocarcinoma,tissue: Pancreatic Ductal Adenocarcinoma,...,tissue: Adrenal Cortical Carcinoma,tissue: Adrenal Cortical Carcinoma,tissue: Adrenal Cortical Carcinoma,tissue: Prostate Carcinoma,tissue: Prostate Carcinoma,tissue: Prostate Carcinoma,tissue: Prostate Carcinoma,tissue: Prostate Carcinoma,tissue: Prostate Carcinoma,tissue: Prostate Carcinoma


In [None]:
is_gene_availabe = True
trait_row = 0
age_row = 0
gender_row = 1

trait_type = 'binary'

selected_clinical_data = geo_select_clinical_features(clinical_data, TRAIT, trait_row, convert_trait, age_row=age_row,
                                                      convert_age=convert_age, gender_row=gender_row,
                                                      convert_gender=convert_gender)
selected_clinical_data.head()

  clinical_df = clinical_df.applymap(convert_fn)
  clinical_df = clinical_df.applymap(convert_fn)


ValueError: Length mismatch: Expected axis has 0 elements, new values have 1 elements

In [None]:
from pandasgui import show

show(selected_clinical_data)

PandasGUI INFO — pandasgui.gui — Opening PandasGUI
  show(selected_clinical_data)
  show(selected_clinical_data)

Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`


Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`


Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`


Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`



<pandasgui.gui.PandasGui at 0x1f221489ea0>

### Genetic data preprocessing and final filtering

In [None]:
# 这个函数get_genetic_data是用来读取基因表达数据文件，并将其转换为pandas DataFrame对象的格式，同时进行一些格式调整。
def get_genetic_data(file_path):
    """Read the gene expression data into a dataframe, and adjust its format"""
    genetic_data = pd.read_csv(file_path, compression='gzip', skiprows=52, comment='!', delimiter='\t')
    genetic_data = genetic_data.dropna()
    genetic_data = genetic_data.rename(columns={'ID_REF': 'ID'}).astype({'ID': 'str'})
    genetic_data.set_index('ID', inplace=True)

    return genetic_data


In [None]:
genetic_data = get_genetic_data(matrix_file)
genetic_data.head()

Unnamed: 0_level_0,GSM540550,GSM540551,GSM540552,GSM540553,GSM540554,GSM540555,GSM540556,GSM540557,GSM540558,GSM540559,...,GSM540601,GSM540602,GSM540603,GSM540604,GSM540605,GSM540606,GSM540607,GSM540608,GSM540609,GSM540610
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0201,0.00599,0.0117,-0.0779,0.0102,-0.00558,-0.0871,-0.0895,-0.114,-0.00695,...,0.0592,0.0563,0.00207,0.0208,0.00841,0.0122,-0.0128,0.0748,-0.000229,-0.0148
2,0.0,0.0,0.0,-0.173,-0.438,0.0,0.0,-0.614,-0.019,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.567,-0.00462,-0.47
3,-0.2,0.0,0.0,0.0,0.0,0.0,0.0,-0.319,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.442,0.0,-0.685
4,0.121,0.199,0.157,0.223,0.0123,0.132,0.0682,0.133,0.148,0.0356,...,0.192,0.365,0.364,0.0321,-0.248,0.024,-0.0865,0.0336,0.167,0.0337
5,-0.105,-0.0601,-0.114,-0.178,-0.00527,0.00652,-0.161,-0.12,-0.152,-0.0283,...,0.0991,-0.0738,-0.0584,0.012,-0.0635,-0.147,-0.247,0.117,-0.115,0.0135


In [None]:
genetic_data

Unnamed: 0_level_0,GSM540550,GSM540551,GSM540552,GSM540553,GSM540554,GSM540555,GSM540556,GSM540557,GSM540558,GSM540559,...,GSM540601,GSM540602,GSM540603,GSM540604,GSM540605,GSM540606,GSM540607,GSM540608,GSM540609,GSM540610
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0201,0.00599,0.0117,-0.0779,0.01020,-0.00558,-0.0871,-0.0895,-0.1140,-0.00695,...,0.0592,0.0563,0.002070,0.0208,0.00841,0.0122,-0.01280,0.0748,-0.000229,-0.0148
2,0.0000,0.00000,0.0000,-0.1730,-0.43800,0.00000,0.0000,-0.6140,-0.0190,0.00000,...,0.0000,0.0000,0.000000,0.0000,0.00000,0.0000,0.00000,-0.5670,-0.004620,-0.4700
3,-0.2000,0.00000,0.0000,0.0000,0.00000,0.00000,0.0000,-0.3190,0.0000,0.00000,...,0.0000,0.0000,0.000000,0.0000,0.00000,0.0000,0.00000,-0.4420,0.000000,-0.6850
4,0.1210,0.19900,0.1570,0.2230,0.01230,0.13200,0.0682,0.1330,0.1480,0.03560,...,0.1920,0.3650,0.364000,0.0321,-0.24800,0.0240,-0.08650,0.0336,0.167000,0.0337
5,-0.1050,-0.06010,-0.1140,-0.1780,-0.00527,0.00652,-0.1610,-0.1200,-0.1520,-0.02830,...,0.0991,-0.0738,-0.058400,0.0120,-0.06350,-0.1470,-0.24700,0.1170,-0.115000,0.0135
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
243500,-0.2810,-0.34800,-0.3390,-0.2320,0.02470,0.02690,-0.2440,0.0160,-0.1790,-0.34600,...,0.1480,-0.1520,-0.211000,-0.0559,0.08870,-0.0608,0.00423,-0.4000,-0.069500,-0.0305
243501,-0.1770,0.00000,-0.2670,0.0000,-0.28800,0.00000,0.0000,0.0000,0.0433,0.00000,...,0.2160,0.0000,0.122000,0.0000,0.00000,0.0000,0.00000,0.0000,0.000000,-1.0400
243502,-0.5300,1.30000,0.0179,0.0000,-0.66700,0.00000,0.0000,0.0000,0.0000,0.00000,...,0.0000,-0.2540,0.000000,0.0000,0.00000,0.0000,0.00000,-0.1480,-0.302000,-0.9700
243503,0.0344,0.02240,0.0275,-0.0692,0.03500,0.01120,-0.0938,-0.0330,-0.1030,0.00613,...,0.0709,0.0626,-0.000328,0.0182,0.01140,0.0342,-0.03030,0.0543,0.014000,-0.0138


In [None]:
# 这些代码将获取DataFrame索引的前20个元素，并将它们转换为列表。在这个上下文中，这个索引代表基因的ID。因此，gene_row_ids将包含前20个基因的ID。
# 如果您在Jupyter Notebook或类似环境中运行这些代码，它们将打印出DataFrame的前几行，以及一个包含前20个基因ID的列表。
# 这可以帮助您了解哪些基因包含在数据集的顶部，并且对进行后续分析，比如特定基因的表达量分析提供一个起点。
gene_row_ids = genetic_data.index[:20].tolist()
gene_row_ids

['1',
 '2',
 '3',
 '4',
 '5',
 '6',
 '7',
 '8',
 '9',
 '10',
 '11',
 '13',
 '14',
 '15',
 '16',
 '17',
 '18',
 '19',
 '20',
 '21']

Check if the gene dataset requires mapping to get the gene symbols corresponding to each data row.

Reference prompt:

In [None]:
f'''
Below are the row headers of a gene expression dataset in GEO. Based on your biomedical knowledge, are they human gene symbols, or are they some other identifiers that need to be mapped to gene symbols? Your answer should be concluded by starting a new line and strictly following this format:
requires_gene_mapping = (True or False)

Row headers:
{gene_row_ids}
'''

"\nBelow are the row headers of a gene expression dataset in GEO. Based on your biomedical knowledge, are they human gene symbols, or are they some other identifiers that need to be mapped to gene symbols? Your answer should be concluded by starting a new line and strictly following this format:\nrequires_gene_mapping = (True or False)\n\nRow headers:\n['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '13', '14', '15', '16', '17', '18', '19', '20', '21']\n"


If not required, jump directly to the gene normalization step

In [None]:
requires_gene_mapping = True

In [None]:
if requires_gene_mapping:
    gene_annotation = get_gene_annotation(soft_file)
    gene_annotation_summary = preview_df(gene_annotation)
    print(gene_annotation_summary)

{'ID': ['1', '2', '3', '4', '5'], 'COL': ['267', '267', '267', '267', '267'], 'ROW': [912.0, 910.0, 908.0, 906.0, 904.0], 'SPOT_ID': ['HsCGHBrightCorner', 'DarkCorner', 'DarkCorner', 'A_16_P20527812', 'A_16_P01708709'], 'CONTROL_TYPE': ['pos', 'pos', 'pos', 'FALSE', 'FALSE'], 'GB_ACC': [nan, nan, nan, nan, 'NM_138295'], 'GENE_SYMBOL': [nan, nan, nan, nan, 'PKD1L1'], 'GENE_NAME': [nan, nan, nan, nan, 'polycystic kidney disease 1 like 1'], 'ACCESSION_STRING': [nan, nan, nan, nan, 'ref|NM_138295|ref|NM_025031'], 'CHROMOSOMAL_LOCATION': [nan, nan, nan, 'chr16:076331867-076331926', 'chr7:047626734-047626793'], 'CYTOBAND': [nan, nan, nan, 'hs|q23.1', 'hs|p12.3'], 'DESCRIPTION': [nan, nan, nan, nan, 'Homo sapiens polycystic kidney disease 1 like 1 (PKD1L1), mRNA.'], 'GB_RANGE': [nan, nan, nan, 'NC_000016.8[076331867..076331926]', 'NC_000007.11[047626734..047626793]']}


Observe the first few cells in the ID column of the gene annotation dataframe, to find the names of columns that store the gene probe IDs and gene symbols respectively.
Reference prompt:

In [None]:
if requires_gene_mapping:
    print(f'''
    As a biomedical research team, we are analyzing a gene expression dataset, and find that its row headers are some identifiers related to genes:
    {gene_row_ids}
    To get the mapping from those identifiers to actual gene symbols, we extracted the gene annotation data from a series in the GEO database, and saved it to a Python dictionary. Please read the dictionary, and decide which key stores the identifiers, and which key stores the gene symbols. Please strictly follow this format in your answer:
    identifier_key = 'key_name1'
    gene_symbol_key = 'key_name2'

    Gene annotation dictionary:
    {gene_annotation_summary}
    ''')


    As a biomedical research team, we are analyzing a gene expression dataset, and find that its row headers are some identifiers related to genes:
    ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '13', '14', '15', '16', '17', '18', '19', '20', '21']
    To get the mapping from those identifiers to actual gene symbols, we extracted the gene annotation data from a series in the GEO database, and saved it to a Python dictionary. Please read the dictionary, and decide which key stores the identifiers, and which key stores the gene symbols. Please strictly follow this format in your answer:
    identifier_key = 'key_name1'
    gene_symbol_key = 'key_name2'

    Gene annotation dictionary:
    {'ID': ['1', '2', '3', '4', '5'], 'COL': ['267', '267', '267', '267', '267'], 'ROW': [912.0, 910.0, 908.0, 906.0, 904.0], 'SPOT_ID': ['HsCGHBrightCorner', 'DarkCorner', 'DarkCorner', 'A_16_P20527812', 'A_16_P01708709'], 'CONTROL_TYPE': ['pos', 'pos', 'pos', 'FALSE', 'FALSE'], 'GB_ACC': [

In [None]:
gene_annotation_summary

{'ID': ['1', '2', '3', '4', '5'],
 'COL': ['267', '267', '267', '267', '267'],
 'ROW': [912.0, 910.0, 908.0, 906.0, 904.0],
 'SPOT_ID': ['HsCGHBrightCorner',
  'DarkCorner',
  'DarkCorner',
  'A_16_P20527812',
  'A_16_P01708709'],
 'CONTROL_TYPE': ['pos', 'pos', 'pos', 'FALSE', 'FALSE'],
 'GB_ACC': [nan, nan, nan, nan, 'NM_138295'],
 'GENE_SYMBOL': [nan, nan, nan, nan, 'PKD1L1'],
 'GENE_NAME': [nan, nan, nan, nan, 'polycystic kidney disease 1 like 1'],
 'ACCESSION_STRING': [nan, nan, nan, nan, 'ref|NM_138295|ref|NM_025031'],
 'CHROMOSOMAL_LOCATION': [nan,
  nan,
  nan,
  'chr16:076331867-076331926',
  'chr7:047626734-047626793'],
 'CYTOBAND': [nan, nan, nan, 'hs|q23.1', 'hs|p12.3'],
 'DESCRIPTION': [nan,
  nan,
  nan,
  nan,
  'Homo sapiens polycystic kidney disease 1 like 1 (PKD1L1), mRNA.'],
 'GB_RANGE': [nan,
  nan,
  nan,
  'NC_000016.8[076331867..076331926]',
  'NC_000007.11[047626734..047626793]']}

In [None]:
gene_annotation


Unnamed: 0,ID,COL,ROW,SPOT_ID,CONTROL_TYPE,GB_ACC,GENE_SYMBOL,GENE_NAME,ACCESSION_STRING,CHROMOSOMAL_LOCATION,CYTOBAND,DESCRIPTION,GB_RANGE
0,1,267,912.0,HsCGHBrightCorner,pos,,,,,,,,
1,2,267,910.0,DarkCorner,pos,,,,,,,,
2,3,267,908.0,DarkCorner,pos,,,,,,,,
3,4,267,906.0,A_16_P20527812,FALSE,,,,,chr16:076331867-076331926,hs|q23.1,,NC_000016.8[076331867..076331926]
4,5,267,904.0,A_16_P01708709,FALSE,NM_138295,PKD1L1,polycystic kidney disease 1 like 1,ref|NM_138295|ref|NM_025031,chr7:047626734-047626793,hs|p12.3,Homo sapiens polycystic kidney disease 1 like ...,NC_000007.11[047626734..047626793]
...,...,...,...,...,...,...,...,...,...,...,...,...,...
15092790,243500,-3.05E-02,,,,,,,,,,,
15092791,243501,-1.04E+00,,,,,,,,,,,
15092792,243502,-9.70E-01,,,,,,,,,,,
15092793,243503,-1.38E-02,,,,,,,,,,,


In [None]:
gene_annotation_summary['GENE_SYMBOL']


[nan, nan, nan, nan, 'PKD1L1']

In [None]:
if requires_gene_mapping:
    identifier_key = 'ID'
    gene_symbol_key = 'GENE_SYMBOL'
    gene_mapping = get_gene_mapping(gene_annotation, identifier_key, gene_symbol_key)
    genetic_data = apply_gene_mapping(genetic_data, gene_mapping)

In [None]:
if NORMALIZE_GENE:
    genetic_data = normalize_gene_symbols_in_index(genetic_data)

1000 input query terms found no hit:	['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '13', '14', '15', '16', '17', '18', '19', 
1000 input query terms found no hit:	['1002', '1003', '1004', '1005', '1006', '1007', '1008', '1009', '1010', '1011', '1012', '1013', '10
1000 input query terms found no hit:	['2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016', '20
1000 input query terms found no hit:	['3005', '3006', '3007', '3008', '3009', '3010', '3011', '3012', '3013', '3014', '3015', '3016', '30
1000 input query terms found no hit:	['4005', '4006', '4007', '4008', '4009', '4010', '4011', '4012', '4013', '4014', '4015', '4016', '40
1000 input query terms found no hit:	['5005', '5006', '5007', '5008', '5009', '5010', '5011', '5012', '5013', '5014', '5015', '5016', '50
1000 input query terms found no hit:	['6005', '6006', '6007', '6008', '6009', '6010', '6011', '6012', '6013', '6014', '6015', '6016', '60
1000 input query terms found no hi

In [None]:
def geo_merge_clinical_genetic_data(clinical_df, genetic_df):
    """
    Merge the clinical features and gene expression features from two dataframes into one dataframe
    """
    if 'ID' in genetic_df.columns:
        genetic_df = genetic_df.rename(columns={'ID': 'Gene'})
    if 'Gene' in genetic_df.columns:
        genetic_df = genetic_df.set_index('Gene')
    merged_data = pd.concat([clinical_df, genetic_df], axis=0).T.dropna()
    return merged_data


In [None]:
merged_data = geo_merge_clinical_genetic_data(selected_clinical_data, genetic_data)
# The preprocessing runs through, which means is_available should be True
is_available = True

In [None]:
merged_data

Unnamed: 0,Adrenocortical Cancer,Unnamed: 2,15E1.2,2'-PDE,76P,7A5,A1BG,A2BP1,A2M,A2ML1,...,ZYG11BL,ZYX,ZZEF1,ZZZ3,bA16L21.2.1,bA9F11.1,dJ341D10.1,hCAP-D3,hCAP-H2,mimitin
GSM540550,1.0,-0.091366,-0.018767,-0.09565,-0.177225,-0.060111,0.2610,-0.110698,-0.071700,-0.056388,...,0.118914,-0.034700,-0.129433,-0.079166,0.029367,0.15015,0.00661,-0.104746,0.053800,-0.052605
GSM540551,1.0,-0.009954,-0.036233,-0.03820,-0.016975,0.072104,0.0237,0.088163,-0.075717,-0.025460,...,0.128233,0.021967,-0.158727,-0.043763,0.102400,0.09530,-0.16100,-0.010522,0.027662,0.082836
GSM540552,1.0,0.064026,0.126533,-0.27350,-0.015500,0.055131,0.1170,-0.057415,0.184000,0.121440,...,0.077617,0.089867,-0.175639,-0.262375,-0.004233,-0.01800,-0.12000,0.026106,0.011612,0.032056
GSM540553,1.0,0.223229,0.015967,-0.18850,-0.230250,-0.020300,0.1540,0.021842,0.176317,0.166813,...,-0.034995,0.084667,-0.123181,-0.159617,-0.040867,0.26300,-0.13300,0.072607,0.063725,-0.053923
GSM540554,1.0,0.015514,-0.001400,0.07720,-0.381500,0.201556,0.1410,0.034802,0.034750,0.002068,...,-0.074967,-0.098300,-0.021133,0.018503,-0.086867,0.03475,-0.13800,-0.014310,-0.114200,-0.005092
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
GSM540606,0.0,0.135557,0.011360,0.01900,-0.010340,0.079336,-0.0340,0.066139,0.115183,-0.013030,...,-0.131533,0.057200,-0.026909,-0.087575,0.167833,-0.15180,-0.12400,0.054733,-0.071883,-0.019704
GSM540607,0.0,0.125829,-0.109767,-0.11105,-0.110975,-0.004712,0.0540,0.069542,0.112083,-0.029762,...,-0.153200,0.041467,-0.100020,-0.126593,0.107967,-0.28300,-0.27100,0.012011,-0.031462,0.012892
GSM540608,0.0,0.009514,0.300333,0.18200,0.131200,-0.077400,-0.1540,0.006996,0.168233,0.093887,...,0.372533,0.072733,0.083072,-0.300383,-0.159400,-0.07216,-0.17600,0.007011,0.068550,-0.295004
GSM540609,0.0,0.174671,0.013167,-0.03220,-0.107950,-0.000033,0.2420,0.019105,0.139317,0.101675,...,0.141417,0.206333,-0.234480,-0.298500,0.132000,0.19100,-0.10000,0.017700,0.098075,0.053736


In [None]:
merged_data.drop(merged_data.columns[1], axis=1, inplace=True)

In [None]:
merged_data

Unnamed: 0,Adrenocortical Cancer,15E1.2,2'-PDE,76P,7A5,A1BG,A2BP1,A2M,A2ML1,A4GALT,...,ZYG11BL,ZYX,ZZEF1,ZZZ3,bA16L21.2.1,bA9F11.1,dJ341D10.1,hCAP-D3,hCAP-H2,mimitin
GSM540550,1.0,-0.018767,-0.09565,-0.177225,-0.060111,0.2610,-0.110698,-0.071700,-0.056388,0.115900,...,0.118914,-0.034700,-0.129433,-0.079166,0.029367,0.15015,0.00661,-0.104746,0.053800,-0.052605
GSM540551,1.0,-0.036233,-0.03820,-0.016975,0.072104,0.0237,0.088163,-0.075717,-0.025460,0.056557,...,0.128233,0.021967,-0.158727,-0.043763,0.102400,0.09530,-0.16100,-0.010522,0.027662,0.082836
GSM540552,1.0,0.126533,-0.27350,-0.015500,0.055131,0.1170,-0.057415,0.184000,0.121440,-0.057271,...,0.077617,0.089867,-0.175639,-0.262375,-0.004233,-0.01800,-0.12000,0.026106,0.011612,0.032056
GSM540553,1.0,0.015967,-0.18850,-0.230250,-0.020300,0.1540,0.021842,0.176317,0.166813,0.139629,...,-0.034995,0.084667,-0.123181,-0.159617,-0.040867,0.26300,-0.13300,0.072607,0.063725,-0.053923
GSM540554,1.0,-0.001400,0.07720,-0.381500,0.201556,0.1410,0.034802,0.034750,0.002068,-0.060130,...,-0.074967,-0.098300,-0.021133,0.018503,-0.086867,0.03475,-0.13800,-0.014310,-0.114200,-0.005092
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
GSM540606,0.0,0.011360,0.01900,-0.010340,0.079336,-0.0340,0.066139,0.115183,-0.013030,-0.120486,...,-0.131533,0.057200,-0.026909,-0.087575,0.167833,-0.15180,-0.12400,0.054733,-0.071883,-0.019704
GSM540607,0.0,-0.109767,-0.11105,-0.110975,-0.004712,0.0540,0.069542,0.112083,-0.029762,-0.127914,...,-0.153200,0.041467,-0.100020,-0.126593,0.107967,-0.28300,-0.27100,0.012011,-0.031462,0.012892
GSM540608,0.0,0.300333,0.18200,0.131200,-0.077400,-0.1540,0.006996,0.168233,0.093887,0.012549,...,0.372533,0.072733,0.083072,-0.300383,-0.159400,-0.07216,-0.17600,0.007011,0.068550,-0.295004
GSM540609,0.0,0.013167,-0.03220,-0.107950,-0.000033,0.2420,0.019105,0.139317,0.101675,0.150286,...,0.141417,0.206333,-0.234480,-0.298500,0.132000,0.19100,-0.10000,0.017700,0.098075,0.053736


In [None]:
print(f"The merged dataset contains {len(merged_data)} samples.")

The merged dataset contains 61 samples.


In [None]:
is_trait_biased, merged_data = judge_and_remove_biased_features(merged_data, TRAIT, trait_type=trait_type)
is_trait_biased

For the feature 'Adrenocortical Cancer', the least common label is '1.0' with 29 occurrences. This represents 47.54% of the dataset.
The distribution of the feature 'Adrenocortical Cancer' in this dataset is fine.



False

In [None]:
if is_available:
    save_cohort_info(cohort, JSON_PATH, is_available, is_trait_biased, merged_data, note='')
else:
    save_cohort_info(cohort, JSON_PATH, is_available)

In [None]:
merged_data.head()
if not is_trait_biased:
    merged_data.to_csv(os.path.join(OUTPUT_DIR, cohort + '.csv'), index=False)

### 3. Do regression & Cross Validation

In [None]:
def read_json_to_dataframe(json_file: str) -> pd.DataFrame:
    """
    Reads a JSON file and converts it into a pandas DataFrame.

    Args:
    json_file (str): The path to the JSON file containing the data.

    Returns:
    DataFrame: A pandas DataFrame with the JSON data.
    """
    with open(json_file, 'r') as file:
        data = json.load(file)
    return pd.DataFrame.from_dict(data, orient='index').reset_index().rename(columns={'index': 'cohort_id'})

In [None]:
def filter_and_rank_cohorts(json_file: str, condition: Union[str, None] = None) -> Tuple[
    Union[str, None], pd.DataFrame]:
    """
    Reads a JSON file, filters cohorts based on usability and an optional condition, then ranks them by sample size.

    Args:
    json_file (str): The path to the JSON file containing the data.
    condition (str, optional): An additional condition for filtering. If None, only 'is_usable' is considered.

    Returns:
    Tuple: A tuple containing the best cohort ID (str or None if no suitable cohort is found) and
           the filtered and ranked DataFrame.
    """
    # Read the JSON file into a DataFrame
    df = read_json_to_dataframe(json_file)

    if condition:
        filtered_df = df[(df['is_usable'] == True) & (df[condition] == True)]
    else:
        filtered_df = df[df['is_usable'] == True]

    ranked_df = filtered_df.sort_values(by='sample_size', ascending=False)
    best_cohort_id = ranked_df.iloc[0]['cohort_id'] if not ranked_df.empty else None

    return best_cohort_id, ranked_df


In [None]:
# Check the information of usable cohorts
best_cohort, ranked_df = filter_and_rank_cohorts(JSON_PATH)
ranked_df

Unnamed: 0,cohort_id,is_usable,is_available,is_biased,has_age,has_gender,sample_size,note
0,GSE21660,True,True,False,False,False,61,


In [None]:
# If both age and gender have available cohorts, select 'age' as the condition.
condition = 'Age'
filter_column = 'has_' + condition.lower()

condition_best_cohort, condition_ranked_df = filter_and_rank_cohorts(JSON_PATH, filter_column)
condition_best_cohort

In [None]:
condition_ranked_df.head()

Unnamed: 0,cohort_id,is_usable,is_available,is_biased,has_age,has_gender,sample_size,note


In [None]:
merged_data = pd.read_csv(os.path.join(OUTPUT_DIR, condition_best_cohort + '.csv'))
merged_data.head()

TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'

In [None]:
# Remove the other condition to prevent interference.
merged_data = merged_data.drop(columns=['Gender'], errors='ignore').astype('float')

X = merged_data.drop(columns=[TRAIT, condition]).values
Y = merged_data[TRAIT].values
Z = merged_data[condition].values

Select the appropriate regression model depending on whether the dataset shows batch effect.

In [None]:
has_batch_effect = detect_batch_effect(X)
has_batch_effect

In [None]:
# Select appropriate models based on whether the dataset has batch effect.
# We experiment on two models for each branch. We will decide which one to choose later.

if has_batch_effect:
    model_constructor1 = VariableSelection
    model_params1 = {'modified': True, 'lamda': 3e-4}
    model_constructor2 = VariableSelection
    model_params2 = {'modified': False}
else:
    model_constructor1 = Lasso
    model_params1 = {'alpha': 1.0, 'random_state': 42}
    model_constructor2 = VariableSelection
    model_params2 = {'modified': False}

In [None]:
trait_type = 'binary'  # Remember to set this properly, either 'binary' or 'continuous'
cv_mean1, cv_std1 = cross_validation(X, Y, Z, model_constructor1, model_params1, target_type=trait_type)

In [None]:
cv_mean2, cv_std2 = cross_validation(X, Y, Z, model_constructor2, model_params2, target_type=trait_type)

In [None]:
normalized_X, _ = normalize_data(X)
normalized_Z, _ = normalize_data(Z)

# Train regression model on the whole dataset to identify significant genes
model1 = ResidualizationRegressor(model_constructor1, model_params1)
model1.fit(normalized_X, Y, normalized_Z)

model2 = ResidualizationRegressor(model_constructor2, model_params2)
model2.fit(normalized_X, Y, normalized_Z)

### 4. Discussion and report

In [None]:
feature_cols = merged_data.columns.tolist()
feature_cols.remove(TRAIT)

threshold = 0.05
interpret_result(model1, feature_cols, TRAIT, condition, threshold=threshold, save_output=True,
                 output_dir=OUTPUT_DIR, model_id=1)

In [None]:
interpret_result(model2, feature_cols, TRAIT, condition, threshold=threshold, save_output=True,
                 output_dir=OUTPUT_DIR, model_id=2)