# Gold standard curation: Preprocessing and single-step regression

In this stage of gold standard curation, we will do the data preprocessing, selection, and single-step regression for the 153 traits in our question set. This file shows the reference steps using the trait "Breast Cancer" as an example. The workflow consists of the following steps:

1. Preprocess all the cohorts related to this trait. Each cohort should be converted to a tabular form and saved to a csv file, with columns being genetic factors, the trait, and age, gender if available;
2. If there exists at least one cohort with age or gender information, conduct regression analysis with genetic features together with age or gender as the regressors.


# 1. Basic setup

In [1]:
import os
import sys

sys.path.append('..')
from utils import *

# Set your preferred name
USER = "Jiayi"
# Set the data and output directories
DATA_ROOT = '/Users/legion/Desktop/Courses/IS389/data'   
OUTPUT_ROOT = '/Users/legion/Desktop/Courses/IS389/output'
TRAIT = 'Kidney Chromophobe'

OUTPUT_DIR = os.path.join(OUTPUT_ROOT, USER, '-'.join(TRAIT.split()))
JSON_PATH = os.path.join(OUTPUT_DIR, "cohort_info.json")
if not os.path.exists(OUTPUT_DIR):
    os.makedirs(OUTPUT_DIR, exist_ok=True)

# Gene symbol normalization may take 1-2 minutes. You may set it to False for debugging.
NORMALIZE_GENE = True

utils.py has been loaded


In [None]:
# This cell is only for use on Google Colab. Skip it if you run your code in other environments

"""import os
from google.colab import drive

drive.mount('/content/drive', force_remount=True)
proj_dir = '/content/drive/MyDrive/AI4Science_Public'
os.chdir(proj_dir)"""

# 2. Data preprocessing and selection

## 2.1. The TCGA Xena dataset

In TCGA Xena, there is either zero or one cohort related to the trait. We search the names of subdirectories to see if any matches the trait. If a match is found, we directly obtain the file paths.

In [2]:
dataset = 'TCGA'
dataset_dir = os.path.join(DATA_ROOT, dataset)
os.listdir(dataset_dir)[:10]

['TCGA_Adrenocortical_Cancer_(ACC)',
 'TCGA_Breast_Cancer_(BRCA)',
 'TCGA_Kidney_Chromophobe_(KICH)-20240402T170820Z-001',
 'TCGA_Kidney_Papillary_Cell_Carcinoma_(KIRP)',
 'TCGA_Melanoma_(SKCM)-20240402T170844Z-001']

If no match is found, jump directly to GEO in Part 2.2

In [3]:
trait_subdir = "TCGA_Kidney_Chromophobe_(KICH)"
cohort = 'Xena'
# All the cancer traits in Xena are binary
trait_type = 'binary'
# Once a relevant cohort is found in Xena, we can generally assume the gene and clinical data are available
is_available = True

clinical_data_file = os.path.join(dataset_dir, trait_subdir, 'TCGA.KICH.sampleMap_KICH_clinicalMatrix')
genetic_data_file = os.path.join(dataset_dir, trait_subdir, 'TCGA.KICH.sampleMap_HiSeqV2_PANCAN.gz')

In [5]:
import pandas as pd

clinical_data = pd.read_csv(clinical_data_file, sep='\t', index_col=0)
genetic_data = pd.read_csv(genetic_data_file, compression='gzip', sep='\t', index_col=0)
age_col = gender_col = None

In [6]:
def check_rows_and_columns(dataframe, display=False):
    """
    Get the lists of row names and column names of a dataset, and optionally observe them.
    :param dataframe:
    :param display:
    :return:
    """
    dataframe_rows = dataframe.index.tolist()
    if display:
        print(f"The dataset has {len(dataframe_rows)} rows, such as {dataframe_rows[:20]}")
    dataframe_cols = dataframe.columns.tolist()
    if display:
        print(f"\nThe dataset has {len(dataframe_cols)} columns, such as {dataframe_cols[:20]}")
    return dataframe_rows, dataframe_cols

In [7]:
_, clinical_data_cols = check_rows_and_columns(clinical_data)
clinical_data_cols[:10]

['_INTEGRATION',
 '_PATIENT',
 '_cohort',
 '_primary_disease',
 '_primary_site',
 'additional_pharmaceutical_therapy',
 'additional_radiation_therapy',
 'additional_surgery_locoregional_procedure',
 'additional_surgery_metastatic_procedure',
 'age_at_initial_pathologic_diagnosis']

Read all the column names in the clinical dataset, to find the columns that record information about age or gender.
Reference prompt:

In [8]:
f'''
Below is a list of column names from a biomedical dataset. Please examine it and identify the columns that are likely to contain information about patients' age. Additionally, please do the same for columns that may hold data on patients' gender. Please provide your answer by strictly following this format, without redundant words:
candidate_age_cols = [col_name1, col_name2, ...]
candidate_gender_cols = [col_name1, col_name2, ...]
If no columns match a criterion, please provide an empty list.

Column names:
{clinical_data_cols}
'''

"\nBelow is a list of column names from a biomedical dataset. Please examine it and identify the columns that are likely to contain information about patients' age. Additionally, please do the same for columns that may hold data on patients' gender. Please provide your answer by strictly following this format, without redundant words:\ncandidate_age_cols = [col_name1, col_name2, ...]\ncandidate_gender_cols = [col_name1, col_name2, ...]\nIf no columns match a criterion, please provide an empty list.\n\nColumn names:\n['_INTEGRATION', '_PATIENT', '_cohort', '_primary_disease', '_primary_site', 'additional_pharmaceutical_therapy', 'additional_radiation_therapy', 'additional_surgery_locoregional_procedure', 'additional_surgery_metastatic_procedure', 'age_at_initial_pathologic_diagnosis', 'bcr_followup_barcode', 'bcr_patient_barcode', 'bcr_sample_barcode', 'clinical_M', 'days_to_additional_surgery_metastatic_procedure', 'days_to_birth', 'days_to_death', 'days_to_initial_pathologic_diagnosis

In [9]:
candidate_age_cols = [ 'age_at_initial_pathologic_diagnosis',
                      'days_to_birth', 'year_of_initial_pathologic_diagnosis']
candidate_gender_cols = [ 'gender']

Choose a single column from the candidate columns that record age and gender information respectively.
If no column meets the requirement, keep 'age_col' or 'gender_col' to None

In [10]:
def preview_df(df, n=5):
    return df.head(n).to_dict(orient='list')

In [11]:
preview_df(clinical_data[candidate_age_cols])

{'age_at_initial_pathologic_diagnosis': [57, 67, 67, 56, 69],
 'days_to_birth': [-20849, -24650, -24650, -20768, -25267],
 'year_of_initial_pathologic_diagnosis': [2000, 2000, 2000, 2000, 2001]}

In [12]:
age_col = 'age_at_initial_pathologic_diagnosis'

In [13]:
preview_df(clinical_data[candidate_gender_cols])

{'gender': ['FEMALE', 'FEMALE', 'FEMALE', 'FEMALE', 'MALE']}

In [14]:
gender_col = 'gender'

In [15]:
def xena_select_clinical_features(clinical_df, trait, age_col=None, gender_col=None):
    feature_list = []
    trait_data = clinical_df.index.to_series().apply(xena_convert_trait).rename(trait)
    feature_list.append(trait_data)
    if age_col:
        age_data = clinical_df[age_col].apply(xena_convert_age).rename("Age")
        feature_list.append(age_data)
    if gender_col:
        gender_data = clinical_df[gender_col].apply(xena_convert_gender).rename("Gender")
        feature_list.append(gender_data)
    selected_clinical_df = pd.concat(feature_list, axis=1)
    return selected_clinical_df

In [16]:
def xena_convert_trait(row_index: str):
    """
    Convert the trait information from Sample IDs to labels depending on the last two digits.
    Tumor types range from 01 - 09, normal types from 10 - 19.
    :param row_index: the index value of a row
    :return: the converted value
    """
    last_two_digits = int(row_index[-2:])

    if 1 <= last_two_digits <= 9:
        return 1
    elif 10 <= last_two_digits <= 19:
        return 0
    else:
        return -1

In [17]:
def xena_convert_age(cell: str):
    """Convert the cell content about age to a numerical value using regular expression
    """
    match = re.search(r'\d+', str(cell))
    if match:
        return int(match.group())
    else:
        return None

In [18]:
def xena_convert_gender(cell: str):
    """Convert the cell content about gender to a binary value
    """
    if isinstance(cell, str):
        cell = cell.lower()

    if cell == "female":
        return 0
    elif cell == "male":
        return 1
    else:
        return None

In [19]:
import re
selected_clinical_data = xena_select_clinical_features(clinical_data, TRAIT, age_col=age_col, gender_col=gender_col)

In [20]:
def normalize_gene_symbols_in_index(gene_df):
    """Normalize the human gene symbols at the index of a dataframe, and replace the index with its normalized version.
    Remove the rows where the index failed to be normalized."""
    normalized_gene_list = normalize_gene_symbols(gene_df.index.tolist())
    assert len(normalized_gene_list) == len(gene_df.index)
    gene_df.index = normalized_gene_list
    gene_df = gene_df[gene_df.index.notnull()]
    return gene_df

In [21]:
def normalize_gene_symbols(gene_symbols, batch_size=1000):
    """Normalize human gene symbols in batches using the 'mygenes' library"""
    mg = mygene.MyGeneInfo()
    normalized_genes = {}

    # Process in batches
    for i in range(0, len(gene_symbols), batch_size):
        batch = gene_symbols[i:i + batch_size]
        results = mg.querymany(batch, scopes='symbol', fields='symbol', species='human')

        # Update the normalized_genes dictionary with results from this batch
        for gene in results:
            normalized_genes[gene['query']] = gene.get('symbol', None)

    # Return the normalized symbols in the same order as the input
    return [normalized_genes.get(symbol) for symbol in gene_symbols]


In [22]:
import mygene

if NORMALIZE_GENE:
    genetic_data = normalize_gene_symbols_in_index(genetic_data)

12 input query terms found dup hits:	[('GTF2IP1', 2), ('RBMY1A3P', 3), ('RPL31P11', 2), ('HERC2P2', 3), ('WASH3P', 3), ('NUDT9P1', 2), ('
154 input query terms found no hit:	['C16orf13', 'C16orf11', 'LOC100272146', 'LOC339240', 'NACAP1', 'LOC441204', 'KLRA1', 'FAM183A', 'FA
10 input query terms found dup hits:	[('SUGT1P1', 2), ('PTPRVP', 2), ('SNORA62', 3), ('IFITM4P', 7), ('HLA-DRB6', 2), ('FUNDC2P2', 2), ('
190 input query terms found no hit:	['NARFL', 'NFKBIL2', 'LOC150197', 'TMEM84', 'LOC162632', 'PPPDE1', 'PPPDE2', 'C1orf38', 'C1orf31', '
11 input query terms found dup hits:	[('PIP5K1P1', 2), ('HBD', 2), ('PPP1R2P1', 9), ('HSD17B7P2', 2), ('RPSAP9', 2), ('SNORD68', 2), ('SN
149 input query terms found no hit:	['FAM153C', 'C9orf167', 'CLK2P', 'CCDC76', 'CCDC75', 'CCDC72', 'HIST3H2BB', 'PRAC', 'LOC285780', 'LO
15 input query terms found dup hits:	[('SNORD58C', 2), ('UOX', 2), ('UBE2Q2P1', 3), ('PPP4R1L', 2), ('SNORD63', 3), ('ESPNP', 2), ('HBBP1
158 input query terms found no hit:	[

In [23]:
merged_data = selected_clinical_data.join(genetic_data.T).dropna()
merged_data.head()

Unnamed: 0_level_0,Kidney Chromophobe,Age,Gender,ARHGEF10L,HIF3A,RNF17,RNF10,RNF11,RNF13,GTF2IP1,...,SLC7A10,PLA2G2C,TULP2,NPY5R,GNGT2,GNGT1,TULP3,BCL6B,GSTK1,SELP
sampleID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
TCGA-KL-8323-01,1,57,0,-0.295492,-3.487426,-0.531035,0.359428,0.850422,0.58019,-0.739994,...,-2.090786,0.397618,-0.748878,0.731583,-1.308333,-0.79709,-1.328077,0.933073,2.015005,0.025067
TCGA-KL-8324-01,1,67,0,0.581408,0.368474,-0.531035,1.217628,0.626922,0.06679,-0.058894,...,-2.090786,-0.086682,-0.193178,0.513683,-0.912933,-0.32549,-1.035777,0.221073,1.600605,2.456767
TCGA-KL-8324-11,0,67,0,1.119008,2.198374,-0.531035,0.341628,0.681022,0.46319,0.057806,...,-0.837786,-0.086682,0.006622,5.587783,0.113667,-0.03259,0.051323,1.403473,0.786105,3.303867
TCGA-KL-8325-01,1,56,0,0.572008,-1.889926,-0.531035,0.683828,0.971922,1.55029,-0.150194,...,-2.090786,-0.086682,-0.748878,0.414383,-2.250633,1.13531,-1.257677,-1.076027,1.424005,-0.464633
TCGA-KL-8326-01,1,69,1,0.259208,-0.380726,-0.531035,0.992728,0.311022,0.65699,0.050806,...,-0.295086,-0.086682,-0.748878,1.982883,-0.845733,1.40521,-1.536277,0.100373,1.883005,0.013067


In [25]:
def judge_and_remove_biased_features(df, trait, trait_type):
    assert trait_type in ["binary", "continuous"], f"The trait must be either a binary or a continuous variable!"
    if trait_type == "binary":
        trait_biased = judge_binary_variable_biased(df, trait)
    else:
        trait_biased = judge_continuous_variable_biased(df, trait)
    if trait_biased:
        print(f"The distribution of the feature \'{trait}\' in this dataset is severely biased.\n")
    else:
        print(f"The distribution of the feature \'{trait}\' in this dataset is fine.\n")
    if "Age" in df.columns:
        age_biased = judge_continuous_variable_biased(df, 'Age')
        if age_biased:
            print(f"The distribution of the feature \'Age\' in this dataset is severely biased.\n")
            df = df.drop(columns='Age')
        else:
            print(f"The distribution of the feature \'Age\' in this dataset is fine.\n")
    if "Gender" in df.columns:
        gender_biased = judge_binary_variable_biased(df, 'Gender')
        if gender_biased:
            print(f"The distribution of the feature \'Gender\' in this dataset is severely biased.\n")
            df = df.drop(columns='Gender')
        else:
            print(f"The distribution of the feature \'Gender\' in this dataset is fine.\n")

    return trait_biased, df

In [26]:
def judge_binary_variable_biased(dataframe, col_name, min_proportion=0.1, min_num=5):
    """
    Check if the distribution of a binary variable in the dataset is too biased to be usable for analysis
    :param dataframe:
    :param col_name:
    :param min_proportion:
    :param min_num:
    :return:
    """
    label_counter = dataframe[col_name].value_counts()
    total_samples = len(dataframe)
    rare_label_num = label_counter.min()
    rare_label = label_counter.idxmin()
    rare_label_proportion = rare_label_num / total_samples

    print(
        f"For the feature \'{col_name}\', the least common label is '{rare_label}' with {rare_label_num} occurrences. This represents {rare_label_proportion:.2%} of the dataset.")

    biased = (len(label_counter) < 2) or ((rare_label_proportion < min_proportion) and (rare_label_num < min_num))
    return bool(biased)


In [27]:
def judge_continuous_variable_biased(dataframe, col_name):
    """Check if the distribution of a continuous variable in the dataset is too biased to be usable for analysis.
    As a starting point, we consider it biased if all values are the same. For the next step, maybe ask GPT to judge
    based on quartile statistics combined with its common sense knowledge about this feature.
    """
    quartiles = dataframe[col_name].quantile([0.25, 0.5, 0.75])
    min_value = dataframe[col_name].min()
    max_value = dataframe[col_name].max()

    # Printing quartile information
    print(f"Quartiles for '{col_name}':")
    print(f"  25%: {quartiles[0.25]}")
    print(f"  50% (Median): {quartiles[0.5]}")
    print(f"  75%: {quartiles[0.75]}")
    print(f"Min: {min_value}")
    print(f"Max: {max_value}")

    biased = min_value == max_value

    return bool(biased)

In [28]:
print(f"The merged dataset contains {len(merged_data)} samples.")
is_trait_biased, merge_data = judge_and_remove_biased_features(merged_data, TRAIT, trait_type=trait_type)
is_trait_biased

The merged dataset contains 91 samples.
For the feature 'Kidney Chromophobe', the least common label is '0' with 25 occurrences. This represents 27.47% of the dataset.
The distribution of the feature 'Kidney Chromophobe' in this dataset is fine.

Quartiles for 'Age':
  25%: 42.5
  50% (Median): 51.0
  75%: 62.0
Min: 17
Max: 86
The distribution of the feature 'Age' in this dataset is fine.

For the feature 'Gender', the least common label is '0' with 39 occurrences. This represents 42.86% of the dataset.
The distribution of the feature 'Gender' in this dataset is fine.



False

In [29]:
merged_data.head()
if not is_trait_biased:
    merge_data.to_csv(os.path.join(OUTPUT_DIR, cohort + '.csv'), index=False)

In [30]:
from typing import Callable, Optional, List, Tuple, Union, Any

In [31]:
def save_cohort_info(cohort: str, info_path: str, is_available: bool, is_biased: Optional[bool] = None,
                     df: Optional[pd.DataFrame] = None, note: str = '') -> None:
    """
    Add or update information about the usability and quality of a dataset for statistical analysis.

    Parameters:
    cohort (str): A unique identifier for the dataset.
    info_path (str): File path to the JSON file where records are stored.
    is_available (bool): Indicates whether both the genetic data and trait data are available in the dataset, and can be
     preprocessed into a dataframe.
    is_biased (bool, optional): Indicates whether the dataset is too biased to be usable.
        Required if `is_available` is True.
    df (pandas.DataFrame, optional): The preprocessed dataset. Required if `is_available` is True.
    note (str, optional): Additional notes about the dataset.

    Returns:
    None: The function does not return a value but updates or creates a record in the specified JSON file.
    """
    if is_available:
        assert (df is not None) and (is_biased is not None), "'df' and 'is_biased' should be provided if this cohort " \
                                                             "is relevant."
    is_usable = is_available and (not is_biased)
    new_record = {"is_usable": is_usable,
                  "is_available": is_available,
                  "is_biased": is_biased if is_available else None,
                  "has_age": "Age" in df.columns if is_available else None,
                  "has_gender": "Gender" in df.columns if is_available else None,
                  "sample_size": len(df) if is_available else None,
                  "note": note}
    
    if not os.path.exists(info_path):
        with open(info_path, 'w') as file:
            json.dump({}, file)
        print(f"A new JSON file was created at: {info_path}")

    with open(info_path, "r") as file:
        records = json.load(file)
    records[cohort] = new_record

    temp_path = info_path + ".tmp"
    try:
        with open(temp_path, 'w') as file:
            json.dump(records, file)
        os.replace(temp_path, info_path)

    except Exception as e:
        print(f"An error occurred: {e}")
        if os.path.exists(temp_path):
            os.remove(temp_path)
        raise

In [32]:
import json

save_cohort_info(cohort, JSON_PATH, is_available, is_trait_biased, merged_data)

A new JSON file was created at: /Users/legion/Desktop/Courses/IS389/output\Jiayi\Kidney-Chromophobe\cohort_info.json


## 2.2. The GEO dataset

In GEO, there may be one or multiple cohorts for a trait. Each cohort is identified by an accession number. We iterate over all accession numbers in the corresponding subdirectory, preprocess the cohort data, and save them to csv files.

In [33]:
dataset = 'GEO'
trait_subdir = "Kidney-Chromophobe"

trait_path = os.path.join(DATA_ROOT, dataset, trait_subdir)
os.listdir(trait_path)

['GSE11024',
 'GSE11151',
 'GSE11447',
 'GSE144082',
 'GSE14670',
 'GSE15641',
 'GSE17746',
 'GSE19949',
 'GSE19982',
 'GSE26574',
 'GSE3',
 'GSE40911',
 'GSE40912',
 'GSE40914',
 'GSE4125',
 'GSE42977',
 'GSE57162',
 'GSE6280',
 'GSE68606',
 'GSE8271',
 'GSE95425']

Repeat the below steps for all the accession numbers

In [34]:
def get_relevant_filepaths(cohort_dir):
    """Find the file paths of a SOFT file and a matrix file from the given data directory of a cohort.
    If there are multiple SOFT files or matrix files, simply choose the first one. May be replaced by better
    strategies later.
    """
    files = os.listdir(cohort_dir)
    soft_files = [f for f in files if 'soft' in f.lower()]
    matrix_files = [f for f in files if 'matrix' in f.lower()]
    assert len(soft_files) > 0 and len(matrix_files) > 0
    soft_file_path = os.path.join(cohort_dir, soft_files[0])
    matrix_file_path = os.path.join(cohort_dir, matrix_files[0])

    return soft_file_path, matrix_file_path

In [36]:
# Finished
cohort = accession_num = "GSE15641"
cohort_dir = os.path.join(trait_path, accession_num)
soft_file, matrix_file = get_relevant_filepaths(cohort_dir)
soft_file, matrix_file

from utils import *
import io
background_prefixes = ['!Series_title', '!Series_summary', '!Series_overall_design']
clinical_prefixes = ['!Sample_geo_accession', '!Sample_characteristics_ch1']

background_info, clinical_data = get_background_and_clinical_data(matrix_file, background_prefixes, clinical_prefixes)
print(background_info)

clinical_data

!Series_title	"Gene signatures of progression and metastasis in renal cell cancer"
!Series_summary	"In order to address the progression, metastasis, and clinical heterogeneity of renal cell cancer (RCC), transcriptional profiling with oligonucleotide microarrays (22,283 genes) was done on 49 RCC tumors, 20 non-RCC renal tumors, and 23 normal kidney samples. Samples were clustered based on gene expression profiles and specific gene sets for each renal tumor type were identified. Gene expression was correlated to disease progression and a metastasis gene signature was derived. Gene signatures were identified for each tumor type with 100% accuracy. Differentially expressed genes during early tumor formation and tumor progression to metastatic RCC were found. Subsets of these genes code for secreted proteins and membrane receptors and are both potential therapeutic or diagnostic targets. A gene pattern (""metastatic signature"") derived from primary tumors was very accurate in classifying 

Unnamed: 0,!Sample_geo_accession,GSM391107,GSM391108,GSM391109,GSM391110,GSM391111,GSM391112,GSM391113,GSM391114,GSM391115,...,GSM391189,GSM391190,GSM391191,GSM391192,GSM391193,GSM391194,GSM391195,GSM391196,GSM391197,GSM391198
0,!Sample_characteristics_ch1,tissue: Kidney,tissue: Kidney,tissue: Kidney,tissue: Kidney,tissue: Kidney,tissue: Kidney,tissue: Kidney,tissue: Kidney,tissue: Kidney,...,tissue: Kidney,tissue: Kidney,tissue: Kidney,tissue: Kidney,tissue: Kidney,tissue: Kidney,tissue: Kidney,tissue: Kidney,tissue: Kidney,tissue: Kidney


In [37]:
tumor_stage_row = clinical_data.iloc[0]
tumor_stage_row.unique()

array(['!Sample_characteristics_ch1', 'tissue: Kidney'], dtype=object)

In [39]:
is_gene_availabe = True
trait_row = 0
age_row = None
gender_row = None

trait_type = 'binary'

# Verify and use the functions generated by GPT

# 这个函数将组织类型（tissue type）转换为有关癫痫存在与否的二进制值。
# 它是基于特定的假设，即如果组织类型是“胰腺导管腺癌”（Pancreatic Ductal Adenocarcinoma），则认为癫痫存在（返回1）；否则，认为癫痫不存在（返回0）。
def convert_trait(tissue_type):
    """
    Convert tissue type to epilepsy presence (binary).
    Assuming epilepsy presence for 'Hippocampus' tissue.
    """
    if tissue_type == 'tissue: Kidney':
        return 1  # Epilepsy present
    else:
        return 0  # Epilepsy not present


# def convert_trait(tumor_grade):
#     if (tumor_grade == 'tumor grade: 2' or tumor_grade == 'tumor grade: 3' or tumor_grade == 'tumor grade: 4'):
#         return 1  
#     elif tumor_grade == 'tumor grade: 1':
#         return 0  
#     else:
#         return None

# 这个函数的目的是将年龄的字符串表示转换为一个连续的数值型表示。如果年龄未知（例如，标记为'n.a.'），则返回None。
# 函数尝试从传入的字符串中提取出一个整数作为年龄值。如果字符串的格式不符合预期，导致提取失败，同样返回None。
def convert_age(age_string):
    """
    Convert age string to a continuous numerical value.
    Unknown values are converted to None.
    """
    if age_string.lower() == 'n.a.':
        return None
    try:
        # Extract age as an integer from the string
        age = int(age_string.split(': ')[1])
        return age
    except (ValueError, IndexError):
        # In case of any format error or unexpected string structure
        return None


# 这个函数将性别的字符串表示转换为二进制值，其中“female”对应1，“male”对应0。如果性别未知或字符串不符合预期格式，则返回None。
# It sometimes maps 'female' to 0, and sometimes 1. Does it matter?
def convert_gender(gender_string):
    """
    Convert gender string to a binary value.
    'female' is represented as 1, 'male' as 0.
    Unknown values are converted to None.
    """
    if (gender_string.lower() == 'sex: female' or gender_string.lower() == 'sex: f' or gender_string.lower() == 'gender: female' or gender_string.lower() == 'gender: f'):
        return 1
    elif (gender_string.lower() == 'sex: male' or gender_string.lower() == 'sex: m' or gender_string.lower() == 'gender: male' or gender_string.lower() == 'gender: m') :  # changeed 
        return 0
    else:
        return None

In [40]:
selected_clinical_data = geo_select_clinical_features(clinical_data, TRAIT, trait_row, convert_trait, age_row=age_row,
                                                      convert_age=convert_age, gender_row=gender_row,
                                                      convert_gender=convert_gender)
selected_clinical_data.head()

  clinical_df = clinical_df.applymap(convert_fn)


Unnamed: 0,GSM391107,GSM391108,GSM391109,GSM391110,GSM391111,GSM391112,GSM391113,GSM391114,GSM391115,GSM391116,...,GSM391189,GSM391190,GSM391191,GSM391192,GSM391193,GSM391194,GSM391195,GSM391196,GSM391197,GSM391198
Kidney Chromophobe,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1


In [41]:
genetic_data = get_genetic_data(matrix_file)
genetic_data

Unnamed: 0_level_0,GSM391107,GSM391108,GSM391109,GSM391110,GSM391111,GSM391112,GSM391113,GSM391114,GSM391115,GSM391116,...,GSM391189,GSM391190,GSM391191,GSM391192,GSM391193,GSM391194,GSM391195,GSM391196,GSM391197,GSM391198
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1007_s_at,679.53,349.90,711.66,946.14,1027.44,726.10,820.66,861.99,796.80,811.09,...,909.19,761.81,1159.35,850.96,862.82,840.91,1069.96,951.86,1140.19,1422.83
1053_at,190.75,179.89,192.90,182.62,168.39,150.40,148.28,145.87,136.88,151.87,...,204.78,158.85,175.79,157.64,151.13,150.52,143.85,186.49,158.97,150.36
117_at,154.05,163.18,158.30,155.36,153.08,140.50,133.20,121.55,130.79,147.74,...,145.93,137.33,161.83,139.11,137.51,131.97,133.61,152.76,149.31,131.13
121_at,1851.76,1347.33,1278.88,1530.74,1695.97,1427.47,1573.13,1658.89,1423.71,1585.53,...,1125.78,1358.90,1091.68,1310.38,1997.77,1711.60,2078.28,1865.49,789.54,916.89
1255_g_at,51.55,56.91,41.02,28.71,33.50,21.63,33.36,32.96,27.96,34.43,...,33.25,40.68,51.91,38.29,32.97,34.79,33.54,33.61,42.25,33.67
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
AFFX-ThrX-5_at,54.49,65.92,38.78,38.31,39.09,43.80,45.69,50.13,45.67,45.27,...,187.14,153.28,188.94,130.51,127.00,139.74,115.42,139.34,214.32,144.41
AFFX-ThrX-M_at,22.43,25.05,17.14,13.20,16.62,20.58,20.76,18.94,17.82,21.95,...,20.92,20.26,20.38,20.16,15.27,13.72,17.47,19.24,23.35,21.44
AFFX-TrpnX-3_at,9.75,9.44,13.83,10.61,12.32,9.15,5.24,8.10,7.98,8.63,...,14.93,16.45,8.56,8.26,12.06,11.87,12.51,12.21,15.17,17.54
AFFX-TrpnX-5_at,56.94,50.68,54.77,48.93,47.31,46.67,51.77,55.94,48.45,50.82,...,46.18,48.27,46.73,41.90,38.26,46.66,40.75,35.47,35.27,40.27


In [42]:
requires_gene_mapping = True

if requires_gene_mapping:
    gene_annotation = get_gene_annotation(soft_file)
    gene_annotation_summary = preview_df(gene_annotation)
    print(gene_annotation_summary)

gene_row_ids = genetic_data.index[:20].tolist()
gene_row_ids

if requires_gene_mapping:
    print(f'''
    As a biomedical research team, we are analyzing a gene expression dataset, and find that its row headers are some identifiers related to genes:
    {gene_row_ids}
    To get the mapping from those identifiers to actual gene symbols, we extracted the gene annotation data from a series in the GEO database, and saved it to a Python dictionary. Please read the dictionary, and decide which key stores the identifiers, and which key stores the gene symbols. Please strictly follow this format in your answer:
    identifier_key = 'key_name1'
    gene_symbol_key = 'key_name2'

    Gene annotation dictionary:
    {gene_annotation_summary}
    ''')

gene_annotation.columns

{'ID': ['1007_s_at', '1053_at', '117_at', '121_at', '1255_g_at'], 'GB_ACC': ['U48705', 'M87338', 'X51757', 'X69699', 'L36861'], 'SPOT_ID': [nan, nan, nan, nan, nan], 'Species Scientific Name': ['Homo sapiens', 'Homo sapiens', 'Homo sapiens', 'Homo sapiens', 'Homo sapiens'], 'Annotation Date': ['Oct 6, 2014', 'Oct 6, 2014', 'Oct 6, 2014', 'Oct 6, 2014', 'Oct 6, 2014'], 'Sequence Type': ['Exemplar sequence', 'Exemplar sequence', 'Exemplar sequence', 'Exemplar sequence', 'Exemplar sequence'], 'Sequence Source': ['Affymetrix Proprietary Database', 'GenBank', 'Affymetrix Proprietary Database', 'GenBank', 'Affymetrix Proprietary Database'], 'Target Description': ['U48705 /FEATURE=mRNA /DEFINITION=HSU48705 Human receptor tyrosine kinase DDR gene, complete cds', 'M87338 /FEATURE= /DEFINITION=HUMA1SBU Human replication factor C, 40-kDa subunit (A1) mRNA, complete cds', "X51757 /FEATURE=cds /DEFINITION=HSP70B Human heat-shock protein HSP70B' gene", 'X69699 /FEATURE= /DEFINITION=HSPAX8A H.sapiens

Index(['ID', 'GB_ACC', 'SPOT_ID', 'Species Scientific Name', 'Annotation Date',
       'Sequence Type', 'Sequence Source', 'Target Description',
       'Representative Public ID', 'Gene Title', 'Gene Symbol',
       'ENTREZ_GENE_ID', 'RefSeq Transcript ID',
       'Gene Ontology Biological Process', 'Gene Ontology Cellular Component',
       'Gene Ontology Molecular Function'],
      dtype='object')

In [43]:
if requires_gene_mapping:
    identifier_key = 'ID'
    gene_symbol_key = 'Gene Symbol'
    gene_mapping = get_gene_mapping(gene_annotation, identifier_key, gene_symbol_key)
    genetic_data = apply_gene_mapping(genetic_data, gene_mapping)

In [44]:
genetic_data = normalize_gene_symbols_in_index(genetic_data)

genetic_data

Unnamed: 0,GSM391107,GSM391108,GSM391109,GSM391110,GSM391111,GSM391112,GSM391113,GSM391114,GSM391115,GSM391116,...,GSM391189,GSM391190,GSM391191,GSM391192,GSM391193,GSM391194,GSM391195,GSM391196,GSM391197,GSM391198
A1CF,216.850,273.32,286.000000,270.790000,234.910000,354.430000,301.360000,291.760000,259.660000,373.820,...,221.61,210.39,238.120000,214.250000,342.850000,184.160000,186.550000,204.180000,208.02,181.880000
A2M,1234.580,1600.46,1138.550000,1222.280000,1633.940000,979.330000,1415.140000,1156.080000,1633.880000,1346.710,...,978.94,716.96,643.230000,1527.600000,2448.750000,2136.700000,1987.630000,1182.570000,665.23,954.210000
A4GALT,156.470,147.93,250.030000,209.760000,244.660000,224.050000,182.410000,220.980000,240.770000,214.970,...,100.28,102.09,101.090000,77.750000,92.030000,95.890000,95.840000,98.780000,92.76,96.770000
A4GNT,114.730,96.43,115.990000,82.230000,91.990000,93.310000,95.910000,94.970000,94.620000,84.930,...,200.55,182.43,178.820000,178.160000,189.040000,149.580000,151.390000,149.030000,183.64,166.220000
AAAS,205.970,161.96,186.940000,175.210000,169.830000,180.700000,161.470000,168.280000,158.630000,178.560,...,215.45,179.03,193.500000,221.900000,189.570000,204.720000,188.040000,216.820000,217.65,204.090000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ZXDB,76.150,73.69,67.460000,59.640000,69.450000,55.650000,57.320000,52.890000,54.360000,68.740,...,68.87,77.84,71.570000,65.840000,68.270000,70.480000,56.580000,57.000000,77.04,76.120000
ZXDC,229.670,204.86,171.430000,198.630000,208.860000,206.790000,214.290000,212.040000,268.640000,194.250,...,187.55,154.18,174.780000,206.150000,210.740000,154.120000,160.570000,159.500000,196.24,145.630000
ZYX,364.805,319.76,403.445000,459.975000,665.935000,317.120000,315.620000,302.040000,295.910000,407.785,...,418.80,336.44,332.555000,385.960000,415.850000,466.205000,484.980000,406.645000,327.05,349.415000
ZZEF1,110.290,115.06,104.916667,96.573333,95.296667,98.516667,102.956667,93.953333,96.336667,104.350,...,131.18,125.87,119.086667,134.856667,119.343333,130.183333,124.223333,123.346667,117.47,123.816667


In [45]:
merged_data = geo_merge_clinical_genetic_data(selected_clinical_data, genetic_data)
# The preprocessing runs through, which means is_available should be True
is_available = True

merged_data

Unnamed: 0,Kidney Chromophobe,A1CF,A2M,A4GALT,A4GNT,AAAS,AACS,AADAC,AAGAB,AAK1,...,ZSWIM1,ZSWIM8,ZW10,ZWILCH,ZWINT,ZXDB,ZXDC,ZYX,ZZEF1,ZZZ3
GSM391107,1.0,216.85,1234.58,156.47,114.73,205.97,275.87,89.78,166.720,181.854,...,139.83,136.890,164.60,88.83,124.41,76.15,229.67,364.805,110.290000,120.57
GSM391108,1.0,273.32,1600.46,147.93,96.43,161.96,237.88,74.43,202.045,166.652,...,137.27,127.165,131.09,77.84,164.05,73.69,204.86,319.760,115.060000,168.53
GSM391109,1.0,286.00,1138.55,250.03,115.99,186.94,243.51,72.76,204.975,149.652,...,129.74,175.435,142.49,50.41,112.01,67.46,171.43,403.445,104.916667,115.28
GSM391110,1.0,270.79,1222.28,209.76,82.23,175.21,187.69,72.96,161.780,151.998,...,108.38,171.055,166.73,54.93,80.23,59.64,198.63,459.975,96.573333,149.22
GSM391111,1.0,234.91,1633.94,244.66,91.99,169.83,216.02,68.80,172.450,129.822,...,95.99,150.990,159.86,76.54,100.93,69.45,208.86,665.935,95.296667,149.95
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
GSM391194,1.0,184.16,2136.70,95.89,149.58,204.72,282.58,114.22,235.415,164.832,...,140.31,186.850,132.90,70.49,133.92,70.48,154.12,466.205,130.183333,169.52
GSM391195,1.0,186.55,1987.63,95.84,151.39,188.04,252.95,117.16,221.765,196.354,...,108.77,209.540,117.39,64.92,110.48,56.58,160.57,484.980,124.223333,178.07
GSM391196,1.0,204.18,1182.57,98.78,149.03,216.82,276.85,135.16,209.215,133.932,...,136.20,211.305,139.30,72.91,111.31,57.00,159.50,406.645,123.346667,125.93
GSM391197,1.0,208.02,665.23,92.76,183.64,217.65,268.78,128.02,190.850,130.122,...,159.44,164.295,158.91,63.49,122.13,77.04,196.24,327.050,117.470000,321.24


In [46]:
print(f"The merged dataset contains {len(merged_data)} samples.")
is_trait_biased, merged_data = judge_and_remove_biased_features(merged_data, TRAIT, trait_type=trait_type)
is_trait_biased

The merged dataset contains 92 samples.
For the feature 'Kidney Chromophobe', the least common label is '1.0' with 92 occurrences. This represents 100.00% of the dataset.
The distribution of the feature 'Kidney Chromophobe' in this dataset is severely biased.



True

In [47]:
if is_available:
    save_cohort_info(cohort, JSON_PATH, is_available, is_trait_biased, merged_data, note='')
else:
    save_cohort_info(cohort, JSON_PATH, is_available)
merged_data.head()
if not is_trait_biased:
    merged_data.to_csv(os.path.join(OUTPUT_DIR, cohort + '.csv'), index=False)

In [34]:
cohort = accession_num = "GSE17746"
cohort_dir = os.path.join(trait_path, accession_num)
soft_file, matrix_file = get_relevant_filepaths(cohort_dir)
soft_file, matrix_file

('/Users/zhangmianchen/Downloads/DATA/GEO/Kidney-Chromophobe/GSE17746/GSE17746_family.soft.gz',
 '/Users/zhangmianchen/Downloads/DATA/GEO/Kidney-Chromophobe/GSE17746/GSE17746-GPL9070_series_matrix.txt.gz')

In [59]:
# Finished
cohort = accession_num = "GSE42977"
cohort_dir = os.path.join(trait_path, accession_num)
soft_file, matrix_file = get_relevant_filepaths(cohort_dir)
soft_file, matrix_file

from utils import *
import io
background_prefixes = ['!Series_title', '!Series_summary', '!Series_overall_design']
clinical_prefixes = ['!Sample_geo_accession', '!Sample_characteristics_ch1']

background_info, clinical_data = get_background_and_clinical_data(matrix_file, background_prefixes, clinical_prefixes)
print(background_info)

clinical_data

!Series_title	"Sequential Binary Gene-Ratio Tests Define a Novel Molecular Diagnostic Strategy for Malignant Pleural Mesothelioma"
!Series_summary	"The gene-expression ratio  technique was used to design a molecular signature to diagnose MPM from among other potentially confounding diagnoses and differentiate the epithelioid from the sarcomatoid histological subtype of MPM."
!Series_overall_design	"Microarray analysis was performed on 113 specimens including MPMs and a spectrum of tumors and benign tissues comprising the differential diagnosis of MPM.  A sequential combination of binary gene-expression ratio tests was developed to discriminate MPM from other thoracic malignancies .  This method was compared to other bioinformatic tools and this signature was validated  in an independent set of 170 samples.  Functional enrichment analysis was performed to identify differentially expressed probes."


Unnamed: 0,!Sample_geo_accession,GSM1054230,GSM1054231,GSM1054232,GSM1054233,GSM1054234,GSM1054235,GSM1054236,GSM1054237,GSM1054238,...,GSM1054337,GSM1054338,GSM1054339,GSM1054340,GSM1054341,GSM1054342,GSM1054343,GSM1054344,GSM1054345,GSM1054346
0,!Sample_characteristics_ch1,tissue: control,tissue: control,tissue: control,tissue: control,tissue: Spindle Cell Sarcoma,tissue: Sarcoma,tissue: Sarcoma,tissue: Metastatic Melanoma,tissue: Pleomorphic Sarcoma,...,tissue: Normal Pleura,tissue: Normal Pleura,tissue: Normal Pleura,tissue: Normal Pleura,tissue: MPM Sarcomatoid,tissue: MPM Biphasic,tissue: MPM Biphasic,tissue: MPM Biphasic,tissue: MPM Biphasic,tissue: MPM Biphasic


In [60]:
tumor_stage_row = clinical_data.iloc[0]
tumor_stage_row.unique()

array(['!Sample_characteristics_ch1', 'tissue: control',
       'tissue: Spindle Cell Sarcoma', 'tissue: Sarcoma',
       'tissue: Metastatic Melanoma', 'tissue: Pleomorphic Sarcoma',
       'tissue: Renal Cell Carcinoma-Clear Cell',
       'tissue: Synovial Sarcoma', 'tissue: Metastatic Thymoma',
       'tissue: Metastatic Prostate Cancer',
       'tissue: Stomach Cancer-Stromal Sarcoma',
       'tissue: Non-Hodgkins Lymphoma', 'tissue: Hemangioendothelioma',
       'tissue: Papillary Thyroid Carcinoma',
       'tissue: Metastatic Thyroid Cancer',
       'tissue: Lymphocytic Lymphoma', 'tissue: Thymoma',
       'tissue: Melanoma-Malignant', 'tissue: Hemangiopericytoma',
       'tissue: Thyroid Carcinoma', 'tissue: Monophasic Synovial Sarcoma',
       'tissue: Metastatic Alveolar Soft Part Sarcoma',
       'tissue: Metastatic Meningeal Hemangiopericytoma',
       'tissue: Follicular Lymphoma', 'tissue: Rhabdomyosarcoma',
       'tissue: Myofibrosarcoma',
       'tissue: Renal Cell Carc

In [61]:
is_gene_availabe = True
trait_row = 0
age_row = None
gender_row = None

trait_type = 'binary'

# Verify and use the functions generated by GPT

# 这个函数将组织类型（tissue type）转换为有关癫痫存在与否的二进制值。
# 它是基于特定的假设，即如果组织类型是“胰腺导管腺癌”（Pancreatic Ductal Adenocarcinoma），则认为癫痫存在（返回1）；否则，认为癫痫不存在（返回0）。
def convert_trait(tissue_type):
    """
    Convert tissue type to epilepsy presence (binary).
    Assuming epilepsy presence for 'Hippocampus' tissue.
    """
    if tissue_type == 'tissue: Renal Cell Carcinoma - Chromophobe':
        return 1  # Epilepsy present
    else:
        return 0  # Epilepsy not present

In [62]:
selected_clinical_data = geo_select_clinical_features(clinical_data, TRAIT, trait_row, convert_trait, age_row=age_row,
                                                      convert_age=convert_age, gender_row=gender_row,
                                                      convert_gender=convert_gender)
selected_clinical_data.head()

  clinical_df = clinical_df.applymap(convert_fn)


Unnamed: 0,GSM1054230,GSM1054231,GSM1054232,GSM1054233,GSM1054234,GSM1054235,GSM1054236,GSM1054237,GSM1054238,GSM1054239,...,GSM1054337,GSM1054338,GSM1054339,GSM1054340,GSM1054341,GSM1054342,GSM1054343,GSM1054344,GSM1054345,GSM1054346
Kidney Chromophobe,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [63]:
genetic_data = get_genetic_data(matrix_file)
genetic_data

Unnamed: 0_level_0,GSM1054230,GSM1054231,GSM1054232,GSM1054233,GSM1054234,GSM1054235,GSM1054236,GSM1054237,GSM1054238,GSM1054239,...,GSM1054337,GSM1054338,GSM1054339,GSM1054340,GSM1054341,GSM1054342,GSM1054343,GSM1054344,GSM1054345,GSM1054346
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ILMN_10000,1.972137,1.511342,2.845652,1.183206,2.740848,3.472879,3.876484,7.543180,3.854739,1.886201,...,2.208187,1.678540,2.347179,2.308059,4.152523,4.434662,5.518134,15.222088,8.603092,4.540210
ILMN_100000,0.924305,0.812420,0.937917,0.942640,0.916636,0.870865,0.775436,0.872982,0.897949,0.925115,...,0.824272,0.884153,0.874214,0.856160,0.943487,0.948834,0.869908,0.857906,0.757040,0.872259
ILMN_100007,0.909786,0.795458,0.936011,0.857263,0.870995,0.775896,0.816944,0.830887,0.811600,0.820598,...,0.711004,0.835415,0.913357,0.890036,0.805675,0.930804,0.894138,0.817027,0.977993,0.944993
ILMN_100009,0.883831,0.857333,0.811424,0.907362,0.763149,0.764961,0.845055,0.898741,0.932021,0.951832,...,0.779348,0.912258,0.887889,0.879680,0.869804,0.685800,0.786594,0.752266,0.791208,0.868821
ILMN_10001,42.180716,28.295220,48.491675,15.108901,50.917271,12.615383,32.558544,48.668974,37.672912,16.421997,...,62.377845,71.203787,21.897301,29.824924,16.600328,15.631902,16.028539,23.609221,16.012058,36.767564
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ILMN_99987,0.734585,0.818516,0.767362,0.877341,0.790919,0.884766,0.762665,0.930539,0.877803,0.897816,...,0.850284,0.895458,0.845082,0.817136,0.811308,0.863365,0.797534,0.797442,0.836796,0.831163
ILMN_9999,2.186225,2.062302,1.901461,1.265443,1.524241,1.100508,2.845723,1.751512,1.252493,1.431054,...,2.442431,1.699486,1.168026,1.244256,1.859018,1.739583,1.604486,1.295576,1.220029,2.120357
ILMN_99990,1.003755,1.105511,0.912209,1.125055,1.005835,1.019736,1.077536,0.985023,1.006931,0.959392,...,1.095790,1.086057,1.050433,0.964199,1.093784,1.000962,0.997990,1.111851,0.976525,1.035464
ILMN_99995,0.867995,0.913600,1.069407,0.840991,0.966751,0.978569,0.851452,0.917244,0.944750,0.969894,...,0.879109,0.928430,0.867635,1.020188,0.820790,0.957369,0.846994,0.922516,0.958226,0.915698


In [64]:
requires_gene_mapping = True

if requires_gene_mapping:
    gene_annotation = get_gene_annotation(soft_file)
    gene_annotation_summary = preview_df(gene_annotation)
    print(gene_annotation_summary)

gene_row_ids = genetic_data.index[:20].tolist()
gene_row_ids

if requires_gene_mapping:
    print(f'''
    As a biomedical research team, we are analyzing a gene expression dataset, and find that its row headers are some identifiers related to genes:
    {gene_row_ids}
    To get the mapping from those identifiers to actual gene symbols, we extracted the gene annotation data from a series in the GEO database, and saved it to a Python dictionary. Please read the dictionary, and decide which key stores the identifiers, and which key stores the gene symbols. Please strictly follow this format in your answer:
    identifier_key = 'key_name1'
    gene_symbol_key = 'key_name2'

    Gene annotation dictionary:
    {gene_annotation_summary}
    ''')

gene_annotation.columns

{'ID': ['ILMN_89282', 'ILMN_35826', 'ILMN_25544', 'ILMN_132331', 'ILMN_105017'], 'GB_ACC': ['BU678343', 'XM_497527.2', 'NM_018433.3', 'AW629334', 'AI818233'], 'Symbol': [nan, 'LOC441782', 'JMJD1A', nan, nan], 'SEQUENCE': ['CTCTCTAAAGGGACAACAGAGTGGACAGTCAAGGAACTCCACATATTCAT', 'GGGGTCAAGCCCAGGTGAAATGTGGATTGGAAAAGTGCTTCCCTTGCCCC', 'CCAGGCTGTAAAAGCAAAACCTCGTATCAGCTCTGGAACAATACCTGCAG', 'CCAGACAGGAAGCATCAAGCCCTTCAGGAAAGAATATGCGAGAGTGCTGC', 'TGTGCAGAAAGCTGATGGAAGGGAGAAAGAATGGAAGTGGGTCACACAGC'], 'Definition': ['UI-CF-EC0-abi-c-12-0-UI.s1 UI-CF-EC0 Homo sapiens cDNA clone UI-CF-EC0-abi-c-12-0-UI 3, mRNA sequence', 'PREDICTED: Homo sapiens similar to spectrin domain with coiled-coils 1 (LOC441782), mRNA.', 'Homo sapiens jumonji domain containing 1A (JMJD1A), mRNA.', 'hi56g05.x1 Soares_NFL_T_GBC_S1 Homo sapiens cDNA clone IMAGE:2976344 3, mRNA sequence', 'wk77d04.x1 NCI_CGAP_Pan1 Homo sapiens cDNA clone IMAGE:2421415 3, mRNA sequence'], 'Ontology': [nan, nan, nan, nan, nan], 'Synonym': [nan, nan,

Index(['ID', 'GB_ACC', 'Symbol', 'SEQUENCE', 'Definition', 'Ontology',
       'Synonym'],
      dtype='object')

In [65]:
if requires_gene_mapping:
    identifier_key = 'ID'
    gene_symbol_key = 'Symbol'
    gene_mapping = get_gene_mapping(gene_annotation, identifier_key, gene_symbol_key)
    genetic_data = apply_gene_mapping(genetic_data, gene_mapping)



In [66]:
genetic_data = normalize_gene_symbols_in_index(genetic_data)

genetic_data

Unnamed: 0,GSM1054230,GSM1054231,GSM1054232,GSM1054233,GSM1054234,GSM1054235,GSM1054236,GSM1054237,GSM1054238,GSM1054239,...,GSM1054337,GSM1054338,GSM1054339,GSM1054340,GSM1054341,GSM1054342,GSM1054343,GSM1054344,GSM1054345,GSM1054346
A1BG,0.903276,0.821580,0.865428,0.890772,0.986434,1.197420,0.886271,1.117455,1.003215,0.865700,...,0.855990,0.897686,0.884723,0.925664,1.061455,0.868972,0.961211,0.802412,0.964033,1.097807
A2M,27.154642,17.070914,34.490890,9.773921,31.282731,20.569610,65.495478,77.842674,30.995838,41.236381,...,35.781895,43.580901,32.651689,25.655624,28.996303,22.165116,21.972497,22.618258,22.302836,34.703123
A2ML1,0.893226,0.963602,0.996714,1.108153,0.855068,1.076572,0.943271,1.017992,0.925794,0.964499,...,1.070994,1.050752,1.000574,1.029542,1.056357,1.102940,0.976782,1.146317,1.128793,0.950305
A3GALT2,1.357179,1.423179,1.518340,1.352150,1.415984,1.592975,1.484345,1.446382,1.389944,1.314923,...,1.488361,1.419631,1.593990,1.414248,1.362292,1.279548,1.113990,1.113299,1.135271,1.320155
A4GALT,2.008587,1.581049,2.094153,1.300390,11.355988,6.706682,7.649266,1.623299,2.566030,7.711680,...,2.302174,3.100832,2.500579,2.247663,4.225168,1.733522,4.810449,4.048876,2.373974,1.274872
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ZYX,19.655399,15.911316,18.813045,9.908477,37.106860,32.012832,46.715755,25.140595,33.192002,11.446667,...,38.491333,31.592975,26.693239,21.626782,48.230134,33.685854,56.189855,58.047052,58.464519,60.301068
ZZEF1,7.226102,9.249937,7.148919,6.130382,9.664492,6.915493,11.620945,9.445520,6.164034,6.642245,...,10.193268,9.867647,6.958772,6.572962,7.505992,7.603477,9.829170,6.854160,9.044523,7.138543
ZZZ3,13.926936,8.321252,10.700753,4.641731,16.992292,6.677801,16.909833,13.829324,11.036007,6.636792,...,13.362898,11.496935,4.852836,5.190852,9.553233,11.981230,8.417731,7.689744,6.061908,14.245139
EIF2A,36.250133,21.393597,24.619330,10.641300,28.841662,15.424809,22.914809,28.124125,26.329730,15.381810,...,28.626723,33.113892,13.609499,14.907299,14.487761,16.448616,16.942906,25.439824,13.607890,31.558448


In [67]:
merged_data = geo_merge_clinical_genetic_data(selected_clinical_data, genetic_data)
# The preprocessing runs through, which means is_available should be True
is_available = True

merged_data

Unnamed: 0,Kidney Chromophobe,A1BG,A2M,A2ML1,A3GALT2,A4GALT,A4GNT,AAA1,AAAS,AACS,...,ZXDA,ZXDB,ZXDC,ZYG11A,ZYG11B,ZYX,ZZEF1,ZZZ3,EIF2A,RAB1C
GSM1054230,0.0,0.903276,27.154642,0.893226,1.357179,2.008587,1.090213,1.113465,3.079174,4.593716,...,0.956936,5.579101,2.600057,1.052827,15.959863,19.655399,7.226102,13.926936,36.250133,0.815878
GSM1054231,0.0,0.821580,17.070914,0.963602,1.423179,1.581049,1.108882,0.962892,1.792671,8.201332,...,0.911934,6.643073,1.840664,0.810640,53.815354,15.911316,9.249937,8.321252,21.393597,0.698088
GSM1054232,0.0,0.865428,34.490890,0.996714,1.518340,2.094153,1.133496,1.238555,3.890664,4.255689,...,0.868369,4.749234,2.945843,1.003140,15.712620,18.813045,7.148919,10.700753,24.619330,0.820355
GSM1054233,0.0,0.890772,9.773921,1.108153,1.352150,1.300390,1.041011,0.957591,1.664235,4.674827,...,0.892851,3.495437,1.402277,0.902783,31.327785,9.908477,6.130382,4.641731,10.641300,0.759291
GSM1054234,0.0,0.986434,31.282731,0.855068,1.415984,11.355988,1.010757,0.926479,2.094559,2.133348,...,0.953112,3.221408,2.098021,1.003471,24.561863,37.106860,9.664492,16.992292,28.841662,0.779781
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
GSM1054342,0.0,0.868972,22.165116,1.102940,1.279548,1.733522,1.089880,0.843426,1.531277,7.013802,...,0.981893,2.710474,2.191160,0.789249,21.725583,33.685854,7.603477,11.981230,16.448616,1.010288
GSM1054343,0.0,0.961211,21.972497,0.976782,1.113990,4.810449,1.236143,0.928271,1.706164,3.073721,...,0.952303,4.391199,2.251523,0.831166,24.281447,56.189855,9.829170,8.417731,16.942906,0.879999
GSM1054344,0.0,0.802412,22.618258,1.146317,1.113299,4.048876,1.340815,0.880647,2.355578,3.902926,...,0.888738,3.320270,2.060882,0.811653,15.325906,58.047052,6.854160,7.689744,25.439824,1.039451
GSM1054345,0.0,0.964033,22.302836,1.128793,1.135271,2.373974,1.257381,0.874247,2.393149,6.094806,...,0.886234,3.197014,2.586177,0.835287,17.117737,58.464519,9.044523,6.061908,13.607890,1.062880


In [69]:
print(f"The merged dataset contains {len(merged_data)} samples.")
is_trait_biased, merged_data = judge_and_remove_biased_features(merged_data, TRAIT, trait_type=trait_type)
is_trait_biased

The merged dataset contains 117 samples.
For the feature 'Kidney Chromophobe', the least common label is '1.0' with 1 occurrences. This represents 0.85% of the dataset.
The distribution of the feature 'Kidney Chromophobe' in this dataset is severely biased.



True

In [70]:
if is_available:
    save_cohort_info(cohort, JSON_PATH, is_available, is_trait_biased, merged_data, note='')
else:
    save_cohort_info(cohort, JSON_PATH, is_available)
merged_data.head()
if not is_trait_biased:
    merged_data.to_csv(os.path.join(OUTPUT_DIR, cohort + '.csv'), index=False)

In [71]:
# Finished 
cohort = accession_num = "GSE26574"
cohort_dir = os.path.join(trait_path, accession_num)
soft_file, matrix_file = get_relevant_filepaths(cohort_dir)
soft_file, matrix_file

from utils import *
import io
background_prefixes = ['!Series_title', '!Series_summary', '!Series_overall_design']
clinical_prefixes = ['!Sample_geo_accession', '!Sample_characteristics_ch1']

background_info, clinical_data = get_background_and_clinical_data(matrix_file, background_prefixes, clinical_prefixes)
print(background_info)

clinical_data

!Series_title	"An antioxidant response phenotype is shared between hereditary and sporadic type 2 papillary renal cell carcinoma"
!Series_summary	"Fumarate hydratase (FH) mutation causes hereditary type 2 papillary renal cell carcinoma (HLRCC, Hereditary Leiomyomatosis and Renal Cell Cancer (MM ID # 605839)). The main effect of FH mutation is fumarate accumulation. The current paradigm posits that the main consequence of fumarate accumulation is HIF-a stabilization. Paradoxically, FH mutation differs from other HIF-a stabilizing mutations, such as VHL and SDH mutations, in its associated tumor types. We identified that fumarate can directly up-regulate antioxidant response element (ARE)-controlled genes. We demonstrated that AKR1B10 is an ARE-controlled gene and is up-regulated upon FH knockdown as well as in FH-null cell lines. AKR1B10 overexpression is also a prominent feature in both hereditary and sporadic PRCC2. This phenotype better explains the similarities between hereditary an

Unnamed: 0,!Sample_geo_accession,GSM655513,GSM655514,GSM655515,GSM655516,GSM655517,GSM655518,GSM655519,GSM655520,GSM655521,...,GSM655570,GSM655571,GSM655572,GSM655573,GSM655574,GSM655575,GSM655576,GSM655577,GSM655578,GSM655579
0,!Sample_characteristics_ch1,disease state: normal_tissue_from_ccRCC_patient,disease state: normal_tissue_from_ccRCC_patient,disease state: normal_tissue_from_ccRCC_patient,disease state: normal_tissue_from_ccRCC_patient,disease state: normal_tissue_from_ccRCC_patient,disease state: normal_tissue_from_ccRCC_patient,disease state: normal_tissue_from_ccRCC_patient,disease state: normal_tissue_from_ccRCC_patient,disease state: ccRCC,...,disease state: Pap_type2,disease state: Pap_type2,disease state: Pap_type2,disease state: HLRCC,disease state: HLRCC,disease state: HLRCC,disease state: HLRCC,disease state: HLRCC,disease state: normal_tissue_from_FH_patient,disease state: normal_tissue_from_FH_patient


In [72]:
tumor_stage_row = clinical_data.iloc[0]
tumor_stage_row.unique()

array(['!Sample_characteristics_ch1',
       'disease state: normal_tissue_from_ccRCC_patient',
       'disease state: ccRCC', 'disease state: Chromophobe',
       'disease state: Pap_type1', 'disease state: Pap_type2',
       'disease state: HLRCC',
       'disease state: normal_tissue_from_FH_patient'], dtype=object)

In [73]:
is_gene_availabe = True
trait_row = 0
age_row = None
gender_row = None

trait_type = 'binary'

# Verify and use the functions generated by GPT

# 这个函数将组织类型（tissue type）转换为有关癫痫存在与否的二进制值。
# 它是基于特定的假设，即如果组织类型是“胰腺导管腺癌”（Pancreatic Ductal Adenocarcinoma），则认为癫痫存在（返回1）；否则，认为癫痫不存在（返回0）。
def convert_trait(tissue_type):
    """
    Convert tissue type to epilepsy presence (binary).
    Assuming epilepsy presence for 'Hippocampus' tissue.
    """
    if tissue_type == 'disease state: Chromophobe':
        return 1  # Epilepsy present
    else:
        return 0  # Epilepsy not present

In [74]:
selected_clinical_data = geo_select_clinical_features(clinical_data, TRAIT, trait_row, convert_trait, age_row=age_row,
                                                      convert_age=convert_age, gender_row=gender_row,
                                                      convert_gender=convert_gender)
selected_clinical_data.head()

  clinical_df = clinical_df.applymap(convert_fn)


Unnamed: 0,GSM655513,GSM655514,GSM655515,GSM655516,GSM655517,GSM655518,GSM655519,GSM655520,GSM655521,GSM655522,...,GSM655570,GSM655571,GSM655572,GSM655573,GSM655574,GSM655575,GSM655576,GSM655577,GSM655578,GSM655579
Kidney Chromophobe,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [75]:
genetic_data = get_genetic_data(matrix_file)
genetic_data

Unnamed: 0_level_0,GSM655513,GSM655514,GSM655515,GSM655516,GSM655517,GSM655518,GSM655519,GSM655520,GSM655521,GSM655522,...,GSM655570,GSM655571,GSM655572,GSM655573,GSM655574,GSM655575,GSM655576,GSM655577,GSM655578,GSM655579
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5.668639,5.397250,5.533062,5.439878,5.455800,5.429943,5.433182,5.920859,5.403412,5.565034,...,6.223809,5.480571,6.147314,6.249860,5.163622,5.785643,5.347343,5.160851,5.617737,5.595228
2,12.464613,12.821652,12.966227,12.520486,13.089543,13.331817,12.606140,12.834138,12.828438,12.980657,...,11.032362,11.496659,9.691689,11.886333,9.689367,12.391459,10.935704,5.521329,12.925633,12.379518
9,7.919827,7.605824,7.350612,6.686158,6.956433,7.230992,7.650826,7.131763,7.106921,7.396125,...,7.089215,6.427457,6.477982,7.520532,6.666302,8.042353,8.071641,7.995131,8.272262,9.176026
10,7.004454,7.640562,6.839965,7.179755,6.764172,7.095299,6.867889,6.434765,6.639619,6.506634,...,7.201593,7.707917,7.483945,5.785292,6.091871,5.636089,5.852559,5.879160,6.439763,7.493414
12,8.897253,8.377712,12.165405,7.766924,10.101467,10.485897,8.890075,7.136956,7.642404,7.404853,...,9.228735,12.464738,9.500725,9.865214,9.082017,9.470265,8.840548,6.709212,10.095208,7.925754
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
730202,5.148511,5.253727,5.452233,5.347934,5.208044,5.377347,5.174600,5.093092,5.318383,5.183217,...,5.391429,5.373495,5.269745,5.462639,5.438494,5.183750,5.008280,5.471475,5.096284,5.147228
730236,5.727775,6.254866,6.060818,6.010149,5.910481,5.935546,5.788366,6.451025,5.519453,5.638241,...,6.160716,6.208490,6.142807,5.313290,5.270430,5.354565,5.255737,4.981816,5.380207,4.973487
730241,5.731454,5.074041,5.453450,5.466722,5.308568,5.155898,5.093306,5.153164,5.210335,5.156344,...,5.089060,5.277337,5.135769,5.644737,5.108351,5.047582,4.824020,5.129366,5.220636,5.371268
730249,6.085765,6.258911,6.260450,6.228555,5.961106,6.102829,6.139825,5.957617,6.112640,6.009320,...,6.222924,6.466923,6.312049,5.432837,5.689616,5.510738,5.775398,5.785600,5.828588,5.799976


In [76]:
requires_gene_mapping = True

if requires_gene_mapping:
    gene_annotation = get_gene_annotation(soft_file)
    gene_annotation_summary = preview_df(gene_annotation)
    print(gene_annotation_summary)

gene_row_ids = genetic_data.index[:20].tolist()
gene_row_ids

if requires_gene_mapping:
    print(f'''
    As a biomedical research team, we are analyzing a gene expression dataset, and find that its row headers are some identifiers related to genes:
    {gene_row_ids}
    To get the mapping from those identifiers to actual gene symbols, we extracted the gene annotation data from a series in the GEO database, and saved it to a Python dictionary. Please read the dictionary, and decide which key stores the identifiers, and which key stores the gene symbols. Please strictly follow this format in your answer:
    identifier_key = 'key_name1'
    gene_symbol_key = 'key_name2'

    Gene annotation dictionary:
    {gene_annotation_summary}
    ''')

gene_annotation.columns

{'ID': ['1', '10', '100', '1000', '10000'], 'CHR': ['19', '8', '20', '18', '1'], 'ORF': ['A1BG', 'NAT2', 'ADA', 'CDH2', 'AKT3'], 'GENE_ID': [1.0, 10.0, 100.0, 1000.0, 10000.0]}

    As a biomedical research team, we are analyzing a gene expression dataset, and find that its row headers are some identifiers related to genes:
    ['1', '2', '9', '10', '12', '13', '14', '15', '16', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28']
    To get the mapping from those identifiers to actual gene symbols, we extracted the gene annotation data from a series in the GEO database, and saved it to a Python dictionary. Please read the dictionary, and decide which key stores the identifiers, and which key stores the gene symbols. Please strictly follow this format in your answer:
    identifier_key = 'key_name1'
    gene_symbol_key = 'key_name2'

    Gene annotation dictionary:
    {'ID': ['1', '10', '100', '1000', '10000'], 'CHR': ['19', '8', '20', '18', '1'], 'ORF': ['A1BG', 'NAT2', 

Index(['ID', 'CHR', 'ORF', 'GENE_ID'], dtype='object')

In [77]:
if requires_gene_mapping:
    identifier_key = 'ID'
    gene_symbol_key = 'ORF'
    gene_mapping = get_gene_mapping(gene_annotation, identifier_key, gene_symbol_key)
    genetic_data = apply_gene_mapping(genetic_data, gene_mapping)

In [78]:
genetic_data = normalize_gene_symbols_in_index(genetic_data)

genetic_data

Unnamed: 0,GSM655513,GSM655514,GSM655515,GSM655516,GSM655517,GSM655518,GSM655519,GSM655520,GSM655521,GSM655522,...,GSM655570,GSM655571,GSM655572,GSM655573,GSM655574,GSM655575,GSM655576,GSM655577,GSM655578,GSM655579
A1BG,5.668639,5.397250,5.533062,5.439878,5.455800,5.429943,5.433182,5.920859,5.403412,5.565034,...,6.223809,5.480571,6.147314,6.249860,5.163622,5.785643,5.347343,5.160851,5.617737,5.595228
A1CF,9.682636,8.724913,8.581043,6.410286,6.235908,7.171356,9.123482,8.831986,9.576187,8.656312,...,6.592295,7.518565,6.904563,5.550474,5.611425,5.478196,5.221965,5.535751,8.179168,9.224674
A2M,12.464613,12.821652,12.966227,12.520486,13.089543,13.331817,12.606140,12.834138,12.828438,12.980657,...,11.032362,11.496659,9.691689,11.886333,9.689367,12.391459,10.935704,5.521329,12.925633,12.379518
A2ML1,5.100448,5.234235,5.262083,5.108046,5.153058,5.005294,5.216862,4.988229,5.101942,5.133923,...,5.315465,5.292718,5.260594,5.120014,5.101100,5.063032,4.969653,5.054539,5.067441,5.026277
A4GALT,8.787488,7.887596,8.429632,8.313462,8.144748,8.193744,8.415532,8.161773,7.909114,7.967972,...,7.563385,8.597225,7.648621,7.797198,7.435960,8.229121,7.069481,7.245178,8.511372,7.888782
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ZXDC,8.726466,8.622260,8.639624,8.487453,8.528002,8.808438,8.563243,8.759431,8.124979,8.527813,...,7.936377,8.133096,7.969016,8.531433,8.737743,8.471879,7.752223,7.706087,8.967396,8.762011
ZYG11B,8.857001,8.208841,8.259732,7.955382,8.092397,8.097794,8.159850,8.714595,8.282024,7.934380,...,7.860002,7.948557,7.781597,7.917889,8.242090,7.291694,7.943692,8.781280,8.781819,8.907930
ZYX,8.965583,9.547013,9.823586,9.714980,10.179625,10.434708,9.511976,8.886803,9.161385,9.508151,...,9.701222,10.041946,10.060402,9.926511,9.515641,10.052259,9.102406,9.584249,9.194147,8.390669
ZZEF1,7.983045,7.868412,8.190961,7.895259,7.864069,7.842018,7.840938,7.647074,7.597856,8.150198,...,8.143651,8.045217,7.738805,8.205857,8.328447,8.374914,8.196029,7.934793,8.317318,8.205114


In [79]:
merged_data = geo_merge_clinical_genetic_data(selected_clinical_data, genetic_data)
# The preprocessing runs through, which means is_available should be True
is_available = True

merged_data

Unnamed: 0,Kidney Chromophobe,A1BG,A1CF,A2M,A2ML1,A4GALT,A4GNT,AAA1,AAAS,AACS,...,ZW10,ZWILCH,ZWINT,ZXDA,ZXDB,ZXDC,ZYG11B,ZYX,ZZEF1,ZZZ3
GSM655513,0.0,5.668639,9.682636,12.464613,5.100448,8.787488,5.809985,5.371351,8.812667,8.677541,...,8.344515,6.125898,8.179752,5.803459,5.454540,8.726466,8.857001,8.965583,7.983045,9.034118
GSM655514,0.0,5.397250,8.724913,12.821652,5.234235,7.887596,5.884389,5.552994,8.726981,8.543634,...,8.441628,6.190711,7.928924,5.754128,5.740305,8.622260,8.208841,9.547013,7.868412,8.738175
GSM655515,0.0,5.533062,8.581043,12.966227,5.262083,8.429632,5.849380,5.497786,9.041170,8.688982,...,8.252270,5.488916,8.042933,5.734307,5.589368,8.639624,8.259732,9.823586,8.190961,8.905796
GSM655516,0.0,5.439878,6.410286,12.520486,5.108046,8.313462,5.709369,5.475360,9.044497,9.346827,...,8.370992,6.272480,9.006933,5.655809,5.579518,8.487453,7.955382,9.714980,7.895259,8.933261
GSM655517,0.0,5.455800,6.235908,13.089543,5.153058,8.144748,5.623220,5.497168,8.881100,8.633818,...,8.230474,6.430454,8.604294,5.667027,5.701254,8.528002,8.092397,10.179625,7.864069,9.227768
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
GSM655575,0.0,5.785643,5.478196,12.391459,5.063032,8.229121,6.011650,4.933610,8.040029,10.093184,...,8.840957,6.061439,10.494962,5.642196,5.644396,8.471879,7.291694,10.052259,8.374914,10.172715
GSM655576,0.0,5.347343,5.221965,10.935704,4.969653,7.069481,5.365046,4.906179,8.122486,8.354842,...,9.702138,7.870020,11.444870,5.998233,5.740564,7.752223,7.943692,9.102406,8.196029,9.482569
GSM655577,0.0,5.160851,5.535751,5.521329,5.054539,7.245178,5.697122,5.170637,8.477883,9.645716,...,9.061160,9.209461,11.673874,5.224771,6.182081,7.706087,8.781280,9.584249,7.934793,10.650001
GSM655578,0.0,5.617737,8.179168,12.925633,5.067441,8.511372,5.065305,5.023733,8.062029,8.667378,...,8.833000,5.645277,8.825598,5.962845,5.752671,8.967396,8.781819,9.194147,8.317318,9.670159


In [80]:
print(f"The merged dataset contains {len(merged_data)} samples.")
is_trait_biased, merged_data = judge_and_remove_biased_features(merged_data, TRAIT, trait_type=trait_type)
is_trait_biased

The merged dataset contains 67 samples.
For the feature 'Kidney Chromophobe', the least common label is '1.0' with 10 occurrences. This represents 14.93% of the dataset.
The distribution of the feature 'Kidney Chromophobe' in this dataset is fine.



False

In [81]:
if is_available:
    save_cohort_info(cohort, JSON_PATH, is_available, is_trait_biased, merged_data, note='')
else:
    save_cohort_info(cohort, JSON_PATH, is_available)
merged_data.head()
if not is_trait_biased:
    merged_data.to_csv(os.path.join(OUTPUT_DIR, cohort + '.csv'), index=False)

In [82]:
# Stopped: No obvious trait convert
cohort = accession_num = "GSE14670"
cohort_dir = os.path.join(trait_path, accession_num)
soft_file, matrix_file = get_relevant_filepaths(cohort_dir)
soft_file, matrix_file

from utils import *
import io
background_prefixes = ['!Series_title', '!Series_summary', '!Series_overall_design']
clinical_prefixes = ['!Sample_geo_accession', '!Sample_characteristics_ch1']

background_info, clinical_data = get_background_and_clinical_data(matrix_file, background_prefixes, clinical_prefixes)
print(background_info)

clinical_data

!Series_title	"Virtual-Karyotyping with SNP microarrays in morphologically challenging renal cell neoplasms"
!Series_summary	"Genetic lesions characteristic for RCC subtypes can be identified by virtual karyotyping with SNP microarrays. In this study, we examined whether virtual karyotypes could be used to better classify a cohort of morphologically challenging/unclassified RCC."
!Series_overall_design	"Tumor resection specimens from 17 patients were profiled by virtual karyotyping with Affymetrix 10K 2.0 or 250K Nsp SNP Mapping arrays and were also evaluated independently by a panel of seven genito-urinary pathologists. Tumors were classified by the established pattern of genomic imbalances based on a reference cohort of 98 cases with classic morphology and compared to the morphologic diagnosis of the pathologist panel. In 3 cases, samples from areas with different morphologic appearance were also tested (n=5)."


Unnamed: 0,!Sample_geo_accession,GSM240385,GSM240386,GSM240387,GSM240388,GSM240389,GSM240390,GSM240391,GSM240392,GSM240393,...,GSM366129,GSM366130,GSM366131,GSM366132,GSM366133,GSM366135,GSM366136,GSM366137,GSM366138,GSM366139
0,!Sample_characteristics_ch1,sex: F,sex: M,sex: M,sex: M,sex: F,sex: M,sex: M,sex: F,sex: M,...,Sex: M,Sex: N/A,Sex: N/A,Sex: N/A,Sex: N/A,Sex: M,Sex: N/A,Sex: N/A,Sex: M,Sex: F
1,!Sample_characteristics_ch1,agedecade: 7,agedecade: 5,agedecade: 6,agedecade: 7,agedecade: 5,agedecade: 5,agedecade: 7,agedecade: 6,agedecade: 8,...,AgeDecade: 68,AgeDecade: N/A,AgeDecade: N/A,AgeDecade: N/A,AgeDecade: N/A,AgeDecade: 68,AgeDecade: N/A,AgeDecade: N/A,AgeDecade: 68,AgeDecade: 68
2,!Sample_characteristics_ch1,tumor size: 7.7,tumor size: 9.7,tumor size: 5.5,tumor size: 7,tumor size: 6,tumor size: 4,tumor size: 6,tumor size: 5.4,tumor size: 4.5,...,Tumor Size: 5.699999809,Tumor Size: N/A,Tumor Size: N/A,Tumor Size: N/A,Tumor Size: 6,Tumor Size: 5,Tumor Size: N/A,Tumor Size: N/A,Tumor Size: N/A,Tumor Size: 14.19999981
3,!Sample_characteristics_ch1,furhman nuclear grade: 2,furhman nuclear grade: 3,furhman nuclear grade: 3,furhman nuclear grade: 3,furhman nuclear grade: 2,furhman nuclear grade: 2,furhman nuclear grade: 2,furhman nuclear grade: 2,furhman nuclear grade: 3,...,Furhman Nuclear Grade: 4,Furhman Nuclear Grade: N/A,Furhman Nuclear Grade: N/A,Furhman Nuclear Grade: N/A,Furhman Nuclear Grade: 4,Furhman Nuclear Grade: 2,Furhman Nuclear Grade: N/A,Furhman Nuclear Grade: N/A,Furhman Nuclear Grade: 3,Furhman Nuclear Grade: 4
4,!Sample_characteristics_ch1,tnm stage: T2NXMX,tnm stage: T2NXMX,tnm stage: T1NXMX,tnm stage: T3bNXMX,tnm stage: T1bNXMX,tnm stage: T1aNXMX,tnm stage: T1bNXMX,tnm stage: T1bNXMX,tnm stage: T1bNXMX,...,TNM Stage: T3bNxMx,TNM Stage: N/A,TNM Stage: N/A,TNM Stage: N/A,TNM Stage: T1bNxMx,TNM Stage: T1NN0MX,TNM Stage: N/A,TNM Stage: N/A,TNM Stage: N/A,TNM Stage: T3aN0Mx
5,!Sample_characteristics_ch1,lymphovascular invasion: N,lymphovascular invasion: N,lymphovascular invasion: N,lymphovascular invasion: Y,lymphovascular invasion: N,lymphovascular invasion: N,lymphovascular invasion: N,lymphovascular invasion: N,lymphovascular invasion: N,...,LymphoVascular Invasion: FALSE,LymphoVascular Invasion: N/A,LymphoVascular Invasion: N/A,LymphoVascular Invasion: N/A,LymphoVascular Invasion: FALSE,LymphoVascular Invasion: FALSE,LymphoVascular Invasion: N/A,LymphoVascular Invasion: N/A,LymphoVascular Invasion: FALSE,LymphoVascular Invasion: FALSE
6,!Sample_characteristics_ch1,"gross appearance: WC solid, fleshy tan-black",gross appearance: Yellow orange granular lesio...,gross appearance: WC yellow variegated,gross appearance: Tan-brown with central scar ...,gross appearance: Bright yellow,gross appearance: lobulated yellow,gross appearance: variegated golden yellow,gross appearance: golden yellow orange,gross appearance: variegated golden yellow,...,,,,,,,,,,
7,!Sample_characteristics_ch1,"comment: IHC: pos CK7, EP4, CI; neg vim, CD10...",comment:,comment:,comment: .015 cm PRCC,comment:,"comment: Multifocal; IHC: pos RCC; neg CK7, E...",comment:,comment:,comment:,...,,,,,,,,,,


In [83]:
tumor_stage_row = clinical_data.iloc[7]
tumor_stage_row.unique()

array(['!Sample_characteristics_ch1',
       'comment: IHC:  pos CK7, EP4, CI; neg vim, CD10, ecad, inh; Ki67 pos < 10%',
       'comment: ', 'comment: .015 cm PRCC',
       'comment: Multifocal; IHC:  pos RCC; neg CK7, EP4, vim, inh, IC; CD10 pos memb; ecad focal weak memb',
       'comment: IHC:  RCC neg, CK7 scat pos and tubules.',
       'comment: Multifocal; no IHC; no FISH',
       'comment: 4cm OC, no ECE, no LVI',
       'comment: IHC:  pos CD10, ecad, inh; neg CI, vim, rcc, EP4; CK7 scat pos; Ki67 1+ of neoplastic cells',
       'comment: 21cm OC, no ECE, no LVI; FISH not c/w OC.',
       'comment: Small focus of clear cell; multifocal vs tangential cut.',
       'comment: Extensive necrosis.', nan], dtype=object)

In [None]:
is_gene_availabe = True
trait_row = 2
age_row = None
gender_row = None

trait_type = 'binary'

# Verify and use the functions generated by GPT

# 这个函数将组织类型（tissue type）转换为有关癫痫存在与否的二进制值。
# 它是基于特定的假设，即如果组织类型是“胰腺导管腺癌”（Pancreatic Ductal Adenocarcinoma），则认为癫痫存在（返回1）；否则，认为癫痫不存在（返回0）。
def convert_trait(tissue_type):
    """
    Convert tissue type to epilepsy presence (binary).
    Assuming epilepsy presence for 'Hippocampus' tissue.
    """
    if tissue_type == 'condition: tumor':
        return 1  # Epilepsy present
    else:
        return 0  # Epilepsy not present

In [84]:
# Finished
cohort = accession_num = "GSE40911"
cohort_dir = os.path.join(trait_path, accession_num)
soft_file, matrix_file = get_relevant_filepaths(cohort_dir)
soft_file, matrix_file

from utils import *
import io
background_prefixes = ['!Series_title', '!Series_summary', '!Series_overall_design']
clinical_prefixes = ['!Sample_geo_accession', '!Sample_characteristics_ch1']

background_info, clinical_data = get_background_and_clinical_data(matrix_file, background_prefixes, clinical_prefixes)
print(background_info)

clinical_data

!Series_title	"Expression analysis and in silico characterization of intronic long noncoding RNAs in renal cell carcinoma: emerging functional associations (RCC malignancy)"
!Series_summary	"Intronic and intergenic long noncoding RNAs (lncRNAs) are emerging gene expression regulators. The molecular pathogenesis of renal cell carcinoma (RCC) is still poorly understood, and in particular, limited studies are available for intronic lncRNAs expressed in RCC. Microarray experiments were performed with two different custom-designed arrays enriched with probes for lncRNAs mapping to intronic genomic regions. Samples from 18 primary clear cell RCC tumors and 11 nontumor adjacent matched tissues were analyzed with 4k-probes microarrays. Oligoarrays with 44k-probes were used to interrogate 17 RCC samples (14 clear cell, 2 papillary, 1 chromophobe subtypes) split into four pools. Meta-analyses were performed by taking the genomic coordinates of the RCC-expressed lncRNAs, and cross-referencing the

Unnamed: 0,!Sample_geo_accession,GSM1004655,GSM1004656,GSM1004657,GSM1004658,GSM1004659,GSM1004660,GSM1004661,GSM1004662,GSM1004663,...,GSM1004689,GSM1004690,GSM1004691,GSM1004692,GSM1004693,GSM1004694,GSM1004695,GSM1004696,GSM1004697,GSM1004698
0,!Sample_characteristics_ch1,patient identifier: 3,patient identifier: 3,patient identifier: 5,patient identifier: 5,patient identifier: 8,patient identifier: 8,patient identifier: 9,patient identifier: 9,patient identifier: 10,...,patient identifier: 24,patient identifier: 24,patient identifier: 26,patient identifier: 26,patient identifier: 28,patient identifier: 28,patient identifier: 30,patient identifier: 30,patient identifier: 31,patient identifier: 31
1,!Sample_characteristics_ch1,disease: clear cell renal cell carcinoma (RCC),disease: clear cell renal cell carcinoma (RCC),disease: clear cell renal cell carcinoma (RCC),disease: clear cell renal cell carcinoma (RCC),disease: clear cell renal cell carcinoma (RCC),disease: clear cell renal cell carcinoma (RCC),disease: clear cell renal cell carcinoma (RCC),disease: clear cell renal cell carcinoma (RCC),disease: clear cell renal cell carcinoma (RCC),...,disease: clear cell renal cell carcinoma (RCC),disease: clear cell renal cell carcinoma (RCC),disease: clear cell renal cell carcinoma (RCC),disease: clear cell renal cell carcinoma (RCC),disease: clear cell renal cell carcinoma (RCC),disease: clear cell renal cell carcinoma (RCC),disease: clear cell renal cell carcinoma (RCC),disease: clear cell renal cell carcinoma (RCC),disease: clear cell renal cell carcinoma (RCC),disease: clear cell renal cell carcinoma (RCC)
2,!Sample_characteristics_ch1,tissue: adjacent nontumor kidney tissue,tissue: adjacent nontumor kidney tissue,tissue: adjacent nontumor kidney tissue,tissue: adjacent nontumor kidney tissue,tissue: adjacent nontumor kidney tissue,tissue: adjacent nontumor kidney tissue,tissue: adjacent nontumor kidney tissue,tissue: adjacent nontumor kidney tissue,tissue: adjacent nontumor kidney tissue,...,tissue: primary kidney tumor,tissue: primary kidney tumor,tissue: primary kidney tumor,tissue: primary kidney tumor,tissue: primary kidney tumor,tissue: primary kidney tumor,tissue: primary kidney tumor,tissue: primary kidney tumor,tissue: primary kidney tumor,tissue: primary kidney tumor
3,!Sample_characteristics_ch1,gender: female,gender: female,gender: male,gender: male,gender: female,gender: female,gender: male,gender: male,gender: female,...,gender: male,gender: male,gender: female,gender: female,gender: male,gender: male,gender: female,gender: female,gender: male,gender: male
4,!Sample_characteristics_ch1,age at surgery (yrs): 78,age at surgery (yrs): 78,age at surgery (yrs): 53,age at surgery (yrs): 53,age at surgery (yrs): 71,age at surgery (yrs): 71,age at surgery (yrs): 39,age at surgery (yrs): 39,age at surgery (yrs): 34,...,age at surgery (yrs): 75,age at surgery (yrs): 75,age at surgery (yrs): 40,age at surgery (yrs): 40,age at surgery (yrs): 51,age at surgery (yrs): 51,age at surgery (yrs): 51,age at surgery (yrs): 51,age at surgery (yrs): 50,age at surgery (yrs): 50
5,!Sample_characteristics_ch1,patient status: cancer-specific death,patient status: cancer-specific death,patient status: cancer-specific death,patient status: cancer-specific death,patient status: dead from other causes,patient status: dead from other causes,patient status: alive without cancer,patient status: alive without cancer,patient status: alive without cancer,...,fuhrman grade: IV,fuhrman grade: IV,fuhrman grade: II,fuhrman grade: II,fuhrman grade: III,fuhrman grade: III,fuhrman grade: II,fuhrman grade: II,fuhrman grade: III,fuhrman grade: III
6,!Sample_characteristics_ch1,,,,,,,,,,...,tumor size (cm): 15,tumor size (cm): 15,tumor size (cm): 7,tumor size (cm): 7,tumor size (cm): 5,tumor size (cm): 5,tumor size (cm): 8.5,tumor size (cm): 8.5,tumor size (cm): 8,tumor size (cm): 8
7,!Sample_characteristics_ch1,,,,,,,,,,...,necrosis: yes,necrosis: yes,necrosis: no,necrosis: no,necrosis: no,necrosis: no,necrosis: yes,necrosis: yes,necrosis: yes,necrosis: yes
8,!Sample_characteristics_ch1,,,,,,,,,,...,capsule infiltration: yes,capsule infiltration: yes,capsule infiltration: yes,capsule infiltration: yes,capsule infiltration: no,capsule infiltration: no,capsule infiltration: yes,capsule infiltration: yes,capsule infiltration: yes,capsule infiltration: yes
9,!Sample_characteristics_ch1,,,,,,,,,,...,tnm classification (t): 2,tnm classification (t): 2,tnm classification (t): 3b,tnm classification (t): 3b,tnm classification (t): 1,tnm classification (t): 1,tnm classification (t): 2,tnm classification (t): 2,tnm classification (t): 3b,tnm classification (t): 3b


In [85]:
tumor_stage_row = clinical_data.iloc[1]
tumor_stage_row.unique()

array(['!Sample_characteristics_ch1',
       'disease: clear cell renal cell carcinoma (RCC)'], dtype=object)

In [88]:
is_gene_availabe = True
trait_row = 1
age_row = 4
gender_row = 3

trait_type = 'binary'

# Verify and use the functions generated by GPT

# 这个函数将组织类型（tissue type）转换为有关癫痫存在与否的二进制值。
# 它是基于特定的假设，即如果组织类型是“胰腺导管腺癌”（Pancreatic Ductal Adenocarcinoma），则认为癫痫存在（返回1）；否则，认为癫痫不存在（返回0）。
def convert_trait(tissue_type):
    """
    Convert tissue type to epilepsy presence (binary).
    Assuming epilepsy presence for 'Hippocampus' tissue.
    """
    if tissue_type == 'disease: clear cell renal cell carcinoma (RCC)':
        return 1  # Epilepsy present
    else:
        return 0  # Epilepsy not present

In [89]:
selected_clinical_data = geo_select_clinical_features(clinical_data, TRAIT, trait_row, convert_trait, age_row=age_row,
                                                      convert_age=convert_age, gender_row=gender_row,
                                                      convert_gender=convert_gender)
selected_clinical_data.head()

  clinical_df = clinical_df.applymap(convert_fn)
  clinical_df = clinical_df.applymap(convert_fn)
  clinical_df = clinical_df.applymap(convert_fn)


Unnamed: 0,GSM1004655,GSM1004656,GSM1004657,GSM1004658,GSM1004659,GSM1004660,GSM1004661,GSM1004662,GSM1004663,GSM1004664,...,GSM1004689,GSM1004690,GSM1004691,GSM1004692,GSM1004693,GSM1004694,GSM1004695,GSM1004696,GSM1004697,GSM1004698
Kidney Chromophobe,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
Age,78,78,53,53,71,71,39,39,34,34,...,75,75,40,40,51,51,51,51,50,50
Gender,1,1,0,0,1,1,0,0,1,1,...,0,0,1,1,0,0,1,1,0,0


In [90]:
genetic_data = get_genetic_data(matrix_file)
genetic_data

Unnamed: 0_level_0,GSM1004655,GSM1004656,GSM1004657,GSM1004658,GSM1004659,GSM1004660,GSM1004661,GSM1004662,GSM1004663,GSM1004664,...,GSM1004689,GSM1004690,GSM1004691,GSM1004692,GSM1004693,GSM1004694,GSM1004695,GSM1004696,GSM1004697,GSM1004698
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
3,1.437273,1.674227,5.305341,4.193955,1.227295,3.416227,3.718273,4.329773,2.180568,1.933909,...,2.263227,1.810614,5.394136,5.185057,2.817341,2.631955,1.718864,0.911318,1.290841,1.184500
4,2.049591,2.043773,4.909659,4.769750,1.303795,2.549125,5.019545,5.817341,2.505500,2.462614,...,2.949727,2.252273,8.628886,9.682500,3.649705,3.624045,0.911318,1.885591,3.894159,2.964523
6,6.138932,5.641727,11.108727,10.532295,4.600114,9.868205,10.034773,12.374955,6.749909,6.394932,...,9.937273,9.117500,12.210477,14.781477,10.952080,10.515000,10.150250,8.656409,5.062489,4.377955
7,1.655091,1.115705,6.016773,6.079091,1.459250,2.676705,1.828693,2.787205,2.109114,2.007932,...,2.164375,1.674795,4.452432,5.394136,2.623705,2.234193,1.003591,1.744398,1.441341,1.398080
9,2.207614,2.803170,4.166545,3.309591,0.425273,0.958455,1.397477,2.026455,1.268511,1.314568,...,1.655739,1.375727,3.354136,4.187477,2.204364,1.943034,1.316227,1.314568,0.830182,1.011409
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4536,6.042318,5.310705,8.327068,8.044136,5.652591,5.437250,3.649705,6.107409,4.091625,3.248409,...,15.402841,18.166068,16.566273,16.050568,7.585966,0.185341,5.608455,5.310705,8.514114,5.644477
4537,3.140114,2.355761,2.947682,2.832591,3.312102,3.492955,0.493170,2.832591,2.075545,1.687136,...,2.699602,2.480773,3.833159,4.187477,2.078386,5.251841,2.068795,2.252273,2.667909,2.605636
4538,3.907591,3.279375,1.924352,1.674227,2.337955,2.564091,0.493170,3.134273,2.337159,2.073455,...,2.365898,2.209568,3.826091,4.673398,0.526318,2.274341,2.937318,2.974318,1.473125,2.510682
4541,6.722045,7.792432,7.052114,7.634705,11.214114,12.201534,9.317886,15.939886,8.500318,6.301864,...,3.204239,3.136705,5.142432,5.929659,4.157114,5.485727,4.972614,6.763159,3.852273,4.061705


In [91]:
requires_gene_mapping = True

if requires_gene_mapping:
    gene_annotation = get_gene_annotation(soft_file)
    gene_annotation_summary = preview_df(gene_annotation)
    print(gene_annotation_summary)

gene_row_ids = genetic_data.index[:20].tolist()
gene_row_ids

if requires_gene_mapping:
    print(f'''
    As a biomedical research team, we are analyzing a gene expression dataset, and find that its row headers are some identifiers related to genes:
    {gene_row_ids}
    To get the mapping from those identifiers to actual gene symbols, we extracted the gene annotation data from a series in the GEO database, and saved it to a Python dictionary. Please read the dictionary, and decide which key stores the identifiers, and which key stores the gene symbols. Please strictly follow this format in your answer:
    identifier_key = 'key_name1'
    gene_symbol_key = 'key_name2'

    Gene annotation dictionary:
    {gene_annotation_summary}
    ''')

gene_annotation.columns

{'ID': ['910', '4260', '1981', '2381', '4288'], 'GB_ACC': ['BE833259', 'BE702227', 'BF364095', 'BE081005', 'AW880607'], 'SPOT_TYPE': ['Exonic', 'Exonic', 'Exonic', 'Exonic', 'Exonic'], 'GENE_ID': [85439.0, 2776.0, 84131.0, 2776.0, 54768.0], 'GENE_SYMBOL': ['STON2', 'GNAQ', 'CEP78', 'GNAQ', 'HYDIN'], 'GENE_ANNOTATION': ['stonin 2', 'Guanine nucleotide binding protein (G protein), q polypeptide', 'centrosomal protein 78kDa', 'Guanine nucleotide binding protein (G protein), q polypeptide', 'hydrocephalus inducing homolog 2 (mouse); hydrocephalus inducing homolog (mouse)'], 'CPC_CODING_POTENTIAL': ['noncoding', 'noncoding', 'noncoding', 'noncoding', '-'], 'SEQUENCE': ['CTGATCCGCTTAAGCTTAGTATGTTTGAGTGTGTAATTTTAGTTTCTTTTCTGGTTGTATTTGTGGTAGTCAGATGTGTTGGATTGATTCCAACTGGACAGAGTAAGGAATTCCAGCATCCTCTTCCTGCTTGCTCGTGTTACCCCACAGATCAAACCCTCAATTCTAGTTGGGGATGCTGTCTAGCCCCACACCATGACTGAAGCCTTAAGCACTGTTGCGCCTCCATGTGCTTTGGATCAGCAACCCCAGTGGTATTCTACCAGAGCATTGTGGGAAAGCAGATGTATAGTCAGGTCCCAACAGCAAATTGTTGGGTGTGAGAG

Index(['ID', 'GB_ACC', 'SPOT_TYPE', 'GENE_ID', 'GENE_SYMBOL',
       'GENE_ANNOTATION', 'CPC_CODING_POTENTIAL', 'SEQUENCE', 'COORDINATES',
       'CLONE ID', 'SPOT_ID'],
      dtype='object')

In [93]:
if requires_gene_mapping:
    identifier_key = 'ID'
    gene_symbol_key = 'GENE_SYMBOL'
    gene_mapping = get_gene_mapping(gene_annotation, identifier_key, gene_symbol_key)
    genetic_data = apply_gene_mapping(genetic_data, gene_mapping)

In [94]:
genetic_data = normalize_gene_symbols_in_index(genetic_data)

genetic_data

Unnamed: 0,GSM1004655,GSM1004656,GSM1004657,GSM1004658,GSM1004659,GSM1004660,GSM1004661,GSM1004662,GSM1004663,GSM1004664,...,GSM1004689,GSM1004690,GSM1004691,GSM1004692,GSM1004693,GSM1004694,GSM1004695,GSM1004696,GSM1004697,GSM1004698
A2ML1,11.611000,12.398614,8.044136,8.556591,7.704807,6.522909,9.999795,9.772500,7.132091,5.202318,...,8.696045,7.371818,8.591148,8.100955,11.280591,11.148239,11.108727,9.756568,8.489977,5.909795
AARSD1,5.463477,7.561534,4.762284,4.645347,8.187159,4.843648,5.484097,6.125852,4.627341,4.554318,...,5.583023,4.967341,5.021307,5.170159,4.953369,5.731830,5.867227,5.509068,6.256591,4.950045
AATF,5.866136,6.779500,7.152886,7.174977,5.923523,7.814091,5.647227,5.936205,5.132614,8.291773,...,6.651886,6.834841,5.958523,6.061705,6.756705,6.643341,8.088909,6.793795,6.476136,7.770227
ABCA3,2.304568,2.531636,4.169409,3.884727,0.992409,2.001591,2.025750,4.709955,2.187364,2.650977,...,4.423670,4.664216,5.718386,6.238136,4.003159,5.027864,2.009568,0.258432,2.962682,2.970159
ABCB10,3.128023,4.258580,5.121580,5.776841,5.195102,6.675045,5.629682,5.837705,6.154091,6.749909,...,1.439886,2.184136,1.479148,1.606227,4.489114,4.833023,5.506409,3.971659,3.641545,3.018000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ZXDC,2.842023,5.670727,3.114318,3.013386,7.037091,2.927023,4.182432,4.116295,3.781341,4.459932,...,3.460523,3.260307,2.971625,3.644750,2.470284,3.244091,4.407591,3.291898,3.743045,2.431773
ZZEF1,1.811693,2.781568,2.068795,2.061159,2.292614,2.279068,2.309682,3.003500,3.627909,2.482614,...,1.370273,2.416659,1.479148,2.442682,2.807750,4.364205,3.128023,3.073705,2.769273,2.724955
ARL5A,5.899591,5.942318,5.942318,6.585159,6.675045,6.543614,1.284795,2.230034,3.418818,2.890795,...,3.263795,3.665227,4.000250,4.701227,7.104273,7.253159,8.872409,9.100227,4.407591,3.595500
NONO,18.844500,18.554023,21.731909,21.803500,14.853091,15.429045,18.751341,17.925205,21.613500,21.613500,...,15.005955,14.472636,22.955568,21.140386,17.286795,16.766295,19.003841,20.658341,17.032136,19.342568


In [95]:
merged_data = geo_merge_clinical_genetic_data(selected_clinical_data, genetic_data)
# The preprocessing runs through, which means is_available should be True
is_available = True

merged_data

Unnamed: 0,Kidney Chromophobe,Age,Gender,A2ML1,AARSD1,AATF,ABCA3,ABCB10,ABCB6,ABCB7,...,ZNF841,ZNHIT3,ZRANB3,ZSCAN18,ZW10,ZXDC,ZZEF1,ARL5A,NONO,PRKRIP1
GSM1004655,1.0,78.0,1.0,11.611,5.463477,5.866136,2.304568,3.128023,2.288523,3.816227,...,52.649727,428.064205,5.56325,8.566886,2.134705,2.842023,1.811693,5.899591,18.8445,3.841659
GSM1004656,1.0,78.0,1.0,12.398614,7.561534,6.7795,2.531636,4.25858,1.857784,5.336545,...,53.024932,428.064205,4.555818,10.59,1.646068,5.670727,2.781568,5.942318,18.554023,8.121682
GSM1004657,1.0,53.0,0.0,8.044136,4.762284,7.152886,4.169409,5.12158,2.802159,5.929659,...,56.87975,492.838,6.010773,7.792432,3.739795,3.114318,2.068795,5.942318,21.731909,3.080023
GSM1004658,1.0,53.0,0.0,8.556591,4.645347,7.174977,3.884727,5.776841,3.129,5.627193,...,53.339636,492.838,6.126432,8.121682,3.108023,3.013386,2.061159,6.585159,21.8035,3.413045
GSM1004659,1.0,71.0,1.0,7.704807,8.187159,5.923523,0.992409,5.195102,1.369386,4.128,...,61.150727,559.608205,4.766364,8.009045,1.378205,7.037091,2.292614,6.675045,14.853091,8.044136
GSM1004660,1.0,71.0,1.0,6.522909,4.843648,7.814091,2.001591,6.675045,3.592977,4.304909,...,57.500205,559.608205,4.132818,7.491523,2.551955,2.927023,2.279068,6.543614,15.429045,5.078136
GSM1004661,1.0,39.0,0.0,9.999795,5.484097,5.647227,2.02575,5.629682,4.240318,4.577591,...,62.049091,110.698523,4.736011,8.18225,2.531636,4.182432,2.309682,1.284795,18.751341,3.000057
GSM1004662,1.0,39.0,0.0,9.7725,6.125852,5.936205,4.709955,5.837705,3.957761,6.022886,...,52.649727,175.147705,4.940864,8.077614,2.713705,4.116295,3.0035,2.230034,17.925205,4.526136
GSM1004663,1.0,34.0,1.0,7.132091,4.627341,5.132614,2.187364,6.154091,1.673659,3.484932,...,61.539432,363.600432,6.819295,6.194659,1.844273,3.781341,3.627909,3.418818,21.6135,3.093068
GSM1004664,1.0,34.0,1.0,5.202318,4.554318,8.291773,2.650977,6.749909,3.503614,2.606432,...,53.883227,351.976114,7.99025,4.994,1.3405,4.459932,2.482614,2.890795,21.6135,3.139205


In [96]:
print(f"The merged dataset contains {len(merged_data)} samples.")
is_trait_biased, merged_data = judge_and_remove_biased_features(merged_data, TRAIT, trait_type=trait_type)
is_trait_biased

The merged dataset contains 44 samples.
For the feature 'Kidney Chromophobe', the least common label is '1.0' with 44 occurrences. This represents 100.00% of the dataset.
The distribution of the feature 'Kidney Chromophobe' in this dataset is severely biased.

Quartiles for 'Age':
  25%: 40.0
  50% (Median): 51.0
  75%: 71.0
Min: 34.0
Max: 78.0
The distribution of the feature 'Age' in this dataset is fine.

For the feature 'Gender', the least common label is '1.0' with 20 occurrences. This represents 45.45% of the dataset.
The distribution of the feature 'Gender' in this dataset is fine.



True

In [97]:
if is_available:
    save_cohort_info(cohort, JSON_PATH, is_available, is_trait_biased, merged_data, note='')
else:
    save_cohort_info(cohort, JSON_PATH, is_available)
merged_data.head()
if not is_trait_biased:
    merged_data.to_csv(os.path.join(OUTPUT_DIR, cohort + '.csv'), index=False)

In [98]:
# Finished
cohort = accession_num = "GSE19982"
cohort_dir = os.path.join(trait_path, accession_num)
soft_file, matrix_file = get_relevant_filepaths(cohort_dir)
soft_file, matrix_file

('/Users/legion/Desktop/Courses/IS389/data\\GEO\\Kidney-Chromophobe\\GSE19982\\GSE19982_family.soft.gz',
 '/Users/legion/Desktop/Courses/IS389/data\\GEO\\Kidney-Chromophobe\\GSE19982\\GSE19982_series_matrix.txt.gz')

In [243]:
# No obvious traiy convert
cohort = accession_num = "GSE19949"
cohort_dir = os.path.join(trait_path, accession_num)
soft_file, matrix_file = get_relevant_filepaths(cohort_dir)
soft_file, matrix_file

('/Users/legion/Desktop/Courses/IS389/data\\GEO\\Kidney-Chromophobe\\GSE19949\\GSE19949_family.soft.gz',
 '/Users/legion/Desktop/Courses/IS389/data\\GEO\\Kidney-Chromophobe\\GSE19949\\GSE19949-GPL3921_series_matrix.txt.gz')

In [112]:
# Finished: Biased
cohort = accession_num = "GSE95425"
cohort_dir = os.path.join(trait_path, accession_num)
soft_file, matrix_file = get_relevant_filepaths(cohort_dir)
soft_file, matrix_file

('/Users/legion/Desktop/Courses/IS389/data\\GEO\\Kidney-Chromophobe\\GSE95425\\GSE95425_family.soft.gz',
 '/Users/legion/Desktop/Courses/IS389/data\\GEO\\Kidney-Chromophobe\\GSE95425\\GSE95425_series_matrix.txt.gz')

In [125]:
# Stopped: No gene mapping
cohort = accession_num = "GSE8271"
cohort_dir = os.path.join(trait_path, accession_num)
soft_file, matrix_file = get_relevant_filepaths(cohort_dir)
soft_file, matrix_file

('/Users/legion/Desktop/Courses/IS389/data\\GEO\\Kidney-Chromophobe\\GSE8271\\GSE8271_family.soft.gz',
 '/Users/legion/Desktop/Courses/IS389/data\\GEO\\Kidney-Chromophobe\\GSE8271\\GSE8271-GPL2004_series_matrix.txt.gz')

In [140]:
# Stopped: No gene mapping
cohort = accession_num = "GSE144082"
cohort_dir = os.path.join(trait_path, accession_num)
soft_file, matrix_file = get_relevant_filepaths(cohort_dir)
soft_file, matrix_file

('/Users/legion/Desktop/Courses/IS389/data\\GEO\\Kidney-Chromophobe\\GSE144082\\GSE144082_family.soft.gz',
 '/Users/legion/Desktop/Courses/IS389/data\\GEO\\Kidney-Chromophobe\\GSE144082\\GSE144082_series_matrix.txt.gz')

In [147]:
# Finished
cohort = accession_num = "GSE11024"
cohort_dir = os.path.join(trait_path, accession_num)
soft_file, matrix_file = get_relevant_filepaths(cohort_dir)
soft_file, matrix_file

('/Users/legion/Desktop/Courses/IS389/data\\GEO\\Kidney-Chromophobe\\GSE11024\\GSE11024_family.soft.gz',
 '/Users/legion/Desktop/Courses/IS389/data\\GEO\\Kidney-Chromophobe\\GSE11024\\GSE11024_series_matrix.txt.gz')

In [161]:
# Stopped: No obvious traits
cohort = accession_num = "GSE11447"
cohort_dir = os.path.join(trait_path, accession_num)
soft_file, matrix_file = get_relevant_filepaths(cohort_dir)
soft_file, matrix_file

('/Users/legion/Desktop/Courses/IS389/data\\GEO\\Kidney-Chromophobe\\GSE11447\\GSE11447_family.soft.gz',
 '/Users/legion/Desktop/Courses/IS389/data\\GEO\\Kidney-Chromophobe\\GSE11447\\GSE11447_series_matrix.txt.gz')

In [166]:
# Stop: No clinical data
cohort = accession_num = "GSE3"
cohort_dir = os.path.join(trait_path, accession_num)
soft_file, matrix_file = get_relevant_filepaths(cohort_dir)
soft_file, matrix_file

('/Users/legion/Desktop/Courses/IS389/data\\GEO\\Kidney-Chromophobe\\GSE3\\GSE3_family.soft.gz',
 '/Users/legion/Desktop/Courses/IS389/data\\GEO\\Kidney-Chromophobe\\GSE3\\GSE3-GPL10_series_matrix.txt.gz')

In [170]:
# Finished
cohort = accession_num = "GSE6280"
cohort_dir = os.path.join(trait_path, accession_num)
soft_file, matrix_file = get_relevant_filepaths(cohort_dir)
soft_file, matrix_file

('/Users/legion/Desktop/Courses/IS389/data\\GEO\\Kidney-Chromophobe\\GSE6280\\GSE6280_family.soft.gz',
 '/Users/legion/Desktop/Courses/IS389/data\\GEO\\Kidney-Chromophobe\\GSE6280\\GSE6280-GPL96_series_matrix.txt.gz')

In [182]:
# Finished
cohort = accession_num = "GSE11151"
cohort_dir = os.path.join(trait_path, accession_num)
soft_file, matrix_file = get_relevant_filepaths(cohort_dir)
soft_file, matrix_file

('/Users/legion/Desktop/Courses/IS389/data\\GEO\\Kidney-Chromophobe\\GSE11151\\GSE11151_family.soft.gz',
 '/Users/legion/Desktop/Courses/IS389/data\\GEO\\Kidney-Chromophobe\\GSE11151\\GSE11151_series_matrix.txt.gz')

In [241]:
# Stopped: No obvious traits
cohort = accession_num = "GSE4125"
cohort_dir = os.path.join(trait_path, accession_num)
soft_file, matrix_file = get_relevant_filepaths(cohort_dir)
soft_file, matrix_file

('/Users/legion/Desktop/Courses/IS389/data\\GEO\\Kidney-Chromophobe\\GSE4125\\GSE4125_family.soft.gz',
 '/Users/legion/Desktop/Courses/IS389/data\\GEO\\Kidney-Chromophobe\\GSE4125\\GSE4125-GPL2649_series_matrix.txt.gz')

In [198]:
# Finished: Biased
cohort = accession_num = "GSE40912"
cohort_dir = os.path.join(trait_path, accession_num)
soft_file, matrix_file = get_relevant_filepaths(cohort_dir)
soft_file, matrix_file

('/Users/legion/Desktop/Courses/IS389/data\\GEO\\Kidney-Chromophobe\\GSE40912\\GSE40912_family.soft.gz',
 '/Users/legion/Desktop/Courses/IS389/data\\GEO\\Kidney-Chromophobe\\GSE40912\\GSE40912_series_matrix.txt.gz')

In [210]:
# Finished
cohort = accession_num = "GSE57162"
cohort_dir = os.path.join(trait_path, accession_num)
soft_file, matrix_file = get_relevant_filepaths(cohort_dir)
soft_file, matrix_file

('/Users/legion/Desktop/Courses/IS389/data\\GEO\\Kidney-Chromophobe\\GSE57162\\GSE57162_family.soft.gz',
 '/Users/legion/Desktop/Courses/IS389/data\\GEO\\Kidney-Chromophobe\\GSE57162\\GSE57162_series_matrix.txt.gz')

In [225]:
# Finished: Biased
cohort = accession_num = "GSE40914"
cohort_dir = os.path.join(trait_path, accession_num)
soft_file, matrix_file = get_relevant_filepaths(cohort_dir)
soft_file, matrix_file

('/Users/legion/Desktop/Courses/IS389/data\\GEO\\Kidney-Chromophobe\\GSE40914\\GSE40914_family.soft.gz',
 '/Users/legion/Desktop/Courses/IS389/data\\GEO\\Kidney-Chromophobe\\GSE40914\\GSE40914-GPL3985_series_matrix.txt.gz')

In [227]:
# Finished
cohort = accession_num = "GSE68606"
cohort_dir = os.path.join(trait_path, accession_num)
soft_file, matrix_file = get_relevant_filepaths(cohort_dir)
soft_file, matrix_file

('/Users/legion/Desktop/Courses/IS389/data\\GEO\\Kidney-Chromophobe\\GSE68606\\GSE68606_family.soft.gz',
 '/Users/legion/Desktop/Courses/IS389/data\\GEO\\Kidney-Chromophobe\\GSE68606\\GSE68606_series_matrix.txt.gz')

# Prepossessing





In [244]:
from utils import *
import io
background_prefixes = ['!Series_title', '!Series_summary', '!Series_overall_design']
clinical_prefixes = ['!Sample_geo_accession', '!Sample_characteristics_ch1']

background_info, clinical_data = get_background_and_clinical_data(matrix_file, background_prefixes, clinical_prefixes)
print(background_info)

clinical_data

!Series_title	"Integrative genome-wide expression profiling identifies three distinct molecular subgroups of renal cell carcinoma with different patient outcome"
!Series_summary	"Background: Renal cell carcinoma (RCC) is characterized by a number of diverse molecular aberrations that differ among individuals. Recent approaches to molecularly classify RCC were based on clinical, pathological as well as on single molecular parameters. As a consequence, gene expression patterns reflecting the sum of genetic aberrations in individual tumors may not have been recognized. In an attempt to uncover such molecular features in RCC, we used a novel, unbiased and integrative approach."
!Series_summary	"Methods: We integrated gene expression data from 97 primary RCCs of different pathologic parameters, 15 RCC metastases as well as 34 cancer cell lines for two-way nonsupervised hierarchical clustering using gene groups suggested by the PANTHER Classification System. We depicted the genomic landscape

Unnamed: 0,!Sample_geo_accession,GSM498450,GSM498451,GSM498452,GSM498453,GSM498454,GSM498455,GSM498456,GSM498457,GSM498458,...,GSM498587,GSM498588,GSM498589,GSM498590,GSM498591,GSM498592,GSM498593,GSM498594,GSM498595,GSM498596
0,!Sample_characteristics_ch1,grade: 2,grade: 2,grade: 2,grade: 2,grade: 1,grade: 2,grade: 1,grade: 2,grade: 3,...,cell line: SLR20,cell line: Caki-2,cell line: SLR21,cell line: KU19-20,cell line: SLR23,cell line: PC3hep27,cell line: PC3hep30,cell line: PC3vec1,cell line: PC3vec3,cell line: HK-2
1,!Sample_characteristics_ch1,stage: 2,stage: 2,stage: 2,stage: 1,stage: 2,stage: 2,stage: 1,stage: 2,stage: 3,...,grade: NA,grade: NA,grade: NA,grade: NA,grade: NA,grade: NA,grade: NA,grade: NA,grade: NA,grade: NA
2,!Sample_characteristics_ch1,sample type: neoplasia,sample type: neoplasia,sample type: neoplasia,sample type: neoplasia,sample type: neoplasia,sample type: neoplasia,sample type: neoplasia,sample type: neoplasia,sample type: neoplasia,...,stage: NA,stage: NA,stage: NA,stage: NA,stage: NA,stage: NA,stage: NA,stage: NA,stage: NA,stage: NA
3,!Sample_characteristics_ch1,icd-o 3 code: 8310/3,icd-o 3 code: 8310/3,icd-o 3 code: 8310/3,icd-o 3 code: 8310/3,icd-o 3 code: 8310/3,icd-o 3 code: 8310/3,icd-o 3 code: 8310/3,icd-o 3 code: 8310/3,icd-o 3 code: 8310/3,...,sample type: neoplasia,sample type: neoplasia,sample type: neoplasia,sample type: neoplasia,sample type: neoplasia,sample type: neoplasia,sample type: neoplasia,sample type: neoplasia,sample type: neoplasia,sample type: neoplasia
4,!Sample_characteristics_ch1,icd-o 3 diagnosis text: clear cell renal cell ...,icd-o 3 diagnosis text: clear cell renal cell ...,icd-o 3 diagnosis text: clear cell renal cell ...,icd-o 3 diagnosis text: clear cell renal cell ...,icd-o 3 diagnosis text: clear cell renal cell ...,icd-o 3 diagnosis text: clear cell renal cell ...,icd-o 3 diagnosis text: clear cell renal cell ...,icd-o 3 diagnosis text: clear cell renal cell ...,icd-o 3 diagnosis text: clear cell renal cell ...,...,icd-o 3 code: 8312/3,icd-o 3 code: 8312/3,icd-o 3 code: 8312/3,icd-o 3 code: 8312/3,icd-o 3 code: 8312/3,icd-o 3 code: 8140/3,icd-o 3 code: 8140/3,icd-o 3 code: 8140/3,icd-o 3 code: 8140/3,icd-o 3 code: 8312/3
5,!Sample_characteristics_ch1,organ site: kidney,organ site: kidney,organ site: kidney,organ site: kidney,organ site: kidney,organ site: kidney,organ site: kidney,organ site: kidney,organ site: kidney,...,icd-o 3 diagnosis text: renal cell carcinoma,icd-o 3 diagnosis text: renal cell carcinoma,icd-o 3 diagnosis text: renal cell carcinoma,icd-o 3 diagnosis text: renal cell carcinoma,icd-o 3 diagnosis text: renal cell carcinoma,"icd-o 3 diagnosis text: adenocarcinoma, NOS","icd-o 3 diagnosis text: adenocarcinoma, NOS","icd-o 3 diagnosis text: adenocarcinoma, NOS","icd-o 3 diagnosis text: adenocarcinoma, NOS",icd-o 3 diagnosis text: renal cell carcinoma
6,!Sample_characteristics_ch1,gender: male,gender: NA,gender: NA,gender: male,gender: NA,gender: NA,gender: NA,gender: female,gender: NA,...,organ site: kidney [cell line],organ site: kidney [cell line],organ site: kidney [cell line],organ site: kidney [cell line],organ site: kidney [cell line],organ site: prostate [cell line],organ site: prostate [cell line],organ site: prostate [cell line],organ site: prostate [cell line],organ site: kidney [cell line]
7,!Sample_characteristics_ch1,tissue type: renal cell carcinoma [clear cell ...,tissue type: renal cell carcinoma [clear cell ...,tissue type: renal cell carcinoma [clear cell ...,tissue type: renal cell carcinoma [clear cell ...,tissue type: renal cell carcinoma [clear cell ...,tissue type: renal cell carcinoma [clear cell ...,tissue type: renal cell carcinoma [clear cell ...,tissue type: renal cell carcinoma [clear cell ...,tissue type: renal cell carcinoma [clear cell ...,...,gender: NA,gender: NA,gender: NA,gender: NA,gender: NA,gender: male,gender: male,gender: male,gender: male,gender: NA
8,!Sample_characteristics_ch1,cluster id: B,cluster id: B,cluster id: A,cluster id: C,cluster id: B,cluster id: B,cluster id: A,cluster id: B,cluster id: B,...,tissue type: renal cell carcinoma [cell line S...,tissue type: renal cell carcinoma [cell line C...,tissue type: renal cell carcinoma [cell line S...,tissue type: renal cell carcinoma [cell line K...,tissue type: renal cell carcinoma [cell line S...,tissue type: prostate adenocarcinoma [cell lin...,tissue type: prostate adenocarcinoma [cell lin...,tissue type: prostate adenocarcinoma [cell lin...,tissue type: prostate adenocarcinoma [cell lin...,tissue type: renal cell carcinoma [cell line H...
9,!Sample_characteristics_ch1,,,,,,,,,,...,cluster id: NA,cluster id: NA,cluster id: NA,cluster id: NA,cluster id: NA,cluster id: NA,cluster id: NA,cluster id: NA,cluster id: NA,cluster id: NA


In [229]:
tumor_stage_row = clinical_data.iloc[1]
tumor_stage_row.unique()

array(['!Sample_characteristics_ch1', 'disease state: --',
       'disease state: Leiomyoma', 'disease state: Lung_Adenocarcinoma',
       'disease state: Conventional_Clear_Cell_Renal_Cell_Carcinoma',
       'disease state: Squamous Cell Carcinoma',
       'disease state: Stomach Adenocarcinoma',
       'disease state: Large Cell Lymphoma',
       'disease state: Malignant Melanoma',
       'disease state: Recurrent Renal Cell Carcinoma',
       'disease state: Adrenal Cortical Adenoma',
       'disease state: Ovarian Adenocarcinoma',
       'disease state: Gastrointestinal_Stromal_Tumor',
       'disease state: Metastatic Renal Cell Carcinoma',
       'disease state: Non neoplastic liver with cirrosis',
       'disease state: Malignant G1 Stromal Tumor',
       'disease state: melanoma'], dtype=object)

In [232]:
is_gene_availabe = True
trait_row = 1
age_row = None
gender_row = None

trait_type = 'binary'

# Verify and use the functions generated by GPT

# 这个函数将组织类型（tissue type）转换为有关癫痫存在与否的二进制值。
# 它是基于特定的假设，即如果组织类型是“胰腺导管腺癌”（Pancreatic Ductal Adenocarcinoma），则认为癫痫存在（返回1）；否则，认为癫痫不存在（返回0）。
def convert_trait(tissue_type):
    """
    Convert tissue type to epilepsy presence (binary).
    Assuming epilepsy presence for 'Hippocampus' tissue.
    """
    if (tissue_type == 'disease state: Conventional_Clear_Cell_Renal_Cell_Carcinoma' or tissue_type == 'disease state: Recurrent Renal Cell Carcinoma' or tissue_type == 'disease state: Metastatic Renal Cell Carcinoma'):
        return 1  # Epilepsy present
    else:
        return 0  # Epilepsy not present


# def convert_trait(tumor_grade):
#     if (tumor_grade == 'tumor grade: 2' or tumor_grade == 'tumor grade: 3' or tumor_grade == 'tumor grade: 4'):
#         return 1  
#     elif tumor_grade == 'tumor grade: 1':
#         return 0  
#     else:
#         return None

# 这个函数的目的是将年龄的字符串表示转换为一个连续的数值型表示。如果年龄未知（例如，标记为'n.a.'），则返回None。
# 函数尝试从传入的字符串中提取出一个整数作为年龄值。如果字符串的格式不符合预期，导致提取失败，同样返回None。
def convert_age(age_string):
    """
    Convert age string to a continuous numerical value.
    Unknown values are converted to None.
    """
    if age_string.lower() == 'n.a.':
        return None
    try:
        # Extract age as an integer from the string
        age = int(age_string.split(':')[1])
        return age
    except (ValueError, IndexError):
        # In case of any format error or unexpected string structure
        return None


# 这个函数将性别的字符串表示转换为二进制值，其中“female”对应1，“male”对应0。如果性别未知或字符串不符合预期格式，则返回None。
# It sometimes maps 'female' to 0, and sometimes 1. Does it matter?
def convert_gender(gender_string):
    """
    Convert gender string to a binary value.
    'female' is represented as 1, 'male' as 0.
    Unknown values are converted to None.
    """
    if (gender_string.lower() == 'sex: female' or gender_string.lower() == 'sex: f' or gender_string.lower() == 'gender: female' or gender_string.lower() == 'gender: f'):
        return 1
    elif (gender_string.lower() == 'sex: male' or gender_string.lower() == 'sex: m' or gender_string.lower() == 'gender: male' or gender_string.lower() == 'gender: m') :  # changeed 
        return 0
    else:
        return None

In [233]:
selected_clinical_data = geo_select_clinical_features(clinical_data, TRAIT, trait_row, convert_trait, age_row=age_row,
                                                      convert_age=convert_age, gender_row=gender_row,
                                                      convert_gender=convert_gender)
selected_clinical_data.head()

  clinical_df = clinical_df.applymap(convert_fn)


Unnamed: 0,GSM1676864,GSM1676865,GSM1676866,GSM1676867,GSM1676868,GSM1676869,GSM1676870,GSM1676871,GSM1676872,GSM1676873,...,GSM1676991,GSM1676992,GSM1676993,GSM1676994,GSM1676995,GSM1676996,GSM1676997,GSM1676998,GSM1676999,GSM1677000
Kidney Chromophobe,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [234]:
genetic_data = get_genetic_data(matrix_file)
genetic_data

Unnamed: 0_level_0,GSM1676864,GSM1676865,GSM1676866,GSM1676867,GSM1676868,GSM1676869,GSM1676870,GSM1676871,GSM1676872,GSM1676873,...,GSM1676991,GSM1676992,GSM1676993,GSM1676994,GSM1676995,GSM1676996,GSM1676997,GSM1676998,GSM1676999,GSM1677000
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1007_s_at,1932.38000,1032.65000,1282.80000,2688.61000,2189.27000,342.63900,254.99600,2225.93000,2785.70000,1785.58000,...,2212.02000,1105.37000,1429.44000,1034.43000,1963.04000,685.84300,1038.43000,267.58600,854.45600,1051.03000
1053_at,833.80500,1034.23000,647.45200,149.05600,315.93400,155.56400,312.14500,431.22300,325.09200,793.21600,...,199.23700,514.35200,879.66300,1314.85000,625.29200,727.30500,521.02900,205.01300,478.05300,466.55800
117_at,122.25500,59.32650,126.75000,139.89900,98.95540,66.88500,74.15530,131.67200,77.37710,120.16700,...,429.30400,101.45700,91.43080,83.34430,197.26200,115.29900,128.30800,84.83030,110.79500,94.88720
121_at,1134.54000,1058.97000,1107.48000,1712.00000,1175.17000,1004.85000,943.42200,1246.45000,1114.25000,1289.73000,...,1237.05000,848.33100,760.27900,608.74700,750.60200,815.26600,822.66700,513.12200,741.42200,630.62300
1255_g_at,26.65850,41.44600,58.22860,77.86010,70.63870,37.91410,16.55080,106.73700,77.79580,67.82810,...,19.03560,16.15000,8.93552,3.17154,93.98570,33.79360,25.44110,25.68730,32.24910,12.31470
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
AFFX-ThrX-5_at,7.04814,8.57466,9.11362,12.67890,14.98140,15.68270,5.77798,9.34684,8.36869,8.44274,...,53.56280,7.69069,7.84491,4.28603,24.32590,7.12141,8.55965,18.17220,60.04510,5.14640
AFFX-ThrX-M_at,27.75430,24.09150,31.00100,25.64010,38.78030,16.97470,25.77740,18.24490,25.75680,28.71170,...,71.87460,26.04780,7.86226,5.50425,65.32950,30.76890,7.79044,33.48380,15.24170,23.51990
AFFX-TrpnX-3_at,3.24531,2.72867,20.23790,19.78960,13.26810,13.99090,3.08212,1.03356,29.03380,3.06343,...,2.39710,16.63280,1.82486,5.90471,5.78538,2.84332,2.08798,16.99070,4.29375,2.71848
AFFX-TrpnX-5_at,7.19261,7.76122,26.20500,12.48630,5.29735,4.88854,28.73010,9.19403,7.81333,43.42730,...,18.31710,27.25600,38.41590,4.80571,60.95720,32.79150,38.91350,7.47526,12.16950,31.59900


In [235]:
requires_gene_mapping = True

if requires_gene_mapping:
    gene_annotation = get_gene_annotation(soft_file)
    gene_annotation_summary = preview_df(gene_annotation)
    print(gene_annotation_summary)

gene_row_ids = genetic_data.index[:20].tolist()
gene_row_ids

if requires_gene_mapping:
    print(f'''
    As a biomedical research team, we are analyzing a gene expression dataset, and find that its row headers are some identifiers related to genes:
    {gene_row_ids}
    To get the mapping from those identifiers to actual gene symbols, we extracted the gene annotation data from a series in the GEO database, and saved it to a Python dictionary. Please read the dictionary, and decide which key stores the identifiers, and which key stores the gene symbols. Please strictly follow this format in your answer:
    identifier_key = 'key_name1'
    gene_symbol_key = 'key_name2'

    Gene annotation dictionary:
    {gene_annotation_summary}
    ''')

gene_annotation.columns

{'ID': ['1007_s_at', '1053_at', '117_at', '121_at', '1255_g_at'], 'GB_ACC': ['U48705', 'M87338', 'X51757', 'X69699', 'L36861'], 'SPOT_ID': [nan, nan, nan, nan, nan], 'Species Scientific Name': ['Homo sapiens', 'Homo sapiens', 'Homo sapiens', 'Homo sapiens', 'Homo sapiens'], 'Annotation Date': ['Oct 6, 2014', 'Oct 6, 2014', 'Oct 6, 2014', 'Oct 6, 2014', 'Oct 6, 2014'], 'Sequence Type': ['Exemplar sequence', 'Exemplar sequence', 'Exemplar sequence', 'Exemplar sequence', 'Exemplar sequence'], 'Sequence Source': ['Affymetrix Proprietary Database', 'GenBank', 'Affymetrix Proprietary Database', 'GenBank', 'Affymetrix Proprietary Database'], 'Target Description': ['U48705 /FEATURE=mRNA /DEFINITION=HSU48705 Human receptor tyrosine kinase DDR gene, complete cds', 'M87338 /FEATURE= /DEFINITION=HUMA1SBU Human replication factor C, 40-kDa subunit (A1) mRNA, complete cds', "X51757 /FEATURE=cds /DEFINITION=HSP70B Human heat-shock protein HSP70B' gene", 'X69699 /FEATURE= /DEFINITION=HSPAX8A H.sapiens

Index(['ID', 'GB_ACC', 'SPOT_ID', 'Species Scientific Name', 'Annotation Date',
       'Sequence Type', 'Sequence Source', 'Target Description',
       'Representative Public ID', 'Gene Title', 'Gene Symbol',
       'ENTREZ_GENE_ID', 'RefSeq Transcript ID',
       'Gene Ontology Biological Process', 'Gene Ontology Cellular Component',
       'Gene Ontology Molecular Function'],
      dtype='object')

In [236]:
if requires_gene_mapping:
    identifier_key = 'ID'
    gene_symbol_key = 'Gene Symbol'
    gene_mapping = get_gene_mapping(gene_annotation, identifier_key, gene_symbol_key)
    genetic_data = apply_gene_mapping(genetic_data, gene_mapping)

In [237]:
genetic_data = normalize_gene_symbols_in_index(genetic_data)

genetic_data

Unnamed: 0,GSM1676864,GSM1676865,GSM1676866,GSM1676867,GSM1676868,GSM1676869,GSM1676870,GSM1676871,GSM1676872,GSM1676873,...,GSM1676991,GSM1676992,GSM1676993,GSM1676994,GSM1676995,GSM1676996,GSM1676997,GSM1676998,GSM1676999,GSM1677000
A1CF,411.658000,328.766000,283.038000,441.273000,456.17000,356.891000,370.538000,391.200000,517.63200,330.712000,...,492.9340,252.68000,214.051000,172.401000,360.6670,224.95800,196.6330,423.316000,604.971000,663.523000
A2M,9.933690,29.729700,31.517800,117.207000,66.05450,76.368500,8.343650,77.880200,96.81630,76.772900,...,1938.5400,34.16810,41.075100,78.744600,13.4254,113.81800,80.4261,90.634000,1469.130000,1450.950000
A4GALT,67.969200,42.261300,103.195000,34.880400,37.44720,20.832300,63.035600,70.497400,25.55710,77.666600,...,73.6854,85.04110,131.085000,155.877000,138.5640,170.43300,96.1404,115.264000,131.866000,22.694400
A4GNT,103.083000,19.840200,120.743000,233.940000,132.97200,54.878500,86.099100,86.712800,132.28300,123.381000,...,375.3930,129.99600,115.976000,82.167800,157.9120,148.86000,60.4659,58.103800,47.507900,122.016000
AAAS,253.048000,307.313000,301.599000,54.338500,292.66000,526.452000,285.639000,242.461000,157.34800,88.740300,...,168.3570,331.88900,455.359000,512.032000,378.5850,584.66500,611.1190,200.164000,299.390000,332.704000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ZXDB,7.021830,25.621200,49.129700,56.345100,52.64970,9.202230,57.596900,38.391100,56.36770,7.179120,...,12.7427,82.66240,84.754400,15.316100,56.1309,43.45840,32.1018,13.704900,66.226700,72.470700
ZXDC,245.770000,223.404000,324.696000,408.354000,274.03900,207.566000,41.970100,96.083800,204.60200,277.355000,...,293.5240,105.07400,239.035000,242.894000,112.4240,226.59100,263.2760,15.799600,209.634000,202.095000
ZYX,864.049500,361.164500,368.090000,509.184500,699.28350,711.235500,831.348000,898.541000,529.82600,535.816500,...,956.2580,1056.60850,1895.635000,513.773000,1485.5750,625.40500,2031.3400,1590.970000,872.288000,982.831000
ZZEF1,95.892933,186.514933,183.754067,259.034713,168.38015,311.179743,224.061743,125.673357,188.30636,118.847337,...,187.8084,105.01919,100.009013,136.908567,82.2776,180.19429,119.2507,192.480833,127.771333,95.924913


In [238]:
merged_data = geo_merge_clinical_genetic_data(selected_clinical_data, genetic_data)
# The preprocessing runs through, which means is_available should be True
is_available = True

merged_data

Unnamed: 0,Kidney Chromophobe,A1CF,A2M,A4GALT,A4GNT,AAAS,AACS,AADAC,AAGAB,AAK1,...,ZSWIM1,ZSWIM8,ZW10,ZWILCH,ZWINT,ZXDB,ZXDC,ZYX,ZZEF1,ZZZ3
GSM1676864,0.0,411.658,9.93369,67.9692,103.0830,253.0480,693.969,52.00740,254.15790,169.690144,...,172.9030,118.234115,1172.070,937.975,4273.62,7.02183,245.7700,864.0495,95.892933,972.634
GSM1676865,0.0,328.766,29.72970,42.2613,19.8402,307.3130,565.707,52.20780,411.08050,138.574204,...,160.9360,164.939945,888.014,810.915,6408.85,25.62120,223.4040,361.1645,186.514933,1135.550
GSM1676866,0.0,283.038,31.51780,103.1950,120.7430,301.5990,614.077,20.14540,301.78795,171.979780,...,149.0170,182.287700,970.310,769.493,7430.47,49.12970,324.6960,368.0900,183.754067,818.291
GSM1676867,0.0,441.273,117.20700,34.8804,233.9400,54.3385,1193.600,26.59360,195.02295,411.052640,...,206.7570,454.626050,910.911,274.856,5882.10,56.34510,408.3540,509.1845,259.034713,659.049
GSM1676868,0.0,456.170,66.05450,37.4472,132.9720,292.6600,869.968,77.00920,291.42050,225.582720,...,188.0690,336.587970,796.490,342.181,4632.58,52.64970,274.0390,699.2835,168.380150,615.557
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
GSM1676996,0.0,224.958,113.81800,170.4330,148.8600,584.6650,523.556,7.43009,450.45955,134.190732,...,188.5000,199.339950,657.216,1048.500,5762.04,43.45840,226.5910,625.4050,180.194290,707.989
GSM1676997,0.0,196.633,80.42610,96.1404,60.4659,611.1190,447.764,11.25050,342.95000,108.845170,...,205.8840,121.243740,685.352,1352.300,2760.27,32.10180,263.2760,2031.3400,119.250700,713.453
GSM1676998,0.0,423.316,90.63400,115.2640,58.1038,200.1640,860.323,9.11952,255.18895,120.683550,...,89.2204,201.062600,441.399,575.010,1340.71,13.70490,15.7996,1590.9700,192.480833,765.151
GSM1676999,0.0,604.971,1469.13000,131.8660,47.5079,299.3900,596.813,63.87370,294.99555,163.793346,...,135.0100,170.178150,469.864,1394.440,3616.59,66.22670,209.6340,872.2880,127.771333,486.239


In [239]:
print(f"The merged dataset contains {len(merged_data)} samples.")
is_trait_biased, merged_data = judge_and_remove_biased_features(merged_data, TRAIT, trait_type=trait_type)
is_trait_biased

The merged dataset contains 137 samples.
For the feature 'Kidney Chromophobe', the least common label is '1.0' with 11 occurrences. This represents 8.03% of the dataset.
The distribution of the feature 'Kidney Chromophobe' in this dataset is fine.



False

In [240]:
if is_available:
    save_cohort_info(cohort, JSON_PATH, is_available, is_trait_biased, merged_data, note='')
else:
    save_cohort_info(cohort, JSON_PATH, is_available)
merged_data.head()
if not is_trait_biased:
    merged_data.to_csv(os.path.join(OUTPUT_DIR, cohort + '.csv'), index=False)

### Initial filtering and clinical data preprocessing

In [54]:
import gzip

In [55]:
def line_generator(source, source_type):
    """Generator that yields lines from a file or a string.

    Parameters:
    - source: File path or string content.
    - source_type: 'file' or 'string'.
    """
    if source_type == 'file':
        with gzip.open(source, 'rt') as f:
            for line in f:
                yield line.strip()
    elif source_type == 'string':
        for line in source.split('\n'):
            yield line.strip()
    else:
        raise ValueError("source_type must be 'file' or 'string'")

In [56]:
from typing import Callable, Optional, List, Tuple, Union, Any
import pandas as pd

In [57]:
def filter_content_by_prefix(
    source: str,
    prefixes_a: List[str],
    prefixes_b: Optional[List[str]] = None,
    unselect: bool = False,
    source_type: str = 'file',
    return_df_a: bool = True,
    return_df_b: bool = True
) -> Tuple[Union[str, pd.DataFrame], Optional[Union[str, pd.DataFrame]]]:
    """
    Filters rows from a file or a list of strings based on specified prefixes.

    Parameters:
    - source (str): File path or string content to filter.
    - prefixes_a (List[str]): Primary list of prefixes to filter by.
    - prefixes_b (Optional[List[str]]): Optional secondary list of prefixes to filter by.
    - unselect (bool): If True, selects rows that do not start with the specified prefixes.
    - source_type (str): 'file' if source is a file path, 'string' if source is a string of text.
    - return_df_a (bool): If True, returns filtered content for prefixes_a as a pandas DataFrame.
    - return_df_b (bool): If True, and if prefixes_b is provided, returns filtered content for prefixes_b as a pandas DataFrame.

    Returns:
    - Tuple: A tuple where the first element is the filtered content for prefixes_a, and the second element is the filtered content for prefixes_b.
    """
    filtered_lines_a = []
    filtered_lines_b = []
    prefix_set_a = set(prefixes_a)
    if prefixes_b is not None:
        prefix_set_b = set(prefixes_b)

    # Use generator to get lines
    for line in line_generator(source, source_type):
        matched_a = any(line.startswith(prefix) for prefix in prefix_set_a)
        if matched_a != unselect:
            filtered_lines_a.append(line)
        if prefixes_b is not None:
            matched_b = any(line.startswith(prefix) for prefix in prefix_set_b)
            if matched_b != unselect:
                filtered_lines_b.append(line)

    filtered_content_a = '\n'.join(filtered_lines_a)
    if return_df_a:
        filtered_content_a = pd.read_csv(io.StringIO(filtered_content_a), delimiter='\t', low_memory=False, on_bad_lines='skip')
    filtered_content_b = None
    if filtered_lines_b:
        filtered_content_b = '\n'.join(filtered_lines_b)
        if return_df_b:
            filtered_content_b = pd.read_csv(io.StringIO(filtered_content_b), delimiter='\t', low_memory=False, on_bad_lines='skip')

    return filtered_content_a, filtered_content_b



In [58]:
def get_background_and_clinical_data(file_path,
                                     prefixes_a=['!Series_title', '!Series_summary', '!Series_overall_design'],
                                     prefixes_b=['!Sample_geo_accession', '!Sample_characteristics_ch1']):
    """Extract from a matrix file the background information about the dataset, and sample characteristics data"""
    background_info, clinical_data = filter_content_by_prefix(file_path, prefixes_a, prefixes_b, unselect=False,
                                                              source_type='file',
                                                              return_df_a=False, return_df_b=True)
    return background_info, clinical_data

In [59]:
import io
background_prefixes = ['!Series_title', '!Series_summary', '!Series_overall_design']
clinical_prefixes = ['!Sample_geo_accession', '!Sample_characteristics_ch1']

background_info, clinical_data = get_background_and_clinical_data(matrix_file, background_prefixes, clinical_prefixes)
print(background_info)

!Series_title	"caArray_dobbi-00100: Interlaboratory comparability study of cancer gene expression analysis using oligonucleotide microarrays"
!Series_summary	"A key step in bringing gene expression data into clinical practice is the conduct of large studies to confirm preliminary models. The performance of such confirmatory studies and the transition to clinical practice requires that microarray data from different laboratories are comparable and reproducible. We designed a study to assess the comparability of data from four laboratories that will conduct a larger microarray profiling confirmation project in lung adenocarcinomas. To test the feasibility of combining data across laboratories, frozen tumor tissues, cell line pellets, and purified RNA samples were analyzed at each of the four laboratories. Samples of each type and several subsamples from each tumor and each cell line were blinded before being distributed. The laboratories followed a common protocol for all steps of tissue

In [60]:
clinical_data.head()

Unnamed: 0,!Sample_geo_accession,GSM1676864,GSM1676865,GSM1676866,GSM1676867,GSM1676868,GSM1676869,GSM1676870,GSM1676871,GSM1676872,...,GSM1676991,GSM1676992,GSM1676993,GSM1676994,GSM1676995,GSM1676996,GSM1676997,GSM1676998,GSM1676999,GSM1677000
0,!Sample_characteristics_ch1,cell line: H2347,cell line: H2347,cell line: H1437,cell line: HCC78,cell line: HCC78,cell line: H2087,cell line: H2087,cell line: H2009,cell line: H2009,...,cell line: --,cell line: HCC78,cell line: H2347,cell line: H1437,cell line: H2009,cell line: H1437,cell line: H2347,cell line: H2087,cell line: --,cell line: --
1,!Sample_characteristics_ch1,disease state: --,disease state: --,disease state: Leiomyoma,disease state: --,disease state: --,disease state: --,disease state: --,disease state: --,disease state: --,...,disease state: Squamous Cell Carcinoma,disease state: --,disease state: --,disease state: --,disease state: --,disease state: --,disease state: --,disease state: --,disease state: --,disease state: --
2,!Sample_characteristics_ch1,tumor grading: --,tumor grading: --,tumor grading: --,tumor grading: --,tumor grading: --,tumor grading: --,tumor grading: --,tumor grading: --,tumor grading: --,...,tumor grading: --,tumor grading: --,tumor grading: --,tumor grading: --,tumor grading: --,tumor grading: --,tumor grading: --,tumor grading: --,tumor grading: --,tumor grading: --
3,!Sample_characteristics_ch1,disease stage: --,disease stage: --,disease stage: --,disease stage: --,disease stage: --,disease stage: --,disease stage: --,disease stage: --,disease stage: --,...,disease stage: --,disease stage: --,disease stage: --,disease stage: --,disease stage: --,disease stage: --,disease stage: --,disease stage: --,disease stage: --,disease stage: --
4,!Sample_characteristics_ch1,organism part: --,organism part: --,organism part: Uterus,organism part: --,organism part: --,organism part: --,organism part: --,organism part: --,organism part: --,...,organism part: Lung,organism part: --,organism part: --,organism part: --,organism part: --,organism part: --,organism part: --,organism part: --,organism part: --,organism part: --


In [61]:
def get_unique_values_by_row(dataframe, max_len=30):
    """
    Organize the unique values in each row of the given dataframe, to get a dictionary
    :param dataframe:
    :param max_len:
    :return:
    """
    if '!Sample_geo_accession' in dataframe.columns:
        dataframe = dataframe.drop(columns=['!Sample_geo_accession'])
    unique_values_dict = {}
    for index, row in dataframe.iterrows():
        unique_values = list(row.unique())[:max_len]
        unique_values_dict[index] = unique_values
    return unique_values_dict

In [62]:
clinical_data_unique = get_unique_values_by_row(clinical_data)
clinical_data_unique

{0: ['cell line: H2347',
  'cell line: H1437',
  'cell line: HCC78',
  'cell line: H2087',
  'cell line: H2009',
  'cell line: --'],
 1: ['disease state: --',
  'disease state: Leiomyoma',
  'disease state: Lung_Adenocarcinoma',
  'disease state: Conventional_Clear_Cell_Renal_Cell_Carcinoma',
  'disease state: Squamous Cell Carcinoma',
  'disease state: Stomach Adenocarcinoma',
  'disease state: Large Cell Lymphoma',
  'disease state: Malignant Melanoma',
  'disease state: Recurrent Renal Cell Carcinoma',
  'disease state: Adrenal Cortical Adenoma',
  'disease state: Ovarian Adenocarcinoma',
  'disease state: Gastrointestinal_Stromal_Tumor',
  'disease state: Metastatic Renal Cell Carcinoma',
  'disease state: Non neoplastic liver with cirrosis',
  'disease state: Malignant G1 Stromal Tumor',
  'disease state: melanoma'],
 2: ['tumor grading: --',
  'tumor grading: G2/pT1pN0pMX',
  'tumor grading: G3/pT2pN0pMX',
  'tumor grading: G2/pT2pN0pMX',
  'tumor grading: G3/pT4pNXpMX'],
 3: ['d

Analyze the metadata to determine data relevance and find ways to extract the clinical data.
Reference prompt:

In [63]:
f'''As a biomedical research team, we are selecting datasets to study the association between the human trait \'{TRAIT}\' and genetic factors, optionally considering the influence of age and gender. After searching the GEO database and parsing the matrix file of a series, we obtained background information and sample characteristics data. We will provide textual information about the dataset background, and a Python dictionary storing a list of unique values for each field of the sample characteristics data. Please carefully review the provided information and answer the following questions about this dataset:
1. Does this dataset contain gene expression data? (Note: Pure miRNA data is not suitable.)
2. For each of the traits \'{TRAIT}\', 'age', and 'gender', please address these points:
   (1) Is there human data available for this trait?
   (2) If so, identify the key in the sample characteristics dictionary where unique values of this trait is recorded. The key is an integer. The trait information might be explicitly recorded, or can be inferred from the field with some biomedical knowledge or understanding about the data collection process.
   (3) Choose an appropriate data type (either 'continuous' or 'binary') for each trait. Write a Python function to convert any given value of the trait to this data type. The function should handle inference about the trait value and convert unknown values to None.
   Name the functions 'convert_trait', 'convert_age', and 'convert_gender', respectively.

Background information about the dataset:
{background_info}

Sample characteristics dictionary (from "!Sample_characteristics_ch1", converted to a Python dictionary that stores the unique values for each field):
{clinical_data_unique}
'''

'As a biomedical research team, we are selecting datasets to study the association between the human trait \'Kidney Chromophobe\' and genetic factors, optionally considering the influence of age and gender. After searching the GEO database and parsing the matrix file of a series, we obtained background information and sample characteristics data. We will provide textual information about the dataset background, and a Python dictionary storing a list of unique values for each field of the sample characteristics data. Please carefully review the provided information and answer the following questions about this dataset:\n1. Does this dataset contain gene expression data? (Note: Pure miRNA data is not suitable.)\n2. For each of the traits \'Kidney Chromophobe\', \'age\', and \'gender\', please address these points:\n   (1) Is there human data available for this trait?\n   (2) If so, identify the key in the sample characteristics dictionary where unique values of this trait is recorded. Th

Understand and verify the answer from GPT, to assign values to the below variables. Assign None to the 'row_id' variables if relevant data row was not found.
Later we need to let GPT format its answer to automatically do these. But given the complexity of this step, let's grow some insight from the free-text answers for now.

In [64]:
age_row = gender_row = None
convert_age = convert_gender = None

In [65]:
is_gene_availabe = True
trait_row = 7
age_row = 6
gender_row = 5
trait_type = 'binary'

In [66]:
is_available = is_gene_availabe and (trait_row is not None)
if not is_available:
    save_cohort_info(cohort, JSON_PATH, is_available)
    print("This cohort is not usable. Please skip the following steps and jump to the next accession number.")

In [67]:
# Verify and use the functions generated by GPT

def convert_trait(tissue_type):
    """
    Convert tissue type to epilepsy presence (binary).
    Assuming epilepsy presence for 'Hippocampus' tissue.
    """
    if tissue_type == 'histology: Leiomyoma':
        return 1  # Epilepsy present
    else:
        return 0  # Epilepsy not present


def convert_age(age_string):
    """
    Convert age string to a continuous numerical value.
    Unknown values are converted to None.
    """
    if age_string.lower() == 'n.a.':
        return None
    try:
        # Extract age as an integer from the string
        age = int(age_string.split(': ')[1])
        return age
    except (ValueError, IndexError):
        # In case of any format error or unexpected string structure
        return None


# It sometimes maps 'female' to 0, and sometimes 1. Does it matter?
def convert_gender(gender_string):
    """
    Convert gender string to a binary value.
    'female' is represented as 1, 'male' as 0.
    Unknown values are converted to None.
    """
    if gender_string.lower() == 'sex: male':
        return 0
    elif gender_string.lower() == 'sex: female':
        return 1
    else:
        return None

In [68]:
def get_feature_data(clinical_df, row_id, feature, convert_fn):
    """select the row corresponding to a feature in the sample characteristics dataframe, and convert the feature into
    a binary or continuous variable"""
    clinical_df = clinical_df.iloc[row_id:row_id + 1].drop(columns=['!Sample_geo_accession'], errors='ignore')
    clinical_df.index = [feature]
    clinical_df = clinical_df.applymap(convert_fn)

    return clinical_df

In [69]:
def geo_select_clinical_features(clinical_df: pd.DataFrame, trait: str, trait_row: int,
                                 convert_trait: Callable,
                                 age_row: Optional[int] = None,
                                 convert_age: Optional[Callable] = None,
                                 gender_row: Optional[int] = None,
                                 convert_gender: Optional[Callable] = None) -> pd.DataFrame:
    """
    Extracts and processes specific clinical features from a DataFrame representing
    sample characteristics in the GEO database series.

    Parameters:
    - clinical_df (pd.DataFrame): DataFrame containing clinical data.
    - trait (str): The trait of interest.
    - trait_row (int): Row identifier for the trait in the DataFrame.
    - convert_trait (Callable): Function to convert trait data into a desired format.
    - age_row (int, optional): Row identifier for age data. Default is None.
    - convert_age (Callable, optional): Function to convert age data. Default is None.
    - gender_row (int, optional): Row identifier for gender data. Default is None.
    - convert_gender (Callable, optional): Function to convert gender data. Default is None.

    Returns:
    pd.DataFrame: A DataFrame containing the selected and processed clinical features.
    """
    feature_list = []

    trait_data = get_feature_data(clinical_df, trait_row, trait, convert_trait)
    feature_list.append(trait_data)
    if age_row is not None:
        age_data = get_feature_data(clinical_df, age_row, 'Age', convert_age)
        feature_list.append(age_data)
    if gender_row is not None:
        gender_data = get_feature_data(clinical_df, gender_row, 'Gender', convert_gender)
        feature_list.append(gender_data)

    selected_clinical_df = pd.concat(feature_list, axis=0)
    return selected_clinical_df

In [70]:
selected_clinical_data = geo_select_clinical_features(clinical_data, TRAIT, trait_row, convert_trait, age_row=age_row,
                                                      convert_age=convert_age, gender_row=gender_row,
                                                      convert_gender=convert_gender)
selected_clinical_data.head()

Unnamed: 0,GSM1676864,GSM1676865,GSM1676866,GSM1676867,GSM1676868,GSM1676869,GSM1676870,GSM1676871,GSM1676872,GSM1676873,...,GSM1676991,GSM1676992,GSM1676993,GSM1676994,GSM1676995,GSM1676996,GSM1676997,GSM1676998,GSM1676999,GSM1677000
Kidney Chromophobe,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Age,,,,,,,,,,,...,,,,,,,,,,
Gender,,,1.0,,,,,,,,...,,,,,,,,,,


### Genetic data preprocessing and final filtering

In [71]:
def get_genetic_data(file_path):
    """Read the gene expression data into a dataframe, and adjust its format"""
    genetic_data = pd.read_csv(file_path, compression='gzip', skiprows=52, comment='!', delimiter='\t')
    genetic_data = genetic_data.dropna()
    genetic_data = genetic_data.rename(columns={'ID_REF': 'ID'}).astype({'ID': 'str'})
    genetic_data.set_index('ID', inplace=True)

    return genetic_data


In [72]:
genetic_data = get_genetic_data(matrix_file)
genetic_data.head()

Unnamed: 0_level_0,GSM1676864,GSM1676865,GSM1676866,GSM1676867,GSM1676868,GSM1676869,GSM1676870,GSM1676871,GSM1676872,GSM1676873,...,GSM1676991,GSM1676992,GSM1676993,GSM1676994,GSM1676995,GSM1676996,GSM1676997,GSM1676998,GSM1676999,GSM1677000
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1007_s_at,1932.38,1032.65,1282.8,2688.61,2189.27,342.639,254.996,2225.93,2785.7,1785.58,...,2212.02,1105.37,1429.44,1034.43,1963.04,685.843,1038.43,267.586,854.456,1051.03
1053_at,833.805,1034.23,647.452,149.056,315.934,155.564,312.145,431.223,325.092,793.216,...,199.237,514.352,879.663,1314.85,625.292,727.305,521.029,205.013,478.053,466.558
117_at,122.255,59.3265,126.75,139.899,98.9554,66.885,74.1553,131.672,77.3771,120.167,...,429.304,101.457,91.4308,83.3443,197.262,115.299,128.308,84.8303,110.795,94.8872
121_at,1134.54,1058.97,1107.48,1712.0,1175.17,1004.85,943.422,1246.45,1114.25,1289.73,...,1237.05,848.331,760.279,608.747,750.602,815.266,822.667,513.122,741.422,630.623
1255_g_at,26.6585,41.446,58.2286,77.8601,70.6387,37.9141,16.5508,106.737,77.7958,67.8281,...,19.0356,16.15,8.93552,3.17154,93.9857,33.7936,25.4411,25.6873,32.2491,12.3147


In [73]:
gene_row_ids = genetic_data.index[:20].tolist()
gene_row_ids

['1007_s_at',
 '1053_at',
 '117_at',
 '121_at',
 '1255_g_at',
 '1294_at',
 '1316_at',
 '1320_at',
 '1405_i_at',
 '1431_at',
 '1438_at',
 '1487_at',
 '1494_f_at',
 '1598_g_at',
 '160020_at',
 '1729_at',
 '1773_at',
 '177_at',
 '179_at',
 '1861_at']

Check if the gene dataset requires mapping to get the gene symbols corresponding to each data row.

Reference prompt:

In [74]:
f'''
Below are the row headers of a gene expression dataset in GEO. Based on your biomedical knowledge, are they human gene symbols, or are they some other identifiers that need to be mapped to gene symbols? Your answer should be concluded by starting a new line and strictly following this format:
requires_gene_mapping = (True or False)

Row headers:
{gene_row_ids}
'''

"\nBelow are the row headers of a gene expression dataset in GEO. Based on your biomedical knowledge, are they human gene symbols, or are they some other identifiers that need to be mapped to gene symbols? Your answer should be concluded by starting a new line and strictly following this format:\nrequires_gene_mapping = (True or False)\n\nRow headers:\n['1007_s_at', '1053_at', '117_at', '121_at', '1255_g_at', '1294_at', '1316_at', '1320_at', '1405_i_at', '1431_at', '1438_at', '1487_at', '1494_f_at', '1598_g_at', '160020_at', '1729_at', '1773_at', '177_at', '179_at', '1861_at']\n"


If not required, jump directly to the gene normalization step

In [75]:
requires_gene_mapping = False

In [76]:
if requires_gene_mapping:
    gene_annotation = get_gene_annotation(soft_file)
    gene_annotation_summary = preview_df(gene_annotation)
    print(gene_annotation_summary)

Observe the first few cells in the ID column of the gene annotation dataframe, to find the names of columns that store the gene probe IDs and gene symbols respectively.
Reference prompt:

In [77]:
if requires_gene_mapping:
    print(f'''
    As a biomedical research team, we are analyzing a gene expression dataset, and find that its row headers are some identifiers related to genes:
    {gene_row_ids}
    To get the mapping from those identifiers to actual gene symbols, we extracted the gene annotation data from a series in the GEO database, and saved it to a Python dictionary. Please read the dictionary, and decide which key stores the identifiers, and which key stores the gene symbols. Please strictly follow this format in your answer:
    identifier_key = 'key_name1'
    gene_symbol_key = 'key_name2'

    Gene annotation dictionary:
    {gene_annotation_summary}
    ''')

In [78]:
if requires_gene_mapping:
    identifier_key = 'ID'
    gene_symbol_key = 'UCSC_RefGene_Name'
    gene_mapping = get_gene_mapping(gene_annotation, identifier_key, gene_symbol_key)
    genetic_data = apply_gene_mapping(genetic_data, gene_mapping)

In [79]:
def normalize_gene_symbols_in_index(gene_df):
    """Normalize the human gene symbols at the index of a dataframe, and replace the index with its normalized version.
    Remove the rows where the index failed to be normalized."""
    normalized_gene_list = normalize_gene_symbols(gene_df.index.tolist())
    assert len(normalized_gene_list) == len(gene_df.index)
    gene_df.index = normalized_gene_list
    gene_df = gene_df[gene_df.index.notnull()]
    return gene_df

In [80]:
def normalize_gene_symbols(gene_symbols, batch_size=1000):
    """Normalize human gene symbols in batches using the 'mygenes' library"""
    mg = mygene.MyGeneInfo()
    normalized_genes = {}

    # Process in batches
    for i in range(0, len(gene_symbols), batch_size):
        batch = gene_symbols[i:i + batch_size]
        results = mg.querymany(batch, scopes='symbol', fields='symbol', species='human')

        # Update the normalized_genes dictionary with results from this batch
        for gene in results:
            normalized_genes[gene['query']] = gene.get('symbol', None)

    # Return the normalized symbols in the same order as the input
    return [normalized_genes.get(symbol) for symbol in gene_symbols]

In [81]:
import mygene

In [82]:
if NORMALIZE_GENE:
    genetic_data = normalize_gene_symbols_in_index(genetic_data)

1000 input query terms found no hit:	['1007_s_at', '1053_at', '117_at', '121_at', '1255_g_at', '1294_at', '1316_at', '1320_at', '1405_i_a
1000 input query terms found no hit:	['201473_at', '201474_s_at', '201475_x_at', '201476_s_at', '201477_s_at', '201478_s_at', '201479_at'
1000 input query terms found no hit:	['202473_x_at', '202474_s_at', '202475_at', '202476_s_at', '202477_s_at', '202478_at', '202479_s_at'
1000 input query terms found no hit:	['203474_at', '203475_at', '203476_at', '203477_at', '203478_at', '203479_s_at', '203480_s_at', '203
1000 input query terms found no hit:	['204474_at', '204475_at', '204476_s_at', '204477_at', '204478_s_at', '204479_at', '204480_s_at', '2
1000 input query terms found no hit:	['205474_at', '205475_at', '205476_at', '205477_s_at', '205478_at', '205479_s_at', '205480_s_at', '2
1000 input query terms found no hit:	['206475_x_at', '206476_s_at', '206477_s_at', '206478_at', '206479_at', '206480_at', '206481_s_at', 
1000 input query terms found no hi

In [83]:
def geo_merge_clinical_genetic_data(clinical_df, genetic_df):
    """
    Merge the clinical features and gene expression features from two dataframes into one dataframe
    """
    if 'ID' in genetic_df.columns:
        genetic_df = genetic_df.rename(columns={'ID': 'Gene'})
    if 'Gene' in genetic_df.columns:
        genetic_df = genetic_df.set_index('Gene')
    merged_data = pd.concat([clinical_df, genetic_df], axis=0).T.dropna()
    return merged_data


In [84]:
merged_data = geo_merge_clinical_genetic_data(selected_clinical_data, genetic_data)
# The preprocessing runs through, which means is_available should be True
is_available = True

In [85]:
print(f"The merged dataset contains {len(merged_data)} samples.")

The merged dataset contains 20 samples.


In [86]:
def judge_and_remove_biased_features(df, trait, trait_type):
    assert trait_type in ["binary", "continuous"], f"The trait must be either a binary or a continuous variable!"
    if trait_type == "binary":
        trait_biased = judge_binary_variable_biased(df, trait)
    else:
        trait_biased = judge_continuous_variable_biased(df, trait)
    if trait_biased:
        print(f"The distribution of the feature \'{trait}\' in this dataset is severely biased.\n")
    else:
        print(f"The distribution of the feature \'{trait}\' in this dataset is fine.\n")
    if "Age" in df.columns:
        age_biased = judge_continuous_variable_biased(df, 'Age')
        if age_biased:
            print(f"The distribution of the feature \'Age\' in this dataset is severely biased.\n")
            df = df.drop(columns='Age')
        else:
            print(f"The distribution of the feature \'Age\' in this dataset is fine.\n")
    if "Gender" in df.columns:
        gender_biased = judge_binary_variable_biased(df, 'Gender')
        if gender_biased:
            print(f"The distribution of the feature \'Gender\' in this dataset is severely biased.\n")
            df = df.drop(columns='Gender')
        else:
            print(f"The distribution of the feature \'Gender\' in this dataset is fine.\n")

    return trait_biased, df


In [87]:
def judge_binary_variable_biased(dataframe, col_name, min_proportion=0.1, min_num=5):
    """
    Check if the distribution of a binary variable in the dataset is too biased to be usable for analysis
    :param dataframe:
    :param col_name:
    :param min_proportion:
    :param min_num:
    :return:
    """
    label_counter = dataframe[col_name].value_counts()
    total_samples = len(dataframe)
    rare_label_num = label_counter.min()
    rare_label = label_counter.idxmin()
    rare_label_proportion = rare_label_num / total_samples

    print(
        f"For the feature \'{col_name}\', the least common label is '{rare_label}' with {rare_label_num} occurrences. This represents {rare_label_proportion:.2%} of the dataset.")

    biased = (len(label_counter) < 2) or ((rare_label_proportion < min_proportion) and (rare_label_num < min_num))
    return bool(biased)

In [88]:
def judge_continuous_variable_biased(dataframe, col_name):
    """Check if the distribution of a continuous variable in the dataset is too biased to be usable for analysis.
    As a starting point, we consider it biased if all values are the same. For the next step, maybe ask GPT to judge
    based on quartile statistics combined with its common sense knowledge about this feature.
    """
    quartiles = dataframe[col_name].quantile([0.25, 0.5, 0.75])
    min_value = dataframe[col_name].min()
    max_value = dataframe[col_name].max()

    # Printing quartile information
    print(f"Quartiles for '{col_name}':")
    print(f"  25%: {quartiles[0.25]}")
    print(f"  50% (Median): {quartiles[0.5]}")
    print(f"  75%: {quartiles[0.75]}")
    print(f"Min: {min_value}")
    print(f"Max: {max_value}")

    biased = min_value == max_value

    return bool(biased)

In [89]:
is_trait_biased, merged_data = judge_and_remove_biased_features(merged_data, TRAIT, trait_type=trait_type)
is_trait_biased

For the feature 'Kidney Chromophobe', the least common label is '0.0' with 20 occurrences. This represents 100.00% of the dataset.
The distribution of the feature 'Kidney Chromophobe' in this dataset is severely biased.

Quartiles for 'Age':
  25%: 56.0
  50% (Median): 66.0
  75%: 67.0
Min: 48.0
Max: 72.0
The distribution of the feature 'Age' in this dataset is fine.

For the feature 'Gender', the least common label is '1.0' with 8 occurrences. This represents 40.00% of the dataset.
The distribution of the feature 'Gender' in this dataset is fine.



True

In [90]:
def save_cohort_info(cohort: str, info_path: str, is_available: bool, is_biased: Optional[bool] = None,
                     df: Optional[pd.DataFrame] = None, note: str = '') -> None:
    """
    Add or update information about the usability and quality of a dataset for statistical analysis.

    Parameters:
    cohort (str): A unique identifier for the dataset.
    info_path (str): File path to the JSON file where records are stored.
    is_available (bool): Indicates whether both the genetic data and trait data are available in the dataset, and can be
     preprocessed into a dataframe.
    is_biased (bool, optional): Indicates whether the dataset is too biased to be usable.
        Required if `is_available` is True.
    df (pandas.DataFrame, optional): The preprocessed dataset. Required if `is_available` is True.
    note (str, optional): Additional notes about the dataset.

    Returns:
    None: The function does not return a value but updates or creates a record in the specified JSON file.
    """
    if is_available:
        assert (df is not None) and (is_biased is not None), "'df' and 'is_biased' should be provided if this cohort " \
                                                             "is relevant."
    is_usable = is_available and (not is_biased)
    new_record = {"is_usable": is_usable,
                  "is_available": is_available,
                  "is_biased": is_biased if is_available else None,
                  "has_age": "Age" in df.columns if is_available else None,
                  "has_gender": "Gender" in df.columns if is_available else None,
                  "sample_size": len(df) if is_available else None,
                  "note": note}
    
    if not os.path.exists(info_path):
        with open(info_path, 'w') as file:
            json.dump({}, file)
        print(f"A new JSON file was created at: {info_path}")

    with open(info_path, "r") as file:
        records = json.load(file)
    records[cohort] = new_record

    temp_path = info_path + ".tmp"
    try:
        with open(temp_path, 'w') as file:
            json.dump(records, file)
        os.replace(temp_path, info_path)

    except Exception as e:
        print(f"An error occurred: {e}")
        if os.path.exists(temp_path):
            os.remove(temp_path)
        raise

In [91]:
import json
if is_available:
    save_cohort_info(cohort, JSON_PATH, is_available, is_trait_biased, merged_data, note='')
else:
    save_cohort_info(cohort, JSON_PATH, is_available)

In [92]:
merged_data.head()
if not is_trait_biased:
    merged_data.to_csv(os.path.join(OUTPUT_DIR, cohort + '.csv'), index=False)

### 3. Do regression & Cross Validation

In [93]:
def read_json_to_dataframe(json_file: str) -> pd.DataFrame:
    """
    Reads a JSON file and converts it into a pandas DataFrame.

    Args:
    json_file (str): The path to the JSON file containing the data.

    Returns:
    DataFrame: A pandas DataFrame with the JSON data.
    """
    with open(json_file, 'r') as file:
        data = json.load(file)
    return pd.DataFrame.from_dict(data, orient='index').reset_index().rename(columns={'index': 'cohort_id'})

In [94]:
def filter_and_rank_cohorts(json_file: str, condition: Union[str, None] = None) -> Tuple[
    Union[str, None], pd.DataFrame]:
    """
    Reads a JSON file, filters cohorts based on usability and an optional condition, then ranks them by sample size.

    Args:
    json_file (str): The path to the JSON file containing the data.
    condition (str, optional): An additional condition for filtering. If None, only 'is_usable' is considered.

    Returns:
    Tuple: A tuple containing the best cohort ID (str or None if no suitable cohort is found) and
           the filtered and ranked DataFrame.
    """
    # Read the JSON file into a DataFrame
    df = read_json_to_dataframe(json_file)

    if condition:
        filtered_df = df[(df['is_usable'] == True) & (df[condition] == True)]
    else:
        filtered_df = df[df['is_usable'] == True]

    ranked_df = filtered_df.sort_values(by='sample_size', ascending=False)
    best_cohort_id = ranked_df.iloc[0]['cohort_id'] if not ranked_df.empty else None

    return best_cohort_id, ranked_df


In [95]:
# Check the information of usable cohorts
best_cohort, ranked_df = filter_and_rank_cohorts(JSON_PATH)
ranked_df

Unnamed: 0,cohort_id,is_usable,is_available,is_biased,has_age,has_gender,sample_size,note
0,Xena,True,True,False,True,True,91,


In [96]:
# If both age and gender have available cohorts, select 'age' as the condition.
condition = 'Age'
filter_column = 'has_' + condition.lower()

condition_best_cohort, condition_ranked_df = filter_and_rank_cohorts(JSON_PATH, filter_column)
condition_best_cohort

'Xena'

In [97]:
condition_ranked_df.head()

Unnamed: 0,cohort_id,is_usable,is_available,is_biased,has_age,has_gender,sample_size,note
0,Xena,True,True,False,True,True,91,


In [98]:
merged_data = pd.read_csv(os.path.join(OUTPUT_DIR, condition_best_cohort + '.csv'))
merged_data.head()

Unnamed: 0,Kidney Chromophobe,Age,Gender,ARHGEF10L,HIF3A,RNF17,RNF10,RNF11,RNF13,GTF2IP1,...,SLC7A10,PLA2G2C,TULP2,NPY5R,GNGT2,GNGT1,TULP3,BCL6B,GSTK1,SELP
0,1,57,0,-0.295492,-3.487426,-0.531035,0.359428,0.850422,0.58019,-0.739994,...,-2.090786,0.397618,-0.748878,0.731583,-1.308333,-0.79709,-1.328077,0.933073,2.015005,0.025067
1,1,67,0,0.581408,0.368474,-0.531035,1.217628,0.626922,0.06679,-0.058894,...,-2.090786,-0.086682,-0.193178,0.513683,-0.912933,-0.32549,-1.035777,0.221073,1.600605,2.456767
2,0,67,0,1.119008,2.198374,-0.531035,0.341628,0.681022,0.46319,0.057806,...,-0.837786,-0.086682,0.006622,5.587783,0.113667,-0.03259,0.051323,1.403473,0.786105,3.303867
3,1,56,0,0.572008,-1.889926,-0.531035,0.683828,0.971922,1.55029,-0.150194,...,-2.090786,-0.086682,-0.748878,0.414383,-2.250633,1.13531,-1.257677,-1.076027,1.424005,-0.464633
4,1,69,1,0.259208,-0.380726,-0.531035,0.992728,0.311022,0.65699,0.050806,...,-0.295086,-0.086682,-0.748878,1.982883,-0.845733,1.40521,-1.536277,0.100373,1.883005,0.013067


In [99]:
# Remove the other condition to prevent interference.
merged_data = merged_data.drop(columns=['Gender'], errors='ignore').astype('float')

X = merged_data.drop(columns=[TRAIT, condition]).values
Y = merged_data[TRAIT].values
Z = merged_data[condition].values

Select the appropriate regression model depending on whether the dataset shows batch effect.

In [100]:
def detect_batch_effect(X):
    """
    Detect potential batch effects in a dataset using eigenvalues of XX^T.

    Args:
    X (numpy.ndarray): A feature matrix with shape (n_samples, n_features).

    Returns:
    bool: True if a potential batch effect is detected, False otherwise.
    """
    n_samples = X.shape[0]

    # Computing XX^T
    XXt = np.dot(X, X.T)

    # Compute the eigenvalues of XX^T
    eigen_values = np.linalg.eigvalsh(XXt)  # Using eigvalsh since XX^T is symmetric
    eigen_values = sorted(eigen_values, reverse=True)

    # Check for large gaps in the eigenvalues
    for i in range(len(eigen_values) - 1):
        gap = eigen_values[i] - eigen_values[i + 1]
        if gap > 1 / n_samples:  # You may need to adjust this threshold
            return True

    return False

In [101]:
import numpy as np
has_batch_effect = detect_batch_effect(X)
has_batch_effect

True

In [102]:
from sparse_lmm import VariableSelection


In [103]:
# Select appropriate models based on whether the dataset has batch effect.
# We experiment on two models for each branch. We will decide which one to choose later.

if has_batch_effect:
    model_constructor1 = VariableSelection
    model_params1 = {'modified': True, 'lamda': 3e-4}
    model_constructor2 = VariableSelection
    model_params2 = {'modified': False}
else:
    model_constructor1 = Lasso
    model_params1 = {'alpha': 1.0, 'random_state': 42}
    model_constructor2 = VariableSelection
    model_params2 = {'modified': False}

In [104]:
def cross_validation(X, Y, Z, model_constructor, model_params, k=5, target_type='binary'):
    assert target_type in ['binary', 'continuous'], "The target type must be chosen from 'binary' or 'continuous'"
    indices = np.arange(X.shape[0])
    np.random.shuffle(indices)

    fold_size = len(X) // k
    performances = []

    for i in range(k):
        # Split data into train and test based on the current fold
        test_indices = indices[i * fold_size: (i + 1) * fold_size]
        train_indices = np.setdiff1d(indices, test_indices)

        X_train, X_test = X[train_indices], X[test_indices]
        Y_train, Y_test = Y[train_indices], Y[test_indices]
        Z_train, Z_test = Z[train_indices], Z[test_indices]

        normalized_X_train, normalized_X_test = normalize_data(X_train, X_test)
        normalized_Z_train, normalized_Z_test = normalize_data(Z_train, Z_test)

        # model = model_constructor(**model_params)
        model = ResidualizationRegressor(model_constructor, model_params)
        model.fit(normalized_X_train, Y_train, normalized_Z_train)
        predictions = model.predict(normalized_X_test, normalized_Z_test)

        if target_type == 'binary':
            predictions = (predictions > 0.5).astype(int)
            Y_test = (Y_test > 0.5).astype(int)
            performance = accuracy_score(Y_test, predictions)
        elif target_type == 'continuous':
            performance = mean_squared_error(Y_test, predictions)

        performances.append(performance)

    cv_mean = np.mean(performances)
    cv_std = np.std(performances)

    if target_type == 'binary':
        print(f'The cross-validation accuracy is {(cv_mean * 100):.2f}% ± {(cv_std * 100):.2f}%')
    else:
        print(f'The cross-validation MSE is {(cv_mean * 100):.2f} ± {(cv_std * 100):.2f}')

    return cv_mean, cv_std

In [105]:
def normalize_data(X_train, X_test=None):
    """Compute the mean and standard deviation statistics of the training data, use them to normalize the training data,
    and optionally the test data"""
    mean = np.mean(X_train, axis=0)
    std = np.std(X_train, axis=0)

    # Handling columns with std = 0
    std_no_zero = np.where(std == 0, 1, std)

    # Normalize X_train
    X_train_normalized = (X_train - mean) / std_no_zero
    # Set normalized values to 0 where std was 0
    X_train_normalized[:, std == 0] = 0

    if X_test is not None:
        X_test_normalized = (X_test - mean) / std_no_zero
        X_test_normalized[:, std == 0] = 0
    else:
        X_test_normalized = None

    return X_train_normalized, X_test_normalized

In [106]:
class ResidualizationRegressor:
    def __init__(self, regression_model_constructor, params=None):
        if params is None:
            params = {}
        self.regression_model = regression_model_constructor(**params)
        self.beta_Z = None  # Coefficients for regression of Y on Z
        self.beta_X = None  # Coefficients for regression of residual on X
        self.neg_log_p_values = None  # Negative logarithm of p-values
        self.p_values = None  # Actual p-values

    def _reshape_data(self, data):
        """
        Reshape the data to ensure it's in the correct format (2D array).

        :param data: The input data (can be 1D or 2D array).
        :return: Reshaped 2D array.
        """
        if data.ndim == 1:
            return data.reshape(-1, 1)
        return data

    def _reshape_output(self, data):
        """
        Reshape the output data to ensure it's in the correct format (1D array).

        :param data: The output data (can be 1D or 2D array).
        :return: Reshaped 1D array.
        """
        if data.ndim == 2 and data.shape[1] == 1:
            return data.ravel()
        return data
    
    def fit(self, X, Y, Z):
        X = self._reshape_data(X)
        Y = self._reshape_data(Y)
        Z = self._reshape_data(Z)

        # Step 1: Linear regression of Y on Z
        Z_ones = np.column_stack((np.ones(Z.shape[0]), Z))
        self.beta_Z = np.linalg.pinv(Z_ones.T @ Z_ones) @ Z_ones.T @ Y
        Y_hat = Z_ones @ self.beta_Z
        e_Y = Y - Y_hat  # Residual of Y

        # Step 2: Regress the residual on X using the included regression model
        self.regression_model.fit(X, e_Y)

        # Obtain coefficients from the regression model
        if hasattr(self.regression_model, 'coef_'):
            self.beta_X = self.regression_model.coef_
        elif hasattr(self.regression_model, 'getBeta'):
            beta_output = self.regression_model.getBeta()
            self.beta_X = self._reshape_output(beta_output)

        # Obtain negative logarithm of p-values, if available
        if hasattr(self.regression_model, 'getNegLogP'):
            neg_log_p_output = self.regression_model.getNegLogP()
            if neg_log_p_output is not None:
                self.neg_log_p_values = self._reshape_output(neg_log_p_output)
                self.p_values = np.exp(-self.neg_log_p_values)
                # Concatenate the p-values of Z and X. The p-values of Z were not computed, mark with NaN.
                p_values_Z = np.full(Z.shape[1], np.nan)
                self.p_values = np.concatenate((p_values_Z, self.p_values))
    def predict(self, X, Z):
        X = self._reshape_data(X)
        Z = self._reshape_data(Z)

        Z_ones = np.column_stack((np.ones(Z.shape[0]), Z))
        ZX = np.column_stack((Z, X))
        combined_beta = np.concatenate((self.beta_Z[1:].ravel(), self.beta_X.ravel()))
        return ZX @ combined_beta + self.beta_Z[0]

    def get_coefficients(self):
        return np.concatenate((self.beta_Z[1:].ravel(), self.beta_X.ravel()))

    def get_p_values(self):
        return self.p_values


In [107]:
from sklearn.metrics import accuracy_score, mean_squared_error


In [108]:
trait_type = 'binary'  # Remember to set this properly, either 'binary' or 'continuous'
cv_mean1, cv_std1 = cross_validation(X, Y, Z, model_constructor1, model_params1, target_type=trait_type)

alpha for Lasso: 0.0003
alpha for Lasso: 0.0003
alpha for Lasso: 0.0003
alpha for Lasso: 0.0003
alpha for Lasso: 0.0003
The cross-validation accuracy is 64.44% ± 15.56%


In [109]:
cv_mean2, cv_std2 = cross_validation(X, Y, Z, model_constructor2, model_params2, target_type=trait_type)

  ts = beta / np.sqrt(var * sigma)
  ts = beta / np.sqrt(var * sigma)
  ts = beta / np.sqrt(var * sigma)
  ts = beta / np.sqrt(var * sigma)
  ts = beta / np.sqrt(var * sigma)


The cross-validation accuracy is 96.67% ± 4.44%


In [110]:
normalized_X, _ = normalize_data(X)
normalized_Z, _ = normalize_data(Z)

# Train regression model on the whole dataset to identify significant genes
model1 = ResidualizationRegressor(model_constructor1, model_params1)
model1.fit(normalized_X, Y, normalized_Z)

model2 = ResidualizationRegressor(model_constructor2, model_params2)
model2.fit(normalized_X, Y, normalized_Z)

alpha for Lasso: 0.0003


  ts = beta / np.sqrt(var * sigma)


### 4. Discussion and report

In [111]:
def interpret_result(model: Any, feature_names: List[str], trait: str, condition: str,
                     threshold: float = 0.05, save_output: bool = True,
                     output_dir: str = './output', model_id: int = 1) -> None:
    """This function interprets and reports the result of a trained linear regression model, where the regressor
    consists of one variable about condition and multiple variables about genetic factors.
    The function extracts coefficients and p-values from the model, and identifies the significant genes based on
    p-values or non-zero coefficients, depending on the availability of p-values.

    Parameters:
    model (Any): The trained regression Model.
    feature_names (List[str]): A list of feature names corresponding to the model's coefficients.
    trait (str): The target trait of interest.
    condition (str): The specific condition to examine within the model.
    threshold (float): Significance level for p-value correction. Defaults to 0.05.
    save_output (bool): Flag to determine whether to save the output to a file. Defaults to True.
    output_dir (str): Directory path where output files are saved. Defaults to './output'.
    model_id (int): The index of the model, 1 or 2.

    Returns:
    None: This function does not return anything but prints and optionally saves the output.
    """
    coefficients = model.get_coefficients().reshape(-1).tolist()
    p_values = model.get_p_values()
    if p_values is None:
        regression_df = pd.DataFrame({
            'Variable': feature_names,
            'Coefficient': coefficients
        })
    else:
        regression_df = pd.DataFrame({
            'Variable': feature_names,
            'Coefficient': coefficients,
            'p_value': p_values.reshape(-1).tolist()
        })
    condition_effect = regression_df[regression_df['Variable'] == condition].iloc[0]

    print(f"Effect of the condition on the target variable:")
    print(f"Variable: {condition}")
    print(f"Coefficient: {condition_effect['Coefficient']:.4f}")
    gene_regression_df = regression_df[regression_df['Variable'] != condition]
    if p_values is None:
        gene_regression_df['Absolute Coefficient'] = gene_regression_df['Coefficient'].abs()
        significant_genes = gene_regression_df[gene_regression_df['Coefficient'] != 0]
        significant_genes_sorted = significant_genes.sort_values(by='Absolute Coefficient', ascending=False)
        print(
            f"Found {len(significant_genes_sorted)} genes with non-zero coefficients associated with the trait '{trait}' "
            f"conditional on the factor '{condition}'. These genes are identified as significant based on the regression model.")
    else:
        # Apply the Benjamini-Hochberg correction, to get the corrected p-values
        corrected_p_values = multipletests(gene_regression_df['p_value'], alpha=threshold, method='fdr_bh')[1]
        gene_regression_df.loc[:, 'corrected_p_value'] = corrected_p_values
        significant_genes = gene_regression_df.loc[gene_regression_df['corrected_p_value'] < threshold]
        significant_genes_sorted = significant_genes.sort_values('corrected_p_value')
        print(
            f"Found {len(significant_genes_sorted)} significant genes associated with the trait '{trait}' conditional on "
            f"the factor '{condition}', with corrected p-value < {threshold}:")

    print(significant_genes_sorted.to_string(index=False))

    # Optionally, save this to a CSV file
    if save_output:
        significant_genes_sorted.to_csv(
            os.path.join(output_dir, f'significant_genes_condition_{condition}_{model_id}.csv'), index=False)

In [112]:
feature_cols = merged_data.columns.tolist()
feature_cols.remove(TRAIT)

threshold = 0.05
interpret_result(model1, feature_cols, TRAIT, condition, threshold=threshold, save_output=True,
                 output_dir=OUTPUT_DIR, model_id=1)

Effect of the condition on the target variable:
Variable: Age
Coefficient: -0.0364
Found 75 genes with non-zero coefficients associated with the trait 'Kidney Chromophobe' conditional on the factor 'Age'. These genes are identified as significant based on the regression model.
 Variable  Coefficient  Absolute Coefficient
    FOXF2    -6.475829              6.475829
    TAF1D    -6.007020              6.007020
  SLC30A2     4.455768              4.455768
   OR51A4    -4.380482              4.380482
     PID1    -3.880443              3.880443
  PGLYRP2    -3.482427              3.482427
   LRSAM1     3.342719              3.342719
    GJA10    -3.283148              3.283148
    GALR2     3.246703              3.246703
    MEX3B    -2.954832              2.954832
     LBX1    -2.931299              2.931299
     VAPA    -2.849938              2.849938
     IRS1    -2.795838              2.795838
 FAM41AY1     2.554983              2.554983
    SEH1L    -2.442089              2.442089
  

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  gene_regression_df['Absolute Coefficient'] = gene_regression_df['Coefficient'].abs()


In [113]:
from statsmodels.stats.multitest import multipletests


In [114]:
interpret_result(model2, feature_cols, TRAIT, condition, threshold=threshold, save_output=True,
                 output_dir=OUTPUT_DIR, model_id=2)

Effect of the condition on the target variable:
Variable: Age
Coefficient: -0.0364
Found 197 significant genes associated with the trait 'Kidney Chromophobe' conditional on the factor 'Age', with corrected p-value < 0.05:
Variable  Coefficient      p_value  corrected_p_value
  UGT2A3    -0.285658 5.341910e-15       4.750530e-11
   RALYL    -0.309340 5.479589e-15       4.750530e-11
  PTGER1    -0.275027 5.602681e-13       3.238163e-09
  UGT3A1    -0.240755 1.502332e-12       6.512235e-09
  MAPK15    -0.194648 1.964597e-12       6.812828e-09
  GABRA2    -0.253559 3.741700e-11       1.081289e-07
  CLDN19    -0.222469 1.784410e-09       3.437764e-06
    IRX1    -0.210113 1.665566e-09       3.437764e-06
    FUT6    -0.205566 1.655916e-09       3.437764e-06
    VIL1    -0.194274 2.204524e-09       3.822423e-06
   MORN5    -0.202149 3.074520e-09       4.846281e-06
   FOXJ1    -0.234478 3.442289e-09       4.973820e-06
   UPK1B    -0.180007 5.028668e-09       6.707082e-06
  SLC9A3    -0.205328 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  gene_regression_df.loc[:, 'corrected_p_value'] = corrected_p_values
