# Gold standard curation: Preprocessing and single-step regression

In this stage of gold standard curation, we will do the data preprocessing, selection, and single-step regression for the 153 traits in our question set. This file shows the reference steps using the trait "Breast Cancer" as an example. The workflow consists of the following steps:

1. Preprocess all the cohorts related to this trait. Each cohort should be converted to a tabular form and saved to a csv file, with columns being genetic factors, the trait, and age, gender if available;
2. If there exists at least one cohort with age or gender information, conduct regression analysis with genetic features together with age or gender as the regressors.


# 1. Basic setup

In [2]:
import os
import sys

sys.path.append('..')
from utils import *

# Set your preferred name
USER = "Jiayi"
# Set the data and output directories
DATA_ROOT = '/Users/legion/Desktop/Courses/IS389/data'
OUTPUT_ROOT = '/Users/legion/Desktop/Courses/IS389/output'
TRAIT = 'Ankylosing Spondylitis'

OUTPUT_DIR = os.path.join(OUTPUT_ROOT, USER, '-'.join(TRAIT.split()))
JSON_PATH = os.path.join(OUTPUT_DIR, "cohort_info.json")
if not os.path.exists(OUTPUT_DIR):
    os.makedirs(OUTPUT_DIR, exist_ok=True)

# Gene symbol normalization may take 1-2 minutes. You may set it to False for debugging.
NORMALIZE_GENE = True

In [None]:
# This cell is only for use on Google Colab. Skip it if you run your code in other environments

"""import os
from google.colab import drive

drive.mount('/content/drive', force_remount=True)
proj_dir = '/content/drive/MyDrive/AI4Science_Public'
os.chdir(proj_dir)"""

# 2. Data preprocessing and selection

## 2.1. The TCGA Xena dataset

In TCGA Xena, there is either zero or one cohort related to the trait. We search the names of subdirectories to see if any matches the trait. If a match is found, we directly obtain the file paths.

In [3]:
dataset = 'TCGA'
dataset_dir = os.path.join(DATA_ROOT, dataset)
os.listdir(dataset_dir)[:10]

['TCGA_Adrenocortical_Cancer_(ACC)',
 'TCGA_Breast_Cancer_(BRCA)',
 'TCGA_Kidney_Papillary_Cell_Carcinoma_(KIRP)']

If no match is found, jump directly to GEO in Part 2.2

In [4]:
trait_subdir = "TCGA_Adrenocortical_Cancer_(ACC)"
cohort = 'Xena'
# All the cancer traits in Xena are binary
trait_type = 'binary'
# Once a relevant cohort is found in Xena, we can generally assume the gene and clinical data are available
is_available = True

clinical_data_file = os.path.join(dataset_dir, trait_subdir, 'TCGA.ACC.sampleMap_ACC_clinicalMatrix')
genetic_data_file = os.path.join(dataset_dir, trait_subdir, 'TCGA.ACC.sampleMap_HiSeqV2_PANCAN.gz')

In [6]:
import pandas as pd

clinical_data = pd.read_csv(clinical_data_file, sep='\t', index_col=0)
genetic_data = pd.read_csv(genetic_data_file, compression='gzip', sep='\t', index_col=0)
age_col = gender_col = None

In [7]:
def check_rows_and_columns(dataframe, display=False):
    """
    Get the lists of row names and column names of a dataset, and optionally observe them.
    :param dataframe:
    :param display:
    :return:
    """
    dataframe_rows = dataframe.index.tolist()
    if display:
        print(f"The dataset has {len(dataframe_rows)} rows, such as {dataframe_rows[:20]}")
    dataframe_cols = dataframe.columns.tolist()
    if display:
        print(f"\nThe dataset has {len(dataframe_cols)} columns, such as {dataframe_cols[:20]}")
    return dataframe_rows, dataframe_cols

In [8]:
_, clinical_data_cols = check_rows_and_columns(clinical_data)
clinical_data_cols[:10]

['_INTEGRATION',
 '_PATIENT',
 '_cohort',
 '_primary_disease',
 '_primary_site',
 'additional_pharmaceutical_therapy',
 'additional_radiation_therapy',
 'age_at_initial_pathologic_diagnosis',
 'atypical_mitotic_figures',
 'bcr_followup_barcode']

Read all the column names in the clinical dataset, to find the columns that record information about age or gender.
Reference prompt:

In [9]:
f'''
Below is a list of column names from a biomedical dataset. Please examine it and identify the columns that are likely to contain information about patients' age. Additionally, please do the same for columns that may hold data on patients' gender. Please provide your answer by strictly following this format, without redundant words:
candidate_age_cols = [col_name1, col_name2, ...]
candidate_gender_cols = [col_name1, col_name2, ...]
If no columns match a criterion, please provide an empty list.

Column names:
{clinical_data_cols}
'''

"\nBelow is a list of column names from a biomedical dataset. Please examine it and identify the columns that are likely to contain information about patients' age. Additionally, please do the same for columns that may hold data on patients' gender. Please provide your answer by strictly following this format, without redundant words:\ncandidate_age_cols = [col_name1, col_name2, ...]\ncandidate_gender_cols = [col_name1, col_name2, ...]\nIf no columns match a criterion, please provide an empty list.\n\nColumn names:\n['_INTEGRATION', '_PATIENT', '_cohort', '_primary_disease', '_primary_site', 'additional_pharmaceutical_therapy', 'additional_radiation_therapy', 'age_at_initial_pathologic_diagnosis', 'atypical_mitotic_figures', 'bcr_followup_barcode', 'bcr_patient_barcode', 'bcr_sample_barcode', 'clinical_M', 'ct_scan_findings', 'cytoplasm_presence_less_than_equal_25_percent', 'days_to_birth', 'days_to_collection', 'days_to_death', 'days_to_initial_pathologic_diagnosis', 'days_to_last_fol

In [10]:
candidate_age_cols = [ 'age_at_initial_pathologic_diagnosis',
                      'days_to_birth', 'year_of_initial_pathologic_diagnosis']
candidate_gender_cols = [ 'gender']

Choose a single column from the candidate columns that record age and gender information respectively.
If no column meets the requirement, keep 'age_col' or 'gender_col' to None

In [11]:
def preview_df(df, n=5):
    return df.head(n).to_dict(orient='list')

In [12]:
preview_df(clinical_data[candidate_age_cols])

{'age_at_initial_pathologic_diagnosis': [58, 44, 23, 23, 30],
 'days_to_birth': [-21496, -16090, -8624, -8451, -11171],
 'year_of_initial_pathologic_diagnosis': [2000, 2004, 2008, 2000, 2000]}

In [13]:
age_col = 'age_at_initial_pathologic_diagnosis'

In [14]:
preview_df(clinical_data[candidate_gender_cols])

{'gender': ['MALE', 'FEMALE', 'FEMALE', 'FEMALE', 'MALE']}

In [15]:
gender_col = 'gender'

In [16]:
def xena_select_clinical_features(clinical_df, trait, age_col=None, gender_col=None):
    feature_list = []
    trait_data = clinical_df.index.to_series().apply(xena_convert_trait).rename(trait)
    feature_list.append(trait_data)
    if age_col:
        age_data = clinical_df[age_col].apply(xena_convert_age).rename("Age")
        feature_list.append(age_data)
    if gender_col:
        gender_data = clinical_df[gender_col].apply(xena_convert_gender).rename("Gender")
        feature_list.append(gender_data)
    selected_clinical_df = pd.concat(feature_list, axis=1)
    return selected_clinical_df

In [17]:
def xena_convert_trait(row_index: str):
    """
    Convert the trait information from Sample IDs to labels depending on the last two digits.
    Tumor types range from 01 - 09, normal types from 10 - 19.
    :param row_index: the index value of a row
    :return: the converted value
    """
    last_two_digits = int(row_index[-2:])

    if 1 <= last_two_digits <= 9:
        return 1
    elif 10 <= last_two_digits <= 19:
        return 0
    else:
        return -1

In [18]:
def xena_convert_age(cell: str):
    """Convert the cell content about age to a numerical value using regular expression
    """
    match = re.search(r'\d+', str(cell))
    if match:
        return int(match.group())
    else:
        return None

In [19]:
def xena_convert_gender(cell: str):
    """Convert the cell content about gender to a binary value
    """
    if isinstance(cell, str):
        cell = cell.lower()

    if cell == "female":
        return 0
    elif cell == "male":
        return 1
    else:
        return None

In [20]:
import re
selected_clinical_data = xena_select_clinical_features(clinical_data, TRAIT, age_col=age_col, gender_col=gender_col)

In [21]:
def normalize_gene_symbols_in_index(gene_df):
    """Normalize the human gene symbols at the index of a dataframe, and replace the index with its normalized version.
    Remove the rows where the index failed to be normalized."""
    normalized_gene_list = normalize_gene_symbols(gene_df.index.tolist())
    assert len(normalized_gene_list) == len(gene_df.index)
    gene_df.index = normalized_gene_list
    gene_df = gene_df[gene_df.index.notnull()]
    return gene_df

In [22]:
def normalize_gene_symbols(gene_symbols, batch_size=1000):
    """Normalize human gene symbols in batches using the 'mygenes' library"""
    mg = mygene.MyGeneInfo()
    normalized_genes = {}

    # Process in batches
    for i in range(0, len(gene_symbols), batch_size):
        batch = gene_symbols[i:i + batch_size]
        results = mg.querymany(batch, scopes='symbol', fields='symbol', species='human')

        # Update the normalized_genes dictionary with results from this batch
        for gene in results:
            normalized_genes[gene['query']] = gene.get('symbol', None)

    # Return the normalized symbols in the same order as the input
    return [normalized_genes.get(symbol) for symbol in gene_symbols]


In [23]:
import mygene

if NORMALIZE_GENE:
    genetic_data = normalize_gene_symbols_in_index(genetic_data)

12 input query terms found dup hits:	[('GTF2IP1', 2), ('RBMY1A3P', 3), ('RPL31P11', 2), ('HERC2P2', 3), ('WASH3P', 3), ('NUDT9P1', 2), ('
154 input query terms found no hit:	['C16orf13', 'C16orf11', 'LOC100272146', 'LOC339240', 'NACAP1', 'LOC441204', 'KLRA1', 'FAM183A', 'FA
10 input query terms found dup hits:	[('SUGT1P1', 2), ('PTPRVP', 2), ('SNORA62', 3), ('IFITM4P', 7), ('HLA-DRB6', 2), ('FUNDC2P2', 2), ('
190 input query terms found no hit:	['NARFL', 'NFKBIL2', 'LOC150197', 'TMEM84', 'LOC162632', 'PPPDE1', 'PPPDE2', 'C1orf38', 'C1orf31', '
11 input query terms found dup hits:	[('PIP5K1P1', 2), ('HBD', 2), ('PPP1R2P1', 9), ('HSD17B7P2', 2), ('RPSAP9', 2), ('SNORD68', 2), ('SN
149 input query terms found no hit:	['FAM153C', 'C9orf167', 'CLK2P', 'CCDC76', 'CCDC75', 'CCDC72', 'HIST3H2BB', 'PRAC', 'LOC285780', 'LO
15 input query terms found dup hits:	[('SNORD58C', 2), ('UOX', 2), ('UBE2Q2P1', 3), ('PPP4R1L', 2), ('SNORD63', 3), ('ESPNP', 2), ('HBBP1
158 input query terms found no hit:	[

In [24]:
merged_data = selected_clinical_data.join(genetic_data.T).dropna()
merged_data.head()

Unnamed: 0_level_0,Ankylosing Spondylitis,Age,Gender,ARHGEF10L,HIF3A,RNF17,RNF10,RNF11,RNF13,GTF2IP1,...,SLC7A10,PLA2G2C,TULP2,NPY5R,GNGT2,GNGT1,TULP3,BCL6B,GSTK1,SELP
sampleID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
TCGA-OR-A5J1-01,1,58,1,-0.641092,-0.325826,-0.531035,1.266428,0.355422,0.03719,0.243706,...,-1.520186,-0.086682,-0.182978,-0.615817,-0.281533,3.02111,-0.927577,-1.006227,1.119905,-2.185533
TCGA-OR-A5J2-01,1,44,0,-1.864792,2.766674,0.321165,1.000728,0.836122,0.35439,-0.436694,...,-0.318586,1.056018,0.393822,2.366583,-0.955033,-1.28139,1.020723,1.226373,1.164005,0.265067
TCGA-OR-A5J3-01,1,23,0,-0.723192,-0.362926,-0.531035,0.639828,-0.199578,-0.48331,0.143606,...,-0.574486,-0.086682,-0.748878,-0.113317,-3.803333,-0.61009,0.397623,-0.675227,1.196005,-3.161633
TCGA-OR-A5J5-01,1,30,1,-1.576792,-2.086226,2.463765,1.382228,-1.115678,-1.23621,0.615806,...,-0.279486,-0.086682,0.078622,1.095983,-0.908533,-1.28139,0.661823,0.458273,0.839605,-5.525533
TCGA-OR-A5J6-01,1,29,0,-2.311992,5.225974,-0.531035,0.967928,-0.393778,-0.38231,-0.060194,...,-2.090786,1.607218,2.481122,-0.946617,-0.570533,-1.28139,-0.425177,0.938573,0.495005,-1.733333


In [25]:
def judge_and_remove_biased_features(df, trait, trait_type):
    assert trait_type in ["binary", "continuous"], f"The trait must be either a binary or a continuous variable!"
    if trait_type == "binary":
        trait_biased = judge_binary_variable_biased(df, trait)
    else:
        trait_biased = judge_continuous_variable_biased(df, trait)
    if trait_biased:
        print(f"The distribution of the feature \'{trait}\' in this dataset is severely biased.\n")
    else:
        print(f"The distribution of the feature \'{trait}\' in this dataset is fine.\n")
    if "Age" in df.columns:
        age_biased = judge_continuous_variable_biased(df, 'Age')
        if age_biased:
            print(f"The distribution of the feature \'Age\' in this dataset is severely biased.\n")
            df = df.drop(columns='Age')
        else:
            print(f"The distribution of the feature \'Age\' in this dataset is fine.\n")
    if "Gender" in df.columns:
        gender_biased = judge_binary_variable_biased(df, 'Gender')
        if gender_biased:
            print(f"The distribution of the feature \'Gender\' in this dataset is severely biased.\n")
            df = df.drop(columns='Gender')
        else:
            print(f"The distribution of the feature \'Gender\' in this dataset is fine.\n")

    return trait_biased, df

In [26]:
def judge_binary_variable_biased(dataframe, col_name, min_proportion=0.1, min_num=5):
    """
    Check if the distribution of a binary variable in the dataset is too biased to be usable for analysis
    :param dataframe:
    :param col_name:
    :param min_proportion:
    :param min_num:
    :return:
    """
    label_counter = dataframe[col_name].value_counts()
    total_samples = len(dataframe)
    rare_label_num = label_counter.min()
    rare_label = label_counter.idxmin()
    rare_label_proportion = rare_label_num / total_samples

    print(
        f"For the feature \'{col_name}\', the least common label is '{rare_label}' with {rare_label_num} occurrences. This represents {rare_label_proportion:.2%} of the dataset.")

    biased = (len(label_counter) < 2) or ((rare_label_proportion < min_proportion) and (rare_label_num < min_num))
    return bool(biased)


In [27]:
def judge_continuous_variable_biased(dataframe, col_name):
    """Check if the distribution of a continuous variable in the dataset is too biased to be usable for analysis.
    As a starting point, we consider it biased if all values are the same. For the next step, maybe ask GPT to judge
    based on quartile statistics combined with its common sense knowledge about this feature.
    """
    quartiles = dataframe[col_name].quantile([0.25, 0.5, 0.75])
    min_value = dataframe[col_name].min()
    max_value = dataframe[col_name].max()

    # Printing quartile information
    print(f"Quartiles for '{col_name}':")
    print(f"  25%: {quartiles[0.25]}")
    print(f"  50% (Median): {quartiles[0.5]}")
    print(f"  75%: {quartiles[0.75]}")
    print(f"Min: {min_value}")
    print(f"Max: {max_value}")

    biased = min_value == max_value

    return bool(biased)

In [28]:
print(f"The merged dataset contains {len(merged_data)} samples.")
is_trait_biased, merge_data = judge_and_remove_biased_features(merged_data, TRAIT, trait_type=trait_type)
is_trait_biased

The merged dataset contains 79 samples.
For the feature 'Ankylosing Spondylitis', the least common label is '1' with 79 occurrences. This represents 100.00% of the dataset.
The distribution of the feature 'Ankylosing Spondylitis' in this dataset is severely biased.

Quartiles for 'Age':
  25%: 35.0
  50% (Median): 49.0
  75%: 59.5
Min: 14
Max: 77
The distribution of the feature 'Age' in this dataset is fine.

For the feature 'Gender', the least common label is '1' with 31 occurrences. This represents 39.24% of the dataset.
The distribution of the feature 'Gender' in this dataset is fine.



True

In [29]:
merged_data.head()
if not is_trait_biased:
    merge_data.to_csv(os.path.join(OUTPUT_DIR, cohort + '.csv'), index=False)

In [30]:
from typing import Callable, Optional, List, Tuple, Union, Any

In [31]:
def save_cohort_info(cohort: str, info_path: str, is_available: bool, is_biased: Optional[bool] = None,
                     df: Optional[pd.DataFrame] = None, note: str = '') -> None:
    """
    Add or update information about the usability and quality of a dataset for statistical analysis.

    Parameters:
    cohort (str): A unique identifier for the dataset.
    info_path (str): File path to the JSON file where records are stored.
    is_available (bool): Indicates whether both the genetic data and trait data are available in the dataset, and can be
     preprocessed into a dataframe.
    is_biased (bool, optional): Indicates whether the dataset is too biased to be usable.
        Required if `is_available` is True.
    df (pandas.DataFrame, optional): The preprocessed dataset. Required if `is_available` is True.
    note (str, optional): Additional notes about the dataset.

    Returns:
    None: The function does not return a value but updates or creates a record in the specified JSON file.
    """
    if is_available:
        assert (df is not None) and (is_biased is not None), "'df' and 'is_biased' should be provided if this cohort " \
                                                             "is relevant."
    is_usable = is_available and (not is_biased)
    new_record = {"is_usable": is_usable,
                  "is_available": is_available,
                  "is_biased": is_biased if is_available else None,
                  "has_age": "Age" in df.columns if is_available else None,
                  "has_gender": "Gender" in df.columns if is_available else None,
                  "sample_size": len(df) if is_available else None,
                  "note": note}
    
    if not os.path.exists(info_path):
        with open(info_path, 'w') as file:
            json.dump({}, file)
        print(f"A new JSON file was created at: {info_path}")

    with open(info_path, "r") as file:
        records = json.load(file)
    records[cohort] = new_record

    temp_path = info_path + ".tmp"
    try:
        with open(temp_path, 'w') as file:
            json.dump(records, file)
        os.replace(temp_path, info_path)

    except Exception as e:
        print(f"An error occurred: {e}")
        if os.path.exists(temp_path):
            os.remove(temp_path)
        raise

In [32]:
import json

save_cohort_info(cohort, JSON_PATH, is_available, is_trait_biased, merged_data)

A new JSON file was created at: /Users/legion/Desktop/Courses/IS389/output\Jiayi\Ankylosing-Spondylitis\cohort_info.json


## 2.2. The GEO dataset

In GEO, there may be one or multiple cohorts for a trait. Each cohort is identified by an accession number. We iterate over all accession numbers in the corresponding subdirectory, preprocess the cohort data, and save them to csv files.

In [33]:
dataset = 'GEO'
trait_subdir = "Ankylosing-Spondylitis"

trait_path = os.path.join(DATA_ROOT, dataset, trait_subdir)
os.listdir(trait_path)

['GSE100648', 'GSE11886', 'GSE25101', 'GSE73754']

Repeat the below steps for all the accession numbers

In [34]:
def get_relevant_filepaths(cohort_dir):
    """Find the file paths of a SOFT file and a matrix file from the given data directory of a cohort.
    If there are multiple SOFT files or matrix files, simply choose the first one. May be replaced by better
    strategies later.
    """
    files = os.listdir(cohort_dir)
    soft_files = [f for f in files if 'soft' in f.lower()]
    matrix_files = [f for f in files if 'matrix' in f.lower()]
    assert len(soft_files) > 0 and len(matrix_files) > 0
    soft_file_path = os.path.join(cohort_dir, soft_files[0])
    matrix_file_path = os.path.join(cohort_dir, matrix_files[0])

    return soft_file_path, matrix_file_path

In [85]:
# Completed
cohort = accession_num = "GSE100648"
cohort_dir = os.path.join(trait_path, accession_num)
soft_file, matrix_file = get_relevant_filepaths(cohort_dir)
soft_file, matrix_file

import io
background_prefixes = ['!Series_title', '!Series_summary', '!Series_overall_design']
clinical_prefixes = ['!Sample_geo_accession', '!Sample_characteristics_ch1']

background_info, clinical_data = get_background_and_clinical_data(matrix_file, background_prefixes, clinical_prefixes)
print(background_info)

clinical_data

!Series_title	"Early arthritis B cell gene expression profiling"
!Series_summary	"242 patients recruited from an early arthritis clinic donated RNA and DNA from freshly isolated and purified peripheral blood CD19+ B cells. Global gene expression measurement was carried out using Illumina BeadChip HT12v4 microarrays. Objectives included the identification of B cell transcripts differentially expressed between disease phenotypes, where all patients were naive to immunomodulatory therapy. In addition an eQTL analysis was carried out with reference to known genotype data for this cohort of patients"
!Series_overall_design	"Cross sectional; diagnoses confirmed at >2 years' follow-up."


Unnamed: 0,!Sample_geo_accession,GSM2690230,GSM2690231,GSM2690232,GSM2690233,GSM2690234,GSM2690235,GSM2690236,GSM2690237,GSM2690238,...,GSM2690462,GSM2690463,GSM2690464,GSM2690465,GSM2690466,GSM2690467,GSM2690468,GSM2690469,GSM2690470,GSM2690471
0,!Sample_characteristics_ch1,tissue: peripheral blood,tissue: peripheral blood,tissue: peripheral blood,tissue: peripheral blood,tissue: peripheral blood,tissue: peripheral blood,tissue: peripheral blood,tissue: peripheral blood,tissue: peripheral blood,...,tissue: peripheral blood,tissue: peripheral blood,tissue: peripheral blood,tissue: peripheral blood,tissue: peripheral blood,tissue: peripheral blood,tissue: peripheral blood,tissue: peripheral blood,tissue: peripheral blood,tissue: peripheral blood
1,!Sample_characteristics_ch1,cell type: CD19+ B cells,cell type: CD19+ B cells,cell type: CD19+ B cells,cell type: CD19+ B cells,cell type: CD19+ B cells,cell type: CD19+ B cells,cell type: CD19+ B cells,cell type: CD19+ B cells,cell type: CD19+ B cells,...,cell type: CD19+ B cells,cell type: CD19+ B cells,cell type: CD19+ B cells,cell type: CD19+ B cells,cell type: CD19+ B cells,cell type: CD19+ B cells,cell type: CD19+ B cells,cell type: CD19+ B cells,cell type: CD19+ B cells,cell type: CD19+ B cells
2,!Sample_characteristics_ch1,first_diagnosis: Rheumatoid Arthritis,first_diagnosis: Undifferentiated Inflammatory...,first_diagnosis: Rheumatoid Arthritis,first_diagnosis: Non-Inflammatory,first_diagnosis: Undifferentiated Inflammatory...,first_diagnosis: Osteoarthritis,first_diagnosis: Undifferentiated Inflammatory...,first_diagnosis: Enteropathic Arthritis,first_diagnosis: Other Inflammatory Arthritis,...,first_diagnosis: Undifferentiated Inflammatory...,first_diagnosis: Undifferentiated Inflammatory...,first_diagnosis: Undifferentiated Inflammatory...,first_diagnosis: Undifferentiated Inflammatory...,first_diagnosis: Undifferentiated Inflammatory...,first_diagnosis: Rheumatoid Arthritis,first_diagnosis: Rheumatoid Arthritis,first_diagnosis: Rheumatoid Arthritis,first_diagnosis: Rheumatoid Arthritis,first_diagnosis: Rheumatoid Arthritis
3,!Sample_characteristics_ch1,working_diagnosis: Rheumatoid Arthritis,working_diagnosis: Osteoarthritis,working_diagnosis: Rheumatoid Arthritis,working_diagnosis: Non-Inflammatory,working_diagnosis: Reactive Arthritis,working_diagnosis: Osteoarthritis,working_diagnosis: Reactive Arthritis,working_diagnosis: Enteropathic Arthritis,working_diagnosis: Other Inflammatory Arthritis,...,working_diagnosis: Undifferentiated Inflammato...,working_diagnosis: Undifferentiated Inflammato...,working_diagnosis: Undifferentiated Inflammato...,working_diagnosis: Undifferentiated Inflammato...,working_diagnosis: Undifferentiated Inflammato...,working_diagnosis: Rheumatoid Arthritis,working_diagnosis: Rheumatoid Arthritis,working_diagnosis: Rheumatoid Arthritis,working_diagnosis: Rheumatoid Arthritis,working_diagnosis: Rheumatoid Arthritis
4,!Sample_characteristics_ch1,Sex: M,Sex: F,Sex: F,Sex: F,Sex: M,Sex: F,Sex: M,Sex: F,Sex: F,...,Sex: F,Sex: F,Sex: F,Sex: F,Sex: F,Sex: M,Sex: F,Sex: F,Sex: F,Sex: F


In [90]:
tumor_stage_row = clinical_data.iloc[3]
tumor_stage_row.unique()


array(['!Sample_characteristics_ch1',
       'working_diagnosis: Rheumatoid Arthritis',
       'working_diagnosis: Osteoarthritis',
       'working_diagnosis: Non-Inflammatory',
       'working_diagnosis: Reactive Arthritis',
       'working_diagnosis: Enteropathic Arthritis',
       'working_diagnosis: Other Inflammatory Arthritis',
       'working_diagnosis: Psoriatic Arthritis',
       'working_diagnosis: Undifferentiated Spondylo-Arthropathy',
       'working_diagnosis: Crystal Arthritis',
       'working_diagnosis: Unknown',
       'working_diagnosis: Undifferentiated Inflammatory Arthritis',
       'working_diagnosis: Ankylosing Spondylitis',
       'working_diagnosis: Lupus/Other CTD-Associated'], dtype=object)

In [91]:
is_gene_availabe = True
trait_row = 3
age_row = None
gender_row = 4

trait_type = 'binary'

# Verify and use the functions generated by GPT

# 这个函数将组织类型（tissue type）转换为有关癫痫存在与否的二进制值。
# 它是基于特定的假设，即如果组织类型是“胰腺导管腺癌”（Pancreatic Ductal Adenocarcinoma），则认为癫痫存在（返回1）；否则，认为癫痫不存在（返回0）。
def convert_trait(tissue_type):
    """
    Convert tissue type to epilepsy presence (binary).
    Assuming epilepsy presence for 'Hippocampus' tissue.
    """
    if tissue_type == 'working_diagnosis: Ankylosing Spondylitis':
        return 1  # Epilepsy present
    else:
        return 0  # Epilepsy not present


# def convert_trait(tumor_grade):
#     if (tumor_grade == 'tumor grade: 2' or tumor_grade == 'tumor grade: 3' or tumor_grade == 'tumor grade: 4'):
#         return 1  
#     elif tumor_grade == 'tumor grade: 1':
#         return 0  
#     else:
#         return None

# 这个函数的目的是将年龄的字符串表示转换为一个连续的数值型表示。如果年龄未知（例如，标记为'n.a.'），则返回None。
# 函数尝试从传入的字符串中提取出一个整数作为年龄值。如果字符串的格式不符合预期，导致提取失败，同样返回None。
def convert_age(age_string):
    """
    Convert age string to a continuous numerical value.
    Unknown values are converted to None.
    """
    if age_string.lower() == 'n.a.':
        return None
    try:
        # Extract age as an integer from the string
        age = int(age_string.split(': ')[1])
        return age
    except (ValueError, IndexError):
        # In case of any format error or unexpected string structure
        return None


# 这个函数将性别的字符串表示转换为二进制值，其中“female”对应1，“male”对应0。如果性别未知或字符串不符合预期格式，则返回None。
# It sometimes maps 'female' to 0, and sometimes 1. Does it matter?
def convert_gender(gender_string):
    """
    Convert gender string to a binary value.
    'female' is represented as 1, 'male' as 0.
    Unknown values are converted to None.
    """
    if (gender_string.lower() == 'sex: female' or gender_string.lower() == 'sex: f' or gender_string.lower() == 'gender: female' or gender_string.lower() == 'gender: f'):
        return 1
    elif (gender_string.lower() == 'sex: male' or gender_string.lower() == 'sex: m' or gender_string.lower() == 'gender: male' or gender_string.lower() == 'gender: m') :  # changeed 
        return 0
    else:
        return None

In [92]:
selected_clinical_data = geo_select_clinical_features(clinical_data, TRAIT, trait_row, convert_trait, age_row=age_row,
                                                      convert_age=convert_age, gender_row=gender_row,
                                                      convert_gender=convert_gender)
selected_clinical_data.head()

  clinical_df = clinical_df.applymap(convert_fn)
  clinical_df = clinical_df.applymap(convert_fn)


Unnamed: 0,GSM2690230,GSM2690231,GSM2690232,GSM2690233,GSM2690234,GSM2690235,GSM2690236,GSM2690237,GSM2690238,GSM2690239,...,GSM2690462,GSM2690463,GSM2690464,GSM2690465,GSM2690466,GSM2690467,GSM2690468,GSM2690469,GSM2690470,GSM2690471
Ankylosing Spondylitis,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Gender,0,1,1,1,0,1,0,1,1,1,...,1,1,1,1,1,0,1,1,1,1


In [93]:
genetic_data = get_genetic_data(matrix_file)
genetic_data

Unnamed: 0_level_0,GSM2690230,GSM2690231,GSM2690232,GSM2690233,GSM2690234,GSM2690235,GSM2690236,GSM2690237,GSM2690238,GSM2690239,...,GSM2690462,GSM2690463,GSM2690464,GSM2690465,GSM2690466,GSM2690467,GSM2690468,GSM2690469,GSM2690470,GSM2690471
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ILMN_1343291,14.415838,14.420469,14.454755,14.548113,14.393246,14.476773,14.536234,14.655704,14.495031,14.532721,...,14.649794,14.700669,14.376907,14.617264,14.512447,14.295207,14.482401,14.478883,14.605284,14.556862
ILMN_1343295,12.501028,11.415616,11.695755,11.923042,11.554511,11.474396,12.061302,11.915192,11.952768,11.313382,...,11.580915,10.424846,10.646008,11.864454,11.133300,11.915684,11.448013,11.947450,11.322478,11.192065
ILMN_1651199,6.627739,6.607227,6.585315,6.676918,6.565506,6.706464,6.494084,6.584417,6.575962,6.658423,...,6.702769,6.486990,6.786289,6.622436,6.590972,6.481921,6.641028,6.683143,6.478434,6.537222
ILMN_1651209,6.556000,7.053035,6.802224,6.954742,6.890331,7.070821,6.954012,6.929918,6.918314,6.734604,...,6.800404,7.103074,6.853516,6.867996,6.674861,6.775554,6.766406,6.716863,6.870444,6.863519
ILMN_1651210,6.597989,6.552052,6.658103,6.666267,6.727378,6.667242,6.653112,6.619658,6.692328,6.622764,...,6.558407,6.635161,6.541176,6.612361,6.504983,6.669707,6.513736,6.523818,6.742682,6.671303
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ILMN_3311170,6.612094,6.658218,6.684843,6.737029,6.580584,6.616513,6.841786,6.555513,6.570710,6.673706,...,6.609160,6.714697,6.698058,6.727300,6.590967,6.690844,6.637789,6.635340,6.648415,6.549795
ILMN_3311175,6.650585,6.696714,6.680007,6.596399,6.775372,6.645024,6.636948,6.580950,6.587754,6.664373,...,6.706110,6.575635,6.578014,6.625735,6.599182,6.613412,6.723767,6.537973,6.747436,6.655572
ILMN_3311180,6.800705,6.635228,6.669978,6.597619,6.735752,6.652822,6.465435,6.728747,6.661455,6.540017,...,6.731054,6.616311,6.580413,6.639311,6.811333,6.688858,6.663167,6.675401,6.566967,6.683958
ILMN_3311185,6.652600,6.572627,6.685865,6.585365,6.652637,6.620325,6.667806,6.574783,6.688020,6.610882,...,6.573938,6.594607,6.529716,6.703237,6.568071,6.655248,6.584248,6.615269,6.664984,6.592061


In [94]:
requires_gene_mapping = True

if requires_gene_mapping:
    gene_annotation = get_gene_annotation(soft_file)
    gene_annotation_summary = preview_df(gene_annotation)
    print(gene_annotation_summary)

gene_row_ids = genetic_data.index[:20].tolist()
gene_row_ids

if requires_gene_mapping:
    print(f'''
    As a biomedical research team, we are analyzing a gene expression dataset, and find that its row headers are some identifiers related to genes:
    {gene_row_ids}
    To get the mapping from those identifiers to actual gene symbols, we extracted the gene annotation data from a series in the GEO database, and saved it to a Python dictionary. Please read the dictionary, and decide which key stores the identifiers, and which key stores the gene symbols. Please strictly follow this format in your answer:
    identifier_key = 'key_name1'
    gene_symbol_key = 'key_name2'

    Gene annotation dictionary:
    {gene_annotation_summary}
    ''')

gene_annotation

{'ID': ['ILMN_1343048', 'ILMN_1343049', 'ILMN_1343050', 'ILMN_1343052', 'ILMN_1343059'], 'Species': [nan, nan, nan, nan, nan], 'Source': [nan, nan, nan, nan, nan], 'Search_Key': [nan, nan, nan, nan, nan], 'Transcript': [nan, nan, nan, nan, nan], 'ILMN_Gene': [nan, nan, nan, nan, nan], 'Source_Reference_ID': [nan, nan, nan, nan, nan], 'RefSeq_ID': [nan, nan, nan, nan, nan], 'Unigene_ID': [nan, nan, nan, nan, nan], 'Entrez_Gene_ID': [nan, nan, nan, nan, nan], 'GI': [nan, nan, nan, nan, nan], 'Accession': [nan, nan, nan, nan, nan], 'Symbol': ['phage_lambda_genome', 'phage_lambda_genome', 'phage_lambda_genome:low', 'phage_lambda_genome:low', 'thrB'], 'Protein_Product': [nan, nan, nan, nan, 'thrB'], 'Probe_Id': [nan, nan, nan, nan, nan], 'Array_Address_Id': [5090180.0, 6510136.0, 7560739.0, 1450438.0, 1240647.0], 'Probe_Type': [nan, nan, nan, nan, nan], 'Probe_Start': [nan, nan, nan, nan, nan], 'SEQUENCE': ['GAATAAAGAACAATCTGCTGATGATCCCTCCGTGGATCTGATTCGTGTAA', 'CCATGTGATACGAGGGCGCGTAGTTTGCA

Unnamed: 0,ID,Species,Source,Search_Key,Transcript,ILMN_Gene,Source_Reference_ID,RefSeq_ID,Unigene_ID,Entrez_Gene_ID,...,Probe_Chr_Orientation,Probe_Coordinates,Cytoband,Definition,Ontology_Component,Ontology_Process,Ontology_Function,Synonyms,Obsolete_Probe_Id,GB_ACC
0,ILMN_1343048,,,,,,,,,,...,,,,,,,,,,
1,ILMN_1343049,,,,,,,,,,...,,,,,,,,,,
2,ILMN_1343050,,,,,,,,,,...,,,,,,,,,,
3,ILMN_1343052,,,,,,,,,,...,,,,,,,,,,
4,ILMN_1343059,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11459370,ILMN_2371169,7.855694502,0,,,,,,,,...,,,,,,,,,,
11459371,ILMN_1701875,8.956319227,0,,,,,,,,...,,,,,,,,,,
11459372,ILMN_1786396,8.314829409,0,,,,,,,,...,,,,,,,,,,
11459373,ILMN_1653618,8.968126225,0,,,,,,,,...,,,,,,,,,,


In [95]:
gene_annotation.columns

Index(['ID', 'Species', 'Source', 'Search_Key', 'Transcript', 'ILMN_Gene',
       'Source_Reference_ID', 'RefSeq_ID', 'Unigene_ID', 'Entrez_Gene_ID',
       'GI', 'Accession', 'Symbol', 'Protein_Product', 'Probe_Id',
       'Array_Address_Id', 'Probe_Type', 'Probe_Start', 'SEQUENCE',
       'Chromosome', 'Probe_Chr_Orientation', 'Probe_Coordinates', 'Cytoband',
       'Definition', 'Ontology_Component', 'Ontology_Process',
       'Ontology_Function', 'Synonyms', 'Obsolete_Probe_Id', 'GB_ACC'],
      dtype='object')

In [96]:
if requires_gene_mapping:
    identifier_key = 'ID'
    gene_symbol_key = 'Symbol'
    gene_mapping = get_gene_mapping(gene_annotation, identifier_key, gene_symbol_key)
    genetic_data = apply_gene_mapping(genetic_data, gene_mapping)

In [97]:
genetic_data

Unnamed: 0_level_0,GSM2690230,GSM2690231,GSM2690232,GSM2690233,GSM2690234,GSM2690235,GSM2690236,GSM2690237,GSM2690238,GSM2690239,...,GSM2690462,GSM2690463,GSM2690464,GSM2690465,GSM2690466,GSM2690467,GSM2690468,GSM2690469,GSM2690470,GSM2690471
Gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1-Dec,6.690225,6.645631,6.613515,6.639225,6.705352,6.642645,6.563353,6.563837,6.530434,6.581223,...,6.634416,6.510920,6.563834,6.627330,6.585572,6.701072,6.556956,6.683356,6.609795,6.650483
1-Mar,7.651063,9.571072,8.556212,8.587053,9.433715,9.594437,9.409010,9.165061,9.791250,9.178682,...,8.791651,9.323251,9.518603,8.650389,9.074401,9.056935,9.216715,8.608755,9.224840,8.558667
10-Mar,6.826650,6.588074,6.590686,6.597999,6.646043,6.676378,6.596264,6.579935,6.616365,6.645453,...,6.559374,6.610369,6.582099,6.742123,6.607611,6.592589,6.641961,6.484952,6.527867,6.685483
11-Mar,6.414740,6.579584,6.574483,6.563864,6.561773,6.560585,6.485683,6.580335,6.443181,6.738039,...,6.591454,6.522802,6.542323,6.517765,6.697750,6.657025,6.722615,6.793510,6.578635,6.436090
2-Mar,7.620253,7.269074,7.263967,7.332735,7.215201,7.068063,7.490501,7.255800,7.211440,7.370995,...,7.335041,7.171274,7.036233,7.365702,7.365079,7.148670,7.330081,7.514638,7.421067,7.287512
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
dJ341D10.1,7.872240,7.808094,7.804654,7.844383,7.853817,7.792229,8.296184,8.005402,7.756242,8.175150,...,8.009629,7.702242,7.744495,7.510275,7.649329,7.759432,7.742089,7.757284,7.602962,7.589611
gm127,6.652717,6.661334,6.766349,6.766286,6.726882,6.739457,6.691700,6.836504,6.715168,6.611255,...,6.781494,6.657049,6.701411,6.758011,6.802065,6.699023,6.653731,6.795169,6.829674,6.672595
psiTPTE22,6.543646,6.651050,6.592777,6.712322,6.651795,6.687181,6.722882,6.615713,6.647023,6.651957,...,6.658668,6.648473,6.719836,6.719294,6.723473,6.659385,6.640942,6.653864,6.602776,6.540232
rab1c,6.720996,6.614096,6.592334,6.602132,6.775020,6.598250,6.731486,6.685775,6.855869,6.709597,...,6.507397,6.568342,6.536055,6.523492,6.719801,6.596157,6.563001,6.661075,6.639940,6.565698


In [98]:
genetic_data = normalize_gene_symbols_in_index(genetic_data)

13 input query terms found dup hits:	[('ABCC13', 2), ('ABCC6P1', 2), ('ABCC6P2', 3), ('ADAM3A', 2), ('ADAM6', 3), ('AGAP11', 2), ('AK2P2'
106 input query terms found no hit:	['1-Dec', '1-Mar', '10-Mar', '11-Mar', '2-Mar', '3-Mar', '4-Mar', '5-Mar', '6-Mar', '7-Mar', '7A5', 
6 input query terms found dup hits:	[('ATP6AP1L', 2), ('ATXN8OS', 2), ('BAGE2', 2), ('BIRC8', 2), ('BRD7P3', 2), ('BRI3P1', 2)]
363 input query terms found no hit:	['ARMC4', 'ARMET', 'ARNTL', 'ARNTL2', 'ARP11', 'ARPM1', 'ARPM2', 'ARPP-21', 'ARS2', 'ARSE', 'ARVP612
809 input query terms found no hit:	['C15orf42', 'C15orf43', 'C15orf44', 'C15orf49', 'C15orf5', 'C15orf51', 'C15orf52', 'C15orf53', 'C15
7 input query terms found dup hits:	[('CATSPER2P1', 2), ('CCDC144NL', 2), ('CCNYL2', 2), ('CCNYL3', 2), ('CCT6P1', 2), ('CDR1', 2), ('CE
131 input query terms found no hit:	['C9orf90', 'C9orf91', 'C9orf93', 'C9orf96', 'C9orf98', 'CA5BP', 'CABC1', 'CAMSAP1L1', 'CARD17', 'CA
4 input query terms found dup hits:	[('CLEC4GP1',

In [99]:
genetic_data

Unnamed: 0,GSM2690230,GSM2690231,GSM2690232,GSM2690233,GSM2690234,GSM2690235,GSM2690236,GSM2690237,GSM2690238,GSM2690239,...,GSM2690462,GSM2690463,GSM2690464,GSM2690465,GSM2690466,GSM2690467,GSM2690468,GSM2690469,GSM2690470,GSM2690471
A1BG,6.848050,6.688162,6.796304,6.726489,6.786394,6.854404,6.674512,6.688926,6.807333,6.666278,...,6.783802,6.657357,6.717128,6.735756,6.748284,6.761963,6.747374,6.804050,6.778584,6.751946
A1CF,6.583041,6.650677,6.686538,6.677932,6.661399,6.609482,6.628126,6.613500,6.613940,6.596914,...,6.574616,6.561873,6.650160,6.606254,6.663059,6.624456,6.657874,6.672318,6.741131,6.641459
A2M,6.650136,6.521780,6.566705,6.618576,6.580874,6.569995,6.588907,6.559197,6.623999,6.567076,...,6.633357,6.453508,6.538286,6.437383,6.601649,6.726172,6.425379,6.620081,6.528318,6.530385
A2ML1,6.611510,6.646145,6.633586,6.460136,6.465241,6.422501,6.467049,6.607412,6.529374,6.381183,...,6.454338,6.698004,6.722807,6.648967,6.501302,6.586460,6.566908,6.641582,6.606379,6.488792
A3GALT2,6.527584,6.622752,6.591259,6.624255,6.512508,6.553509,6.691260,6.716285,6.605287,6.574475,...,6.770754,6.606267,6.758579,6.685279,6.702444,6.548548,6.623991,6.503292,6.727856,6.595299
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ZYG11B,9.911613,10.017467,9.871995,9.707972,10.086790,10.151445,10.318409,10.080975,10.183896,10.038278,...,9.724662,10.242638,9.792358,9.719300,10.147318,10.715561,10.077286,9.726975,10.139342,10.125031
ZYX,10.858828,8.512090,8.711756,9.209957,8.739951,8.567852,9.859342,9.021956,9.131341,8.492129,...,8.395954,8.109818,8.257713,9.571841,8.877332,8.493656,8.613942,8.916826,8.144795,8.406007
ZZEF1,8.413154,8.307626,8.234364,8.298999,8.282404,8.633262,8.541087,8.412907,8.163264,8.233234,...,8.620517,8.503926,8.273694,8.654892,8.389683,8.012815,8.396138,8.400772,8.212944,8.314829
ZZZ3,8.525627,8.723545,8.754528,8.591482,8.868986,9.154676,8.968219,8.918800,8.966573,8.646385,...,8.854108,8.960751,9.038261,9.150526,8.888503,8.894667,8.722772,8.783000,8.705091,8.872893


In [100]:
merged_data = geo_merge_clinical_genetic_data(selected_clinical_data, genetic_data)
# The preprocessing runs through, which means is_available should be True
is_available = True

In [101]:
merged_data

Unnamed: 0,Ankylosing Spondylitis,Gender,A1BG,A1CF,A2M,A2ML1,A3GALT2,A4GALT,A4GNT,AAA1,...,ZWINT,ZXDA,ZXDB,ZXDC,ZYG11A,ZYG11B,ZYX,ZZEF1,ZZZ3,RAB1C
GSM2690230,0.0,0.0,6.848050,6.583041,6.650136,6.611510,6.527584,6.434655,7.058692,6.700324,...,6.880832,7.006752,7.284013,6.975131,6.629065,9.911613,10.858828,8.413154,8.525627,6.720996
GSM2690231,0.0,1.0,6.688162,6.650677,6.521780,6.646145,6.622752,6.618596,6.880522,6.620599,...,6.560931,7.133329,7.969672,7.378143,6.566949,10.017467,8.512090,8.307626,8.723545,6.614096
GSM2690232,0.0,1.0,6.796304,6.686538,6.566705,6.633586,6.591259,6.555226,6.825927,6.636943,...,6.619013,7.188181,7.688697,7.205052,6.687631,9.871995,8.711756,8.234364,8.754528,6.592334
GSM2690233,0.0,1.0,6.726489,6.677932,6.618576,6.460136,6.624255,6.605685,6.989004,6.624586,...,6.705561,7.072152,7.534276,7.056405,6.583996,9.707972,9.209957,8.298999,8.591482,6.602132
GSM2690234,0.0,0.0,6.786394,6.661399,6.580874,6.465241,6.512508,6.698219,6.666786,6.656601,...,6.697169,7.046910,7.638880,7.310057,6.686599,10.086790,8.739951,8.282404,8.868986,6.775020
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
GSM2690467,0.0,0.0,6.761963,6.624456,6.726172,6.586460,6.548548,6.567630,6.679969,6.674690,...,6.979054,7.193730,8.066666,7.221787,6.595565,10.715561,8.493656,8.012815,8.894667,6.596157
GSM2690468,0.0,1.0,6.747374,6.657874,6.425379,6.566908,6.623991,6.576486,6.668170,6.618987,...,6.836740,7.007840,7.708474,7.048696,6.667023,10.077286,8.613942,8.396138,8.722772,6.563001
GSM2690469,0.0,1.0,6.804050,6.672318,6.620081,6.641582,6.503292,6.665137,6.746265,6.649298,...,6.718197,7.087727,7.800900,7.160791,6.633927,9.726975,8.916826,8.400772,8.783000,6.661075
GSM2690470,0.0,1.0,6.778584,6.741131,6.528318,6.606379,6.727856,6.466134,6.547827,6.558552,...,6.751371,7.033584,7.752131,6.960197,6.550888,10.139342,8.144795,8.212944,8.705091,6.639940


In [102]:
print(f"The merged dataset contains {len(merged_data)} samples.")
is_trait_biased, merged_data = judge_and_remove_biased_features(merged_data, TRAIT, trait_type=trait_type)
is_trait_biased

The merged dataset contains 242 samples.
For the feature 'Ankylosing Spondylitis', the least common label is '1.0' with 2 occurrences. This represents 0.83% of the dataset.
The distribution of the feature 'Ankylosing Spondylitis' in this dataset is severely biased.

For the feature 'Gender', the least common label is '0.0' with 69 occurrences. This represents 28.51% of the dataset.
The distribution of the feature 'Gender' in this dataset is fine.



True

In [103]:
if is_available:
    save_cohort_info(cohort, JSON_PATH, is_available, is_trait_biased, merged_data, note='')
else:
    save_cohort_info(cohort, JSON_PATH, is_available)
merged_data.head()
if not is_trait_biased:
    merged_data.to_csv(os.path.join(OUTPUT_DIR, cohort + '.csv'), index=False)

In [104]:
# Finished
cohort = accession_num = "GSE25101"
cohort_dir = os.path.join(trait_path, accession_num)
soft_file, matrix_file = get_relevant_filepaths(cohort_dir)
soft_file, matrix_file

from utils import *
import io
background_prefixes = ['!Series_title', '!Series_summary', '!Series_overall_design']
clinical_prefixes = ['!Sample_geo_accession', '!Sample_characteristics_ch1']

background_info, clinical_data = get_background_and_clinical_data(matrix_file, background_prefixes, clinical_prefixes)
print(background_info)

clinical_data

!Series_title	"Expression profiling in whole blood in ankylosing spondylitis patients and controls"
!Series_summary	"Introduction: A number of genetic-association studies have identified genes contributing to AS susceptibility but such approaches provide little information as to the gene activity changes occurring during the disease process. Transcriptional profiling generates a “snapshot” of the sampled cells activity and thus can provide insights into the molecular processes driving the disease process. We undertook a whole-genome microarray approach to identify candidate genes associated with AS and validated these gene-expression changes in a larger sample cohort.  Methods: 18 active AS patients, classified according to the New York criteria. and 18 gender-and age-matched controls were profiled using Illumina HT-12 Whole-Genome Expression BeadChips which carry cDNAs for 48000 genes and transcripts. Class comparison analysis identified a number of differentially expressed candidate 

Unnamed: 0,!Sample_geo_accession,GSM616668,GSM616669,GSM616670,GSM616671,GSM616672,GSM616673,GSM616674,GSM616675,GSM616676,...,GSM616690,GSM616691,GSM616692,GSM616693,GSM616694,GSM616695,GSM616696,GSM616697,GSM616698,GSM616699
0,!Sample_characteristics_ch1,tissue: Whole blood,tissue: Whole blood,tissue: Whole blood,tissue: Whole blood,tissue: Whole blood,tissue: Whole blood,tissue: Whole blood,tissue: Whole blood,tissue: Whole blood,...,tissue: Whole blood,tissue: Whole blood,tissue: Whole blood,tissue: Whole blood,tissue: Whole blood,tissue: Whole blood,tissue: Whole blood,tissue: Whole blood,tissue: Whole blood,tissue: Whole blood
1,!Sample_characteristics_ch1,cell type: PBMC,cell type: PBMC,cell type: PBMC,cell type: PBMC,cell type: PBMC,cell type: PBMC,cell type: PBMC,cell type: PBMC,cell type: PBMC,...,cell type: PBMC,cell type: PBMC,cell type: PBMC,cell type: PBMC,cell type: PBMC,cell type: PBMC,cell type: PBMC,cell type: PBMC,cell type: PBMC,cell type: PBMC
2,!Sample_characteristics_ch1,disease status: Ankylosing spondylitis patient,disease status: Ankylosing spondylitis patient,disease status: Ankylosing spondylitis patient,disease status: Ankylosing spondylitis patient,disease status: Ankylosing spondylitis patient,disease status: Ankylosing spondylitis patient,disease status: Ankylosing spondylitis patient,disease status: Ankylosing spondylitis patient,disease status: Ankylosing spondylitis patient,...,disease status: Normal control,disease status: Normal control,disease status: Normal control,disease status: Normal control,disease status: Normal control,disease status: Normal control,disease status: Normal control,disease status: Normal control,disease status: Normal control,disease status: Normal control


In [105]:
tumor_stage_row = clinical_data.iloc[2]
tumor_stage_row.unique()

array(['!Sample_characteristics_ch1',
       'disease status: Ankylosing spondylitis patient',
       'disease status: Normal control'], dtype=object)

In [106]:
is_gene_availabe = True
trait_row = 2
age_row = None
gender_row = None

trait_type = 'binary'

# Verify and use the functions generated by GPT

# 这个函数将组织类型（tissue type）转换为有关癫痫存在与否的二进制值。
# 它是基于特定的假设，即如果组织类型是“胰腺导管腺癌”（Pancreatic Ductal Adenocarcinoma），则认为癫痫存在（返回1）；否则，认为癫痫不存在（返回0）。
def convert_trait(tissue_type):
    """
    Convert tissue type to epilepsy presence (binary).
    Assuming epilepsy presence for 'Hippocampus' tissue.
    """
    if tissue_type == 'disease status: Ankylosing spondylitis patient':
        return 1  # Epilepsy present
    else:
        return 0  # Epilepsy not present

In [107]:
selected_clinical_data = geo_select_clinical_features(clinical_data, TRAIT, trait_row, convert_trait, age_row=age_row,
                                                      convert_age=convert_age, gender_row=gender_row,
                                                      convert_gender=convert_gender)
selected_clinical_data.head()

  clinical_df = clinical_df.applymap(convert_fn)


Unnamed: 0,GSM616668,GSM616669,GSM616670,GSM616671,GSM616672,GSM616673,GSM616674,GSM616675,GSM616676,GSM616677,...,GSM616690,GSM616691,GSM616692,GSM616693,GSM616694,GSM616695,GSM616696,GSM616697,GSM616698,GSM616699
Ankylosing Spondylitis,1,1,1,1,1,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0


In [108]:
genetic_data = get_genetic_data(matrix_file)
genetic_data

Unnamed: 0_level_0,GSM616668,GSM616669,GSM616670,GSM616671,GSM616672,GSM616673,GSM616674,GSM616675,GSM616676,GSM616677,...,GSM616690,GSM616691,GSM616692,GSM616693,GSM616694,GSM616695,GSM616696,GSM616697,GSM616698,GSM616699
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ILMN_1343291,14.721687,14.910628,14.945053,14.868905,14.910628,14.994942,14.579192,14.889235,14.868905,14.789278,...,14.889235,14.889235,14.945053,14.683020,14.817801,14.910628,14.889235,14.776362,14.737311,14.868905
ILMN_1343295,11.818846,12.014962,12.020319,11.923485,11.674407,11.829631,11.492026,12.027219,11.738941,12.599862,...,12.233341,11.936598,12.282542,11.938787,11.631719,11.804583,11.701640,11.929974,11.738941,11.578072
ILMN_1651209,6.869540,7.095127,7.031877,7.133220,6.887904,6.918006,6.894105,6.886695,7.016599,6.924489,...,6.694988,6.839130,7.323969,6.973208,6.868359,7.265285,7.415702,7.236032,7.022915,7.205181
ILMN_1651228,11.300704,11.486897,11.337098,12.668199,11.107247,11.330551,12.863078,12.654738,12.784749,12.150922,...,10.968782,11.407495,13.174214,12.512700,13.083279,12.814878,12.476037,12.498150,12.313310,11.912799
ILMN_1651229,7.687854,8.152096,7.854758,7.793770,7.773122,7.649979,7.808541,7.788769,8.079810,8.140961,...,8.059401,7.950677,8.487722,8.039596,7.828835,8.255879,8.148248,8.321234,7.813087,7.871922
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ILMN_2415898,7.976265,7.467628,7.779078,7.799240,7.642209,7.737723,8.040922,8.059585,7.551395,7.646354,...,7.673213,7.906765,7.839494,7.832799,8.817948,7.376251,8.010885,7.484848,8.145814,8.037064
ILMN_2415911,7.753634,7.728451,7.741349,7.352076,7.736549,7.857763,7.634732,7.350687,7.762689,7.677580,...,7.860533,7.655432,7.467160,7.627300,7.769050,7.573400,7.615588,7.719455,7.598297,7.516465
ILMN_2415926,7.812243,7.828331,8.236525,8.052980,7.933305,8.180785,8.112408,7.564648,8.198021,7.351243,...,7.972885,8.229952,7.826864,7.976265,7.928434,8.080596,8.407442,7.698175,8.157596,7.569014
ILMN_2415949,7.937806,7.797597,7.981566,8.007248,7.969345,8.101712,7.859426,7.683960,8.171502,7.814201,...,7.941449,7.864130,7.956597,7.738982,7.996923,8.098316,7.850872,7.967910,8.114295,7.848682


In [109]:
requires_gene_mapping = True

if requires_gene_mapping:
    gene_annotation = get_gene_annotation(soft_file)
    gene_annotation_summary = preview_df(gene_annotation)
    print(gene_annotation_summary)

gene_row_ids = genetic_data.index[:20].tolist()
gene_row_ids

if requires_gene_mapping:
    print(f'''
    As a biomedical research team, we are analyzing a gene expression dataset, and find that its row headers are some identifiers related to genes:
    {gene_row_ids}
    To get the mapping from those identifiers to actual gene symbols, we extracted the gene annotation data from a series in the GEO database, and saved it to a Python dictionary. Please read the dictionary, and decide which key stores the identifiers, and which key stores the gene symbols. Please strictly follow this format in your answer:
    identifier_key = 'key_name1'
    gene_symbol_key = 'key_name2'

    Gene annotation dictionary:
    {gene_annotation_summary}
    ''')

gene_annotation

{'ID': ['ILMN_1725881', 'ILMN_1910180', 'ILMN_1804174', 'ILMN_1796063', 'ILMN_1811966'], 'nuID': ['rp13_p1x6D80lNLk3c', 'NEX0oqCV8.er4HVfU4', 'KyqQynMZxJcruyylEU', 'xXl7eXuF7sbPEp.KFI', '9ckqJrioiaej9_ajeQ'], 'Species': ['Homo sapiens', 'Homo sapiens', 'Homo sapiens', 'Homo sapiens', 'Homo sapiens'], 'Source': ['RefSeq', 'Unigene', 'RefSeq', 'RefSeq', 'RefSeq'], 'Search_Key': ['ILMN_44919', 'ILMN_127219', 'ILMN_139282', 'ILMN_5006', 'ILMN_38756'], 'Transcript': ['ILMN_44919', 'ILMN_127219', 'ILMN_139282', 'ILMN_5006', 'ILMN_38756'], 'ILMN_Gene': ['LOC23117', 'HS.575038', 'FCGR2B', 'TRIM44', 'LOC653895'], 'Source_Reference_ID': ['XM_933824.1', 'Hs.575038', 'XM_938851.1', 'NM_017583.3', 'XM_936379.1'], 'RefSeq_ID': ['XM_933824.1', nan, 'XM_938851.1', 'NM_017583.3', 'XM_936379.1'], 'Unigene_ID': [nan, 'Hs.575038', nan, nan, nan], 'Entrez_Gene_ID': [23117.0, nan, 2213.0, 54765.0, 653895.0], 'GI': [89040007.0, 10437021.0, 88952550.0, 29029528.0, 89033487.0], 'Accession': ['XM_933824.1', 'AK

Unnamed: 0,ID,nuID,Species,Source,Search_Key,Transcript,ILMN_Gene,Source_Reference_ID,RefSeq_ID,Unigene_ID,...,Probe_Chr_Orientation,Probe_Coordinates,Cytoband,Definition,Ontology_Component,Ontology_Process,Ontology_Function,Synonyms,Obsolete_Probe_Id,GB_ACC
0,ILMN_1725881,rp13_p1x6D80lNLk3c,Homo sapiens,RefSeq,ILMN_44919,ILMN_44919,LOC23117,XM_933824.1,XM_933824.1,,...,-,21766363-21766363:21769901-21769949,16p12.2a,"PREDICTED: Homo sapiens KIAA0220-like protein,...",,,,,,XM_933824.1
1,ILMN_1910180,NEX0oqCV8.er4HVfU4,Homo sapiens,Unigene,ILMN_127219,ILMN_127219,HS.575038,Hs.575038,,Hs.575038,...,,,,"Homo sapiens cDNA: FLJ21027 fis, clone CAE07110",,,,,,AK024680
2,ILMN_1804174,KyqQynMZxJcruyylEU,Homo sapiens,RefSeq,ILMN_139282,ILMN_139282,FCGR2B,XM_938851.1,XM_938851.1,,...,,,1q23.3b,"PREDICTED: Homo sapiens Fc fragment of IgG, lo...",,,,,,XM_938851.1
3,ILMN_1796063,xXl7eXuF7sbPEp.KFI,Homo sapiens,RefSeq,ILMN_5006,ILMN_5006,TRIM44,NM_017583.3,NM_017583.3,,...,+,35786070-35786119,11p13a,Homo sapiens tripartite motif-containing 44 (T...,intracellular [goid 5622] [evidence IEA],,zinc ion binding [goid 8270] [evidence IEA]; m...,MGC3490; MC7; HSA249128; DIPB,MGC3490; MC7; HSA249128; DIPB,NM_017583.3
4,ILMN_1811966,9ckqJrioiaej9_ajeQ,Homo sapiens,RefSeq,ILMN_38756,ILMN_38756,LOC653895,XM_936379.1,XM_936379.1,,...,,,10q11.23b,PREDICTED: Homo sapiens similar to protein ger...,,,,,,XM_936379.1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
630979,ILMN_2371169,11.14937973,,,,,,,,,...,,,,,,,,,,
630980,ILMN_1701875,12.68724918,,,,,,,,,...,,,,,,,,,,
630981,ILMN_1786396,9.097121239,,,,,,,,,...,,,,,,,,,,
630982,ILMN_1653618,7.608833313,,,,,,,,,...,,,,,,,,,,


In [110]:
gene_annotation.columns

Index(['ID', 'nuID', 'Species', 'Source', 'Search_Key', 'Transcript',
       'ILMN_Gene', 'Source_Reference_ID', 'RefSeq_ID', 'Unigene_ID',
       'Entrez_Gene_ID', 'GI', 'Accession', 'Symbol', 'Protein_Product',
       'Array_Address_Id', 'Probe_Type', 'Probe_Start', 'SEQUENCE',
       'Chromosome', 'Probe_Chr_Orientation', 'Probe_Coordinates', 'Cytoband',
       'Definition', 'Ontology_Component', 'Ontology_Process',
       'Ontology_Function', 'Synonyms', 'Obsolete_Probe_Id', 'GB_ACC'],
      dtype='object')

In [111]:
if requires_gene_mapping:
    identifier_key = 'ID'
    gene_symbol_key = 'Symbol'
    gene_mapping = get_gene_mapping(gene_annotation, identifier_key, gene_symbol_key)
    genetic_data = apply_gene_mapping(genetic_data, gene_mapping)

In [112]:
genetic_data

Unnamed: 0_level_0,GSM616668,GSM616669,GSM616670,GSM616671,GSM616672,GSM616673,GSM616674,GSM616675,GSM616676,GSM616677,...,GSM616690,GSM616691,GSM616692,GSM616693,GSM616694,GSM616695,GSM616696,GSM616697,GSM616698,GSM616699
Gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A26A1,6.964517,6.851497,6.821692,6.826913,6.788508,6.874905,6.896505,6.852490,6.828704,7.078653,...,7.063698,6.953264,7.004011,6.957736,6.984375,6.974978,6.945546,7.068300,6.788908,6.949239
AAAS,7.138664,7.288485,7.213284,7.165006,7.123239,7.024581,7.037256,6.882017,7.361308,7.100629,...,7.269105,7.292895,7.286026,7.194649,7.100249,7.406284,7.178009,7.214195,7.138752,7.372166
AACS,7.104400,7.329364,7.213284,7.033386,7.183104,7.352597,7.134465,6.929381,7.067596,7.140382,...,7.090643,7.198502,7.192792,7.197270,7.168820,7.046652,6.951822,7.075513,7.323013,7.038501
AACSL,7.033191,6.958621,6.788754,6.695595,7.003847,6.791738,6.751658,6.971259,6.715012,6.869732,...,6.825945,6.955751,6.838070,6.815473,6.796976,6.931289,6.823332,6.948809,6.959229,7.014102
AADACL1,7.121072,6.905127,7.059623,7.089688,6.907916,7.101289,7.101122,7.056745,7.134234,7.301244,...,7.143976,7.120851,6.888729,7.024350,7.141639,7.094998,7.081976,7.119085,7.075375,6.990618
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ZZZ3,7.914881,7.396230,7.706191,7.630652,7.340271,7.637811,8.126432,7.835849,7.917731,7.631298,...,7.630288,7.499424,7.369628,7.676353,8.130350,7.288772,7.660359,7.366716,7.832213,7.455215
bA16L21.2.1,7.339529,7.211838,7.196417,7.031924,7.290146,7.293930,7.433933,7.527881,7.461130,7.567906,...,7.326622,7.134547,6.991170,7.081851,7.406407,6.949899,7.117732,6.900108,7.034636,6.967118
dJ222E13.2,7.573959,7.669524,7.750307,8.019474,7.682048,7.785646,7.746305,7.526654,7.781176,7.297188,...,7.184054,7.523621,8.238818,7.646125,7.379656,7.904087,7.627300,7.496995,7.870841,7.516465
dJ341D10.1,7.623617,7.090202,7.306702,7.393552,7.412091,7.578962,7.972756,8.398911,7.306329,7.674254,...,7.639648,7.459779,7.044920,7.447530,7.994782,7.202621,7.403683,7.309569,7.457038,7.588035


In [113]:
genetic_data = normalize_gene_symbols_in_index(genetic_data)

In [114]:
genetic_data

Unnamed: 0,GSM616668,GSM616669,GSM616670,GSM616671,GSM616672,GSM616673,GSM616674,GSM616675,GSM616676,GSM616677,...,GSM616690,GSM616691,GSM616692,GSM616693,GSM616694,GSM616695,GSM616696,GSM616697,GSM616698,GSM616699
AAAS,7.138664,7.288485,7.213284,7.165006,7.123239,7.024581,7.037256,6.882017,7.361308,7.100629,...,7.269105,7.292895,7.286026,7.194649,7.100249,7.406284,7.178009,7.214195,7.138752,7.372166
AACS,7.104400,7.329364,7.213284,7.033386,7.183104,7.352597,7.134465,6.929381,7.067596,7.140382,...,7.090643,7.198502,7.192792,7.197270,7.168820,7.046652,6.951822,7.075513,7.323013,7.038501
AAK1,8.125071,8.129137,8.741529,9.285954,8.532619,8.234013,8.562658,8.690536,7.937707,7.986583,...,7.925247,8.789428,9.701271,9.514471,9.337640,9.529326,9.779090,9.065612,9.901291,9.068633
AAMP,7.782783,8.029534,7.667971,7.552847,7.936913,7.741029,7.611396,7.966206,7.819900,7.863311,...,7.882847,7.834193,8.028064,7.736413,7.761745,8.097668,8.044268,8.274695,7.868418,7.934589
AARS2,7.587197,7.969007,7.796640,7.702655,7.887355,7.795298,7.696078,7.563307,8.145167,7.659180,...,7.877119,7.877955,8.552603,8.044010,7.658999,8.102704,8.030426,8.248636,7.996058,7.947954
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ZXDC,7.378561,7.452878,7.461378,7.385134,7.467755,7.718804,7.296222,7.242961,7.242285,7.279282,...,7.569137,7.664972,7.229661,7.132043,7.310514,7.357571,7.227276,7.360502,7.446771,7.425161
ZYG11B,10.819175,11.302196,11.017737,10.724758,11.607532,11.321941,10.398137,10.223679,10.732550,11.002866,...,11.089995,11.514976,11.535714,10.719100,10.194335,11.138047,10.881106,10.925000,10.903630,11.353933
ZYX,11.578540,11.843250,11.460245,11.575239,12.079652,11.114155,10.822670,11.149511,11.332497,12.048542,...,12.249063,12.137544,11.776277,11.550189,11.181100,11.662707,11.189414,12.111736,11.342251,11.918314
ZZEF1,8.807526,8.832830,9.120240,8.602360,8.843867,9.354744,8.511277,8.494726,8.476937,8.467565,...,9.240523,9.445060,8.341862,8.362275,8.506950,9.161143,8.831561,8.752559,8.822759,9.097121


In [115]:
merged_data = geo_merge_clinical_genetic_data(selected_clinical_data, genetic_data)
# The preprocessing runs through, which means is_available should be True
is_available = True

In [116]:
merged_data

Unnamed: 0,Ankylosing Spondylitis,AAAS,AACS,AAK1,AAMP,AARS2,AARSD1,AASDH,AASDHPPT,AASS,...,ZSWIM5,ZSWIM6,ZW10,ZWILCH,ZXDB,ZXDC,ZYG11B,ZYX,ZZEF1,ZZZ3
GSM616668,1.0,7.138664,7.1044,8.125071,7.782783,7.587197,7.960803,7.524874,7.715052,6.83103,...,7.006623,9.150701,7.99491,7.820794,6.958386,7.378561,10.819175,11.57854,8.807526,7.914881
GSM616669,1.0,7.288485,7.329364,8.129137,8.029534,7.969007,8.500934,7.59868,7.167869,6.692567,...,6.884218,8.472918,8.275855,7.942832,6.977118,7.452878,11.302196,11.84325,8.83283,7.39623
GSM616670,1.0,7.213284,7.213284,8.741529,7.667971,7.79664,8.418613,7.633838,7.16268,6.661228,...,7.016038,8.628788,8.373534,7.972513,7.027813,7.461378,11.017737,11.460245,9.12024,7.706191
GSM616671,1.0,7.165006,7.033386,9.285954,7.552847,7.702655,8.364841,7.743772,7.285993,6.78848,...,7.097028,8.901087,7.959203,7.707246,6.907603,7.385134,10.724758,11.575239,8.60236,7.630652
GSM616672,1.0,7.123239,7.183104,8.532619,7.936913,7.887355,8.279008,7.696225,7.200697,6.693215,...,6.958621,8.599793,8.179619,7.543519,6.98067,7.467755,11.607532,12.079652,8.843867,7.340271
GSM616673,1.0,7.024581,7.352597,8.234013,7.741029,7.795298,8.448704,7.687576,7.496373,6.827851,...,7.0687,8.788205,8.206963,7.860533,7.144718,7.718804,11.321941,11.114155,9.354744,7.637811
GSM616674,1.0,7.037256,7.134465,8.562658,7.611396,7.696078,8.136888,7.667861,7.852657,6.651165,...,7.015444,8.687778,8.248844,7.886782,6.963159,7.296222,10.398137,10.82267,8.511277,8.126432
GSM616675,1.0,6.882017,6.929381,8.690536,7.966206,7.563307,7.77713,7.80942,7.672385,6.636449,...,6.788754,9.296139,7.68396,7.59461,6.8955,7.242961,10.223679,11.149511,8.494726,7.835849
GSM616676,1.0,7.361308,7.067596,7.937707,7.8199,8.145167,8.27772,7.96176,7.502574,6.642665,...,6.822207,8.4509,8.254313,7.997812,7.10077,7.242285,10.73255,11.332497,8.476937,7.917731
GSM616677,1.0,7.100629,7.140382,7.986583,7.863311,7.65918,7.916114,7.934813,7.337132,6.701745,...,6.763535,8.844864,8.011358,7.779078,6.895423,7.279282,11.002866,12.048542,8.467565,7.631298


In [117]:
print(f"The merged dataset contains {len(merged_data)} samples.")
is_trait_biased, merged_data = judge_and_remove_biased_features(merged_data, TRAIT, trait_type=trait_type)
is_trait_biased

The merged dataset contains 32 samples.
For the feature 'Ankylosing Spondylitis', the least common label is '1.0' with 16 occurrences. This represents 50.00% of the dataset.
The distribution of the feature 'Ankylosing Spondylitis' in this dataset is fine.



False

In [118]:
if is_available:
    save_cohort_info(cohort, JSON_PATH, is_available, is_trait_biased, merged_data, note='')
else:
    save_cohort_info(cohort, JSON_PATH, is_available)
merged_data.head()
if not is_trait_biased:
    merged_data.to_csv(os.path.join(OUTPUT_DIR, cohort + '.csv'), index=False)

In [119]:
# Finished
cohort = accession_num = "GSE73754"
cohort_dir = os.path.join(trait_path, accession_num)
soft_file, matrix_file = get_relevant_filepaths(cohort_dir)
soft_file, matrix_file

from utils import *
import io
background_prefixes = ['!Series_title', '!Series_summary', '!Series_overall_design']
clinical_prefixes = ['!Sample_geo_accession', '!Sample_characteristics_ch1']

background_info, clinical_data = get_background_and_clinical_data(matrix_file, background_prefixes, clinical_prefixes)
print(background_info)

clinical_data

!Series_title	"Sexual Dimorphism in the Th17 Signature of Ankylosing Spondylitis"
!Series_summary	"Male AS patients have an elevated Th17 cell frequency vs. female AS patients (Gracey et al, Arthritis and Rheumatology, 2015). This analysis was performed to further examine differences between male and female AS patients"
!Series_overall_design	"AS patients were compared to healthy controls (HC). For sex-specific anaylsis, three groups were compared: F-HC vs. M-HC, M-AS vs. M-HC and F-AS vs. F-HC. A one way ANOVA was performed to identify genes differentially regulated in male and female AS patients"


Unnamed: 0,!Sample_geo_accession,GSM1902130,GSM1902131,GSM1902132,GSM1902133,GSM1902134,GSM1902135,GSM1902136,GSM1902137,GSM1902138,...,GSM1902192,GSM1902193,GSM1902194,GSM1902195,GSM1902196,GSM1902197,GSM1902198,GSM1902199,GSM1902200,GSM1902201
0,!Sample_characteristics_ch1,Sex: Male,Sex: Male,Sex: Male,Sex: Male,Sex: Male,Sex: Male,Sex: Male,Sex: Male,Sex: Male,...,Sex: Female,Sex: Female,Sex: Female,Sex: Female,Sex: Female,Sex: Female,Sex: Female,Sex: Female,Sex: Female,Sex: Female
1,!Sample_characteristics_ch1,age (yr): 53,age (yr): 26,age (yr): 29,age (yr): 50,age (yr): 35,age (yr): 48,age (yr): 18,age (yr): 39,age (yr): 49,...,age (yr): 27,age (yr): 37,age (yr): 42,age (yr): 63,age (yr): 61,age (yr): 20,age (yr): 31,age (yr): 25,age (yr): 29,age (yr): 65
2,!Sample_characteristics_ch1,"hla-b27 (1=positive, 0=negative): 1","hla-b27 (1=positive, 0=negative): 1","hla-b27 (1=positive, 0=negative): 1","hla-b27 (1=positive, 0=negative): 1","hla-b27 (1=positive, 0=negative): 1","hla-b27 (1=positive, 0=negative): 0","hla-b27 (1=positive, 0=negative): 1","hla-b27 (1=positive, 0=negative): 1","hla-b27 (1=positive, 0=negative): 1",...,"hla-b27 (1=positive, 0=negative): unknown","hla-b27 (1=positive, 0=negative): unknown","hla-b27 (1=positive, 0=negative): unknown","hla-b27 (1=positive, 0=negative): unknown","hla-b27 (1=positive, 0=negative): unknown","hla-b27 (1=positive, 0=negative): unknown","hla-b27 (1=positive, 0=negative): unknown","hla-b27 (1=positive, 0=negative): unknown","hla-b27 (1=positive, 0=negative): unknown","hla-b27 (1=positive, 0=negative): unknown"
3,!Sample_characteristics_ch1,disease: Ankylosing Spondylitis,disease: Ankylosing Spondylitis,disease: Ankylosing Spondylitis,disease: Ankylosing Spondylitis,disease: Ankylosing Spondylitis,disease: Ankylosing Spondylitis,disease: Ankylosing Spondylitis,disease: Ankylosing Spondylitis,disease: Ankylosing Spondylitis,...,disease: healthy control,disease: healthy control,disease: healthy control,disease: healthy control,disease: healthy control,disease: healthy control,disease: healthy control,disease: healthy control,disease: healthy control,disease: healthy control


In [122]:
tumor_stage_row = clinical_data.iloc[3]
tumor_stage_row.unique()

array(['!Sample_characteristics_ch1', 'disease: Ankylosing Spondylitis',
       'disease: healthy control'], dtype=object)

In [123]:
is_gene_availabe = True
trait_row = 3
age_row = 1
gender_row = 0

trait_type = 'binary'

# Verify and use the functions generated by GPT

# 这个函数将组织类型（tissue type）转换为有关癫痫存在与否的二进制值。
# 它是基于特定的假设，即如果组织类型是“胰腺导管腺癌”（Pancreatic Ductal Adenocarcinoma），则认为癫痫存在（返回1）；否则，认为癫痫不存在（返回0）。
def convert_trait(tissue_type):
    """
    Convert tissue type to epilepsy presence (binary).
    Assuming epilepsy presence for 'Hippocampus' tissue.
    """
    if tissue_type == 'disease: Ankylosing Spondylitis':
        return 1  # Epilepsy present
    else:
        return 0  # Epilepsy not present

In [124]:
selected_clinical_data = geo_select_clinical_features(clinical_data, TRAIT, trait_row, convert_trait, age_row=age_row,
                                                      convert_age=convert_age, gender_row=gender_row,
                                                      convert_gender=convert_gender)
selected_clinical_data.head()

  clinical_df = clinical_df.applymap(convert_fn)
  clinical_df = clinical_df.applymap(convert_fn)
  clinical_df = clinical_df.applymap(convert_fn)


Unnamed: 0,GSM1902130,GSM1902131,GSM1902132,GSM1902133,GSM1902134,GSM1902135,GSM1902136,GSM1902137,GSM1902138,GSM1902139,...,GSM1902192,GSM1902193,GSM1902194,GSM1902195,GSM1902196,GSM1902197,GSM1902198,GSM1902199,GSM1902200,GSM1902201
Ankylosing Spondylitis,1,1,1,1,1,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
Age,53,26,29,50,35,48,18,39,49,43,...,27,37,42,63,61,20,31,25,29,65
Gender,0,0,0,0,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,1,1


In [125]:
genetic_data = get_genetic_data(matrix_file)
genetic_data

Unnamed: 0_level_0,GSM1902130,GSM1902131,GSM1902132,GSM1902133,GSM1902134,GSM1902135,GSM1902136,GSM1902137,GSM1902138,GSM1902139,...,GSM1902192,GSM1902193,GSM1902194,GSM1902195,GSM1902196,GSM1902197,GSM1902198,GSM1902199,GSM1902200,GSM1902201
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ILMN_1343291,14.723105,14.974196,14.875049,14.262832,14.380508,14.737535,14.994368,14.517866,14.751615,14.937985,...,14.546172,14.609772,14.502806,14.751615,15.046889,14.751615,14.890210,14.859273,14.937985,14.455832
ILMN_1343295,10.796875,10.545005,10.617579,10.663274,10.373557,11.070106,10.609179,10.614307,10.612894,10.263954,...,10.551885,10.677234,10.606567,10.688255,10.647923,10.527452,10.519320,10.395747,10.680664,10.368892
ILMN_1651199,6.625879,6.658650,6.614950,6.729367,6.670856,6.751337,6.658062,6.658457,6.644127,6.698253,...,6.663772,6.603572,6.649300,6.763689,6.677779,6.550622,6.693903,6.742800,6.584209,6.593124
ILMN_1651209,6.762562,6.775621,6.630290,6.822824,6.670856,6.713077,6.704106,6.854264,6.767462,6.659816,...,6.835929,6.729711,6.770642,6.811416,6.824817,6.736938,6.766212,6.765388,6.690326,6.719627
ILMN_1651210,6.800612,6.596698,6.613691,6.504107,6.708310,6.593606,6.674507,6.741474,6.683399,6.752126,...,6.612110,6.662521,6.708362,6.758281,6.690449,6.609686,6.743372,6.715278,6.601116,6.584874
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ILMN_3311170,6.598350,6.656326,6.745738,6.566161,6.572376,6.582429,6.613186,6.614191,6.634587,6.679710,...,6.665184,6.615608,6.665901,6.591221,6.563782,6.734294,6.627239,6.496727,6.703297,6.643118
ILMN_3311175,6.652280,6.639231,6.695448,6.540806,6.721645,6.662629,6.662562,6.680551,6.638602,6.637241,...,6.673474,6.650218,6.685569,6.740448,6.637933,6.638491,6.666933,6.588203,6.743888,6.641553
ILMN_3311180,6.864999,6.839016,6.642746,6.691449,6.716093,6.657982,6.662562,6.669022,6.671518,6.704380,...,6.770264,6.613962,6.695612,6.821868,6.648760,6.698223,6.615875,6.737687,6.725904,6.682509
ILMN_3311185,6.632191,6.655011,6.718722,6.732387,6.545948,6.675332,6.608312,6.635070,6.621255,6.705511,...,6.742424,6.655760,6.705492,6.636093,6.659603,6.696920,6.622859,6.716543,6.626361,6.704373


In [126]:
requires_gene_mapping = True

if requires_gene_mapping:
    gene_annotation = get_gene_annotation(soft_file)
    gene_annotation_summary = preview_df(gene_annotation)
    print(gene_annotation_summary)

gene_row_ids = genetic_data.index[:20].tolist()
gene_row_ids

if requires_gene_mapping:
    print(f'''
    As a biomedical research team, we are analyzing a gene expression dataset, and find that its row headers are some identifiers related to genes:
    {gene_row_ids}
    To get the mapping from those identifiers to actual gene symbols, we extracted the gene annotation data from a series in the GEO database, and saved it to a Python dictionary. Please read the dictionary, and decide which key stores the identifiers, and which key stores the gene symbols. Please strictly follow this format in your answer:
    identifier_key = 'key_name1'
    gene_symbol_key = 'key_name2'

    Gene annotation dictionary:
    {gene_annotation_summary}
    ''')

gene_annotation

{'ID': ['ILMN_1343048', 'ILMN_1343049', 'ILMN_1343050', 'ILMN_1343052', 'ILMN_1343059'], 'Species': [nan, nan, nan, nan, nan], 'Source': [nan, nan, nan, nan, nan], 'Search_Key': [nan, nan, nan, nan, nan], 'Transcript': [nan, nan, nan, nan, nan], 'ILMN_Gene': [nan, nan, nan, nan, nan], 'Source_Reference_ID': [nan, nan, nan, nan, nan], 'RefSeq_ID': [nan, nan, nan, nan, nan], 'Unigene_ID': [nan, nan, nan, nan, nan], 'Entrez_Gene_ID': [nan, nan, nan, nan, nan], 'GI': [nan, nan, nan, nan, nan], 'Accession': [nan, nan, nan, nan, nan], 'Symbol': ['phage_lambda_genome', 'phage_lambda_genome', 'phage_lambda_genome:low', 'phage_lambda_genome:low', 'thrB'], 'Protein_Product': [nan, nan, nan, nan, 'thrB'], 'Probe_Id': [nan, nan, nan, nan, nan], 'Array_Address_Id': [5090180.0, 6510136.0, 7560739.0, 1450438.0, 1240647.0], 'Probe_Type': [nan, nan, nan, nan, nan], 'Probe_Start': [nan, nan, nan, nan, nan], 'SEQUENCE': ['GAATAAAGAACAATCTGCTGATGATCCCTCCGTGGATCTGATTCGTGTAA', 'CCATGTGATACGAGGGCGCGTAGTTTGCA

Unnamed: 0,ID,Species,Source,Search_Key,Transcript,ILMN_Gene,Source_Reference_ID,RefSeq_ID,Unigene_ID,Entrez_Gene_ID,...,Probe_Chr_Orientation,Probe_Coordinates,Cytoband,Definition,Ontology_Component,Ontology_Process,Ontology_Function,Synonyms,Obsolete_Probe_Id,GB_ACC
0,ILMN_1343048,,,,,,,,,,...,,,,,,,,,,
1,ILMN_1343049,,,,,,,,,,...,,,,,,,,,,
2,ILMN_1343050,,,,,,,,,,...,,,,,,,,,,
3,ILMN_1343052,,,,,,,,,,...,,,,,,,,,,
4,ILMN_1343059,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3455430,ILMN_1908807,6.766003,0.17013,,,,,,,,...,,,,,,,,,,
3455431,ILMN_1701127,7.081023,0.0039,,,,,,,,...,,,,,,,,,,
3455432,ILMN_1751164,7.821392,0,,,,,,,,...,,,,,,,,,,
3455433,ILMN_3296040,6.7245464,0.27532,,,,,,,,...,,,,,,,,,,


In [127]:
gene_annotation.columns

Index(['ID', 'Species', 'Source', 'Search_Key', 'Transcript', 'ILMN_Gene',
       'Source_Reference_ID', 'RefSeq_ID', 'Unigene_ID', 'Entrez_Gene_ID',
       'GI', 'Accession', 'Symbol', 'Protein_Product', 'Probe_Id',
       'Array_Address_Id', 'Probe_Type', 'Probe_Start', 'SEQUENCE',
       'Chromosome', 'Probe_Chr_Orientation', 'Probe_Coordinates', 'Cytoband',
       'Definition', 'Ontology_Component', 'Ontology_Process',
       'Ontology_Function', 'Synonyms', 'Obsolete_Probe_Id', 'GB_ACC'],
      dtype='object')

In [128]:
if requires_gene_mapping:
    identifier_key = 'ID'
    gene_symbol_key = 'Symbol'
    gene_mapping = get_gene_mapping(gene_annotation, identifier_key, gene_symbol_key)
    genetic_data = apply_gene_mapping(genetic_data, gene_mapping)

In [129]:
genetic_data

Unnamed: 0_level_0,GSM1902130,GSM1902131,GSM1902132,GSM1902133,GSM1902134,GSM1902135,GSM1902136,GSM1902137,GSM1902138,GSM1902139,...,GSM1902192,GSM1902193,GSM1902194,GSM1902195,GSM1902196,GSM1902197,GSM1902198,GSM1902199,GSM1902200,GSM1902201
Gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1-Dec,6.583238,6.634187,6.736772,6.606144,6.592335,6.582429,6.727200,6.687715,6.815838,6.686119,...,6.730622,6.662521,6.536256,6.717677,6.567441,6.684505,6.635338,6.501363,6.668359,6.601340
1-Mar,7.541307,7.418936,7.723857,7.457889,8.063496,7.345468,7.918340,7.683813,7.517664,7.331091,...,7.727845,7.882471,7.646876,7.620931,7.521968,7.612260,7.677948,7.704807,7.747943,7.453524
10-Mar,6.637913,6.584239,6.664167,6.665900,6.645813,6.645117,6.615416,6.647268,6.654260,6.575992,...,6.565555,6.667528,6.646483,6.655234,6.561255,6.596798,6.652807,6.647284,6.658114,6.607802
11-Mar,6.803986,6.617751,6.677545,6.577254,6.725436,6.696657,6.768100,6.588689,6.622732,6.529279,...,6.641082,6.693208,6.656019,6.701447,6.622160,6.628750,6.712999,6.660842,6.612525,6.849221
2-Mar,7.515051,7.646111,7.551643,8.093898,7.642950,7.756180,7.660852,7.634819,8.171752,7.752842,...,7.408531,7.456631,7.849837,7.341353,7.283743,7.821662,7.364903,7.319059,7.650863,7.850871
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
dJ341D10.1,7.977261,8.056454,7.997424,8.412908,8.033030,7.717434,7.861945,8.140966,7.782730,7.924043,...,7.971788,8.230297,7.978352,7.938903,7.934387,7.668821,7.950748,7.760117,7.918305,7.816753
gm127,6.849804,6.806737,6.851868,6.970032,6.857639,6.631670,6.785138,6.753465,6.807225,6.670449,...,6.737225,6.884189,6.812922,6.720719,6.863619,6.722490,6.738746,6.873762,6.905868,6.845744
psiTPTE22,6.817363,6.632995,6.764498,6.673868,6.815591,6.803827,6.751432,6.836163,6.743703,6.660348,...,6.831514,6.734106,6.700657,6.713387,6.765148,6.750495,6.787045,6.755791,6.733667,6.791706
rab1c,6.741328,6.624208,6.761005,6.754579,6.889446,6.790038,6.850925,6.762543,6.914238,6.925648,...,6.799939,6.769505,6.811423,6.713453,6.852134,6.777274,6.651200,6.628250,6.637634,6.675023


In [130]:
genetic_data = normalize_gene_symbols_in_index(genetic_data)

In [131]:
genetic_data

Unnamed: 0,GSM1902130,GSM1902131,GSM1902132,GSM1902133,GSM1902134,GSM1902135,GSM1902136,GSM1902137,GSM1902138,GSM1902139,...,GSM1902192,GSM1902193,GSM1902194,GSM1902195,GSM1902196,GSM1902197,GSM1902198,GSM1902199,GSM1902200,GSM1902201
A1BG,6.721839,6.673086,6.712955,6.770714,6.712682,6.767250,6.680871,6.732433,6.645191,6.721483,...,6.663826,6.726559,6.688819,6.754805,6.705244,6.714713,6.752249,6.711358,6.759263,6.767227
A1CF,6.687158,6.624651,6.642954,6.664230,6.665765,6.677747,6.677909,6.644698,6.646873,6.668884,...,6.658412,6.649672,6.705482,6.721324,6.799274,6.621188,6.664272,6.739637,6.641758,6.708410
A2M,6.619832,6.542803,6.522699,6.633948,6.562571,6.568287,6.528718,6.529343,6.533412,6.462207,...,6.565053,6.575708,6.484352,6.526597,6.537614,6.520083,6.586989,6.711289,6.548265,6.591539
A2ML1,6.567780,6.488923,6.513175,6.599086,6.622063,6.612026,6.483853,6.690247,6.527108,6.492720,...,6.509235,6.595049,6.582683,6.524981,6.558844,6.676249,6.471403,6.618289,6.416780,6.467222
A3GALT2,6.614632,6.741219,6.615549,6.703858,6.693452,6.785090,6.611774,6.734135,6.758213,6.595013,...,6.613405,6.615288,6.633757,6.611420,6.658956,6.751886,6.724360,6.759662,6.663325,6.663378
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ZYG11B,8.893455,9.275059,9.146032,8.504040,8.506646,9.501809,8.781733,8.942747,9.295257,8.939878,...,9.292432,8.983416,9.144604,8.913773,8.860433,8.674577,8.972839,8.776298,8.944420,8.943609
ZYX,10.224247,10.687603,10.355187,10.090020,10.490643,10.181591,10.305449,10.251164,10.484419,10.033065,...,9.807033,10.682264,10.716734,10.330274,10.143487,10.200066,10.003246,9.996454,10.237811,10.164364
ZZEF1,8.439836,8.507540,8.215377,8.177778,8.198018,8.288051,7.965330,8.336710,8.172971,8.184372,...,8.381381,8.294295,8.237680,8.304722,8.183370,8.010283,8.331811,8.261510,8.301020,8.169597
ZZZ3,7.975543,7.979056,7.773221,7.542429,7.725909,7.796597,7.886633,7.822113,7.877351,7.945854,...,7.869515,7.740685,7.638251,8.012779,7.999851,7.920978,7.926523,8.024448,7.776912,7.721161


In [132]:
merged_data = geo_merge_clinical_genetic_data(selected_clinical_data, genetic_data)
# The preprocessing runs through, which means is_available should be True
is_available = True

In [133]:
merged_data

Unnamed: 0,Ankylosing Spondylitis,Age,Gender,A1BG,A1CF,A2M,A2ML1,A3GALT2,A4GALT,A4GNT,...,ZWINT,ZXDA,ZXDB,ZXDC,ZYG11A,ZYG11B,ZYX,ZZEF1,ZZZ3,RAB1C
GSM1902130,1.0,53.0,0.0,6.721839,6.687158,6.619832,6.567780,6.614632,6.689058,6.681622,...,6.705402,6.664771,6.928788,6.942867,6.616599,8.893455,10.224247,8.439836,7.975543,6.741328
GSM1902131,1.0,26.0,0.0,6.673086,6.624651,6.542803,6.488923,6.741219,6.659898,6.792750,...,6.683924,6.626405,6.902658,6.945891,6.701322,9.275059,10.687603,8.507540,7.979056,6.624208
GSM1902132,1.0,29.0,0.0,6.712955,6.642954,6.522699,6.513175,6.615549,6.638833,6.765411,...,6.668817,6.638077,6.995390,6.998783,6.642820,9.146032,10.355187,8.215377,7.773221,6.761005
GSM1902133,1.0,50.0,0.0,6.770714,6.664230,6.633948,6.599086,6.703858,6.620400,7.051553,...,6.675158,6.668629,6.711156,7.044865,6.705459,8.504040,10.090020,8.177778,7.542429,6.754579
GSM1902134,1.0,35.0,0.0,6.712682,6.665765,6.562571,6.622063,6.693452,6.586829,6.802776,...,6.736839,6.653528,6.818604,6.877028,6.696736,8.506646,10.490643,8.198018,7.725909,6.889446
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
GSM1902197,0.0,20.0,1.0,6.714713,6.621188,6.520083,6.676249,6.751886,6.580731,6.945715,...,6.731256,6.639304,6.856193,6.968547,6.762333,8.674577,10.200066,8.010283,7.920978,6.777274
GSM1902198,0.0,31.0,1.0,6.752249,6.664272,6.586989,6.471403,6.724360,6.556073,6.747668,...,6.723123,6.694954,6.942069,6.970328,6.651852,8.972839,10.003246,8.331811,7.926523,6.651200
GSM1902199,0.0,25.0,1.0,6.711358,6.739637,6.711289,6.618289,6.759662,6.596578,6.671845,...,6.684863,6.703465,6.790693,6.942059,6.648330,8.776298,9.996454,8.261510,8.024448,6.628250
GSM1902200,0.0,29.0,1.0,6.759263,6.641758,6.548265,6.416780,6.663325,6.565605,6.846892,...,6.649319,6.658996,6.859029,7.010749,6.641221,8.944420,10.237811,8.301020,7.776912,6.637634


In [134]:
print(f"The merged dataset contains {len(merged_data)} samples.")
is_trait_biased, merged_data = judge_and_remove_biased_features(merged_data, TRAIT, trait_type=trait_type)
is_trait_biased

The merged dataset contains 72 samples.
For the feature 'Ankylosing Spondylitis', the least common label is '0.0' with 20 occurrences. This represents 27.78% of the dataset.
The distribution of the feature 'Ankylosing Spondylitis' in this dataset is fine.

Quartiles for 'Age':
  25%: 28.75
  50% (Median): 41.5
  75%: 51.25
Min: 18.0
Max: 77.0
The distribution of the feature 'Age' in this dataset is fine.

For the feature 'Gender', the least common label is '1.0' with 35 occurrences. This represents 48.61% of the dataset.
The distribution of the feature 'Gender' in this dataset is fine.



False

In [135]:
if is_available:
    save_cohort_info(cohort, JSON_PATH, is_available, is_trait_biased, merged_data, note='')
else:
    save_cohort_info(cohort, JSON_PATH, is_available)
merged_data.head()
if not is_trait_biased:
    merged_data.to_csv(os.path.join(OUTPUT_DIR, cohort + '.csv'), index=False)

In [136]:
# Stopped: How to convert trait? (control vs patient? IFN vs Do nothing?)
cohort = accession_num = "GSE11886"
cohort_dir = os.path.join(trait_path, accession_num)
soft_file, matrix_file = get_relevant_filepaths(cohort_dir)
soft_file, matrix_file

from utils import *
import io
background_prefixes = ['!Series_title', '!Series_summary', '!Series_overall_design']
clinical_prefixes = ['!Sample_geo_accession', '!Sample_characteristics_ch1']

background_info, clinical_data = get_background_and_clinical_data(matrix_file, background_prefixes, clinical_prefixes)
print(background_info)

clinical_data

!Series_title	"Gene expression analysis of macrophages from ankylosing spondylitis patients reveals interferon-gamma dysregulation"
!Series_summary	"OBJECTIVE: To determine whether macrophages, a type of cell implicated in the pathogenesis of ankylosing spondylitis (AS), exhibit a characteristic gene expression pattern. METHODS: Macrophages were derived from the peripheral blood of 8 AS patients (median disease duration 13 years [range <1-43 years]) and 9 healthy control subjects over 7 days with the use of granulocyte-macrophage colony-stimulating factor. Cells were stimulated for 24 hours with interferon-gamma (IFNgamma; 100 units/ml), were left untreated for 24 hours, or were treated for 3 hours with lipopolysaccharide (LPS; 10 ng/ml). RNA was isolated and examined by microarray and real-time quantitative reverse transcription-polymerase chain reaction analysis. RESULTS: Microarray analysis revealed 198 probe sets detecting the differential expression of 141 unique genes in untreate

Unnamed: 0,!Sample_geo_accession,GSM300389,GSM300390,GSM300391,GSM300392,GSM300393,GSM300394,GSM300395,GSM300396,GSM300397,...,GSM300412,GSM300413,GSM300414,GSM300415,GSM300416,GSM300417,GSM300418,GSM300419,GSM300420,GSM300421
0,!Sample_characteristics_ch1,"Control #1, IFN treated","Control #5, IFN treated","Control #7, IFN treated","Control #2, IFN treated","Control #3, IFN treated","Control #6, IFN treated","Control #4, IFN treated","Control #8, IFN treated","Control #9, IFN treated",...,"Patient #6, IFN treated","Patient #7, IFN treated","Patient #2, Untreated","Patient #8, Untreated","Patient #4, Untreated","Patient #3, Untreated","Patient #5, Untreated","Patient #1, Untreated","Patient #6, Untreated","Patient #7, Untreated"


In [138]:
tumor_stage_row = clinical_data.iloc[0]
tumor_stage_row.unique()

array(['!Sample_characteristics_ch1', 'Control #1, IFN treated',
       'Control #5, IFN treated', 'Control #7, IFN treated',
       'Control #2, IFN treated', 'Control #3, IFN treated',
       'Control #6, IFN treated', 'Control #4, IFN treated',
       'Control #8, IFN treated', 'Control #9, IFN treated',
       'Control #1, Untreated', 'Control #5, Untreated',
       'Control #7, Untreated', 'Control #2, Untreated',
       'Control #3, Untreated', 'Control #6, Untreated',
       'Control #4, Untreated', 'Control #8, Untreated',
       'Control #9, Untreated', 'Patient #2, Untreated',
       'Patient 42, IFN treated', 'Patient #3, IFN treated',
       'Patient #5, IFN treated', 'Patient #1, IFN treated',
       'Patient #6, IFN treated', 'Patient #7, IFN treated',
       'Patient #8, Untreated', 'Patient #4, Untreated',
       'Patient #3, Untreated', 'Patient #5, Untreated',
       'Patient #1, Untreated', 'Patient #6, Untreated',
       'Patient #7, Untreated'], dtype=object)

### Initial filtering and clinical data preprocessing

In [39]:
import gzip

In [40]:
def line_generator(source, source_type):
    """Generator that yields lines from a file or a string.

    Parameters:
    - source: File path or string content.
    - source_type: 'file' or 'string'.
    """
    if source_type == 'file':
        with gzip.open(source, 'rt') as f:
            for line in f:
                yield line.strip()
    elif source_type == 'string':
        for line in source.split('\n'):
            yield line.strip()
    else:
        raise ValueError("source_type must be 'file' or 'string'")

In [41]:
from typing import Callable, Optional, List, Tuple, Union, Any
import pandas as pd

In [42]:
def filter_content_by_prefix(
    source: str,
    prefixes_a: List[str],
    prefixes_b: Optional[List[str]] = None,
    unselect: bool = False,
    source_type: str = 'file',
    return_df_a: bool = True,
    return_df_b: bool = True
) -> Tuple[Union[str, pd.DataFrame], Optional[Union[str, pd.DataFrame]]]:
    """
    Filters rows from a file or a list of strings based on specified prefixes.

    Parameters:
    - source (str): File path or string content to filter.
    - prefixes_a (List[str]): Primary list of prefixes to filter by.
    - prefixes_b (Optional[List[str]]): Optional secondary list of prefixes to filter by.
    - unselect (bool): If True, selects rows that do not start with the specified prefixes.
    - source_type (str): 'file' if source is a file path, 'string' if source is a string of text.
    - return_df_a (bool): If True, returns filtered content for prefixes_a as a pandas DataFrame.
    - return_df_b (bool): If True, and if prefixes_b is provided, returns filtered content for prefixes_b as a pandas DataFrame.

    Returns:
    - Tuple: A tuple where the first element is the filtered content for prefixes_a, and the second element is the filtered content for prefixes_b.
    """
    filtered_lines_a = []
    filtered_lines_b = []
    prefix_set_a = set(prefixes_a)
    if prefixes_b is not None:
        prefix_set_b = set(prefixes_b)

    # Use generator to get lines
    for line in line_generator(source, source_type):
        matched_a = any(line.startswith(prefix) for prefix in prefix_set_a)
        if matched_a != unselect:
            filtered_lines_a.append(line)
        if prefixes_b is not None:
            matched_b = any(line.startswith(prefix) for prefix in prefix_set_b)
            if matched_b != unselect:
                filtered_lines_b.append(line)

    filtered_content_a = '\n'.join(filtered_lines_a)
    if return_df_a:
        filtered_content_a = pd.read_csv(io.StringIO(filtered_content_a), delimiter='\t', low_memory=False, on_bad_lines='skip')
    filtered_content_b = None
    if filtered_lines_b:
        filtered_content_b = '\n'.join(filtered_lines_b)
        if return_df_b:
            filtered_content_b = pd.read_csv(io.StringIO(filtered_content_b), delimiter='\t', low_memory=False, on_bad_lines='skip')

    return filtered_content_a, filtered_content_b



In [43]:
def get_background_and_clinical_data(file_path,
                                     prefixes_a=['!Series_title', '!Series_summary', '!Series_overall_design'],
                                     prefixes_b=['!Sample_geo_accession', '!Sample_characteristics_ch1']):
    """Extract from a matrix file the background information about the dataset, and sample characteristics data"""
    background_info, clinical_data = filter_content_by_prefix(file_path, prefixes_a, prefixes_b, unselect=False,
                                                              source_type='file',
                                                              return_df_a=False, return_df_b=True)
    return background_info, clinical_data

In [45]:
from utils import *
import io
background_prefixes = ['!Series_title', '!Series_summary', '!Series_overall_design']
clinical_prefixes = ['!Sample_geo_accession', '!Sample_characteristics_ch1']

background_info, clinical_data = get_background_and_clinical_data(matrix_file, background_prefixes, clinical_prefixes)
print(background_info)

!Series_title	"Gene expression analysis of macrophages from ankylosing spondylitis patients reveals interferon-gamma dysregulation"
!Series_summary	"OBJECTIVE: To determine whether macrophages, a type of cell implicated in the pathogenesis of ankylosing spondylitis (AS), exhibit a characteristic gene expression pattern. METHODS: Macrophages were derived from the peripheral blood of 8 AS patients (median disease duration 13 years [range <1-43 years]) and 9 healthy control subjects over 7 days with the use of granulocyte-macrophage colony-stimulating factor. Cells were stimulated for 24 hours with interferon-gamma (IFNgamma; 100 units/ml), were left untreated for 24 hours, or were treated for 3 hours with lipopolysaccharide (LPS; 10 ng/ml). RNA was isolated and examined by microarray and real-time quantitative reverse transcription-polymerase chain reaction analysis. RESULTS: Microarray analysis revealed 198 probe sets detecting the differential expression of 141 unique genes in untreate

In [47]:
clinical_data.head()

Unnamed: 0,!Sample_geo_accession,GSM300389,GSM300390,GSM300391,GSM300392,GSM300393,GSM300394,GSM300395,GSM300396,GSM300397,...,GSM300412,GSM300413,GSM300414,GSM300415,GSM300416,GSM300417,GSM300418,GSM300419,GSM300420,GSM300421
0,!Sample_characteristics_ch1,"Control #1, IFN treated","Control #5, IFN treated","Control #7, IFN treated","Control #2, IFN treated","Control #3, IFN treated","Control #6, IFN treated","Control #4, IFN treated","Control #8, IFN treated","Control #9, IFN treated",...,"Patient #6, IFN treated","Patient #7, IFN treated","Patient #2, Untreated","Patient #8, Untreated","Patient #4, Untreated","Patient #3, Untreated","Patient #5, Untreated","Patient #1, Untreated","Patient #6, Untreated","Patient #7, Untreated"


In [48]:
def get_unique_values_by_row(dataframe, max_len=30):
    """
    Organize the unique values in each row of the given dataframe, to get a dictionary
    :param dataframe:
    :param max_len:
    :return:
    """
    if '!Sample_geo_accession' in dataframe.columns:
        dataframe = dataframe.drop(columns=['!Sample_geo_accession'])
    unique_values_dict = {}
    for index, row in dataframe.iterrows():
        unique_values = list(row.unique())[:max_len]
        unique_values_dict[index] = unique_values
    return unique_values_dict

In [49]:
clinical_data_unique = get_unique_values_by_row(clinical_data)
clinical_data_unique

{0: ['Control #1, IFN treated',
  'Control #5, IFN treated',
  'Control #7, IFN treated',
  'Control #2, IFN treated',
  'Control #3, IFN treated',
  'Control #6, IFN treated',
  'Control #4, IFN treated',
  'Control #8, IFN treated',
  'Control #9, IFN treated',
  'Control #1, Untreated',
  'Control #5, Untreated',
  'Control #7, Untreated',
  'Control #2, Untreated',
  'Control #3, Untreated',
  'Control #6, Untreated',
  'Control #4, Untreated',
  'Control #8, Untreated',
  'Control #9, Untreated',
  'Patient #2, Untreated',
  'Patient 42, IFN treated',
  'Patient #3, IFN treated',
  'Patient #5, IFN treated',
  'Patient #1, IFN treated',
  'Patient #6, IFN treated',
  'Patient #7, IFN treated',
  'Patient #8, Untreated',
  'Patient #4, Untreated',
  'Patient #3, Untreated',
  'Patient #5, Untreated',
  'Patient #1, Untreated']}

Analyze the metadata to determine data relevance and find ways to extract the clinical data.
Reference prompt:

In [50]:
f'''As a biomedical research team, we are selecting datasets to study the association between the human trait \'{TRAIT}\' and genetic factors, optionally considering the influence of age and gender. After searching the GEO database and parsing the matrix file of a series, we obtained background information and sample characteristics data. We will provide textual information about the dataset background, and a Python dictionary storing a list of unique values for each field of the sample characteristics data. Please carefully review the provided information and answer the following questions about this dataset:
1. Does this dataset contain gene expression data? (Note: Pure miRNA data is not suitable.)
2. For each of the traits \'{TRAIT}\', 'age', and 'gender', please address these points:
   (1) Is there human data available for this trait?
   (2) If so, identify the key in the sample characteristics dictionary where unique values of this trait is recorded. The key is an integer. The trait information might be explicitly recorded, or can be inferred from the field with some biomedical knowledge or understanding about the data collection process.
   (3) Choose an appropriate data type (either 'continuous' or 'binary') for each trait. Write a Python function to convert any given value of the trait to this data type. The function should handle inference about the trait value and convert unknown values to None.
   Name the functions 'convert_trait', 'convert_age', and 'convert_gender', respectively.

Background information about the dataset:
{background_info}

Sample characteristics dictionary (from "!Sample_characteristics_ch1", converted to a Python dictionary that stores the unique values for each field):
{clinical_data_unique}
'''

'As a biomedical research team, we are selecting datasets to study the association between the human trait \'Ankylosing Spondylitis\' and genetic factors, optionally considering the influence of age and gender. After searching the GEO database and parsing the matrix file of a series, we obtained background information and sample characteristics data. We will provide textual information about the dataset background, and a Python dictionary storing a list of unique values for each field of the sample characteristics data. Please carefully review the provided information and answer the following questions about this dataset:\n1. Does this dataset contain gene expression data? (Note: Pure miRNA data is not suitable.)\n2. For each of the traits \'Ankylosing Spondylitis\', \'age\', and \'gender\', please address these points:\n   (1) Is there human data available for this trait?\n   (2) If so, identify the key in the sample characteristics dictionary where unique values of this trait is reco

Understand and verify the answer from GPT, to assign values to the below variables. Assign None to the 'row_id' variables if relevant data row was not found.
Later we need to let GPT format its answer to automatically do these. But given the complexity of this step, let's grow some insight from the free-text answers for now.

In [51]:
age_row = gender_row = None
convert_age = convert_gender = None

In [52]:
is_gene_availabe = True
trait_row = 0
age_row = None
gender_row = None

trait_type = 'binary'

In [53]:
is_available = is_gene_availabe and (trait_row is not None)
if not is_available:
    save_cohort_info(cohort, JSON_PATH, is_available)
    print("This cohort is not usable. Please skip the following steps and jump to the next accession number.")

In [54]:
# Verify and use the functions generated by GPT

def convert_trait(tissue_type):
    """
    Convert tissue type to epilepsy presence (binary).
    Assuming epilepsy presence for 'Hippocampus' tissue.
    """
    if tissue_type == 'Control #1, IFN treated' or tissue_type == 'Control #5, IFN treated'or tissue_type == 'Control #7, IFN treated'or tissue_type == 'Control #2, IFN treated'or tissue_type == 'Control #3, IFN treated'or tissue_type == 'Control #6, IFN treated'or tissue_type == 'Control #4, IFN treated'or tissue_type == 'Control #8, IFN treated'or tissue_type == 'Control #9, IFN treated'or tissue_type == 'Control 42, IFN treated':
        return 1  # Epilepsy present
    else:
        return 0  # Epilepsy not present


def convert_age(age_string):
    """
    Convert age string to a continuous numerical value.
    Unknown values are converted to None.
    """
    if age_string.lower() == 'n.a.':
        return None
    try:
        # Extract age as an integer from the string
        age = int(age_string.split(': ')[1])
        return age
    except (ValueError, IndexError):
        # In case of any format error or unexpected string structure
        return None


# It sometimes maps 'female' to 0, and sometimes 1. Does it matter?
def convert_gender(gender_string):
    """
    Convert gender string to a binary value.
    'female' is represented as 1, 'male' as 0.
    Unknown values are converted to None.
    """
    if gender_string.lower() == 'sex: female':
        return 1
    elif gender_string.lower() == 'sex: male':
        return 0
    else:
        return None

In [55]:
def get_feature_data(clinical_df, row_id, feature, convert_fn):
    """select the row corresponding to a feature in the sample characteristics dataframe, and convert the feature into
    a binary or continuous variable"""
    clinical_df = clinical_df.iloc[row_id:row_id + 1].drop(columns=['!Sample_geo_accession'], errors='ignore')
    clinical_df.index = [feature]
    clinical_df = clinical_df.applymap(convert_fn)

    return clinical_df

In [56]:
def geo_select_clinical_features(clinical_df: pd.DataFrame, trait: str, trait_row: int,
                                 convert_trait: Callable,
                                 age_row: Optional[int] = None,
                                 convert_age: Optional[Callable] = None,
                                 gender_row: Optional[int] = None,
                                 convert_gender: Optional[Callable] = None) -> pd.DataFrame:
    """
    Extracts and processes specific clinical features from a DataFrame representing
    sample characteristics in the GEO database series.

    Parameters:
    - clinical_df (pd.DataFrame): DataFrame containing clinical data.
    - trait (str): The trait of interest.
    - trait_row (int): Row identifier for the trait in the DataFrame.
    - convert_trait (Callable): Function to convert trait data into a desired format.
    - age_row (int, optional): Row identifier for age data. Default is None.
    - convert_age (Callable, optional): Function to convert age data. Default is None.
    - gender_row (int, optional): Row identifier for gender data. Default is None.
    - convert_gender (Callable, optional): Function to convert gender data. Default is None.

    Returns:
    pd.DataFrame: A DataFrame containing the selected and processed clinical features.
    """
    feature_list = []

    trait_data = get_feature_data(clinical_df, trait_row, trait, convert_trait)
    feature_list.append(trait_data)
    if age_row is not None:
        age_data = get_feature_data(clinical_df, age_row, 'Age', convert_age)
        feature_list.append(age_data)
    if gender_row is not None:
        gender_data = get_feature_data(clinical_df, gender_row, 'Gender', convert_gender)
        feature_list.append(gender_data)

    selected_clinical_df = pd.concat(feature_list, axis=0)
    return selected_clinical_df

In [57]:
selected_clinical_data = geo_select_clinical_features(clinical_data, TRAIT, trait_row, convert_trait, age_row=age_row,
                                                      convert_age=convert_age, gender_row=gender_row,
                                                      convert_gender=convert_gender)
selected_clinical_data.head()

  clinical_df = clinical_df.applymap(convert_fn)


Unnamed: 0,GSM300389,GSM300390,GSM300391,GSM300392,GSM300393,GSM300394,GSM300395,GSM300396,GSM300397,GSM300398,...,GSM300412,GSM300413,GSM300414,GSM300415,GSM300416,GSM300417,GSM300418,GSM300419,GSM300420,GSM300421
Ankylosing Spondylitis,1,1,1,1,1,1,1,1,1,0,...,0,0,0,0,0,0,0,0,0,0


### Genetic data preprocessing and final filtering

In [58]:
def get_genetic_data(file_path):
    """Read the gene expression data into a dataframe, and adjust its format"""
    genetic_data = pd.read_csv(file_path, compression='gzip', skiprows=52, comment='!', delimiter='\t')
    genetic_data = genetic_data.dropna()
    genetic_data = genetic_data.rename(columns={'ID_REF': 'ID'}).astype({'ID': 'str'})
    genetic_data.set_index('ID', inplace=True)

    return genetic_data


In [59]:
genetic_data = get_genetic_data(matrix_file)
genetic_data.head()

Unnamed: 0_level_0,GSM300389,GSM300390,GSM300391,GSM300392,GSM300393,GSM300394,GSM300395,GSM300396,GSM300397,GSM300398,...,GSM300412,GSM300413,GSM300414,GSM300415,GSM300416,GSM300417,GSM300418,GSM300419,GSM300420,GSM300421
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1007_s_at,0.315248,0.11215,0.070344,0.248315,0.020277,-0.150602,0.060186,-0.001738,0.139716,0.103189,...,0.07653,-0.004617,0.0,0.154255,-0.010868,-0.270209,-0.05573,0.071297,0.110618,-0.171019
1053_at,-0.389087,-0.814719,-0.879279,-0.310306,-0.975432,-0.997373,-0.852638,-0.490865,-0.105638,0.232449,...,-0.721043,-0.932037,0.576449,0.601156,0.393908,0.478886,-0.25964,-0.210929,-0.120008,0.164282
117_at,0.08038,0.006952,0.128363,-0.112075,0.272227,0.141575,0.937242,0.018661,0.091574,-0.058224,...,-0.159909,0.0211,0.034181,0.0,0.004055,-0.16602,-0.105684,-0.039442,-0.310866,-0.257435
121_at,0.15904,-0.029855,-0.065653,-0.014381,-0.015624,0.008447,-0.228561,-0.146182,0.123858,0.208845,...,-0.204799,-0.251754,-0.09548,0.24511,0.043694,-0.265528,0.12426,-0.07581,0.0,-0.365727
1255_g_at,0.240606,0.073325,0.142995,0.058,0.006382,0.239335,0.062448,0.221553,0.279908,0.0,...,-0.003346,0.267826,0.149425,0.106815,0.047712,0.035424,0.04232,-0.025099,-0.018436,-0.038872


In [60]:
gene_row_ids = genetic_data.index[:20].tolist()
gene_row_ids

['1007_s_at',
 '1053_at',
 '117_at',
 '121_at',
 '1255_g_at',
 '1294_at',
 '1316_at',
 '1320_at',
 '1405_i_at',
 '1431_at',
 '1438_at',
 '1487_at',
 '1494_f_at',
 '1552256_a_at',
 '1552257_a_at',
 '1552258_at',
 '1552261_at',
 '1552263_at',
 '1552264_a_at',
 '1552266_at']

Check if the gene dataset requires mapping to get the gene symbols corresponding to each data row.

Reference prompt:

In [61]:
f'''
Below are the row headers of a gene expression dataset in GEO. Based on your biomedical knowledge, are they human gene symbols, or are they some other identifiers that need to be mapped to gene symbols? Your answer should be concluded by starting a new line and strictly following this format:
requires_gene_mapping = (True or False)

Row headers:
{gene_row_ids}
'''

"\nBelow are the row headers of a gene expression dataset in GEO. Based on your biomedical knowledge, are they human gene symbols, or are they some other identifiers that need to be mapped to gene symbols? Your answer should be concluded by starting a new line and strictly following this format:\nrequires_gene_mapping = (True or False)\n\nRow headers:\n['1007_s_at', '1053_at', '117_at', '121_at', '1255_g_at', '1294_at', '1316_at', '1320_at', '1405_i_at', '1431_at', '1438_at', '1487_at', '1494_f_at', '1552256_a_at', '1552257_a_at', '1552258_at', '1552261_at', '1552263_at', '1552264_a_at', '1552266_at']\n"


If not required, jump directly to the gene normalization step

In [62]:
requires_gene_mapping = False

In [63]:
if requires_gene_mapping:
    gene_annotation = get_gene_annotation(soft_file)
    gene_annotation_summary = preview_df(gene_annotation)
    print(gene_annotation_summary)

Observe the first few cells in the ID column of the gene annotation dataframe, to find the names of columns that store the gene probe IDs and gene symbols respectively.
Reference prompt:

In [64]:
if requires_gene_mapping:
    print(f'''
    As a biomedical research team, we are analyzing a gene expression dataset, and find that its row headers are some identifiers related to genes:
    {gene_row_ids}
    To get the mapping from those identifiers to actual gene symbols, we extracted the gene annotation data from a series in the GEO database, and saved it to a Python dictionary. Please read the dictionary, and decide which key stores the identifiers, and which key stores the gene symbols. Please strictly follow this format in your answer:
    identifier_key = 'key_name1'
    gene_symbol_key = 'key_name2'

    Gene annotation dictionary:
    {gene_annotation_summary}
    ''')

In [65]:
if requires_gene_mapping:
    identifier_key = 'ID'
    gene_symbol_key = 'UCSC_RefGene_Name'
    gene_mapping = get_gene_mapping(gene_annotation, identifier_key, gene_symbol_key)
    genetic_data = apply_gene_mapping(genetic_data, gene_mapping)

In [66]:
def normalize_gene_symbols_in_index(gene_df):
    """Normalize the human gene symbols at the index of a dataframe, and replace the index with its normalized version.
    Remove the rows where the index failed to be normalized."""
    normalized_gene_list = normalize_gene_symbols(gene_df.index.tolist())
    assert len(normalized_gene_list) == len(gene_df.index)
    gene_df.index = normalized_gene_list
    gene_df = gene_df[gene_df.index.notnull()]
    return gene_df

In [67]:
def normalize_gene_symbols(gene_symbols, batch_size=1000):
    """Normalize human gene symbols in batches using the 'mygenes' library"""
    mg = mygene.MyGeneInfo()
    normalized_genes = {}

    # Process in batches
    for i in range(0, len(gene_symbols), batch_size):
        batch = gene_symbols[i:i + batch_size]
        results = mg.querymany(batch, scopes='symbol', fields='symbol', species='human')

        # Update the normalized_genes dictionary with results from this batch
        for gene in results:
            normalized_genes[gene['query']] = gene.get('symbol', None)

    # Return the normalized symbols in the same order as the input
    return [normalized_genes.get(symbol) for symbol in gene_symbols]

In [68]:
import mygene

In [69]:
if NORMALIZE_GENE:
    genetic_data = normalize_gene_symbols_in_index(genetic_data)

1000 input query terms found no hit:	['1007_s_at', '1053_at', '117_at', '121_at', '1255_g_at', '1294_at', '1316_at', '1320_at', '1405_i_a
1000 input query terms found no hit:	['1553625_at', '1553626_a_at', '1553627_s_at', '1553629_a_at', '1553630_at', '1553633_s_at', '155363
1000 input query terms found no hit:	['1555009_a_at', '1555011_at', '1555014_x_at', '1555015_a_at', '1555016_at', '1555018_at', '1555019_
1000 input query terms found no hit:	['1556425_a_at', '1556426_at', '1556427_s_at', '1556429_a_at', '1556432_at', '1556434_at', '1556435_
1000 input query terms found no hit:	['1557981_at', '1557984_s_at', '1557985_s_at', '1557986_s_at', '1557987_at', '1557991_at', '1557993_
1000 input query terms found no hit:	['1559795_at', '1559796_at', '1559800_a_at', '1559804_at', '1559806_at', '1559807_at', '1559808_at',
1000 input query terms found no hit:	['1561429_a_at', '1561430_s_at', '1561431_at', '1561432_at', '1561433_at', '1561434_at', '1561436_at
1000 input query terms found no hi

In [70]:
def geo_merge_clinical_genetic_data(clinical_df, genetic_df):
    """
    Merge the clinical features and gene expression features from two dataframes into one dataframe
    """
    if 'ID' in genetic_df.columns:
        genetic_df = genetic_df.rename(columns={'ID': 'Gene'})
    if 'Gene' in genetic_df.columns:
        genetic_df = genetic_df.set_index('Gene')
    merged_data = pd.concat([clinical_df, genetic_df], axis=0).T.dropna()
    return merged_data


In [71]:
merged_data = geo_merge_clinical_genetic_data(selected_clinical_data, genetic_data)
# The preprocessing runs through, which means is_available should be True
is_available = True

In [72]:
print(f"The merged dataset contains {len(merged_data)} samples.")

The merged dataset contains 33 samples.


In [73]:
def judge_and_remove_biased_features(df, trait, trait_type):
    assert trait_type in ["binary", "continuous"], f"The trait must be either a binary or a continuous variable!"
    if trait_type == "binary":
        trait_biased = judge_binary_variable_biased(df, trait)
    else:
        trait_biased = judge_continuous_variable_biased(df, trait)
    if trait_biased:
        print(f"The distribution of the feature \'{trait}\' in this dataset is severely biased.\n")
    else:
        print(f"The distribution of the feature \'{trait}\' in this dataset is fine.\n")
    if "Age" in df.columns:
        age_biased = judge_continuous_variable_biased(df, 'Age')
        if age_biased:
            print(f"The distribution of the feature \'Age\' in this dataset is severely biased.\n")
            df = df.drop(columns='Age')
        else:
            print(f"The distribution of the feature \'Age\' in this dataset is fine.\n")
    if "Gender" in df.columns:
        gender_biased = judge_binary_variable_biased(df, 'Gender')
        if gender_biased:
            print(f"The distribution of the feature \'Gender\' in this dataset is severely biased.\n")
            df = df.drop(columns='Gender')
        else:
            print(f"The distribution of the feature \'Gender\' in this dataset is fine.\n")

    return trait_biased, df


In [74]:
def judge_binary_variable_biased(dataframe, col_name, min_proportion=0.1, min_num=5):
    """
    Check if the distribution of a binary variable in the dataset is too biased to be usable for analysis
    :param dataframe:
    :param col_name:
    :param min_proportion:
    :param min_num:
    :return:
    """
    label_counter = dataframe[col_name].value_counts()
    total_samples = len(dataframe)
    rare_label_num = label_counter.min()
    rare_label = label_counter.idxmin()
    rare_label_proportion = rare_label_num / total_samples

    print(
        f"For the feature \'{col_name}\', the least common label is '{rare_label}' with {rare_label_num} occurrences. This represents {rare_label_proportion:.2%} of the dataset.")

    biased = (len(label_counter) < 2) or ((rare_label_proportion < min_proportion) and (rare_label_num < min_num))
    return bool(biased)

In [75]:
is_trait_biased, merged_data = judge_and_remove_biased_features(merged_data, TRAIT, trait_type=trait_type)
is_trait_biased

For the feature 'Ankylosing Spondylitis', the least common label is '1.0' with 9 occurrences. This represents 27.27% of the dataset.
The distribution of the feature 'Ankylosing Spondylitis' in this dataset is fine.



False

In [76]:
def save_cohort_info(cohort: str, info_path: str, is_available: bool, is_biased: Optional[bool] = None,
                     df: Optional[pd.DataFrame] = None, note: str = '') -> None:
    """
    Add or update information about the usability and quality of a dataset for statistical analysis.

    Parameters:
    cohort (str): A unique identifier for the dataset.
    info_path (str): File path to the JSON file where records are stored.
    is_available (bool): Indicates whether both the genetic data and trait data are available in the dataset, and can be
     preprocessed into a dataframe.
    is_biased (bool, optional): Indicates whether the dataset is too biased to be usable.
        Required if `is_available` is True.
    df (pandas.DataFrame, optional): The preprocessed dataset. Required if `is_available` is True.
    note (str, optional): Additional notes about the dataset.

    Returns:
    None: The function does not return a value but updates or creates a record in the specified JSON file.
    """
    if is_available:
        assert (df is not None) and (is_biased is not None), "'df' and 'is_biased' should be provided if this cohort " \
                                                             "is relevant."
    is_usable = is_available and (not is_biased)
    new_record = {"is_usable": is_usable,
                  "is_available": is_available,
                  "is_biased": is_biased if is_available else None,
                  "has_age": "Age" in df.columns if is_available else None,
                  "has_gender": "Gender" in df.columns if is_available else None,
                  "sample_size": len(df) if is_available else None,
                  "note": note}
    
    if not os.path.exists(info_path):
        with open(info_path, 'w') as file:
            json.dump({}, file)
        print(f"A new JSON file was created at: {info_path}")

    with open(info_path, "r") as file:
        records = json.load(file)
    records[cohort] = new_record

    temp_path = info_path + ".tmp"
    try:
        with open(temp_path, 'w') as file:
            json.dump(records, file)
        os.replace(temp_path, info_path)

    except Exception as e:
        print(f"An error occurred: {e}")
        if os.path.exists(temp_path):
            os.remove(temp_path)
        raise

In [77]:
import json
if is_available:
    save_cohort_info(cohort, JSON_PATH, is_available, is_trait_biased, merged_data, note='')
else:
    save_cohort_info(cohort, JSON_PATH, is_available)

In [78]:
merged_data.head()
if not is_trait_biased:
    merged_data.to_csv(os.path.join(OUTPUT_DIR, cohort + '.csv'), index=False)

### 3. Do regression & Cross Validation

In [79]:
def read_json_to_dataframe(json_file: str) -> pd.DataFrame:
    """
    Reads a JSON file and converts it into a pandas DataFrame.

    Args:
    json_file (str): The path to the JSON file containing the data.

    Returns:
    DataFrame: A pandas DataFrame with the JSON data.
    """
    with open(json_file, 'r') as file:
        data = json.load(file)
    return pd.DataFrame.from_dict(data, orient='index').reset_index().rename(columns={'index': 'cohort_id'})

In [80]:
def filter_and_rank_cohorts(json_file: str, condition: Union[str, None] = None) -> Tuple[
    Union[str, None], pd.DataFrame]:
    """
    Reads a JSON file, filters cohorts based on usability and an optional condition, then ranks them by sample size.

    Args:
    json_file (str): The path to the JSON file containing the data.
    condition (str, optional): An additional condition for filtering. If None, only 'is_usable' is considered.

    Returns:
    Tuple: A tuple containing the best cohort ID (str or None if no suitable cohort is found) and
           the filtered and ranked DataFrame.
    """
    # Read the JSON file into a DataFrame
    df = read_json_to_dataframe(json_file)

    if condition:
        filtered_df = df[(df['is_usable'] == True) & (df[condition] == True)]
    else:
        filtered_df = df[df['is_usable'] == True]

    ranked_df = filtered_df.sort_values(by='sample_size', ascending=False)
    best_cohort_id = ranked_df.iloc[0]['cohort_id'] if not ranked_df.empty else None

    return best_cohort_id, ranked_df


In [81]:
# Check the information of usable cohorts
best_cohort, ranked_df = filter_and_rank_cohorts(JSON_PATH)
ranked_df

Unnamed: 0,cohort_id,is_usable,is_available,is_biased,has_age,has_gender,sample_size,note
1,GSE11886,True,True,False,False,False,33,


In [82]:
# If both age and gender have available cohorts, select 'age' as the condition.
condition = 'Age'
filter_column = 'has_' + condition.lower()

condition_best_cohort, condition_ranked_df = filter_and_rank_cohorts(JSON_PATH, filter_column)
condition_best_cohort

In [83]:
condition_ranked_df.head()

Unnamed: 0,cohort_id,is_usable,is_available,is_biased,has_age,has_gender,sample_size,note


In [84]:
merged_data = pd.read_csv(os.path.join(OUTPUT_DIR, condition_best_cohort + '.csv'))
merged_data.head()

TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'

In [None]:
# Remove the other condition to prevent interference.
merged_data = merged_data.drop(columns=['Gender'], errors='ignore').astype('float')

X = merged_data.drop(columns=[TRAIT, condition]).values
Y = merged_data[TRAIT].values
Z = merged_data[condition].values

Select the appropriate regression model depending on whether the dataset shows batch effect.

In [None]:
has_batch_effect = detect_batch_effect(X)
has_batch_effect

In [None]:
# Select appropriate models based on whether the dataset has batch effect.
# We experiment on two models for each branch. We will decide which one to choose later.

if has_batch_effect:
    model_constructor1 = VariableSelection
    model_params1 = {'modified': True, 'lamda': 3e-4}
    model_constructor2 = VariableSelection
    model_params2 = {'modified': False}
else:
    model_constructor1 = Lasso
    model_params1 = {'alpha': 1.0, 'random_state': 42}
    model_constructor2 = VariableSelection
    model_params2 = {'modified': False}

In [None]:
trait_type = 'binary'  # Remember to set this properly, either 'binary' or 'continuous'
cv_mean1, cv_std1 = cross_validation(X, Y, Z, model_constructor1, model_params1, target_type=trait_type)

In [None]:
cv_mean2, cv_std2 = cross_validation(X, Y, Z, model_constructor2, model_params2, target_type=trait_type)

In [None]:
normalized_X, _ = normalize_data(X)
normalized_Z, _ = normalize_data(Z)

# Train regression model on the whole dataset to identify significant genes
model1 = ResidualizationRegressor(model_constructor1, model_params1)
model1.fit(normalized_X, Y, normalized_Z)

model2 = ResidualizationRegressor(model_constructor2, model_params2)
model2.fit(normalized_X, Y, normalized_Z)

### 4. Discussion and report

In [None]:
feature_cols = merged_data.columns.tolist()
feature_cols.remove(TRAIT)

threshold = 0.05
interpret_result(model1, feature_cols, TRAIT, condition, threshold=threshold, save_output=True,
                 output_dir=OUTPUT_DIR, model_id=1)

In [None]:
interpret_result(model2, feature_cols, TRAIT, condition, threshold=threshold, save_output=True,
                 output_dir=OUTPUT_DIR, model_id=2)