# Gold standard curation: Preprocessing and single-step regression

In this stage of gold standard curation, we will do the data preprocessing, selection, and single-step regression for the 153 traits in our question set. This file shows the reference steps using the trait "Breast Cancer" as an example. The workflow consists of the following steps:

1. Preprocess all the cohorts related to this trait. Each cohort should be converted to a tabular form and saved to a csv file, with columns being genetic factors, the trait, and age, gender if available;
2. If there exists at least one cohort with age or gender information, conduct regression analysis with genetic features together with age or gender as the regressors.


# 1. Basic setup

In [3]:
import os
import sys

sys.path.append('..')
from utils import *

# Set your preferred name
USER = "Jiayi"
# Set the data and output directories
DATA_ROOT = '/Users/legion/Desktop/Courses/IS389/data'   
OUTPUT_ROOT = '/Users/legion/Desktop/Courses/IS389/output'
TRAIT = 'Anxiety disorder'

OUTPUT_DIR = os.path.join(OUTPUT_ROOT, USER, '-'.join(TRAIT.split()))
JSON_PATH = os.path.join(OUTPUT_DIR, "cohort_info.json")
if not os.path.exists(OUTPUT_DIR):
    os.makedirs(OUTPUT_DIR, exist_ok=True)

# Gene symbol normalization may take 1-2 minutes. You may set it to False for debugging.
NORMALIZE_GENE = True

In [None]:
# This cell is only for use on Google Colab. Skip it if you run your code in other environments

"""import os
from google.colab import drive

drive.mount('/content/drive', force_remount=True)
proj_dir = '/content/drive/MyDrive/AI4Science_Public'
os.chdir(proj_dir)"""

# 2. Data preprocessing and selection

## 2.1. The TCGA Xena dataset

In TCGA Xena, there is either zero or one cohort related to the trait. We search the names of subdirectories to see if any matches the trait. If a match is found, we directly obtain the file paths.

In [4]:
dataset = 'TCGA'
dataset_dir = os.path.join(DATA_ROOT, dataset)
os.listdir(dataset_dir)[:10]

['TCGA_Adrenocortical_Cancer_(ACC)',
 'TCGA_Breast_Cancer_(BRCA)',
 'TCGA_Kidney_Papillary_Cell_Carcinoma_(KIRP)']

If no match is found, jump directly to GEO in Part 2.2

In [5]:
trait_subdir = "TCGA_Adrenocortical_Cancer_(ACC)"
cohort = 'Xena'
# All the cancer traits in Xena are binary
trait_type = 'binary'
# Once a relevant cohort is found in Xena, we can generally assume the gene and clinical data are available
is_available = True

clinical_data_file = os.path.join(dataset_dir, trait_subdir, 'TCGA.ACC.sampleMap_ACC_clinicalMatrix')
genetic_data_file = os.path.join(dataset_dir, trait_subdir, 'TCGA.ACC.sampleMap_HiSeqV2_PANCAN.gz')

In [6]:
import pandas as pd

clinical_data = pd.read_csv(clinical_data_file, sep='\t', index_col=0)
genetic_data = pd.read_csv(genetic_data_file, compression='gzip', sep='\t', index_col=0)
age_col = gender_col = None

In [7]:
def check_rows_and_columns(dataframe, display=False):
    """
    Get the lists of row names and column names of a dataset, and optionally observe them.
    :param dataframe:
    :param display:
    :return:
    """
    dataframe_rows = dataframe.index.tolist()
    if display:
        print(f"The dataset has {len(dataframe_rows)} rows, such as {dataframe_rows[:20]}")
    dataframe_cols = dataframe.columns.tolist()
    if display:
        print(f"\nThe dataset has {len(dataframe_cols)} columns, such as {dataframe_cols[:20]}")
    return dataframe_rows, dataframe_cols

In [8]:
_, clinical_data_cols = check_rows_and_columns(clinical_data)
clinical_data_cols[:10]

['_INTEGRATION',
 '_PATIENT',
 '_cohort',
 '_primary_disease',
 '_primary_site',
 'additional_pharmaceutical_therapy',
 'additional_radiation_therapy',
 'age_at_initial_pathologic_diagnosis',
 'atypical_mitotic_figures',
 'bcr_followup_barcode']

Read all the column names in the clinical dataset, to find the columns that record information about age or gender.
Reference prompt:

In [9]:
f'''
Below is a list of column names from a biomedical dataset. Please examine it and identify the columns that are likely to contain information about patients' age. Additionally, please do the same for columns that may hold data on patients' gender. Please provide your answer by strictly following this format, without redundant words:
candidate_age_cols = [col_name1, col_name2, ...]
candidate_gender_cols = [col_name1, col_name2, ...]
If no columns match a criterion, please provide an empty list.

Column names:
{clinical_data_cols}
'''

"\nBelow is a list of column names from a biomedical dataset. Please examine it and identify the columns that are likely to contain information about patients' age. Additionally, please do the same for columns that may hold data on patients' gender. Please provide your answer by strictly following this format, without redundant words:\ncandidate_age_cols = [col_name1, col_name2, ...]\ncandidate_gender_cols = [col_name1, col_name2, ...]\nIf no columns match a criterion, please provide an empty list.\n\nColumn names:\n['_INTEGRATION', '_PATIENT', '_cohort', '_primary_disease', '_primary_site', 'additional_pharmaceutical_therapy', 'additional_radiation_therapy', 'age_at_initial_pathologic_diagnosis', 'atypical_mitotic_figures', 'bcr_followup_barcode', 'bcr_patient_barcode', 'bcr_sample_barcode', 'clinical_M', 'ct_scan_findings', 'cytoplasm_presence_less_than_equal_25_percent', 'days_to_birth', 'days_to_collection', 'days_to_death', 'days_to_initial_pathologic_diagnosis', 'days_to_last_fol

In [10]:
candidate_age_cols = [ 'age_at_initial_pathologic_diagnosis',
                      'days_to_birth', 'year_of_initial_pathologic_diagnosis']
candidate_gender_cols = [ 'gender']

Choose a single column from the candidate columns that record age and gender information respectively.
If no column meets the requirement, keep 'age_col' or 'gender_col' to None

In [11]:
def preview_df(df, n=5):
    return df.head(n).to_dict(orient='list')

In [12]:
preview_df(clinical_data[candidate_age_cols])

{'age_at_initial_pathologic_diagnosis': [58, 44, 23, 23, 30],
 'days_to_birth': [-21496, -16090, -8624, -8451, -11171],
 'year_of_initial_pathologic_diagnosis': [2000, 2004, 2008, 2000, 2000]}

In [13]:
age_col = 'age_at_initial_pathologic_diagnosis'

In [14]:
preview_df(clinical_data[candidate_gender_cols])

{'gender': ['MALE', 'FEMALE', 'FEMALE', 'FEMALE', 'MALE']}

In [15]:
gender_col = 'gender'

In [16]:
def xena_select_clinical_features(clinical_df, trait, age_col=None, gender_col=None):
    feature_list = []
    trait_data = clinical_df.index.to_series().apply(xena_convert_trait).rename(trait)
    feature_list.append(trait_data)
    if age_col:
        age_data = clinical_df[age_col].apply(xena_convert_age).rename("Age")
        feature_list.append(age_data)
    if gender_col:
        gender_data = clinical_df[gender_col].apply(xena_convert_gender).rename("Gender")
        feature_list.append(gender_data)
    selected_clinical_df = pd.concat(feature_list, axis=1)
    return selected_clinical_df

In [17]:
def xena_convert_trait(row_index: str):
    """
    Convert the trait information from Sample IDs to labels depending on the last two digits.
    Tumor types range from 01 - 09, normal types from 10 - 19.
    :param row_index: the index value of a row
    :return: the converted value
    """
    last_two_digits = int(row_index[-2:])

    if 1 <= last_two_digits <= 9:
        return 1
    elif 10 <= last_two_digits <= 19:
        return 0
    else:
        return -1

In [18]:
def xena_convert_age(cell: str):
    """Convert the cell content about age to a numerical value using regular expression
    """
    match = re.search(r'\d+', str(cell))
    if match:
        return int(match.group())
    else:
        return None

In [19]:
def xena_convert_gender(cell: str):
    """Convert the cell content about gender to a binary value
    """
    if isinstance(cell, str):
        cell = cell.lower()

    if cell == "female":
        return 0
    elif cell == "male":
        return 1
    else:
        return None

In [20]:
import re
selected_clinical_data = xena_select_clinical_features(clinical_data, TRAIT, age_col=age_col, gender_col=gender_col)

In [21]:
def normalize_gene_symbols_in_index(gene_df):
    """Normalize the human gene symbols at the index of a dataframe, and replace the index with its normalized version.
    Remove the rows where the index failed to be normalized."""
    normalized_gene_list = normalize_gene_symbols(gene_df.index.tolist())
    assert len(normalized_gene_list) == len(gene_df.index)
    gene_df.index = normalized_gene_list
    gene_df = gene_df[gene_df.index.notnull()]
    return gene_df

In [22]:
def normalize_gene_symbols(gene_symbols, batch_size=1000):
    """Normalize human gene symbols in batches using the 'mygenes' library"""
    mg = mygene.MyGeneInfo()
    normalized_genes = {}

    # Process in batches
    for i in range(0, len(gene_symbols), batch_size):
        batch = gene_symbols[i:i + batch_size]
        results = mg.querymany(batch, scopes='symbol', fields='symbol', species='human')

        # Update the normalized_genes dictionary with results from this batch
        for gene in results:
            normalized_genes[gene['query']] = gene.get('symbol', None)

    # Return the normalized symbols in the same order as the input
    return [normalized_genes.get(symbol) for symbol in gene_symbols]


In [23]:
import mygene

if NORMALIZE_GENE:
    genetic_data = normalize_gene_symbols_in_index(genetic_data)

12 input query terms found dup hits:	[('GTF2IP1', 2), ('RBMY1A3P', 3), ('RPL31P11', 2), ('HERC2P2', 3), ('WASH3P', 3), ('NUDT9P1', 2), ('
154 input query terms found no hit:	['C16orf13', 'C16orf11', 'LOC100272146', 'LOC339240', 'NACAP1', 'LOC441204', 'KLRA1', 'FAM183A', 'FA
10 input query terms found dup hits:	[('SUGT1P1', 2), ('PTPRVP', 2), ('SNORA62', 3), ('IFITM4P', 7), ('HLA-DRB6', 2), ('FUNDC2P2', 2), ('
190 input query terms found no hit:	['NARFL', 'NFKBIL2', 'LOC150197', 'TMEM84', 'LOC162632', 'PPPDE1', 'PPPDE2', 'C1orf38', 'C1orf31', '
11 input query terms found dup hits:	[('PIP5K1P1', 2), ('HBD', 2), ('PPP1R2P1', 9), ('HSD17B7P2', 2), ('RPSAP9', 2), ('SNORD68', 2), ('SN
149 input query terms found no hit:	['FAM153C', 'C9orf167', 'CLK2P', 'CCDC76', 'CCDC75', 'CCDC72', 'HIST3H2BB', 'PRAC', 'LOC285780', 'LO
15 input query terms found dup hits:	[('SNORD58C', 2), ('UOX', 2), ('UBE2Q2P1', 3), ('PPP4R1L', 2), ('SNORD63', 3), ('ESPNP', 2), ('HBBP1
158 input query terms found no hit:	[

In [24]:
merged_data = selected_clinical_data.join(genetic_data.T).dropna()
merged_data.head()

Unnamed: 0_level_0,Anxiety disorder,Age,Gender,ARHGEF10L,HIF3A,RNF17,RNF10,RNF11,RNF13,GTF2IP1,...,SLC7A10,PLA2G2C,TULP2,NPY5R,GNGT2,GNGT1,TULP3,BCL6B,GSTK1,SELP
sampleID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
TCGA-OR-A5J1-01,1,58,1,-0.641092,-0.325826,-0.531035,1.266428,0.355422,0.03719,0.243706,...,-1.520186,-0.086682,-0.182978,-0.615817,-0.281533,3.02111,-0.927577,-1.006227,1.119905,-2.185533
TCGA-OR-A5J2-01,1,44,0,-1.864792,2.766674,0.321165,1.000728,0.836122,0.35439,-0.436694,...,-0.318586,1.056018,0.393822,2.366583,-0.955033,-1.28139,1.020723,1.226373,1.164005,0.265067
TCGA-OR-A5J3-01,1,23,0,-0.723192,-0.362926,-0.531035,0.639828,-0.199578,-0.48331,0.143606,...,-0.574486,-0.086682,-0.748878,-0.113317,-3.803333,-0.61009,0.397623,-0.675227,1.196005,-3.161633
TCGA-OR-A5J5-01,1,30,1,-1.576792,-2.086226,2.463765,1.382228,-1.115678,-1.23621,0.615806,...,-0.279486,-0.086682,0.078622,1.095983,-0.908533,-1.28139,0.661823,0.458273,0.839605,-5.525533
TCGA-OR-A5J6-01,1,29,0,-2.311992,5.225974,-0.531035,0.967928,-0.393778,-0.38231,-0.060194,...,-2.090786,1.607218,2.481122,-0.946617,-0.570533,-1.28139,-0.425177,0.938573,0.495005,-1.733333


In [25]:
def judge_and_remove_biased_features(df, trait, trait_type):
    assert trait_type in ["binary", "continuous"], f"The trait must be either a binary or a continuous variable!"
    if trait_type == "binary":
        trait_biased = judge_binary_variable_biased(df, trait)
    else:
        trait_biased = judge_continuous_variable_biased(df, trait)
    if trait_biased:
        print(f"The distribution of the feature \'{trait}\' in this dataset is severely biased.\n")
    else:
        print(f"The distribution of the feature \'{trait}\' in this dataset is fine.\n")
    if "Age" in df.columns:
        age_biased = judge_continuous_variable_biased(df, 'Age')
        if age_biased:
            print(f"The distribution of the feature \'Age\' in this dataset is severely biased.\n")
            df = df.drop(columns='Age')
        else:
            print(f"The distribution of the feature \'Age\' in this dataset is fine.\n")
    if "Gender" in df.columns:
        gender_biased = judge_binary_variable_biased(df, 'Gender')
        if gender_biased:
            print(f"The distribution of the feature \'Gender\' in this dataset is severely biased.\n")
            df = df.drop(columns='Gender')
        else:
            print(f"The distribution of the feature \'Gender\' in this dataset is fine.\n")

    return trait_biased, df

In [26]:
def judge_binary_variable_biased(dataframe, col_name, min_proportion=0.1, min_num=5):
    """
    Check if the distribution of a binary variable in the dataset is too biased to be usable for analysis
    :param dataframe:
    :param col_name:
    :param min_proportion:
    :param min_num:
    :return:
    """
    label_counter = dataframe[col_name].value_counts()
    total_samples = len(dataframe)
    rare_label_num = label_counter.min()
    rare_label = label_counter.idxmin()
    rare_label_proportion = rare_label_num / total_samples

    print(
        f"For the feature \'{col_name}\', the least common label is '{rare_label}' with {rare_label_num} occurrences. This represents {rare_label_proportion:.2%} of the dataset.")

    biased = (len(label_counter) < 2) or ((rare_label_proportion < min_proportion) and (rare_label_num < min_num))
    return bool(biased)


In [27]:
def judge_continuous_variable_biased(dataframe, col_name):
    """Check if the distribution of a continuous variable in the dataset is too biased to be usable for analysis.
    As a starting point, we consider it biased if all values are the same. For the next step, maybe ask GPT to judge
    based on quartile statistics combined with its common sense knowledge about this feature.
    """
    quartiles = dataframe[col_name].quantile([0.25, 0.5, 0.75])
    min_value = dataframe[col_name].min()
    max_value = dataframe[col_name].max()

    # Printing quartile information
    print(f"Quartiles for '{col_name}':")
    print(f"  25%: {quartiles[0.25]}")
    print(f"  50% (Median): {quartiles[0.5]}")
    print(f"  75%: {quartiles[0.75]}")
    print(f"Min: {min_value}")
    print(f"Max: {max_value}")

    biased = min_value == max_value

    return bool(biased)

In [28]:
print(f"The merged dataset contains {len(merged_data)} samples.")
is_trait_biased, merge_data = judge_and_remove_biased_features(merged_data, TRAIT, trait_type=trait_type)
is_trait_biased

The merged dataset contains 79 samples.
For the feature 'Anxiety disorder', the least common label is '1' with 79 occurrences. This represents 100.00% of the dataset.
The distribution of the feature 'Anxiety disorder' in this dataset is severely biased.

Quartiles for 'Age':
  25%: 35.0
  50% (Median): 49.0
  75%: 59.5
Min: 14
Max: 77
The distribution of the feature 'Age' in this dataset is fine.

For the feature 'Gender', the least common label is '1' with 31 occurrences. This represents 39.24% of the dataset.
The distribution of the feature 'Gender' in this dataset is fine.



True

In [29]:
merged_data.head()
if not is_trait_biased:
    merge_data.to_csv(os.path.join(OUTPUT_DIR, cohort + '.csv'), index=False)

In [30]:
from typing import Callable, Optional, List, Tuple, Union, Any

In [31]:
def save_cohort_info(cohort: str, info_path: str, is_available: bool, is_biased: Optional[bool] = None,
                     df: Optional[pd.DataFrame] = None, note: str = '') -> None:
    """
    Add or update information about the usability and quality of a dataset for statistical analysis.

    Parameters:
    cohort (str): A unique identifier for the dataset.
    info_path (str): File path to the JSON file where records are stored.
    is_available (bool): Indicates whether both the genetic data and trait data are available in the dataset, and can be
     preprocessed into a dataframe.
    is_biased (bool, optional): Indicates whether the dataset is too biased to be usable.
        Required if `is_available` is True.
    df (pandas.DataFrame, optional): The preprocessed dataset. Required if `is_available` is True.
    note (str, optional): Additional notes about the dataset.

    Returns:
    None: The function does not return a value but updates or creates a record in the specified JSON file.
    """
    if is_available:
        assert (df is not None) and (is_biased is not None), "'df' and 'is_biased' should be provided if this cohort " \
                                                             "is relevant."
    is_usable = is_available and (not is_biased)
    new_record = {"is_usable": is_usable,
                  "is_available": is_available,
                  "is_biased": is_biased if is_available else None,
                  "has_age": "Age" in df.columns if is_available else None,
                  "has_gender": "Gender" in df.columns if is_available else None,
                  "sample_size": len(df) if is_available else None,
                  "note": note}
    
    if not os.path.exists(info_path):
        with open(info_path, 'w') as file:
            json.dump({}, file)
        print(f"A new JSON file was created at: {info_path}")

    with open(info_path, "r") as file:
        records = json.load(file)
    records[cohort] = new_record

    temp_path = info_path + ".tmp"
    try:
        with open(temp_path, 'w') as file:
            json.dump(records, file)
        os.replace(temp_path, info_path)

    except Exception as e:
        print(f"An error occurred: {e}")
        if os.path.exists(temp_path):
            os.remove(temp_path)
        raise

In [32]:
import json

save_cohort_info(cohort, JSON_PATH, is_available, is_trait_biased, merged_data)

A new JSON file was created at: /Users/legion/Desktop/Courses/IS389/output\Jiayi\Anxiety-disorder\cohort_info.json


## 2.2. The GEO dataset

In GEO, there may be one or multiple cohorts for a trait. Each cohort is identified by an accession number. We iterate over all accession numbers in the corresponding subdirectory, preprocess the cohort data, and save them to csv files.

In [33]:
dataset = 'GEO'
trait_subdir = "Anxiety-disorder"

trait_path = os.path.join(DATA_ROOT, dataset, trait_subdir)
os.listdir(trait_path)

['GSE119995',
 'GSE18123',
 'GSE60190',
 'GSE60491',
 'GSE61672',
 'GSE68526',
 'GSE78104',
 'GSE94119',
 'GSE98793']

Repeat the below steps for all the accession numbers

In [34]:
def get_relevant_filepaths(cohort_dir):
    """Find the file paths of a SOFT file and a matrix file from the given data directory of a cohort.
    If there are multiple SOFT files or matrix files, simply choose the first one. May be replaced by better
    strategies later.
    """
    files = os.listdir(cohort_dir)
    soft_files = [f for f in files if 'soft' in f.lower()]
    matrix_files = [f for f in files if 'matrix' in f.lower()]
    assert len(soft_files) > 0 and len(matrix_files) > 0
    soft_file_path = os.path.join(cohort_dir, soft_files[0])
    matrix_file_path = os.path.join(cohort_dir, matrix_files[0])

    return soft_file_path, matrix_file_path

In [205]:
# Finished
cohort = accession_num = "GSE60190"
cohort_dir = os.path.join(trait_path, accession_num)
soft_file, matrix_file = get_relevant_filepaths(cohort_dir)
soft_file, matrix_file

import io
background_prefixes = ['!Series_title', '!Series_summary', '!Series_overall_design']
clinical_prefixes = ['!Sample_geo_accession', '!Sample_characteristics_ch1']

background_info, clinical_data = get_background_and_clinical_data(matrix_file, background_prefixes, clinical_prefixes)
print(background_info)

clinical_data

!Series_title	"Genetic Neuropathology of Obsessive Psychiatric Syndromes"
!Series_summary	"Anorexia nervosa (AN), bulimia nervosa (BN), and obsessive-compulsive disorder (OCD) are complex psychiatric disorders with shared obsessive features, thought to arise from the interaction of multiple genes of small effect with environmental factors.  Potential candidate genes for AN, BN, and OCD have been identified through clinical association and neuroimaging studies; however, recent genome-wide association studies of eating disorders (ED) so far have failed to report significant findings. Additionally, few if any studies have interrogated postmortem brain tissue for evidence of eQTLs associated with candidate genes, which has particular promise as an approach to elucidating molecular mechanisms of association. We therefore selected single nucleotide polymorphisms (SNPs) based on candidate gene studies for AN, BN, and OCD from the literature, and examined the association of these SNPs with gen

Unnamed: 0,!Sample_geo_accession,GSM1467273,GSM1467274,GSM1467275,GSM1467276,GSM1467277,GSM1467278,GSM1467279,GSM1467280,GSM1467281,...,GSM1467396,GSM1467397,GSM1467398,GSM1467399,GSM1467400,GSM1467401,GSM1467402,GSM1467403,GSM1467404,GSM1467405
0,!Sample_characteristics_ch1,rin: 7.4,rin: 8.6,rin: 7.8,rin: 8.2,rin: 8.5,rin: 8.3,rin: 8.1,rin: 8.8,rin: 8.7,...,rin: 8.4,rin: 8,rin: 8.3,rin: 7.5,rin: 8.1,rin: 8.5,rin: 6.6,rin: 7.1,rin: 9,rin: 8.3
1,!Sample_characteristics_ch1,ocd: ED,ocd: Control,ocd: OCD,ocd: Control,ocd: Control,ocd: ED,ocd: Control,ocd: Control,ocd: Control,...,ocd: ED,ocd: Control,ocd: Control,ocd: Control,ocd: Control,ocd: Control,ocd: ED,ocd: Control,ocd: Control,ocd: Control
2,!Sample_characteristics_ch1,rinmatched: 1,rinmatched: 0,rinmatched: 1,rinmatched: 0,rinmatched: 0,rinmatched: 1,rinmatched: 0,rinmatched: 0,rinmatched: 0,...,rinmatched: 1,rinmatched: 1,rinmatched: 0,rinmatched: 1,rinmatched: 1,rinmatched: 0,rinmatched: 1,rinmatched: 1,rinmatched: 0,rinmatched: 0
3,!Sample_characteristics_ch1,dx: Bipolar,dx: Control,dx: Bipolar,dx: Control,dx: Control,dx: Bipolar,dx: Control,dx: Control,dx: Control,...,dx: ED,dx: Control,dx: Control,dx: Control,dx: Control,dx: Control,dx: MDD,dx: Control,dx: Control,dx: Control
4,!Sample_characteristics_ch1,ph: 6.18,ph: 6.59,ph: 6.37,ph: 6.6,ph: 6.38,ph: 6.02,ph: 6.87,ph: 6.95,ph: 6.82,...,ph: 6.34,ph: 6.5,ph: 6.51,ph: 6.59,ph: 6.65,ph: 6.71,ph: 6.2,ph: 6.2,ph: 6.83,ph: 6.92
5,!Sample_characteristics_ch1,age: 50.421917,age: 27.49863,age: 30.627397,age: 61.167123,age: 32.69589,age: 39.213698,age: 58.605479,age: 49.2,age: 41.041095,...,age: 30.726027,age: 63.471232,age: 54.808219,age: 57.512328,age: 57.610958,age: 44.958904,age: 35.684931,age: 63,age: 38.780821,age: 45.978082
6,!Sample_characteristics_ch1,pmi: 27,pmi: 19.5,pmi: 71.5,pmi: 19.5,pmi: 22.5,pmi: 22.5,pmi: 64,pmi: 28,pmi: 18,...,pmi: 19,pmi: 37,pmi: 21,pmi: 24,pmi: 17,pmi: 20,pmi: 33,pmi: 45.5,pmi: 29,pmi: 34
7,!Sample_characteristics_ch1,Sex: F,Sex: M,Sex: M,Sex: M,Sex: M,Sex: F,Sex: M,Sex: M,Sex: M,...,Sex: F,Sex: M,Sex: M,Sex: M,Sex: M,Sex: M,Sex: F,Sex: M,Sex: M,Sex: M
8,!Sample_characteristics_ch1,race: CAUC,race: CAUC,race: CAUC,race: CAUC,race: CAUC,race: CAUC,race: CAUC,race: CAUC,race: CAUC,...,race: CAUC,race: CAUC,race: CAUC,race: CAUC,race: CAUC,race: CAUC,race: CAUC,race: CAUC,race: CAUC,race: CAUC
9,!Sample_characteristics_ch1,batch1: 16,batch1: 18,batch1: 18,batch1: 18,batch1: 19,batch1: 19,batch1: 20,batch1: 21,batch1: 21,...,batch1: 96,batch1: 96,batch1: 97,batch1: 98,batch1: 99,batch1: 99,batch1: 100,batch1: 102,batch1: 102,batch1: 102


In [206]:
tumor_stage_row = clinical_data.iloc[1]
tumor_stage_row.unique()

array(['!Sample_characteristics_ch1', 'ocd: ED', 'ocd: Control',
       'ocd: OCD'], dtype=object)

In [215]:
is_gene_availabe = True
trait_row = 1
age_row = 5
gender_row = 7

trait_type = 'binary'

# Verify and use the functions generated by GPT

# 这个函数将组织类型（tissue type）转换为有关癫痫存在与否的二进制值。
# 它是基于特定的假设，即如果组织类型是“胰腺导管腺癌”（Pancreatic Ductal Adenocarcinoma），则认为癫痫存在（返回1）；否则，认为癫痫不存在（返回0）。
def convert_trait(tissue_type):
    """
    Convert tissue type to epilepsy presence (binary).
    Assuming epilepsy presence for 'Hippocampus' tissue.
    """
    if tissue_type == 'ocd: OCD':
        return 1  # Epilepsy present
    else:
        return 0  # Epilepsy not present
    
def convert_age(age_string):
    """
    Convert age string to a continuous numerical value.
    Unknown values are converted to None.
    """
    if age_string.lower() == 'n.a.':
        return None
    try:
        # Extract age as an integer from the string
        age = round(float(age_string.split(': ')[1]),2)
        return age
    except (ValueError, IndexError):
        # In case of any format error or unexpected string structure
        return None

In [216]:
selected_clinical_data = geo_select_clinical_features(clinical_data, TRAIT, trait_row, convert_trait, age_row=age_row,
                                                      convert_age=convert_age, gender_row=gender_row,
                                                      convert_gender=convert_gender)
selected_clinical_data.head()

  clinical_df = clinical_df.applymap(convert_fn)
  clinical_df = clinical_df.applymap(convert_fn)
  clinical_df = clinical_df.applymap(convert_fn)


Unnamed: 0,GSM1467273,GSM1467274,GSM1467275,GSM1467276,GSM1467277,GSM1467278,GSM1467279,GSM1467280,GSM1467281,GSM1467282,...,GSM1467396,GSM1467397,GSM1467398,GSM1467399,GSM1467400,GSM1467401,GSM1467402,GSM1467403,GSM1467404,GSM1467405
Anxiety disorder,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Age,50.42,27.5,30.63,61.17,32.7,39.21,58.61,49.2,41.04,51.75,...,30.73,63.47,54.81,57.51,57.61,44.96,35.68,63.0,38.78,45.98
Gender,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


In [217]:
genetic_data = get_genetic_data(matrix_file)
genetic_data

Unnamed: 0_level_0,GSM1467273,GSM1467274,GSM1467275,GSM1467276,GSM1467277,GSM1467278,GSM1467279,GSM1467280,GSM1467281,GSM1467282,...,GSM1467396,GSM1467397,GSM1467398,GSM1467399,GSM1467400,GSM1467401,GSM1467402,GSM1467403,GSM1467404,GSM1467405
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ILMN_1343291,14.631928,14.846268,14.949614,15.044821,14.790150,14.670395,14.730323,14.972534,14.826768,14.790150,...,14.749395,14.593207,14.558687,14.490661,14.540732,14.540732,14.972534,14.730323,14.790150,14.609839
ILMN_1343295,13.015539,13.199489,13.256768,13.030246,12.611440,13.030246,12.894845,12.668329,12.536053,12.684084,...,12.729559,12.628114,12.729559,12.526797,12.921417,12.536053,12.616903,12.685365,12.729559,12.894845
ILMN_1651199,7.407451,7.404480,7.384722,7.398199,7.388903,7.392256,7.411295,7.392441,7.402658,7.397018,...,7.386564,7.392892,7.387466,7.398703,7.387851,7.380980,7.391920,7.391737,7.384270,7.391287
ILMN_1651209,7.551069,7.489157,7.481180,7.454187,7.468150,7.538856,7.516444,7.428403,7.420693,7.435272,...,7.492618,7.517797,7.570776,7.491987,7.543178,7.536219,7.486087,7.445162,7.523485,7.524602
ILMN_1651210,7.389822,7.396480,7.392958,7.389106,7.399044,7.401760,7.412553,7.418004,7.383820,7.399614,...,7.383362,7.386934,7.383795,7.390598,7.383517,7.380075,7.383755,7.392136,7.383855,7.395560
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ILMN_2415911,8.200796,8.057956,8.244277,8.318973,8.585545,8.051581,8.212767,8.391308,8.099802,8.173629,...,8.142346,8.187782,8.310146,8.234435,8.165085,8.222113,7.763713,8.091295,8.353276,7.976434
ILMN_2415926,9.308517,9.261562,9.539295,8.953072,9.178453,9.364314,9.391644,9.428445,9.813872,9.575225,...,9.629529,9.005815,9.073252,9.540430,9.374462,9.614333,8.026149,8.558921,8.787474,9.170768
ILMN_2415949,8.258738,8.050959,8.363362,8.037019,8.059074,8.154139,8.092218,8.098232,8.231403,8.278169,...,8.041222,7.976685,7.960897,7.998456,8.063397,7.888010,7.992915,8.265580,7.960649,8.072490
ILMN_2415979,11.414900,9.318214,9.597813,9.561681,9.837778,9.649293,10.331787,10.224846,10.343168,10.534634,...,10.500506,10.723755,10.533912,10.080191,11.034512,10.533200,11.493401,10.312506,10.355448,10.214869


In [218]:
requires_gene_mapping = True

if requires_gene_mapping:
    gene_annotation = get_gene_annotation(soft_file)
    gene_annotation_summary = preview_df(gene_annotation)
    print(gene_annotation_summary)

gene_row_ids = genetic_data.index[:20].tolist()
gene_row_ids

if requires_gene_mapping:
    print(f'''
    As a biomedical research team, we are analyzing a gene expression dataset, and find that its row headers are some identifiers related to genes:
    {gene_row_ids}
    To get the mapping from those identifiers to actual gene symbols, we extracted the gene annotation data from a series in the GEO database, and saved it to a Python dictionary. Please read the dictionary, and decide which key stores the identifiers, and which key stores the gene symbols. Please strictly follow this format in your answer:
    identifier_key = 'key_name1'
    gene_symbol_key = 'key_name2'

    Gene annotation dictionary:
    {gene_annotation_summary}
    ''')

gene_annotation.columns

{'ID': ['ILMN_1725881', 'ILMN_1910180', 'ILMN_1804174', 'ILMN_1796063', 'ILMN_1811966'], 'nuID': ['rp13_p1x6D80lNLk3c', 'NEX0oqCV8.er4HVfU4', 'KyqQynMZxJcruyylEU', 'xXl7eXuF7sbPEp.KFI', '9ckqJrioiaej9_ajeQ'], 'Species': ['Homo sapiens', 'Homo sapiens', 'Homo sapiens', 'Homo sapiens', 'Homo sapiens'], 'Source': ['RefSeq', 'Unigene', 'RefSeq', 'RefSeq', 'RefSeq'], 'Search_Key': ['ILMN_44919', 'ILMN_127219', 'ILMN_139282', 'ILMN_5006', 'ILMN_38756'], 'Transcript': ['ILMN_44919', 'ILMN_127219', 'ILMN_139282', 'ILMN_5006', 'ILMN_38756'], 'ILMN_Gene': ['LOC23117', 'HS.575038', 'FCGR2B', 'TRIM44', 'LOC653895'], 'Source_Reference_ID': ['XM_933824.1', 'Hs.575038', 'XM_938851.1', 'NM_017583.3', 'XM_936379.1'], 'RefSeq_ID': ['XM_933824.1', nan, 'XM_938851.1', 'NM_017583.3', 'XM_936379.1'], 'Unigene_ID': [nan, 'Hs.575038', nan, nan, nan], 'Entrez_Gene_ID': [23117.0, nan, 2213.0, 54765.0, 653895.0], 'GI': [89040007.0, 10437021.0, 88952550.0, 29029528.0, 89033487.0], 'Accession': ['XM_933824.1', 'AK

Index(['ID', 'nuID', 'Species', 'Source', 'Search_Key', 'Transcript',
       'ILMN_Gene', 'Source_Reference_ID', 'RefSeq_ID', 'Unigene_ID',
       'Entrez_Gene_ID', 'GI', 'Accession', 'Symbol', 'Protein_Product',
       'Array_Address_Id', 'Probe_Type', 'Probe_Start', 'SEQUENCE',
       'Chromosome', 'Probe_Chr_Orientation', 'Probe_Coordinates', 'Cytoband',
       'Definition', 'Ontology_Component', 'Ontology_Process',
       'Ontology_Function', 'Synonyms', 'Obsolete_Probe_Id', 'GB_ACC'],
      dtype='object')

In [220]:
if requires_gene_mapping:
    identifier_key = 'ID'
    gene_symbol_key = 'Symbol'
    gene_mapping = get_gene_mapping(gene_annotation, identifier_key, gene_symbol_key)
    genetic_data = apply_gene_mapping(genetic_data, gene_mapping)

In [221]:
genetic_data = normalize_gene_symbols_in_index(genetic_data)

genetic_data

Unnamed: 0,GSM1467273,GSM1467274,GSM1467275,GSM1467276,GSM1467277,GSM1467278,GSM1467279,GSM1467280,GSM1467281,GSM1467282,...,GSM1467396,GSM1467397,GSM1467398,GSM1467399,GSM1467400,GSM1467401,GSM1467402,GSM1467403,GSM1467404,GSM1467405
A1BG,7.481047,7.459438,7.462799,7.453585,7.434372,7.435308,7.497011,7.434371,7.413374,7.439196,...,7.497660,7.457713,7.427958,7.443941,7.438771,7.455120,7.419787,7.442493,7.427613,7.424304
A1CF,7.405821,7.396154,7.399692,7.411053,7.427684,7.420371,7.409882,7.406598,7.423508,7.403380,...,7.420024,7.425435,7.405370,7.404043,7.407866,7.410821,7.398595,7.412058,7.404174,7.397455
A2M,8.369617,8.115136,7.834904,8.268345,8.132207,8.514582,8.681611,8.820405,8.701411,8.446774,...,8.942798,8.940076,8.665946,8.740932,8.765340,8.901560,9.056649,9.500376,8.691591,8.890048
A2ML1,7.381102,7.424990,7.406004,7.397059,7.408826,7.394649,7.410082,7.443830,7.402433,7.527460,...,7.387483,7.386391,7.434159,7.406756,7.388920,7.403031,7.630691,7.391827,7.423193,7.387081
A3GALT2,7.395020,7.384147,7.385292,7.387125,7.389030,7.395563,7.393727,7.381826,7.391831,7.383787,...,7.381678,7.383456,7.390774,7.385387,7.381505,7.391571,7.393197,7.380317,7.390618,7.384771
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ZYG11B,11.755243,11.580051,11.472778,11.197515,11.357074,11.292039,11.431473,11.529512,11.584090,11.731810,...,11.920459,11.946019,11.600578,11.121539,11.563971,11.461402,11.607571,11.942620,12.143056,11.889138
ZYX,9.488777,8.958607,8.853385,9.023618,9.076320,8.989631,8.968002,8.859351,9.026985,9.067510,...,8.818570,9.242392,9.202663,9.317282,9.217690,9.172183,9.902353,8.986207,9.157025,9.066989
ZZEF1,9.066159,8.598233,8.774467,8.956653,8.843062,9.069212,8.888722,8.928407,9.072954,8.852790,...,8.847695,8.790686,8.876767,9.085703,8.807783,8.747593,8.930893,8.798290,8.892741,8.587646
ZZZ3,8.523295,8.584810,8.731412,8.739782,8.744687,8.432901,8.545003,8.718865,8.629061,8.707885,...,8.760058,8.684442,8.750618,8.786311,8.544695,8.787504,8.753251,8.602089,8.709299,8.472725


In [222]:
merged_data = geo_merge_clinical_genetic_data(selected_clinical_data, genetic_data)
# The preprocessing runs through, which means is_available should be True
is_available = True

merged_data

Unnamed: 0,Anxiety disorder,Age,Gender,A1BG,A1CF,A2M,A2ML1,A3GALT2,A4GALT,A4GNT,...,ZWINT,ZXDA,ZXDB,ZXDC,ZYG11A,ZYG11B,ZYX,ZZEF1,ZZZ3,RAB1C
GSM1467273,0.0,50.42,1.0,7.481047,7.405821,8.369617,7.381102,7.395020,7.445335,7.418757,...,7.422142,7.411947,7.688330,7.692514,7.397040,11.755243,9.488777,9.066159,8.523295,7.408587
GSM1467274,0.0,27.50,0.0,7.459438,7.396154,8.115136,7.424990,7.384147,7.413263,7.399258,...,7.438231,7.418272,7.734598,7.639503,7.393080,11.580051,8.958607,8.598233,8.584810,7.403482
GSM1467275,1.0,30.63,0.0,7.462799,7.399692,7.834904,7.406004,7.385292,7.481972,7.389239,...,7.442823,7.420557,7.897464,7.799014,7.396474,11.472778,8.853385,8.774467,8.731412,7.377490
GSM1467276,0.0,61.17,0.0,7.453585,7.411053,8.268345,7.397059,7.387125,7.463897,7.401978,...,7.415354,7.414003,7.809488,7.719431,7.408978,11.197515,9.023618,8.956653,8.739782,7.383207
GSM1467277,0.0,32.70,0.0,7.434372,7.427684,8.132207,7.408826,7.389030,7.414156,7.415771,...,7.389650,7.418803,7.653474,7.805373,7.393036,11.357074,9.076320,8.843062,8.744687,7.390069
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
GSM1467401,0.0,44.96,0.0,7.455120,7.410821,8.901560,7.403031,7.391571,7.433895,7.459833,...,7.427711,7.424786,7.672185,7.568192,7.391293,11.461402,9.172183,8.747593,8.787504,7.389036
GSM1467402,0.0,35.68,1.0,7.419787,7.398595,9.056649,7.630691,7.393197,8.135820,7.411807,...,7.399261,7.415387,7.559787,7.702635,7.389805,11.607571,9.902353,8.930893,8.753251,7.382404
GSM1467403,0.0,63.00,0.0,7.442493,7.412058,9.500376,7.391827,7.380317,7.532850,7.447209,...,7.415293,7.420746,7.617346,7.715541,7.388253,11.942620,8.986207,8.798290,8.602089,7.406016
GSM1467404,0.0,38.78,0.0,7.427613,7.404174,8.691591,7.423193,7.390618,7.440411,7.416510,...,7.412564,7.397225,7.710904,7.607852,7.389100,12.143056,9.157025,8.892741,8.709299,7.383361


In [223]:
print(f"The merged dataset contains {len(merged_data)} samples.")
is_trait_biased, merged_data = judge_and_remove_biased_features(merged_data, TRAIT, trait_type=trait_type)
is_trait_biased

The merged dataset contains 133 samples.
For the feature 'Anxiety disorder', the least common label is '1.0' with 16 occurrences. This represents 12.03% of the dataset.
The distribution of the feature 'Anxiety disorder' in this dataset is fine.

Quartiles for 'Age':
  25%: 30.04
  50% (Median): 45.22
  75%: 54.42
Min: 16.18
Max: 84.06
The distribution of the feature 'Age' in this dataset is fine.

For the feature 'Gender', the least common label is '1.0' with 36 occurrences. This represents 27.07% of the dataset.
The distribution of the feature 'Gender' in this dataset is fine.



False

In [224]:
if is_available:
    save_cohort_info(cohort, JSON_PATH, is_available, is_trait_biased, merged_data, note='')
else:
    save_cohort_info(cohort, JSON_PATH, is_available)
merged_data.head()
if not is_trait_biased:
    merged_data.to_csv(os.path.join(OUTPUT_DIR, cohort + '.csv'), index=False)

In [192]:
# Finished
cohort = accession_num = "GSE78104"
cohort_dir = os.path.join(trait_path, accession_num)
soft_file, matrix_file = get_relevant_filepaths(cohort_dir)
soft_file, matrix_file

import io
background_prefixes = ['!Series_title', '!Series_summary', '!Series_overall_design']
clinical_prefixes = ['!Sample_geo_accession', '!Sample_characteristics_ch1']

background_info, clinical_data = get_background_and_clinical_data(matrix_file, background_prefixes, clinical_prefixes)
print(background_info)

clinical_data

!Series_title	"lncRNA and mRNA expression data in peripheral blood sampled from patients with Obsessive-Compulsive Disorder"
!Series_summary	"The aim of the study is to identify the global messenger RNA (mRNA) and long noncoding RNA (lncRNA) expression profiling in peripheral blood from thirty patients with Obsessive Compulsive Disorders (OCD) and thirty paired normal controls."
!Series_overall_design	"We quantified the gene transcripts in peripheral blood from thirty patients with OCD and thirty normal controls by the method of Microarray using Aglilent G3 lncRNA v4.04×180K."


Unnamed: 0,!Sample_geo_accession,GSM2067403,GSM2067404,GSM2067405,GSM2067406,GSM2067407,GSM2067408,GSM2067409,GSM2067410,GSM2067411,...,GSM2067453,GSM2067454,GSM2067455,GSM2067456,GSM2067457,GSM2067458,GSM2067459,GSM2067460,GSM2067461,GSM2067462
0,!Sample_characteristics_ch1,tissue: whole blood,tissue: whole blood,tissue: whole blood,tissue: whole blood,tissue: whole blood,tissue: whole blood,tissue: whole blood,tissue: whole blood,tissue: whole blood,...,tissue: whole blood,tissue: whole blood,tissue: whole blood,tissue: whole blood,tissue: whole blood,tissue: whole blood,tissue: whole blood,tissue: whole blood,tissue: whole blood,tissue: whole blood
1,!Sample_characteristics_ch1,disease state: Obsessive-Compulsive Disorder,disease state: Obsessive-Compulsive Disorder,disease state: Obsessive-Compulsive Disorder,disease state: Obsessive-Compulsive Disorder,disease state: Obsessive-Compulsive Disorder,disease state: Obsessive-Compulsive Disorder,disease state: Obsessive-Compulsive Disorder,disease state: Obsessive-Compulsive Disorder,disease state: Obsessive-Compulsive Disorder,...,disease state: normal control,disease state: normal control,disease state: normal control,disease state: normal control,disease state: normal control,disease state: normal control,disease state: normal control,disease state: normal control,disease state: normal control,disease state: normal control
2,!Sample_characteristics_ch1,gender: male,gender: female,gender: female,gender: male,gender: female,gender: male,gender: male,gender: female,gender: male,...,gender: female,gender: male,gender: male,gender: female,gender: male,gender: male,gender: male,gender: female,gender: male,gender: female
3,!Sample_characteristics_ch1,age: 25y,age: 23y,age: 18y,age: 26y,age: 27y,age: 19y,age: 22y,age: 27y,age: 18y,...,age: 43y,age: 40y,age: 32y,age: 28y,age: 27y,age: 30y,age: 24y,age: 35y,age: 56y,age: 56y


In [193]:
tumor_stage_row = clinical_data.iloc[1]
tumor_stage_row.unique()

array(['!Sample_characteristics_ch1',
       'disease state: Obsessive-Compulsive Disorder',
       'disease state: normal control'], dtype=object)

In [196]:
is_gene_availabe = True
trait_row = 1
age_row = 3
gender_row = 2

trait_type = 'binary'

# Verify and use the functions generated by GPT

# 这个函数将组织类型（tissue type）转换为有关癫痫存在与否的二进制值。
# 它是基于特定的假设，即如果组织类型是“胰腺导管腺癌”（Pancreatic Ductal Adenocarcinoma），则认为癫痫存在（返回1）；否则，认为癫痫不存在（返回0）。
def convert_trait(tissue_type):
    """
    Convert tissue type to epilepsy presence (binary).
    Assuming epilepsy presence for 'Hippocampus' tissue.
    """
    if tissue_type == 'disease state: Obsessive-Compulsive Disorder':
        return 1  # Epilepsy present
    else:
        return 0  # Epilepsy not present
    
def convert_age(age_string):
    """
    Convert age string to a continuous numerical value.
    Unknown values are converted to None.
    """
    if age_string.lower() == 'n.a.':
        return None
    try:
        # Extract age as an integer from the string
        age = int(age_string.split(': ')[1].rstrip('y'))
        return age
    except (ValueError, IndexError):
        # In case of any format error or unexpected string structure
        return None

In [197]:
selected_clinical_data = geo_select_clinical_features(clinical_data, TRAIT, trait_row, convert_trait, age_row=age_row,
                                                      convert_age=convert_age, gender_row=gender_row,
                                                      convert_gender=convert_gender)
selected_clinical_data.head()

  clinical_df = clinical_df.applymap(convert_fn)
  clinical_df = clinical_df.applymap(convert_fn)
  clinical_df = clinical_df.applymap(convert_fn)


Unnamed: 0,GSM2067403,GSM2067404,GSM2067405,GSM2067406,GSM2067407,GSM2067408,GSM2067409,GSM2067410,GSM2067411,GSM2067412,...,GSM2067453,GSM2067454,GSM2067455,GSM2067456,GSM2067457,GSM2067458,GSM2067459,GSM2067460,GSM2067461,GSM2067462
Anxiety disorder,1,1,1,1,1,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
Age,25,23,18,26,27,19,22,27,18,25,...,43,40,32,28,27,30,24,35,56,56
Gender,0,1,1,0,1,0,0,1,0,0,...,1,0,0,1,0,0,0,1,0,1


In [198]:
genetic_data = get_genetic_data(matrix_file)
genetic_data

Unnamed: 0_level_0,GSM2067403,GSM2067404,GSM2067405,GSM2067406,GSM2067407,GSM2067408,GSM2067409,GSM2067410,GSM2067411,GSM2067412,...,GSM2067453,GSM2067454,GSM2067455,GSM2067456,GSM2067457,GSM2067458,GSM2067459,GSM2067460,GSM2067461,GSM2067462
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
(+)E1A_r60_1,13.442002,13.263885,13.225509,13.144137,12.986026,13.849309,13.541306,14.152843,13.809382,14.053410,...,13.750706,13.790617,13.543028,13.721816,14.025154,14.458712,14.075050,14.186631,14.150471,13.071623
(+)E1A_r60_3,5.764329,6.253843,6.890254,6.576528,6.647077,7.138355,7.172650,7.322565,6.515238,6.263576,...,6.291416,6.574324,6.940987,6.606686,5.639268,6.330926,6.834661,6.725825,5.632979,6.230266
(+)E1A_r60_a104,2.894064,2.899291,2.815752,2.822751,3.035972,3.451708,3.365927,2.915322,3.178784,2.909366,...,2.813052,2.971318,3.223145,3.114489,3.113753,3.152049,3.279744,3.089397,2.978466,3.149935
(+)E1A_r60_a107,3.237023,3.233810,3.052772,2.913015,3.037258,2.997480,3.694422,2.851734,3.155538,3.100223,...,2.784098,2.810281,2.952121,2.872968,2.855768,2.886406,2.929406,2.928709,2.821016,2.908678
(+)E1A_r60_a135,2.954736,2.860286,2.936522,2.731988,2.946573,2.921638,3.183307,2.877388,2.897539,2.898822,...,3.015113,3.371869,3.247685,3.419323,3.670872,3.799645,4.004826,3.943948,3.313756,3.302009
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Mouse-PGK1_5,6.652913,6.739501,7.098698,7.341454,7.097640,7.305539,6.441029,6.780557,7.133135,7.222959,...,7.267308,6.833162,7.242861,7.123768,7.757611,7.613537,7.108137,7.041728,7.224681,6.686666
Rat-GAPDH_3,9.665619,9.663037,9.832910,9.395596,9.562595,9.109386,9.993700,10.020462,9.721497,9.618459,...,9.284128,9.557268,9.594813,9.146444,9.382476,9.110078,9.928258,9.932738,8.526566,8.991116
Rat-GAPDH_5,9.923443,10.188957,10.370885,10.922882,10.819334,10.613005,9.511038,9.847263,10.111858,10.107303,...,10.379840,9.993458,10.705560,11.044461,10.144645,9.949533,9.954214,9.842153,10.088426,10.082981
Rat-PGK1_3,2.849316,3.204162,2.710797,2.904025,3.079639,3.073844,3.135625,3.499832,2.786003,2.844454,...,2.759463,3.272689,3.189556,2.902571,2.766141,2.941688,3.025670,2.826981,2.782573,2.770213


In [199]:
requires_gene_mapping = True

if requires_gene_mapping:
    gene_annotation = get_gene_annotation(soft_file)
    gene_annotation_summary = preview_df(gene_annotation)
    print(gene_annotation_summary)

gene_row_ids = genetic_data.index[:20].tolist()
gene_row_ids

if requires_gene_mapping:
    print(f'''
    As a biomedical research team, we are analyzing a gene expression dataset, and find that its row headers are some identifiers related to genes:
    {gene_row_ids}
    To get the mapping from those identifiers to actual gene symbols, we extracted the gene annotation data from a series in the GEO database, and saved it to a Python dictionary. Please read the dictionary, and decide which key stores the identifiers, and which key stores the gene symbols. Please strictly follow this format in your answer:
    identifier_key = 'key_name1'
    gene_symbol_key = 'key_name2'

    Gene annotation dictionary:
    {gene_annotation_summary}
    ''')

gene_annotation.columns

{'ID': ['A_19_P00315459', 'A_19_P00315492', 'A_19_P00315502', 'A_19_P00315506', 'A_19_P00315538'], 'CONTROL_TYPE': ['FALSE', 'FALSE', 'FALSE', 'FALSE', 'FALSE'], 'SEQUENCE': ['AGCCCCCACTGTTCCACTTATTGTGATGGTTTGTATATCTTTATTTCAAAGAAGATCTGT', 'AGGCAGCCTTGCTGTTGGGGGTTATTGGCAGCTGTTGGGGGTTAGAGACAGGACTCTCAT', 'AGCCGGGATCGGGTTGTTGTTAATTTCTTAAGCAATTTCTAAATTCTGTATTGACTCTCT', 'CAATGGATTCCATGTTTCTTTTTCTTGGGGGGAGCAGGGAGGGAGAAAGGTAGAAAAATG', 'CACAATGACCATCATTGAGGGCGATGTTTATGCTTCCATTGTTAGTTTAGATATTTTGTT'], 'TargetID': [nan, 'Q73P46', 'P01115', nan, nan], 'ncRNA_SeqID': [nan, nan, nan, nan, nan], 'Source': ['Agilent_humanG3V2', 'Agilent_humanG3V2', 'Agilent_humanG3V2', nan, nan], 'ncRNA_Accession': [nan, nan, nan, nan, nan], 'Chr': ['chrX', 'chr4', 'chr10', nan, nan], 'Start': [149131107.0, 129376376.0, 6780785.0, nan, nan], 'End': [149131166.0, 129376435.0, 6780844.0, nan, nan], 'strand': ['+', '+', '+', nan, nan], 'Description': [nan, 'Q73P46_TREDE (Q73P46) Branched-chain amino acid ABC transporter, 

Index(['ID', 'CONTROL_TYPE', 'SEQUENCE', 'TargetID', 'ncRNA_SeqID', 'Source',
       'ncRNA_Accession', 'Chr', 'Start', 'End', 'strand', 'Description',
       'Genome', 'GeneSymbol', 'Seq_type', 'ControlType', 'EntrezGeneID',
       'GenbankAccession', 'GeneName', 'Go', 'GB_ACC', 'UniGeneID', 'SPOT_ID'],
      dtype='object')

In [200]:
if requires_gene_mapping:
    identifier_key = 'ID'
    gene_symbol_key = 'GeneSymbol'
    gene_mapping = get_gene_mapping(gene_annotation, identifier_key, gene_symbol_key)
    genetic_data = apply_gene_mapping(genetic_data, gene_mapping)

In [201]:
genetic_data = normalize_gene_symbols_in_index(genetic_data)

genetic_data

Unnamed: 0,GSM2067403,GSM2067404,GSM2067405,GSM2067406,GSM2067407,GSM2067408,GSM2067409,GSM2067410,GSM2067411,GSM2067412,...,GSM2067453,GSM2067454,GSM2067455,GSM2067456,GSM2067457,GSM2067458,GSM2067459,GSM2067460,GSM2067461,GSM2067462
A1BG,6.714729,6.119238,6.588651,6.389377,6.489448,6.229496,5.642367,5.565992,6.170604,6.437280,...,6.436784,6.354268,6.692356,6.509643,7.310910,6.487669,6.933303,6.728533,6.522120,6.859007
A1CF,3.938725,3.857275,4.320075,4.589062,4.668073,4.644432,6.507755,5.445320,3.228480,4.336737,...,4.451320,4.461054,4.306510,3.974989,3.481818,4.010231,4.262363,4.172008,3.728118,3.784718
A2M,4.105190,5.898689,5.163733,5.968357,4.846476,5.877680,5.523924,6.222440,6.007900,6.735513,...,6.042601,5.190834,5.677143,5.098562,5.875097,5.961306,5.425658,5.313376,6.113496,5.387186
A4GALT,13.030699,12.865675,13.568920,12.454192,13.582437,12.652796,12.800940,11.776719,12.721724,12.280159,...,12.211297,11.890785,12.882605,12.581439,11.951515,11.085049,12.906422,12.599278,12.247601,12.555675
A4GNT,2.704922,2.733084,3.909408,2.893169,2.692215,3.176106,3.038265,4.287126,2.768476,2.787102,...,3.411104,3.899433,4.059085,2.905740,2.999079,3.518736,3.275482,2.708914,2.969698,3.287521
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ZXDC,6.206871,6.407975,6.309663,6.479846,6.401652,6.667216,6.080510,5.717164,6.664669,5.979232,...,6.647157,6.405932,6.357309,6.456726,6.496564,6.893305,6.206098,6.251587,6.796412,6.615562
ZYG11A,3.020767,3.026273,3.113522,2.773175,2.981554,3.513956,3.391591,3.025322,3.290877,3.548486,...,2.735566,2.748930,2.876586,2.828128,2.962675,2.874528,2.942508,3.252552,2.821534,3.412952
ZYG11B,5.951725,6.141165,5.639593,6.538147,5.849266,7.026554,5.296118,5.633491,6.056791,6.068389,...,6.347763,6.040958,6.040609,6.360076,6.197123,5.809800,5.440079,5.502089,6.425600,6.460133
ZYX,12.271689,12.290836,12.674829,13.083159,12.931067,12.483398,12.510457,12.603576,13.032246,13.294943,...,13.338604,12.802153,12.110378,12.578838,12.872514,12.954499,12.288324,12.264850,13.421458,12.969342


In [202]:
merged_data = geo_merge_clinical_genetic_data(selected_clinical_data, genetic_data)
# The preprocessing runs through, which means is_available should be True
is_available = True

merged_data

Unnamed: 0,Anxiety disorder,Age,Gender,A1BG,A1CF,A2M,A4GALT,A4GNT,AAAS,AACS,...,ZSWIM6,ZSWIM7,ZW10,ZWILCH,ZXDA,ZXDC,ZYG11A,ZYG11B,ZYX,ZZZ3
GSM2067403,1.0,25.0,0.0,6.714729,3.938725,4.10519,13.030699,2.704922,11.219065,6.193231,...,8.472954,4.146875,5.884128,3.879921,7.125728,6.206871,3.020767,5.951725,12.271689,3.938076
GSM2067404,1.0,23.0,1.0,6.119238,3.857275,5.898689,12.865675,2.733084,10.71892,5.529508,...,8.837225,5.017093,6.116274,4.135404,7.121595,6.407975,3.026273,6.141165,12.290836,4.905338
GSM2067405,1.0,18.0,1.0,6.588651,4.320075,5.163733,13.56892,3.909408,11.084585,6.278926,...,8.916005,5.16664,5.942174,2.728597,7.344893,6.309663,3.113522,5.639593,12.674829,4.189789
GSM2067406,1.0,26.0,0.0,6.389377,4.589062,5.968357,12.454192,2.893169,10.745497,5.896101,...,8.870725,4.644601,5.981628,4.986273,7.135086,6.479846,2.773175,6.538147,13.083159,5.28
GSM2067407,1.0,27.0,1.0,6.489448,4.668073,4.846476,13.582437,2.692215,10.799071,6.265001,...,8.773541,3.564052,6.142061,4.55893,7.214107,6.401652,2.981554,5.849266,12.931067,4.669816
GSM2067408,1.0,19.0,0.0,6.229496,4.644432,5.87768,12.652796,3.176106,9.883424,6.28517,...,8.755016,5.112708,6.385875,4.67687,6.74288,6.667216,3.513956,7.026554,12.483398,5.285511
GSM2067409,1.0,22.0,0.0,5.642367,6.507755,5.523924,12.80094,3.038265,10.526562,6.054632,...,8.459496,5.665781,6.182136,4.099351,6.705152,6.08051,3.391591,5.296118,12.510457,5.079488
GSM2067410,1.0,27.0,1.0,5.565992,5.44532,6.22244,11.776719,4.287126,10.491643,6.232655,...,8.678071,4.061905,5.991161,4.052005,7.071524,5.717164,3.025322,5.633491,12.603576,4.814769
GSM2067411,1.0,18.0,0.0,6.170604,3.22848,6.0079,12.721724,2.768476,11.095754,6.630271,...,8.630165,4.591509,6.099095,3.889528,6.957309,6.664669,3.290877,6.056791,13.032246,4.740775
GSM2067412,1.0,25.0,0.0,6.43728,4.336737,6.735513,12.280159,2.787102,10.937666,6.262236,...,8.848257,4.779469,6.101059,4.707293,7.179975,5.979232,3.548486,6.068389,13.294943,4.656932


In [203]:
print(f"The merged dataset contains {len(merged_data)} samples.")
is_trait_biased, merged_data = judge_and_remove_biased_features(merged_data, TRAIT, trait_type=trait_type)
is_trait_biased

The merged dataset contains 60 samples.
For the feature 'Anxiety disorder', the least common label is '1.0' with 30 occurrences. This represents 50.00% of the dataset.
The distribution of the feature 'Anxiety disorder' in this dataset is fine.

Quartiles for 'Age':
  25%: 18.75
  50% (Median): 27.0
  75%: 35.0
Min: 15.0
Max: 60.0
The distribution of the feature 'Age' in this dataset is fine.

For the feature 'Gender', the least common label is '1.0' with 20 occurrences. This represents 33.33% of the dataset.
The distribution of the feature 'Gender' in this dataset is fine.



False

In [204]:
if is_available:
    save_cohort_info(cohort, JSON_PATH, is_available, is_trait_biased, merged_data, note='')
else:
    save_cohort_info(cohort, JSON_PATH, is_available)
merged_data.head()
if not is_trait_biased:
    merged_data.to_csv(os.path.join(OUTPUT_DIR, cohort + '.csv'), index=False)

In [108]:
# Stopped: No trait convert
from utils import *
cohort = accession_num = "GSE94119"
cohort_dir = os.path.join(trait_path, accession_num)
soft_file, matrix_file = get_relevant_filepaths(cohort_dir)
soft_file, matrix_file

import io
background_prefixes = ['!Series_title', '!Series_summary', '!Series_overall_design']
clinical_prefixes = ['!Sample_geo_accession', '!Sample_characteristics_ch1']

background_info, clinical_data = get_background_and_clinical_data(matrix_file, background_prefixes, clinical_prefixes)
print(background_info)

clinical_data

!Series_title	"Gene expression and response to psychological therapy"
!Series_summary	"This study represents the first investigation of genome-wide expression profiles with respect to psychological treatment outcome. Participants (n=102) with panic disorder or specific phobia received exposure-based CBT. Treatment outcome was defined as percentage reduction from baseline in clinician-rated severity of their primary anxiety diagnosis at post-treatment and six month follow-up. Gene expression was determined from whole blood samples at 3 time-points using the Illumina HT-12v4 BeadChip microarray. No changes in gene expression were significantly associated with treatment outcomes when correcting for multiple testing (q<0.05), although a small number of genes showed a suggestive association with treatment outcome (q<0.5, n=20). Study reports suggestive evidence for the role of a small number of genes in treatment outcome. Although preliminary, the findings contribute to a growing body of re

Unnamed: 0,!Sample_geo_accession,GSM2469746,GSM2469747,GSM2469748,GSM2469749,GSM2469750,GSM2469751,GSM2469752,GSM2469753,GSM2469754,...,GSM2470051,GSM2470052,GSM2470053,GSM2470054,GSM2470055,GSM2470056,GSM2470057,GSM2470058,GSM2470059,GSM2470060
0,!Sample_characteristics_ch1,gender: FEMALE,gender: FEMALE,gender: FEMALE,gender: MALE,gender: MALE,gender: MALE,gender: MALE,gender: MALE,gender: MALE,...,gender: MALE,gender: MALE,gender: FEMALE,gender: FEMALE,gender: FEMALE,gender: MALE,gender: FEMALE,gender: MALE,gender: FEMALE,gender: MALE
1,!Sample_characteristics_ch1,tissue: Blood,tissue: Blood,tissue: Blood,tissue: Blood,tissue: Blood,tissue: Blood,tissue: Blood,tissue: Blood,tissue: Blood,...,tissue: Blood,tissue: Blood,tissue: Blood,tissue: Blood,tissue: Blood,tissue: Blood,tissue: Blood,tissue: Blood,tissue: Blood,tissue: Blood
2,!Sample_characteristics_ch1,timepoint: pre,timepoint: post,timepoint: follow-up,timepoint: pre,timepoint: post,timepoint: follow-up,timepoint: pre,timepoint: post,timepoint: follow-up,...,timepoint: pre,timepoint: post,timepoint: follow-up,timepoint: pre,timepoint: pre,timepoint: follow-up,timepoint: pre,timepoint: pre,timepoint: post,timepoint: follow-up


In [109]:
cohort = accession_num = "GSE68526"
cohort_dir = os.path.join(trait_path, accession_num)
soft_file, matrix_file = get_relevant_filepaths(cohort_dir)
soft_file, matrix_file

import io
background_prefixes = ['!Series_title', '!Series_summary', '!Series_overall_design']
clinical_prefixes = ['!Sample_geo_accession', '!Sample_characteristics_ch1']

background_info, clinical_data = get_background_and_clinical_data(matrix_file, background_prefixes, clinical_prefixes)
print(background_info)

clinical_data

!Series_title	"Peripheral blood transcriptome profiles from an RNA Pilot Study within the United States Health and Retirement Study (HRS)"
!Series_summary	"Individual differences in peripheral blood transcriptomes in older adults as a function of demographic, socio-economic, psychological, and health history characteristics."
!Series_overall_design	"Gene expression profiling was carried out on peripheral blood RNA samples collected from 121 community dwelling older adults participating in the United States Health and Retirement Study.  In addition to basic demographic characteristics (age, sex, race/ethnicity), participants were also assessed on health-related characteristics (body mass index/BMI; history of smoking or heavy alcohol consumption; history of chronic illnesses such as diabetes, cardiovascular disease, cancer, stroke), household annual income (log transformed), and measures of loneliness (UCLA Loneliness Scale; Russell D, Peplau LA, Cutrona CE: The revised UCLA Loneliness 

Unnamed: 0,!Sample_geo_accession,GSM1674313,GSM1674314,GSM1674315,GSM1674316,GSM1674317,GSM1674318,GSM1674319,GSM1674320,GSM1674321,...,GSM1674424,GSM1674425,GSM1674426,GSM1674427,GSM1674428,GSM1674429,GSM1674430,GSM1674431,GSM1674432,GSM1674433
0,!Sample_characteristics_ch1,age (yrs): 79,age (yrs): 79,age (yrs): 76,age (yrs): 70,age (yrs): 65,age (yrs): 64,age (yrs): 75,age (yrs): 70,age (yrs): 66,...,age (yrs): 83,age (yrs): 89,age (yrs): 85,age (yrs): 88,age (yrs): 87,age (yrs): 81,age (yrs): 72,age (yrs): 66,age (yrs): 71,age (yrs): 73
1,!Sample_characteristics_ch1,female: 0,female: 0,female: 0,female: 0,female: 1,female: 1,female: 1,female: 0,female: 1,...,female: 1,female: 0,female: 1,female: 1,female: 0,female: 0,female: 1,female: 1,female: 0,female: 0
2,!Sample_characteristics_ch1,black: 0,black: 0,black: 0,black: 0,black: 0,black: 0,black: 0,black: 0,black: 0,...,black: 0,black: 0,black: 0,black: 0,black: 0,black: 0,black: 1,black: 0,black: 0,black: 0
3,!Sample_characteristics_ch1,hispanic: 0,hispanic: 0,hispanic: 0,hispanic: 0,hispanic: 0,hispanic: 0,hispanic: 0,hispanic: 0,hispanic: 0,...,hispanic: 0,hispanic: 0,hispanic: 0,hispanic: 0,hispanic: 0,hispanic: 0,hispanic: 0,hispanic: 0,hispanic: 0,hispanic: 0
4,!Sample_characteristics_ch1,bmi: 22.7,bmi: 29.1,bmi: 25.8,bmi: 24.8,bmi: 42.1,bmi: 29.6,bmi: 21.4,bmi: 32.7,bmi: 30.7,...,bmi: 27.4,bmi: 23.6,bmi: 22.3,bmi: 18.6,bmi: 24.2,bmi: 24.5,bmi: 23.2,bmi: 24.4,bmi: 23.5,bmi: 29.0
5,!Sample_characteristics_ch1,diabcvdcastr: 1,diabcvdcastr: 1,diabcvdcastr: 1,diabcvdcastr: 0,diabcvdcastr: 1,diabcvdcastr: 1,diabcvdcastr: 0,diabcvdcastr: 1,diabcvdcastr: 0,...,diabcvdcastr: 1,diabcvdcastr: 1,diabcvdcastr: 1,diabcvdcastr: 1,diabcvdcastr: 0,diabcvdcastr: 1,diabcvdcastr: 0,diabcvdcastr: 1,diabcvdcastr: 1,diabcvdcastr: 0
6,!Sample_characteristics_ch1,ln_hh_income: 16.03,ln_hh_income: 15.49,ln_hh_income: 15.34,ln_hh_income: 15.52,ln_hh_income: 16.41,ln_hh_income: 15.34,ln_hh_income: 16.03,ln_hh_income: 14.20,ln_hh_income: 15.49,...,ln_hh_income: 14.03,ln_hh_income: 13.95,ln_hh_income: 13.96,ln_hh_income: 14.10,ln_hh_income: 13.96,ln_hh_income: 15.58,ln_hh_income: 14.35,ln_hh_income: 16.01,ln_hh_income: 17.90,ln_hh_income: 15.93
7,!Sample_characteristics_ch1,smoke: 1,smoke: 1,smoke: 1,smoke: 1,smoke: 0,smoke: 0,smoke: 1,smoke: 1,smoke: 1,...,smoke: 1,smoke: 1,smoke: 0,smoke: 0,smoke: 1,smoke: 1,smoke: 0,smoke: 1,smoke: 1,smoke: 1
8,!Sample_characteristics_ch1,alcohol: 0,alcohol: 0,alcohol: 0,alcohol: 1,alcohol: 0,alcohol: 0,alcohol: 0,alcohol: 0,alcohol: 0,...,alcohol: 0,alcohol: missing,alcohol: 0,alcohol: 0,alcohol: 0,alcohol: 0,alcohol: 0,alcohol: 0,alcohol: 0,alcohol: 0
9,!Sample_characteristics_ch1,loneliness: 1.00,loneliness: 1.00,loneliness: 2.00,loneliness: 1.00,loneliness: 1.67,loneliness: 1.33,loneliness: 1.00,loneliness: 1.00,loneliness: 1.00,...,loneliness: 1.33,loneliness: 2.00,loneliness: 1.67,loneliness: 1.00,loneliness: 1.33,loneliness: 1.00,loneliness: 1.67,loneliness: 1.33,loneliness: 1.00,loneliness: 1.00


In [121]:
# Finished
cohort = accession_num = "GSE98793"
cohort_dir = os.path.join(trait_path, accession_num)
soft_file, matrix_file = get_relevant_filepaths(cohort_dir)
soft_file, matrix_file

import io
background_prefixes = ['!Series_title', '!Series_summary', '!Series_overall_design']
clinical_prefixes = ['!Sample_geo_accession', '!Sample_characteristics_ch1']

background_info, clinical_data = get_background_and_clinical_data(matrix_file, background_prefixes, clinical_prefixes)
print(background_info)

clinical_data

!Series_title	"Gene expression analysis in whole blood samples obtained from donors diagnosed with MDD compared to healthy controls"
!Series_summary	"This study investigated gene expression changes in whole blood samples obtained from donors diagnosed with major depressive disorder (MDD) compared to healthy controls. Micro-array data were available from whole blood on patients with MDD (N=128, 64 with generalised anxiety disorder, diagnosed by the MINI questionnaire, and 64 without anxiety disorder) and healthy controls (N=64). RNA was isolated from all samples using the standard PAXgene protocol on the Qiagen Biorobot 8000. All samples gave good quality RNA, as assessed by Agilent Bioanalyser. The yield range was 0.86-15.05ug with an average of 6.25ug. Samples were then randomised into batches, with each batch containing a representative number of controls, depression with anxiety and depression without anxiety, and the same ratio of females to males (3:1). 50ng of RNA from each sampl

Unnamed: 0,!Sample_geo_accession,GSM2612096,GSM2612097,GSM2612098,GSM2612099,GSM2612100,GSM2612101,GSM2612102,GSM2612103,GSM2612104,...,GSM2612278,GSM2612279,GSM2612280,GSM2612281,GSM2612282,GSM2612283,GSM2612284,GSM2612285,GSM2612286,GSM2612287
0,!Sample_characteristics_ch1,subject group: CNTL; healthy control,subject group: CNTL; healthy control,subject group: CNTL; healthy control,subject group: CNTL; healthy control,subject group: CNTL; healthy control,subject group: CNTL; healthy control,subject group: CNTL; healthy control,subject group: CNTL; healthy control,subject group: CNTL; healthy control,...,subject group: CNTL; healthy control,subject group: CNTL; healthy control,subject group: CNTL; healthy control,subject group: CASE; major depressive disorder...,subject group: CASE; major depressive disorder...,subject group: CASE; major depressive disorder...,subject group: CNTL; healthy control,subject group: CASE; major depressive disorder...,subject group: CNTL; healthy control,subject group: CNTL; healthy control
1,!Sample_characteristics_ch1,anxiety: no,anxiety: no,anxiety: no,anxiety: no,anxiety: no,anxiety: no,anxiety: no,anxiety: no,anxiety: no,...,anxiety: no,anxiety: no,anxiety: no,anxiety: yes,anxiety: no,anxiety: no,anxiety: no,anxiety: yes,anxiety: no,anxiety: no
2,!Sample_characteristics_ch1,gender: M,gender: F,gender: F,gender: F,gender: M,gender: F,gender: F,gender: F,gender: F,...,gender: F,gender: F,gender: F,gender: F,gender: F,gender: F,gender: F,gender: F,gender: F,gender: F
3,!Sample_characteristics_ch1,age: 35.8,age: 36.9,age: 62,age: 35.7,age: 53.4,age: 52.5,age: 55.2,age: 64.5,age: 52.9,...,age: 66.7,age: 52.6,age: 62.2,age: 48.6,age: 32,age: 62.3,age: 73.1,age: 61.3,age: 31.2,age: 61.2
4,!Sample_characteristics_ch1,tissue: whole blood,tissue: whole blood,tissue: whole blood,tissue: whole blood,tissue: whole blood,tissue: whole blood,tissue: whole blood,tissue: whole blood,tissue: whole blood,...,tissue: whole blood,tissue: whole blood,tissue: whole blood,tissue: whole blood,tissue: whole blood,tissue: whole blood,tissue: whole blood,tissue: whole blood,tissue: whole blood,tissue: whole blood
5,!Sample_characteristics_ch1,batch: 1,batch: 1,batch: 1,batch: 1,batch: 1,batch: 1,batch: 1,batch: 1,batch: 1,...,batch: 2,batch: 2,batch: 2,batch: 2,batch: 2,batch: 2,batch: 2,batch: 2,batch: 2,batch: 2


In [122]:
tumor_stage_row = clinical_data.iloc[1]
tumor_stage_row.unique()

array(['!Sample_characteristics_ch1', 'anxiety: no', 'anxiety: yes'],
      dtype=object)

In [129]:
is_gene_availabe = True
trait_row = 1
age_row = 3
gender_row = 2

trait_type = 'binary'

# Verify and use the functions generated by GPT

# 这个函数将组织类型（tissue type）转换为有关癫痫存在与否的二进制值。
# 它是基于特定的假设，即如果组织类型是“胰腺导管腺癌”（Pancreatic Ductal Adenocarcinoma），则认为癫痫存在（返回1）；否则，认为癫痫不存在（返回0）。
def convert_trait(tissue_type):
    """
    Convert tissue type to epilepsy presence (binary).
    Assuming epilepsy presence for 'Hippocampus' tissue.
    """
    if tissue_type == 'anxiety: yes':
        return 1  # Epilepsy present
    else:
        return 0  # Epilepsy not present

def convert_age(age_string):
    """
    Convert age string to a continuous numerical value.
    Unknown values are converted to None.
    """
    if age_string.lower() == 'n.a.':
        return None
    try:
        # Extract age as an integer from the string
        age = float(age_string.split(': ')[1])
        return age
    except (ValueError, IndexError):
        # In case of any format error or unexpected string structure
        return None
    
def convert_gender(gender_string):
    """
    Convert gender string to a binary value.
    'female' is represented as 1, 'male' as 0.
    Unknown values are converted to None.
    """
    if (gender_string.lower() == 'sex: female' or gender_string.lower() == 'sex: f' or gender_string.lower() == 'gender: female' or gender_string.lower() == 'gender: f'):
        return 1
    elif (gender_string.lower() == 'sex: male' or gender_string.lower() == 'sex: m' or gender_string.lower() == 'gender: male' or gender_string.lower() == 'gender: m') :  # changeed 
        return 0
    else:
        return None

In [130]:
selected_clinical_data = geo_select_clinical_features(clinical_data, TRAIT, trait_row, convert_trait, age_row=age_row,
                                                      convert_age=convert_age, gender_row=gender_row,
                                                      convert_gender=convert_gender)
selected_clinical_data.head()

  clinical_df = clinical_df.applymap(convert_fn)
  clinical_df = clinical_df.applymap(convert_fn)
  clinical_df = clinical_df.applymap(convert_fn)


Unnamed: 0,GSM2612096,GSM2612097,GSM2612098,GSM2612099,GSM2612100,GSM2612101,GSM2612102,GSM2612103,GSM2612104,GSM2612105,...,GSM2612278,GSM2612279,GSM2612280,GSM2612281,GSM2612282,GSM2612283,GSM2612284,GSM2612285,GSM2612286,GSM2612287
Anxiety disorder,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
Age,35.8,36.9,62.0,35.7,53.4,52.5,55.2,64.5,52.9,69.6,...,66.7,52.6,62.2,48.6,32.0,62.3,73.1,61.3,31.2,61.2
Gender,0.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [131]:
genetic_data = get_genetic_data(matrix_file)
genetic_data

Unnamed: 0_level_0,GSM2612096,GSM2612097,GSM2612098,GSM2612099,GSM2612100,GSM2612101,GSM2612102,GSM2612103,GSM2612104,GSM2612105,...,GSM2612278,GSM2612279,GSM2612280,GSM2612281,GSM2612282,GSM2612283,GSM2612284,GSM2612285,GSM2612286,GSM2612287
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1007_s_at,5.850755,5.577231,5.663056,5.596154,5.242699,5.212191,4.872792,5.272285,5.338772,5.070973,...,6.414110,6.115670,5.679880,6.740236,5.972264,5.808947,5.867953,5.938182,6.000403,5.772250
1053_at,7.092003,6.618856,6.487570,6.565388,6.531346,6.073299,6.414285,6.099957,6.969604,6.837721,...,7.798097,7.802633,7.345928,7.758726,7.580560,7.444604,7.238209,7.513523,7.737975,7.702891
117_at,9.373934,9.315652,8.237757,8.877479,8.148008,8.805715,8.859519,7.861680,9.129116,8.725871,...,9.967406,10.193932,9.737181,10.029855,9.960287,10.432011,10.155011,9.821582,9.522425,10.838637
121_at,5.814709,5.643282,5.363979,5.340978,5.701092,5.293687,5.379093,5.975036,5.661394,5.439686,...,5.572871,5.547584,5.748475,5.630231,5.858684,5.945134,5.631375,5.695992,5.706208,6.016455
1255_g_at,2.728267,2.671652,2.206741,2.998085,2.594565,2.340774,2.754393,2.607237,2.532411,3.130581,...,2.157327,2.264339,2.443638,2.302708,2.523345,2.454866,2.416422,2.298950,2.235815,2.475537
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
AFFX-ThrX-5_at,2.413047,2.499112,2.493270,2.176757,2.497286,2.473523,2.415532,2.959415,2.552571,2.484300,...,2.421918,2.395902,2.259330,2.362848,2.301254,2.554583,2.382011,2.411032,2.385052,2.573699
AFFX-ThrX-M_at,3.248725,2.952288,3.036010,3.018452,2.857945,2.867773,2.934452,2.851396,3.078947,2.719665,...,2.702217,2.721138,2.791224,2.854180,2.759138,3.055240,2.988927,2.825169,2.891791,3.178614
AFFX-TrpnX-3_at,2.054429,2.049808,1.899813,2.431576,2.212574,2.339308,2.407863,1.997606,2.412021,2.173443,...,1.929942,1.947370,1.879175,2.075299,1.967994,2.045717,1.961351,2.030025,1.918269,2.202680
AFFX-TrpnX-5_at,3.187857,2.929797,2.686815,2.938610,2.973220,2.737518,2.637788,2.782907,3.149515,2.969094,...,2.638295,2.821802,2.744246,2.854341,2.791133,3.237114,2.943271,2.863169,2.741495,3.075356


In [132]:
requires_gene_mapping = True

if requires_gene_mapping:
    gene_annotation = get_gene_annotation(soft_file)
    gene_annotation_summary = preview_df(gene_annotation)
    print(gene_annotation_summary)

gene_row_ids = genetic_data.index[:20].tolist()
gene_row_ids

if requires_gene_mapping:
    print(f'''
    As a biomedical research team, we are analyzing a gene expression dataset, and find that its row headers are some identifiers related to genes:
    {gene_row_ids}
    To get the mapping from those identifiers to actual gene symbols, we extracted the gene annotation data from a series in the GEO database, and saved it to a Python dictionary. Please read the dictionary, and decide which key stores the identifiers, and which key stores the gene symbols. Please strictly follow this format in your answer:
    identifier_key = 'key_name1'
    gene_symbol_key = 'key_name2'

    Gene annotation dictionary:
    {gene_annotation_summary}
    ''')

gene_annotation

{'ID': ['1007_s_at', '1053_at', '117_at', '121_at', '1255_g_at'], 'GB_ACC': ['U48705', 'M87338', 'X51757', 'X69699', 'L36861'], 'SPOT_ID': [nan, nan, nan, nan, nan], 'Species Scientific Name': ['Homo sapiens', 'Homo sapiens', 'Homo sapiens', 'Homo sapiens', 'Homo sapiens'], 'Annotation Date': ['Oct 6, 2014', 'Oct 6, 2014', 'Oct 6, 2014', 'Oct 6, 2014', 'Oct 6, 2014'], 'Sequence Type': ['Exemplar sequence', 'Exemplar sequence', 'Exemplar sequence', 'Exemplar sequence', 'Exemplar sequence'], 'Sequence Source': ['Affymetrix Proprietary Database', 'GenBank', 'Affymetrix Proprietary Database', 'GenBank', 'Affymetrix Proprietary Database'], 'Target Description': ['U48705 /FEATURE=mRNA /DEFINITION=HSU48705 Human receptor tyrosine kinase DDR gene, complete cds', 'M87338 /FEATURE= /DEFINITION=HUMA1SBU Human replication factor C, 40-kDa subunit (A1) mRNA, complete cds', "X51757 /FEATURE=cds /DEFINITION=HSP70B Human heat-shock protein HSP70B' gene", 'X69699 /FEATURE= /DEFINITION=HSPAX8A H.sapiens

Unnamed: 0,ID,GB_ACC,SPOT_ID,Species Scientific Name,Annotation Date,Sequence Type,Sequence Source,Target Description,Representative Public ID,Gene Title,Gene Symbol,ENTREZ_GENE_ID,RefSeq Transcript ID,Gene Ontology Biological Process,Gene Ontology Cellular Component,Gene Ontology Molecular Function
0,1007_s_at,U48705,,Homo sapiens,"Oct 6, 2014",Exemplar sequence,Affymetrix Proprietary Database,U48705 /FEATURE=mRNA /DEFINITION=HSU48705 Huma...,U48705,discoidin domain receptor tyrosine kinase 1 //...,DDR1 /// MIR4640,780 /// 100616237,NM_001202521 /// NM_001202522 /// NM_001202523...,0001558 // regulation of cell growth // inferr...,0005576 // extracellular region // inferred fr...,0000166 // nucleotide binding // inferred from...
1,1053_at,M87338,,Homo sapiens,"Oct 6, 2014",Exemplar sequence,GenBank,M87338 /FEATURE= /DEFINITION=HUMA1SBU Human re...,M87338,"replication factor C (activator 1) 2, 40kDa",RFC2,5982,NM_001278791 /// NM_001278792 /// NM_001278793...,0000278 // mitotic cell cycle // traceable aut...,0005634 // nucleus // inferred from electronic...,0000166 // nucleotide binding // inferred from...
2,117_at,X51757,,Homo sapiens,"Oct 6, 2014",Exemplar sequence,Affymetrix Proprietary Database,X51757 /FEATURE=cds /DEFINITION=HSP70B Human h...,X51757,heat shock 70kDa protein 6 (HSP70B'),HSPA6,3310,NM_002155,0000902 // cell morphogenesis // inferred from...,0005737 // cytoplasm // inferred from direct a...,0000166 // nucleotide binding // inferred from...
3,121_at,X69699,,Homo sapiens,"Oct 6, 2014",Exemplar sequence,GenBank,X69699 /FEATURE= /DEFINITION=HSPAX8A H.sapiens...,X69699,paired box 8,PAX8,7849,NM_003466 /// NM_013951 /// NM_013952 /// NM_0...,0001655 // urogenital system development // in...,0005634 // nucleus // inferred from direct ass...,0000979 // RNA polymerase II core promoter seq...
4,1255_g_at,L36861,,Homo sapiens,"Oct 6, 2014",Exemplar sequence,Affymetrix Proprietary Database,L36861 /FEATURE=expanded_cds /DEFINITION=HUMGC...,L36861,guanylate cyclase activator 1A (retina),GUCA1A,2978,NM_000409 /// XM_006715073,0007165 // signal transduction // non-traceabl...,0001750 // photoreceptor outer segment // infe...,0005509 // calcium ion binding // inferred fro...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10552462,AFFX-ThrX-5_at,2.573698544,,,,,,,,,,,,,,
10552463,AFFX-ThrX-M_at,3.178614051,,,,,,,,,,,,,,
10552464,AFFX-TrpnX-3_at,2.202680092,,,,,,,,,,,,,,
10552465,AFFX-TrpnX-5_at,3.075355563,,,,,,,,,,,,,,


In [133]:
gene_annotation.columns

Index(['ID', 'GB_ACC', 'SPOT_ID', 'Species Scientific Name', 'Annotation Date',
       'Sequence Type', 'Sequence Source', 'Target Description',
       'Representative Public ID', 'Gene Title', 'Gene Symbol',
       'ENTREZ_GENE_ID', 'RefSeq Transcript ID',
       'Gene Ontology Biological Process', 'Gene Ontology Cellular Component',
       'Gene Ontology Molecular Function'],
      dtype='object')

In [134]:
if requires_gene_mapping:
    identifier_key = 'ID'
    gene_symbol_key = 'Gene Symbol'
    gene_mapping = get_gene_mapping(gene_annotation, identifier_key, gene_symbol_key)
    genetic_data = apply_gene_mapping(genetic_data, gene_mapping)

In [135]:
genetic_data

Unnamed: 0_level_0,GSM2612096,GSM2612097,GSM2612098,GSM2612099,GSM2612100,GSM2612101,GSM2612102,GSM2612103,GSM2612104,GSM2612105,...,GSM2612278,GSM2612279,GSM2612280,GSM2612281,GSM2612282,GSM2612283,GSM2612284,GSM2612285,GSM2612286,GSM2612287
Gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ABCB4,6.887209,6.181964,6.555487,5.873066,5.745004,6.641166,6.464039,6.133586,6.274664,6.054550,...,7.150558,7.406669,7.529819,6.864007,7.001945,7.050386,7.087407,6.896694,7.360574,7.115658
ABCC6P1,3.770740,3.849153,3.587279,3.026468,2.917429,3.113438,2.945742,3.397430,2.757350,3.446217,...,4.452625,5.330988,4.461095,3.202160,3.611992,4.006498,4.670841,4.149561,3.908489,3.980791
ABCC6P2,3.770740,3.849153,3.587279,3.026468,2.917429,3.113438,2.945742,3.397430,2.757350,3.446217,...,4.452625,5.330988,4.461095,3.202160,3.611992,4.006498,4.670841,4.149561,3.908489,3.980791
ABCD1P2,3.671268,3.251920,2.888037,3.098870,3.274688,3.004873,2.793291,3.240455,3.303735,3.518894,...,3.088957,3.423271,3.088001,3.194958,3.352198,3.436326,3.176444,3.377824,3.152157,3.445710
AC078883.4,4.540082,4.221583,4.406263,3.895509,4.212557,3.518138,3.875489,4.169540,4.052477,4.282009,...,4.374490,4.178730,4.138277,4.331086,4.361535,4.509868,4.303939,4.293775,4.281061,4.637451
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
abParts,6.840695,6.761038,8.097322,7.367817,5.648297,7.888521,6.200375,6.745986,6.421771,7.483332,...,7.742626,8.184422,8.277556,7.479552,8.518359,7.365643,8.075855,7.948421,8.453753,7.860338
alpha,7.993384,7.859073,7.443326,7.288378,5.391930,7.254210,6.638218,7.073584,7.923679,7.498770,...,6.715298,9.210556,8.251305,6.888582,8.485592,7.241490,8.236890,8.081402,8.295665,8.086276
av27s1,3.267696,5.357518,4.489498,5.203069,5.114636,4.838910,4.514805,4.865177,4.920292,4.544280,...,4.951748,5.418962,5.442874,5.258984,5.787806,4.993456,4.926423,6.270793,5.380123,4.624921
hsa-let-7a-3,4.173511,3.739451,3.689065,3.757212,4.251514,4.259409,3.850559,4.035009,4.156332,3.819965,...,3.731068,3.860831,3.787460,3.719467,3.674982,3.980256,3.852515,3.792280,3.986952,3.899352


In [136]:
genetic_data = normalize_gene_symbols_in_index(genetic_data)

In [137]:
genetic_data

Unnamed: 0,GSM2612096,GSM2612097,GSM2612098,GSM2612099,GSM2612100,GSM2612101,GSM2612102,GSM2612103,GSM2612104,GSM2612105,...,GSM2612278,GSM2612279,GSM2612280,GSM2612281,GSM2612282,GSM2612283,GSM2612284,GSM2612285,GSM2612286,GSM2612287
A1BG,3.678218,3.286122,2.978241,3.416928,3.915461,3.622674,3.348873,3.407293,3.475643,3.588239,...,3.845511,3.589142,3.502773,3.322457,3.618603,3.791884,3.453535,3.363853,3.454187,3.765669
A1BG-AS1,6.443778,6.439185,6.091514,6.160160,6.262001,6.650213,6.046788,6.930769,6.418877,6.679607,...,6.482534,6.767941,6.677439,6.602510,6.787397,7.079915,6.603517,6.683437,6.562028,7.033624
A1CF,3.282722,2.779510,2.701834,3.136804,2.722878,2.927107,2.806098,2.691950,2.971447,2.958255,...,2.643486,3.209170,2.591007,2.900855,2.649972,3.033433,2.871597,2.804416,2.773500,3.082072
A2M,4.952460,4.575449,4.801654,4.814918,4.836391,4.607112,4.566239,4.528694,4.940149,4.757857,...,4.389326,4.522569,4.750834,4.543339,4.872960,4.662414,4.716106,4.468409,4.588417,4.876001
A2M-AS1,8.313736,7.442635,8.522661,8.584944,8.582563,8.080346,8.787317,8.089630,7.931064,7.702357,...,7.133639,7.295100,8.266381,6.530439,7.978626,7.685621,7.938541,7.261100,8.017463,8.066153
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ZYG11A,4.472377,4.687464,4.928949,4.379016,4.538363,4.160522,4.396658,4.667440,4.280527,3.729029,...,3.390104,3.393494,4.065496,3.282047,3.713914,3.856949,3.856006,3.690631,3.736156,4.090386
ZYG11B,7.702434,8.273578,8.332399,8.055039,8.177309,8.418922,8.559867,7.807338,8.284237,7.906655,...,8.116576,8.004243,7.810824,7.475349,7.764784,7.792606,7.508345,7.978566,7.914982,7.472140
ZYX,9.922776,9.761252,8.414962,9.092369,8.967648,9.341418,8.551150,9.326148,9.529829,9.574537,...,10.017479,10.863439,9.639776,10.481763,10.356037,10.074941,9.985552,10.279970,9.808788,10.101097
ZZEF1,6.453670,6.335933,6.446376,6.527935,6.428279,6.555135,6.586912,6.558270,6.884512,6.341421,...,6.501560,6.853390,6.646841,6.643384,6.619929,6.850845,6.910645,6.938110,6.824424,6.589108


In [138]:
merged_data = geo_merge_clinical_genetic_data(selected_clinical_data, genetic_data)
# The preprocessing runs through, which means is_available should be True
is_available = True

merged_data

Unnamed: 0,Anxiety disorder,Age,Gender,A1BG,A1BG-AS1,A1CF,A2M,A2M-AS1,A2ML1,A2MP1,...,ZWILCH,ZWINT,ZXDA,ZXDB,ZXDC,ZYG11A,ZYG11B,ZYX,ZZEF1,ZZZ3
GSM2612096,0.0,35.8,0.0,3.678218,6.443778,3.282722,4.952460,8.313736,2.896964,5.992351,...,4.365722,6.239195,5.691087,3.869996,6.513365,4.472377,7.702434,9.922776,6.453670,4.693944
GSM2612097,0.0,36.9,1.0,3.286122,6.439185,2.779510,4.575449,7.442635,2.811418,6.043174,...,4.564470,5.821275,5.678074,4.446895,6.941676,4.687464,8.273578,9.761252,6.335933,5.581965
GSM2612098,0.0,62.0,1.0,2.978241,6.091514,2.701834,4.801654,8.522661,2.564243,6.885544,...,5.097691,7.184773,6.748653,5.092760,7.343632,4.928949,8.332399,8.414962,6.446376,6.012590
GSM2612099,0.0,35.7,1.0,3.416928,6.160160,3.136804,4.814918,8.584944,2.754305,6.822452,...,5.084126,6.549417,5.525266,4.292599,7.140479,4.379016,8.055039,9.092369,6.527935,5.716714
GSM2612100,0.0,53.4,0.0,3.915461,6.262001,2.722878,4.836391,8.582563,3.218403,7.374226,...,4.983173,6.058277,6.216823,4.518410,6.689277,4.538363,8.177309,8.967648,6.428279,5.297856
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
GSM2612283,0.0,62.3,1.0,3.791884,7.079915,3.033433,4.662414,7.685621,2.400548,4.939600,...,4.744505,5.960281,5.231191,4.844246,7.182306,3.856949,7.792606,10.074941,6.850845,5.711944
GSM2612284,0.0,73.1,1.0,3.453535,6.603517,2.871597,4.716106,7.938541,2.448653,5.530573,...,4.965975,6.443442,5.754605,4.754218,7.200340,3.856006,7.508345,9.985552,6.910645,5.948168
GSM2612285,1.0,61.3,1.0,3.363853,6.683437,2.804416,4.468409,7.261100,2.562233,5.239228,...,4.895647,6.270400,5.716105,4.821335,7.223925,3.690631,7.978566,10.279970,6.938110,5.699622
GSM2612286,0.0,31.2,1.0,3.454187,6.562028,2.773500,4.588417,8.017463,2.184304,5.584945,...,5.111829,7.238046,5.936044,5.046152,7.297916,3.736156,7.914982,9.808788,6.824424,6.270375


In [139]:
print(f"The merged dataset contains {len(merged_data)} samples.")
is_trait_biased, merged_data = judge_and_remove_biased_features(merged_data, TRAIT, trait_type=trait_type)
is_trait_biased

The merged dataset contains 192 samples.
For the feature 'Anxiety disorder', the least common label is '1.0' with 64 occurrences. This represents 33.33% of the dataset.
The distribution of the feature 'Anxiety disorder' in this dataset is fine.

Quartiles for 'Age':
  25%: 43.875
  50% (Median): 52.6
  75%: 62.05
Min: 31.0
Max: 73.1
The distribution of the feature 'Age' in this dataset is fine.

For the feature 'Gender', the least common label is '0.0' with 48 occurrences. This represents 25.00% of the dataset.
The distribution of the feature 'Gender' in this dataset is fine.



False

In [140]:
if is_available:
    save_cohort_info(cohort, JSON_PATH, is_available, is_trait_biased, merged_data, note='')
else:
    save_cohort_info(cohort, JSON_PATH, is_available)
merged_data.head()
if not is_trait_biased:
    merged_data.to_csv(os.path.join(OUTPUT_DIR, cohort + '.csv'), index=False)

In [141]:
# Finished 
cohort = accession_num = "GSE61672"
cohort_dir = os.path.join(trait_path, accession_num)
soft_file, matrix_file = get_relevant_filepaths(cohort_dir)
soft_file, matrix_file

from utils import *
import io
background_prefixes = ['!Series_title', '!Series_summary', '!Series_overall_design']
clinical_prefixes = ['!Sample_geo_accession', '!Sample_characteristics_ch1']

background_info, clinical_data = get_background_and_clinical_data(matrix_file, background_prefixes, clinical_prefixes)
print(background_info)

clinical_data

!Series_title	"Blood gene expression profiles associated with symptoms of generalized anxiety disorder"
!Series_summary	"Prospective epidemiological studies found that generalized anxiety disorder (GAD) can impair immune function and increase risk for cardiovascular disease or events. Mechanisms underlying the physiological reververations of anxiety, however, are still elusive. Hence, we aimed to investigate molecular processes mediating effects of anxiety on physical health using blood gene expression profiles of 546 community participants. Of these, 179 met the status of controls and 157 cases of anxiety."
!Series_overall_design	"We examined genome-wide differential gene expression in anxiety, as well as associations between nine major modules of co-regulated transcripts in blood gene expression and anxiety. There were a total of 546 subjects."


Unnamed: 0,!Sample_geo_accession,GSM1510561,GSM1510562,GSM1510563,GSM1510564,GSM1510565,GSM1510566,GSM1510567,GSM1510568,GSM1510569,...,GSM1511097,GSM1511098,GSM1511099,GSM1511100,GSM1511101,GSM1511102,GSM1511103,GSM1511104,GSM1511105,GSM1511106
0,!Sample_characteristics_ch1,age: 44,age: 59,age: 44,age: 39,age: 64,age: 58,age: 45,age: 37,age: 40,...,age: 58,age: 58,age: 57,age: 56,age: 54,age: 59,age: 57,age: 56,age: 56,age: 37
1,!Sample_characteristics_ch1,Sex: F,Sex: F,Sex: F,Sex: F,Sex: F,Sex: M,Sex: M,Sex: M,Sex: M,...,body mass index: 39.5,body mass index: 24.9,body mass index: 29.2,body mass index: 22.2,body mass index: 26.6,body mass index: 22.1,body mass index: 28.8,body mass index: 23.9,body mass index: 27.1,body mass index: 20.4
2,!Sample_characteristics_ch1,body mass index: 22.2,body mass index: 33.1,body mass index: 22.4,body mass index: 20.6,body mass index: 27.5,body mass index: 21.9,body mass index: 26.1,body mass index: 34.8,body mass index: 20.8,...,ethnicity: AFR,ethnicity: AFR,ethnicity: CAU,ethnicity: CAU,ethnicity: CAU,ethnicity: CAU,ethnicity: CAU,ethnicity: CAU,ethnicity: CAU,ethnicity: CAU
3,!Sample_characteristics_ch1,ethnicity: CAU,ethnicity: AFR,ethnicity: CAU,ethnicity: CAU,ethnicity: CAU,ethnicity: CAU,ethnicity: CAU,ethnicity: AFR,ethnicity: CAU,...,gad7 score: 6,gad7 score: 1,gad7 score: 3,gad7 score: 8,gad7 score: 4,gad7 score: 7,gad7 score: 12,gad7 score: 7,gad7 score: 4,gad7 score: 6
4,!Sample_characteristics_ch1,gad7 score: 2,gad7 score: 0,gad7 score: 3,gad7 score: 0,gad7 score: 7,gad7 score: 0,gad7 score: 3,gad7 score: 3,gad7 score: 4,...,anxiety case/control: case,anxiety case/control: control,hybridization batch: D,anxiety case/control: case,hybridization batch: D,anxiety case/control: case,anxiety case/control: case,anxiety case/control: case,hybridization batch: D,anxiety case/control: case
5,!Sample_characteristics_ch1,hybridization batch: Z,anxiety case/control: control,hybridization batch: Z,anxiety case/control: control,anxiety case/control: case,anxiety case/control: control,hybridization batch: Z,hybridization batch: Z,hybridization batch: Z,...,hybridization batch: D,hybridization batch: D,rin: 9,hybridization batch: D,rin: 9.6,hybridization batch: D,hybridization batch: D,hybridization batch: D,rin: 9.4,hybridization batch: D
6,!Sample_characteristics_ch1,rin: 8.1,hybridization batch: Z,rin: 7.9,hybridization batch: Z,hybridization batch: Z,hybridization batch: Z,rin: 6.6,rin: 7.3,rin: 6.6,...,rin: 9.4,rin: 9.7,,rin: 9.1,,rin: 9.7,rin: 9.8,rin: 9.5,,rin: 9.4
7,!Sample_characteristics_ch1,,rin: 7.8,,rin: 8.1,rin: 8.1,rin: 6.6,,,,...,,,,,,,,,,


In [144]:
tumor_stage_row = clinical_data.iloc[4]
tumor_stage_row.unique()

array(['!Sample_characteristics_ch1', 'gad7 score: 2', 'gad7 score: 0',
       'gad7 score: 3', 'gad7 score: 7', 'gad7 score: 4', 'gad7 score: 9',
       'gad7 score: 1', 'gad7 score: 10', 'gad7 score: 5',
       'gad7 score: 17', 'gad7 score: 6', 'gad7 score: 8',
       'gad7 score: 12', 'gad7 score: 11', 'gad7 score: 14',
       'gad7 score: .', 'hybridization batch: Z', 'gad7 score: 18',
       'hybridization batch: O', 'gad7 score: 13', 'gad7 score: 15',
       'gad7 score: 20', 'gad7 score: 21', 'gad7 score: 19',
       'anxiety case/control: case', 'anxiety case/control: control',
       'hybridization batch: B', nan, 'hybridization batch: C',
       'hybridization batch: D'], dtype=object)

In [145]:
is_gene_availabe = True
trait_row = 4
age_row = 0
gender_row = 1

trait_type = 'binary'

# Verify and use the functions generated by GPT

# 这个函数将组织类型（tissue type）转换为有关癫痫存在与否的二进制值。
# 它是基于特定的假设，即如果组织类型是“胰腺导管腺癌”（Pancreatic Ductal Adenocarcinoma），则认为癫痫存在（返回1）；否则，认为癫痫不存在（返回0）。
def convert_trait(tissue_type):
    """
    Convert tissue type to epilepsy presence (binary).
    Assuming epilepsy presence for 'Hippocampus' tissue.
    """
    if (tissue_type == 'gad7 score: 2' or 'gad7 score: 3' or 'gad7 score: 4' or 'gad7 score: 1' or 'anxiety case/control: case' or 'anxiety case/control: control'):
        return 1  # Epilepsy present
    else:
        return 0  # Epilepsy not present
    
def convert_age(age_string):
    """
    Convert age string to a continuous numerical value.
    Unknown values are converted to None.
    """
    if age_string.lower() == 'n.a.':
        return None
    try:
        # Extract age as an integer from the string
        age = int(age_string.split(': ')[1])
        return age
    except (ValueError, IndexError):
        # In case of any format error or unexpected string structure
        return None

In [146]:
selected_clinical_data = geo_select_clinical_features(clinical_data, TRAIT, trait_row, convert_trait, age_row=age_row,
                                                      convert_age=convert_age, gender_row=gender_row,
                                                      convert_gender=convert_gender)
selected_clinical_data.head()

  clinical_df = clinical_df.applymap(convert_fn)
  clinical_df = clinical_df.applymap(convert_fn)
  clinical_df = clinical_df.applymap(convert_fn)


Unnamed: 0,GSM1510561,GSM1510562,GSM1510563,GSM1510564,GSM1510565,GSM1510566,GSM1510567,GSM1510568,GSM1510569,GSM1510570,...,GSM1511097,GSM1511098,GSM1511099,GSM1511100,GSM1511101,GSM1511102,GSM1511103,GSM1511104,GSM1511105,GSM1511106
Anxiety disorder,1,1,1,1,1,1,1,1,1,1,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
Age,44,59,44,39,64,58,45,37,40,39,...,58.0,58.0,57.0,56.0,54.0,59.0,57.0,56.0,56.0,37.0
Gender,1,1,1,1,1,0,0,0,0,1,...,,,,,,,,,,


In [147]:
genetic_data = get_genetic_data(matrix_file)
genetic_data

Unnamed: 0_level_0,GSM1510561,GSM1510562,GSM1510563,GSM1510564,GSM1510565,GSM1510566,GSM1510567,GSM1510568,GSM1510569,GSM1510570,...,GSM1511097,GSM1511098,GSM1511099,GSM1511100,GSM1511101,GSM1511102,GSM1511103,GSM1511104,GSM1511105,GSM1511106
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ILMN_1343291,14.808,14.763,14.774,14.835,14.535,14.614,14.704,14.710,14.804,14.864,...,14.705,14.665,14.497,14.610,14.774,14.790,14.751,14.630,14.668,14.583
ILMN_1343295,9.518,9.386,9.254,9.774,9.406,9.295,9.527,9.304,9.177,9.303,...,8.862,9.346,8.962,9.040,8.982,9.450,8.852,8.930,9.035,8.847
ILMN_1651228,11.092,11.183,10.725,11.114,10.845,10.658,11.262,11.182,11.011,11.466,...,12.031,11.951,11.716,12.017,12.302,11.970,11.917,12.106,12.102,12.138
ILMN_1651229,6.451,6.321,6.434,6.502,6.655,6.618,6.666,6.409,6.364,6.391,...,6.523,6.662,6.649,6.466,6.686,6.609,6.529,6.416,6.303,6.484
ILMN_1651254,8.407,8.242,8.190,8.203,8.631,8.831,8.525,8.238,8.501,8.189,...,8.633,8.625,8.288,8.253,8.224,8.571,8.228,8.230,8.217,8.247
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ILMN_2415898,6.892,7.133,7.111,6.744,7.353,6.868,7.049,7.100,6.962,7.145,...,7.091,7.592,7.302,7.331,7.106,7.321,7.388,7.422,7.440,7.300
ILMN_2415911,5.529,5.597,5.637,5.653,5.714,5.578,5.613,5.850,5.849,5.779,...,5.690,5.525,5.547,5.623,5.666,5.631,5.580,5.802,5.649,5.655
ILMN_2415926,5.388,5.371,5.280,5.871,5.533,5.319,5.286,5.447,5.326,5.714,...,5.259,5.458,5.192,4.991,5.199,5.009,5.299,5.108,5.194,5.255
ILMN_2415949,6.907,6.995,6.738,7.140,6.872,6.813,7.040,6.790,6.908,6.730,...,6.857,6.740,6.866,6.781,6.892,6.756,6.846,6.924,6.849,6.757


In [148]:
requires_gene_mapping = True

if requires_gene_mapping:
    gene_annotation = get_gene_annotation(soft_file)
    gene_annotation_summary = preview_df(gene_annotation)
    print(gene_annotation_summary)

gene_row_ids = genetic_data.index[:20].tolist()
gene_row_ids

if requires_gene_mapping:
    print(f'''
    As a biomedical research team, we are analyzing a gene expression dataset, and find that its row headers are some identifiers related to genes:
    {gene_row_ids}
    To get the mapping from those identifiers to actual gene symbols, we extracted the gene annotation data from a series in the GEO database, and saved it to a Python dictionary. Please read the dictionary, and decide which key stores the identifiers, and which key stores the gene symbols. Please strictly follow this format in your answer:
    identifier_key = 'key_name1'
    gene_symbol_key = 'key_name2'

    Gene annotation dictionary:
    {gene_annotation_summary}
    ''')

gene_annotation.columns

{'ID': ['ILMN_1343048', 'ILMN_1343049', 'ILMN_1343050', 'ILMN_1343052', 'ILMN_1343059'], 'Species': [nan, nan, nan, nan, nan], 'Source': [nan, nan, nan, nan, nan], 'Search_Key': [nan, nan, nan, nan, nan], 'Transcript': [nan, nan, nan, nan, nan], 'ILMN_Gene': [nan, nan, nan, nan, nan], 'Source_Reference_ID': [nan, nan, nan, nan, nan], 'RefSeq_ID': [nan, nan, nan, nan, nan], 'Unigene_ID': [nan, nan, nan, nan, nan], 'Entrez_Gene_ID': [nan, nan, nan, nan, nan], 'GI': [nan, nan, nan, nan, nan], 'Accession': [nan, nan, nan, nan, nan], 'Symbol': ['phage_lambda_genome', 'phage_lambda_genome', 'phage_lambda_genome:low', 'phage_lambda_genome:low', 'thrB'], 'Protein_Product': [nan, nan, nan, nan, 'thrB'], 'Probe_Id': [nan, nan, nan, nan, nan], 'Array_Address_Id': [5090180.0, 6510136.0, 7560739.0, 1450438.0, 1240647.0], 'Probe_Type': [nan, nan, nan, nan, nan], 'Probe_Start': [nan, nan, nan, nan, nan], 'SEQUENCE': ['GAATAAAGAACAATCTGCTGATGATCCCTCCGTGGATCTGATTCGTGTAA', 'CCATGTGATACGAGGGCGCGTAGTTTGCA

Index(['ID', 'Species', 'Source', 'Search_Key', 'Transcript', 'ILMN_Gene',
       'Source_Reference_ID', 'RefSeq_ID', 'Unigene_ID', 'Entrez_Gene_ID',
       'GI', 'Accession', 'Symbol', 'Protein_Product', 'Probe_Id',
       'Array_Address_Id', 'Probe_Type', 'Probe_Start', 'SEQUENCE',
       'Chromosome', 'Probe_Chr_Orientation', 'Probe_Coordinates', 'Cytoband',
       'Definition', 'Ontology_Component', 'Ontology_Process',
       'Ontology_Function', 'Synonyms', 'Obsolete_Probe_Id', 'GB_ACC'],
      dtype='object')

In [149]:
if requires_gene_mapping:
    identifier_key = 'ID'
    gene_symbol_key = 'Symbol'
    gene_mapping = get_gene_mapping(gene_annotation, identifier_key, gene_symbol_key)
    genetic_data = apply_gene_mapping(genetic_data, gene_mapping)

In [150]:
genetic_data = normalize_gene_symbols_in_index(genetic_data)

genetic_data

Unnamed: 0,GSM1510561,GSM1510562,GSM1510563,GSM1510564,GSM1510565,GSM1510566,GSM1510567,GSM1510568,GSM1510569,GSM1510570,...,GSM1511097,GSM1511098,GSM1511099,GSM1511100,GSM1511101,GSM1511102,GSM1511103,GSM1511104,GSM1511105,GSM1511106
AACS,6.3000,6.2350,6.4810,6.3210,6.2640,6.3570,6.2440,6.2300,6.1600,6.2880,...,6.1430,6.0480,6.0820,6.1060,6.0790,6.1280,6.1210,6.0010,6.139,6.129
AAK1,7.9120,8.0340,7.6650,7.4960,8.0450,7.6370,8.3660,7.9090,7.5420,7.5680,...,7.5470,8.0470,7.7170,7.5850,7.3850,7.8180,7.7600,7.7860,7.761,7.806
AAMP,6.7600,6.6760,6.5100,6.7020,6.7100,6.5860,6.1300,6.5600,6.7190,6.4530,...,6.6090,6.9330,6.5270,6.3300,6.4850,6.2460,6.6000,6.4350,6.349,6.570
AARS2,5.9320,6.1460,6.1720,6.1280,6.3120,6.0410,6.3970,6.0420,6.1950,6.0040,...,5.9640,5.7230,6.1090,5.8720,5.9690,5.8380,5.8000,6.0040,5.923,5.911
AARSD1,6.5340,6.7350,6.5230,6.6050,6.9780,6.4210,6.6610,6.4630,6.1730,6.1030,...,6.1590,6.2260,6.2600,6.4460,6.2760,6.2790,6.3230,6.3910,6.152,6.246
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ZXDC,6.0660,5.8830,6.2500,5.7180,5.6890,6.0330,5.8980,5.9980,6.1380,5.8640,...,5.8450,5.6210,5.8320,5.8400,5.7070,5.6790,5.7950,5.8330,5.805,5.582
ZYG11B,7.6940,7.1850,7.5830,7.2500,7.1850,7.6110,7.9210,7.4210,7.3520,7.5220,...,7.4310,7.4150,7.4470,7.8880,7.4550,7.7730,7.7200,7.5820,7.616,7.887
ZYX,9.4940,9.1505,9.1545,9.2205,8.9365,9.1875,9.5075,9.3245,9.2635,9.1475,...,9.2445,9.3425,9.3440,9.6660,9.1590,9.3025,9.3550,9.1835,9.101,9.120
ZZEF1,6.5570,6.7780,6.4080,6.6600,6.4880,6.6580,6.6810,6.5490,6.6980,6.8570,...,6.6000,6.4810,6.1780,6.4850,6.5820,6.4330,6.2890,6.3560,6.402,6.717


In [151]:
merged_data = geo_merge_clinical_genetic_data(selected_clinical_data, genetic_data)
# The preprocessing runs through, which means is_available should be True
is_available = True

merged_data

Unnamed: 0,Anxiety disorder,Age,Gender,AACS,AAK1,AAMP,AARS2,AARSD1,AASDH,AASDHPPT,...,ZSWIM4,ZSWIM6,ZW10,ZWILCH,ZXDB,ZXDC,ZYG11B,ZYX,ZZEF1,ZZZ3
GSM1510561,1.0,44.0,1.0,6.3,7.912,6.76,5.932,6.534,5.799,6.122,...,6.8215,7.775,5.7,5.966,5.689,6.066,7.694,9.494,6.557,5.9875
GSM1510562,1.0,59.0,1.0,6.235,8.034,6.676,6.146,6.735,6.143,6.319,...,6.7145,7.276,5.773,6.297,5.765,5.883,7.185,9.1505,6.778,6.527
GSM1510563,1.0,44.0,1.0,6.481,7.665,6.51,6.172,6.523,5.64,6.065,...,6.656,7.648,5.527,6.334,5.725,6.25,7.583,9.1545,6.408,6.131
GSM1510564,1.0,39.0,1.0,6.321,7.496,6.702,6.128,6.605,5.923,6.194,...,6.641,7.205,5.814,6.709,5.774,5.718,7.25,9.2205,6.66,6.24
GSM1510565,1.0,64.0,1.0,6.264,8.045,6.71,6.312,6.978,6.267,6.364,...,6.6235,7.285,5.833,6.437,5.678,5.689,7.185,8.9365,6.488,6.408
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
GSM1510892,1.0,39.0,1.0,6.165,7.446,6.516,6.024,6.276,5.923,6.304,...,6.7275,7.342,5.707,6.173,5.776,5.889,7.292,9.207,6.507,6.241
GSM1510893,1.0,60.0,1.0,6.252,7.416,6.394,5.984,6.173,5.934,6.2085,...,6.672,7.301,5.604,6.092,5.802,5.849,7.494,8.9835,6.432,6.1945
GSM1510894,1.0,54.0,0.0,6.227,7.42,6.451,6.083,6.311,5.871,6.27,...,6.7825,7.387,5.46,6.046,5.753,5.801,7.392,9.092,6.411,6.429
GSM1510895,1.0,50.0,0.0,6.232,7.406,6.446,5.87,6.263,5.785,6.2825,...,6.663,7.478,5.646,6.08,5.694,5.884,7.536,9.6855,6.601,6.262


In [152]:
print(f"The merged dataset contains {len(merged_data)} samples.")
is_trait_biased, merged_data = judge_and_remove_biased_features(merged_data, TRAIT, trait_type=trait_type)
is_trait_biased

The merged dataset contains 336 samples.
For the feature 'Anxiety disorder', the least common label is '1.0' with 336 occurrences. This represents 100.00% of the dataset.
The distribution of the feature 'Anxiety disorder' in this dataset is severely biased.

Quartiles for 'Age':
  25%: 43.0
  50% (Median): 50.0
  75%: 56.25
Min: 18.0
Max: 78.0
The distribution of the feature 'Age' in this dataset is fine.

For the feature 'Gender', the least common label is '0.0' with 112 occurrences. This represents 33.33% of the dataset.
The distribution of the feature 'Gender' in this dataset is fine.



True

In [153]:
if is_available:
    save_cohort_info(cohort, JSON_PATH, is_available, is_trait_biased, merged_data, note='')
else:
    save_cohort_info(cohort, JSON_PATH, is_available)
merged_data.head()
if not is_trait_biased:
    merged_data.to_csv(os.path.join(OUTPUT_DIR, cohort + '.csv'), index=False)

In [154]:
# Finished
cohort = accession_num = "GSE119995"
cohort_dir = os.path.join(trait_path, accession_num)
soft_file, matrix_file = get_relevant_filepaths(cohort_dir)
soft_file, matrix_file

from utils import *
import io
background_prefixes = ['!Series_title', '!Series_summary', '!Series_overall_design']
clinical_prefixes = ['!Sample_geo_accession', '!Sample_characteristics_ch1']

background_info, clinical_data = get_background_and_clinical_data(matrix_file, background_prefixes, clinical_prefixes)
print(background_info)

clinical_data

!Series_title	"Exposure-induced changes of plasma mRNA expression levels in patients with panic disorder"
!Series_summary	"Anxiety disorders including panic disorders with or without agoraphobia are the most prevalent mental disorders. Exposure is a core technique within the framework of cognitive behavioral therapy to treat phobia- and anxiety-related symptoms. The primary aim of this study was to trace specific anxiety-related plasma gene expression changes of subjects with PD at three time points in order to identify biomarkers for acute anxiety states. In this intervention, the patient is exposed to highly feared and mostly avoided situations."
!Series_overall_design	"Blood samples from individuals with panic disorder (n=24) were drawn at three time points during exposure: baseline, 1 hour post-exposure and 24 hours after exposure-onset."


Unnamed: 0,!Sample_geo_accession,GSM3391438,GSM3391439,GSM3391440,GSM3391441,GSM3391442,GSM3391443,GSM3391444,GSM3391445,GSM3391446,...,GSM3391500,GSM3391501,GSM3391502,GSM3391503,GSM3391504,GSM3391505,GSM3391506,GSM3391507,GSM3391508,GSM3391509
0,!Sample_characteristics_ch1,disease: panic disorder,disease: panic disorder,disease: panic disorder,disease: panic disorder,disease: panic disorder,disease: panic disorder,disease: panic disorder,disease: panic disorder,disease: panic disorder,...,disease: panic disorder,disease: panic disorder,disease: panic disorder,disease: panic disorder,disease: panic disorder,disease: panic disorder,disease: panic disorder,disease: panic disorder,disease: panic disorder,disease: panic disorder
1,!Sample_characteristics_ch1,tissue: blood plasma,tissue: blood plasma,tissue: blood plasma,tissue: blood plasma,tissue: blood plasma,tissue: blood plasma,tissue: blood plasma,tissue: blood plasma,tissue: blood plasma,...,tissue: blood plasma,tissue: blood plasma,tissue: blood plasma,tissue: blood plasma,tissue: blood plasma,tissue: blood plasma,tissue: blood plasma,tissue: blood plasma,tissue: blood plasma,tissue: blood plasma
2,!Sample_characteristics_ch1,Sex: female,Sex: female,Sex: female,Sex: female,Sex: female,Sex: female,Sex: male,Sex: male,Sex: male,...,Sex: not determined,Sex: female,Sex: female,Sex: female,Sex: not determined,Sex: not determined,Sex: not determined,Sex: not determined,Sex: not determined,Sex: not determined
3,!Sample_characteristics_ch1,medication: 0,medication: 0,medication: 0,medication: 0,medication: 0,medication: 0,medication: 0,medication: 0,medication: 0,...,medication: 1,medication: 0,medication: 0,medication: 0,medication: 1,medication: 1,medication: 1,medication: 1,medication: 1,medication: 1
4,!Sample_characteristics_ch1,timepoint: b1,timepoint: p24_1,timepoint: pe1,timepoint: p24_1,timepoint: pe1,timepoint: b1,timepoint: b1,timepoint: pe1,timepoint: p24_1,...,timepoint: pe1,timepoint: b1,timepoint: pe1,timepoint: p24_1,timepoint: pe1,timepoint: b1,timepoint: p24_1,timepoint: pe1,timepoint: p24_1,timepoint: b1
5,!Sample_characteristics_ch1,individual: 2,individual: 2,individual: 2,individual: 9,individual: 9,individual: 9,individual: 7,individual: 7,individual: 7,...,individual: 38,individual: 21,individual: 21,individual: 21,individual: 39,individual: 39,individual: 39,individual: 41,individual: 41,individual: 41


In [155]:
tumor_stage_row = clinical_data.iloc[0]
tumor_stage_row.unique()

array(['!Sample_characteristics_ch1', 'disease: panic disorder'],
      dtype=object)

In [156]:
is_gene_availabe = True
trait_row = 0
age_row = None
gender_row = 2

trait_type = 'binary'

# Verify and use the functions generated by GPT

# 这个函数将组织类型（tissue type）转换为有关癫痫存在与否的二进制值。
# 它是基于特定的假设，即如果组织类型是“胰腺导管腺癌”（Pancreatic Ductal Adenocarcinoma），则认为癫痫存在（返回1）；否则，认为癫痫不存在（返回0）。
def convert_trait(tissue_type):
    """
    Convert tissue type to epilepsy presence (binary).
    Assuming epilepsy presence for 'Hippocampus' tissue.
    """
    if tissue_type == 'disease: panic disorder':
        return 1  # Epilepsy present
    else:
        return 0  # Epilepsy not present

In [157]:
selected_clinical_data = geo_select_clinical_features(clinical_data, TRAIT, trait_row, convert_trait, age_row=age_row,
                                                      convert_age=convert_age, gender_row=gender_row,
                                                      convert_gender=convert_gender)
selected_clinical_data.head()

  clinical_df = clinical_df.applymap(convert_fn)
  clinical_df = clinical_df.applymap(convert_fn)


Unnamed: 0,GSM3391438,GSM3391439,GSM3391440,GSM3391441,GSM3391442,GSM3391443,GSM3391444,GSM3391445,GSM3391446,GSM3391447,...,GSM3391500,GSM3391501,GSM3391502,GSM3391503,GSM3391504,GSM3391505,GSM3391506,GSM3391507,GSM3391508,GSM3391509
Anxiety disorder,1,1,1,1,1,1,1,1,1,1,...,1.0,1,1,1,1.0,1.0,1.0,1.0,1.0,1.0
Gender,1,1,1,1,1,1,0,0,0,1,...,,1,1,1,,,,,,


In [158]:
genetic_data = get_genetic_data(matrix_file)
genetic_data

Unnamed: 0_level_0,GSM3391438,GSM3391439,GSM3391440,GSM3391441,GSM3391442,GSM3391443,GSM3391444,GSM3391445,GSM3391446,GSM3391447,...,GSM3391500,GSM3391501,GSM3391502,GSM3391503,GSM3391504,GSM3391505,GSM3391506,GSM3391507,GSM3391508,GSM3391509
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ILMN_1343291,13.951749,13.632986,13.718869,14.017101,14.325731,14.328347,14.156883,13.975844,14.355057,14.263098,...,13.452625,12.726963,13.041178,13.137704,13.001847,13.462342,12.910161,13.141218,13.247583,12.954122
ILMN_1343295,11.071988,10.894695,11.358074,11.219213,11.498586,11.661600,11.992226,12.115965,12.005301,11.571721,...,11.807002,11.583282,11.843324,11.818459,11.658436,11.825321,11.409416,11.742370,11.768177,11.643327
ILMN_1651199,5.212670,5.482078,5.357357,5.606711,4.839159,4.821573,5.241077,5.039115,4.730284,4.675609,...,4.826455,5.074913,5.010022,5.550813,4.759479,5.001379,4.736003,4.927063,4.735439,4.896824
ILMN_1651209,5.710210,5.611946,5.930514,5.307133,5.391922,5.643804,5.415717,5.227683,5.307049,5.712707,...,5.528267,5.145149,5.490204,4.951263,5.827258,5.639573,5.975935,5.417391,5.218732,5.548980
ILMN_1651210,5.385582,5.243548,5.347599,4.957354,5.386004,5.124224,4.952177,4.693552,5.285033,5.125894,...,5.045532,4.862499,5.216427,5.084498,5.025204,5.045238,5.134669,5.030738,4.727393,4.736943
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ILMN_3311170,5.391212,5.150167,4.968791,5.215904,4.876341,4.803347,4.772063,4.738366,4.726934,5.245909,...,5.425808,4.783792,5.465189,5.279209,4.872425,4.974837,5.462659,5.198998,5.215293,5.448491
ILMN_3311175,5.585412,5.642667,5.141310,5.137059,5.476385,5.103688,5.073684,5.257233,5.253794,4.717288,...,5.035802,5.373508,5.464113,5.085251,5.126689,5.093751,5.102440,5.267268,4.973795,5.219592
ILMN_3311180,5.662226,5.262558,5.342964,5.502404,5.329657,5.426351,5.281256,5.027729,5.411350,4.545742,...,5.359871,5.789463,5.239502,4.986382,5.340640,5.046709,5.336161,5.520774,5.237579,5.110814
ILMN_3311185,5.220874,5.488200,5.430758,5.099603,5.371403,5.266177,4.968190,4.920104,5.291698,5.333733,...,5.440144,5.063427,4.921783,5.496954,5.005685,5.378131,5.255420,5.402068,5.737316,5.465343


In [159]:
requires_gene_mapping = True

if requires_gene_mapping:
    gene_annotation = get_gene_annotation(soft_file)
    gene_annotation_summary = preview_df(gene_annotation)
    print(gene_annotation_summary)

gene_row_ids = genetic_data.index[:20].tolist()
gene_row_ids

if requires_gene_mapping:
    print(f'''
    As a biomedical research team, we are analyzing a gene expression dataset, and find that its row headers are some identifiers related to genes:
    {gene_row_ids}
    To get the mapping from those identifiers to actual gene symbols, we extracted the gene annotation data from a series in the GEO database, and saved it to a Python dictionary. Please read the dictionary, and decide which key stores the identifiers, and which key stores the gene symbols. Please strictly follow this format in your answer:
    identifier_key = 'key_name1'
    gene_symbol_key = 'key_name2'

    Gene annotation dictionary:
    {gene_annotation_summary}
    ''')

gene_annotation.columns

{'ID': ['ILMN_1343048', 'ILMN_1343049', 'ILMN_1343050', 'ILMN_1343052', 'ILMN_1343059'], 'Species': [nan, nan, nan, nan, nan], 'Source': [nan, nan, nan, nan, nan], 'Search_Key': [nan, nan, nan, nan, nan], 'Transcript': [nan, nan, nan, nan, nan], 'ILMN_Gene': [nan, nan, nan, nan, nan], 'Source_Reference_ID': [nan, nan, nan, nan, nan], 'RefSeq_ID': [nan, nan, nan, nan, nan], 'Unigene_ID': [nan, nan, nan, nan, nan], 'Entrez_Gene_ID': [nan, nan, nan, nan, nan], 'GI': [nan, nan, nan, nan, nan], 'Accession': [nan, nan, nan, nan, nan], 'Symbol': ['phage_lambda_genome', 'phage_lambda_genome', 'phage_lambda_genome:low', 'phage_lambda_genome:low', 'thrB'], 'Protein_Product': [nan, nan, nan, nan, 'thrB'], 'Probe_Id': [nan, nan, nan, nan, nan], 'Array_Address_Id': [5090180.0, 6510136.0, 7560739.0, 1450438.0, 1240647.0], 'Probe_Type': [nan, nan, nan, nan, nan], 'Probe_Start': [nan, nan, nan, nan, nan], 'SEQUENCE': ['GAATAAAGAACAATCTGCTGATGATCCCTCCGTGGATCTGATTCGTGTAA', 'CCATGTGATACGAGGGCGCGTAGTTTGCA

Index(['ID', 'Species', 'Source', 'Search_Key', 'Transcript', 'ILMN_Gene',
       'Source_Reference_ID', 'RefSeq_ID', 'Unigene_ID', 'Entrez_Gene_ID',
       'GI', 'Accession', 'Symbol', 'Protein_Product', 'Probe_Id',
       'Array_Address_Id', 'Probe_Type', 'Probe_Start', 'SEQUENCE',
       'Chromosome', 'Probe_Chr_Orientation', 'Probe_Coordinates', 'Cytoband',
       'Definition', 'Ontology_Component', 'Ontology_Process',
       'Ontology_Function', 'Synonyms', 'Obsolete_Probe_Id', 'GB_ACC'],
      dtype='object')

In [160]:
if requires_gene_mapping:
    identifier_key = 'ID'
    gene_symbol_key = 'Symbol'
    gene_mapping = get_gene_mapping(gene_annotation, identifier_key, gene_symbol_key)
    genetic_data = apply_gene_mapping(genetic_data, gene_mapping)

In [161]:
genetic_data = normalize_gene_symbols_in_index(genetic_data)

genetic_data

Unnamed: 0,GSM3391438,GSM3391439,GSM3391440,GSM3391441,GSM3391442,GSM3391443,GSM3391444,GSM3391445,GSM3391446,GSM3391447,...,GSM3391500,GSM3391501,GSM3391502,GSM3391503,GSM3391504,GSM3391505,GSM3391506,GSM3391507,GSM3391508,GSM3391509
A1BG,5.549873,5.156616,5.547066,5.302168,5.185350,5.186735,5.319098,5.102512,5.078470,5.539373,...,5.382586,5.576582,5.413309,5.522419,5.247449,5.161875,5.481058,5.447319,5.551500,5.466158
A1CF,5.236366,4.993963,5.103459,5.089547,4.955472,5.042418,4.984974,5.009426,5.085790,5.284906,...,5.368677,5.368314,5.098376,5.195656,5.073764,5.052990,4.698542,5.235947,5.367221,5.286537
A2M,4.676289,4.170137,3.772850,3.713753,4.116266,3.496562,4.170447,4.677483,4.418204,3.900678,...,3.740830,4.377500,4.039470,3.831097,4.090791,3.959632,4.693917,4.111218,4.111810,4.263751
A2ML1,4.692780,4.125878,4.425559,3.463952,4.574096,4.156001,4.954480,4.528565,4.246560,4.486061,...,4.582576,3.784288,4.655316,3.924939,4.819901,3.122133,4.649517,4.163421,4.967549,4.784429
A3GALT2,5.148601,5.177554,5.003242,4.878174,4.777945,5.341176,4.879487,4.935164,5.069078,5.282842,...,4.682381,5.265869,5.196862,4.947796,5.090187,5.133800,4.693469,4.802838,5.274135,5.047789
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ZYG11B,6.771469,6.336628,6.555160,6.129595,5.900300,6.191537,5.782648,6.413790,6.128815,6.406779,...,4.913304,5.037243,5.635179,5.186833,5.426234,4.902439,5.288266,4.998581,4.986692,5.337996
ZYX,9.569284,9.326883,9.911704,9.542863,9.504159,9.546907,9.848659,10.509720,9.726760,9.400222,...,9.122883,9.403109,9.491428,9.310131,9.807796,9.505578,9.945647,9.577237,9.028118,9.320740
ZZEF1,6.099535,5.957916,6.237110,6.430535,6.247545,6.389494,6.482721,6.755369,6.471711,6.180198,...,5.476549,4.864323,5.268544,5.469607,6.201858,5.980847,5.804774,5.236338,5.575930,5.052478
ZZZ3,5.464955,5.349479,5.307907,5.278758,5.418133,5.036321,5.517764,5.098876,5.583733,5.428541,...,5.559138,5.364978,5.011174,4.990195,5.108184,5.317062,5.322555,5.345468,5.269554,4.930351


In [162]:
merged_data = geo_merge_clinical_genetic_data(selected_clinical_data, genetic_data)
# The preprocessing runs through, which means is_available should be True
is_available = True

merged_data

Unnamed: 0,Anxiety disorder,Gender,A1BG,A1CF,A2M,A2ML1,A3GALT2,A4GALT,A4GNT,AAA1,...,ZWINT,ZXDA,ZXDB,ZXDC,ZYG11A,ZYG11B,ZYX,ZZEF1,ZZZ3,RAB1C
GSM3391438,1.0,1.0,5.549873,5.236366,4.676289,4.69278,5.148601,4.803079,5.458767,4.923727,...,4.964826,4.940303,5.073779,5.884059,4.705465,6.771469,9.569284,6.099535,5.464955,4.960658
GSM3391439,1.0,1.0,5.156616,4.993963,4.170137,4.125878,5.177554,5.119226,5.515966,4.804336,...,4.753434,4.591868,5.164072,6.149183,4.846971,6.336628,9.326883,5.957916,5.349479,5.100309
GSM3391440,1.0,1.0,5.547066,5.103459,3.77285,4.425559,5.003242,4.216696,5.356336,4.647138,...,4.705461,4.46119,4.788716,6.228165,4.89554,6.55516,9.911704,6.23711,5.307907,4.314269
GSM3391441,1.0,1.0,5.302168,5.089547,3.713753,3.463952,4.878174,4.084012,5.809323,5.224995,...,4.709073,4.607347,4.687015,6.370589,5.048973,6.129595,9.542863,6.430535,5.278758,4.822136
GSM3391442,1.0,1.0,5.18535,4.955472,4.116266,4.574096,4.777945,4.557257,5.611132,4.936271,...,4.883024,4.638102,5.072173,6.126019,4.675962,5.9003,9.504159,6.247545,5.418133,5.013245
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
GSM3391496,1.0,0.0,5.506376,5.145666,3.867223,4.486624,5.333725,4.400585,5.74246,4.953283,...,4.669475,4.705037,5.28401,6.567619,4.826816,5.64686,10.161538,6.28309,5.190047,4.530331
GSM3391497,1.0,0.0,5.173176,5.040722,4.074068,4.549112,5.013721,4.583834,5.466677,4.70262,...,4.72745,4.840712,4.972748,6.626783,4.806269,5.501809,10.196891,6.33844,5.275097,4.90162
GSM3391501,1.0,1.0,5.576582,5.368314,4.3775,3.784288,5.265869,4.882237,6.312621,4.832429,...,4.479698,4.54106,5.122266,6.304932,4.705346,5.037243,9.403109,4.864323,5.364978,5.240203
GSM3391502,1.0,1.0,5.413309,5.098376,4.03947,4.655316,5.196862,3.692935,5.458813,4.765107,...,4.344076,4.844574,4.838627,6.203816,4.915858,5.635179,9.491428,5.268544,5.011174,4.78497


In [163]:
print(f"The merged dataset contains {len(merged_data)} samples.")
is_trait_biased, merged_data = judge_and_remove_biased_features(merged_data, TRAIT, trait_type=trait_type)
is_trait_biased

The merged dataset contains 63 samples.
For the feature 'Anxiety disorder', the least common label is '1.0' with 63 occurrences. This represents 100.00% of the dataset.
The distribution of the feature 'Anxiety disorder' in this dataset is severely biased.

For the feature 'Gender', the least common label is '0.0' with 18 occurrences. This represents 28.57% of the dataset.
The distribution of the feature 'Gender' in this dataset is fine.



True

In [164]:
if is_available:
    save_cohort_info(cohort, JSON_PATH, is_available, is_trait_biased, merged_data, note='')
else:
    save_cohort_info(cohort, JSON_PATH, is_available)
merged_data.head()
if not is_trait_biased:
    merged_data.to_csv(os.path.join(OUTPUT_DIR, cohort + '.csv'), index=False)

In [165]:
# Finished 
cohort = accession_num = "GSE18123"
cohort_dir = os.path.join(trait_path, accession_num)
soft_file, matrix_file = get_relevant_filepaths(cohort_dir)
soft_file, matrix_file

from utils import *
import io
background_prefixes = ['!Series_title', '!Series_summary', '!Series_overall_design']
clinical_prefixes = ['!Sample_geo_accession', '!Sample_characteristics_ch1']

background_info, clinical_data = get_background_and_clinical_data(matrix_file, background_prefixes, clinical_prefixes)
print(background_info)

clinical_data

!Series_title	"Blood gene expression signatures distinguish autism spectrum disorders from controls"
!Series_summary	"Autism Spectrum Disorder (ASD) is a common pediatric cognitive disorder with high heritability. Yet no single genetic variant has accounted for more than a small fraction of cases. We sought to determine whether we could classify patients as having ASD vs. controls solely based on a multi-gene expression profiling of their peripheral blood cells."
!Series_overall_design	"To test whether peripheral blood gene expression profiles could be used as a molecular diagnostic tool for distinguishing ASD from controls, we performed peripheral blood gene expression profiling on 170 patients with ASD and 115 controls collected from Boston area hospitals.  We developed a 55-gene prediction model with a sample cohort of 66 male patients with ASD and 33 age-matched male controls using cross-validation strategy.  Subsequently, 104 ASD and 82 controls were recruited and used as first an

Unnamed: 0,!Sample_geo_accession,GSM650510,GSM650512,GSM650513,GSM650514,GSM650515,GSM650516,GSM650517,GSM650518,GSM650519,...,GSM650644,GSM650645,GSM650646,GSM650647,GSM650648,GSM650649,GSM650651,GSM650652,GSM650653,GSM650654
0,!Sample_characteristics_ch1,diagnosis: PDD-NOS,diagnosis: PDD-NOS,diagnosis: AUTISM,diagnosis: AUTISM,diagnosis: AUTISM,diagnosis: PDD-NOS,diagnosis: PDD-NOS,diagnosis: AUTISM,diagnosis: ASPERGER'S DISORDER,...,diagnosis: CONTROL,diagnosis: CONTROL,diagnosis: CONTROL,diagnosis: CONTROL,diagnosis: CONTROL,diagnosis: CONTROL,diagnosis: CONTROL,diagnosis: CONTROL,diagnosis: CONTROL,diagnosis: CONTROL
1,!Sample_characteristics_ch1,gender: male,gender: male,gender: male,gender: male,gender: male,gender: male,gender: male,gender: male,gender: male,...,gender: male,gender: male,gender: male,gender: male,gender: male,gender: male,gender: male,gender: male,gender: male,gender: male
2,!Sample_characteristics_ch1,age at blood drawing (months): 118,age at blood drawing (months): 79,age at blood drawing (months): 169,age at blood drawing (months): 92,age at blood drawing (months): 78,age at blood drawing (months): 124,age at blood drawing (months): 81,age at blood drawing (months): 82,age at blood drawing (months): 96,...,age at blood drawing (months): 157,age at blood drawing (months): 160,age at blood drawing (months): 163,age at blood drawing (months): 156,age at blood drawing (months): 167,age at blood drawing (months): 168,age at blood drawing (months): 183,age at blood drawing (months): 185,age at blood drawing (months): 191,age at blood drawing (months): 192
3,!Sample_characteristics_ch1,race: Caucasian,race: Caucasian,race: Caucasian,race: Caucasian,"race: Mixed (Caucasian, Asian)",race: Caucasian,race: Caucasian,race: Caucasian,race: Caucasian,...,race: Caucasian,race: Caucasian,race: Caucasian,race: Caucasian,race: Black,race: Caucasian,race: Caucasian,race: Caucasian,race: Unknown,race: Caucasian
4,!Sample_characteristics_ch1,ethnicity: Non-Hispanic,ethnicity: Non-Hispanic,ethnicity: Non-Hispanic,ethnicity: Non-Hispanic,ethnicity: Non-Hispanic,ethnicity: Non-Hispanic,ethnicity: Non-Hispanic,ethnicity: Non-Hispanic,ethnicity: Non-Hispanic,...,ethnicity: Non-Hispanic,ethnicity: Non-Hispanic,ethnicity: Non-Hispanic,ethnicity: Hispanic,ethnicity: Non-Hispanic,ethnicity: Non-Hispanic,ethnicity: Non-Hispanic,ethnicity: Non-Hispanic,ethnicity: Non-Hispanic,ethnicity: Non-Hispanic
5,!Sample_characteristics_ch1,hours (minutes since last caloric intake): : 0:30,hours (minutes since last caloric intake): : 0:10,hours (minutes since last caloric intake): : 1:00,hours (minutes since last caloric intake): : 1:00,hours (minutes since last caloric intake): : 0:05,hours (minutes since last caloric intake): : 2:30,hours (minutes since last caloric intake): : 2:30,hours (minutes since last caloric intake): : 8:00,hours (minutes since last caloric intake): : 1:30,...,hours (minutes since last caloric intake): : 2:30,hours (minutes since last caloric intake): : 5:10,hours (minutes since last caloric intake): : 3:00,hours (minutes since last caloric intake): : u...,hours (minutes since last caloric intake): : u...,hours (minutes since last caloric intake): : u...,hours (minutes since last caloric intake): : u...,hours (minutes since last caloric intake): : u...,hours (minutes since last caloric intake): : u...,hours (minutes since last caloric intake): : 4:10
6,!Sample_characteristics_ch1,chronic diseases: -,chronic diseases: -,chronic diseases: -,chronic diseases: -,chronic diseases: -,chronic diseases: -,chronic diseases: -,chronic diseases: -,chronic diseases: -,...,chronic diseases: -,chronic diseases: -,chronic diseases: -,chronic diseases: -,chronic diseases: -,chronic diseases: -,chronic diseases: -,chronic diseases: -,chronic diseases: -,chronic diseases: -
7,!Sample_characteristics_ch1,allergies: -,allergies: -,allergies: -,allergies: -,allergies: -,allergies: -,allergies: -,allergies: -,allergies: -,...,allergies: -,allergies: -,allergies: -,allergies: -,allergies: -,allergies: -,allergies: -,allergies: -,allergies: -,allergies: -
8,!Sample_characteristics_ch1,developmental/speech disorder: -,developmental/speech disorder: -,developmental/speech disorder: yes (mild menta...,developmental/speech disorder: -,developmental/speech disorder: yes (mild menta...,developmental/speech disorder: -,developmental/speech disorder: -,developmental/speech disorder: yes (mild menta...,developmental/speech disorder: -,...,developmental/speech disorder: -,developmental/speech disorder: -,developmental/speech disorder: -,developmental/speech disorder: -,developmental/speech disorder: -,developmental/speech disorder: -,developmental/speech disorder: -,developmental/speech disorder: -,developmental/speech disorder: -,developmental/speech disorder: -
9,!Sample_characteristics_ch1,neurological disorder: -,neurological disorder: -,neurological disorder: -,neurological disorder: -,neurological disorder: -,neurological disorder: -,neurological disorder: -,neurological disorder: yes (Siezure/ convulsio...,neurological disorder: -,...,neurological disorder: -,neurological disorder: -,neurological disorder: -,neurological disorder: -,neurological disorder: -,neurological disorder: -,neurological disorder: -,neurological disorder: -,neurological disorder: -,neurological disorder: -


In [166]:
tumor_stage_row = clinical_data.iloc[10]
tumor_stage_row.unique()

array(['!Sample_characteristics_ch1',
       'psychiatric disorder: yes (anxiety)', 'psychiatric disorder: -',
       'psychiatric disorder: yes (add)',
       'psychiatric disorder: yes (adhd/add)',
       'psychiatric disorder: yes (adhd)',
       'psychiatric disorder: yes (pica)', 'gastrointestinal disorder: -',
       'psychiatric disorder: unknown',
       'psychiatric disorder: yes (ADHD)',
       'psychiatric disorder: yes (behaviorial issues)',
       'psychiatric disorder: yes (ADD)'], dtype=object)

In [173]:
is_gene_availabe = True
trait_row = 10
age_row = 2
gender_row = 1

trait_type = 'binary'

# Verify and use the functions generated by GPT

# 这个函数将组织类型（tissue type）转换为有关癫痫存在与否的二进制值。
# 它是基于特定的假设，即如果组织类型是“胰腺导管腺癌”（Pancreatic Ductal Adenocarcinoma），则认为癫痫存在（返回1）；否则，认为癫痫不存在（返回0）。
def convert_trait(tissue_type):
    """
    Convert tissue type to epilepsy presence (binary).
    Assuming epilepsy presence for 'Hippocampus' tissue.
    """
    if tissue_type == 'psychiatric disorder: yes (anxiety)':
        return 1  # Epilepsy present
    else:
        return 0  # Epilepsy not present

def convert_age(age_string):
    """
    Convert age string to a continuous numerical value.
    Unknown values are converted to None.
    """
    if age_string.lower() == 'n.a.':
        return None
    try:
        # Extract age as an integer from the string
        age = round(int(age_string.split(': ')[1])/12, 2)
        return age
    except (ValueError, IndexError):
        # In case of any format error or unexpected string structure
        return None
    
def convert_gender(gender_string):
    """
    Convert gender string to a binary value.
    'female' is represented as 1, 'male' as 0.
    Unknown values are converted to None.
    """
    if (gender_string.lower() == 'sex: female' or gender_string.lower() == 'sex: f' or gender_string.lower() == 'gender: female' or gender_string.lower() == 'gender: f'):
        return 1
    elif (gender_string.lower() == 'sex: male' or gender_string.lower() == 'sex: m' or gender_string.lower() == 'gender: male' or gender_string.lower() == 'gender: m') :  # changeed 
        return 0
    else:
        return None

In [182]:
selected_clinical_data = geo_select_clinical_features(clinical_data, TRAIT, trait_row, convert_trait, age_row=age_row,
                                                      convert_age=convert_age, gender_row=gender_row,
                                                      convert_gender=convert_gender)

selected_clinical_data.loc['Anxiety disorder'] = selected_clinical_data.loc['Anxiety disorder'].astype(int)
selected_clinical_data.loc['Gender'] = selected_clinical_data.loc['Gender'].astype(int)

selected_clinical_data.head()

  clinical_df = clinical_df.applymap(convert_fn)
  clinical_df = clinical_df.applymap(convert_fn)
  clinical_df = clinical_df.applymap(convert_fn)


Unnamed: 0,GSM650510,GSM650512,GSM650513,GSM650514,GSM650515,GSM650516,GSM650517,GSM650518,GSM650519,GSM650520,...,GSM650644,GSM650645,GSM650646,GSM650647,GSM650648,GSM650649,GSM650651,GSM650652,GSM650653,GSM650654
Anxiety disorder,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Age,9.83,6.58,14.08,7.67,6.5,10.33,6.75,6.83,8.0,9.75,...,13.08,13.33,13.58,13.0,13.92,14.0,15.25,15.42,15.92,16.0
Gender,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [183]:
genetic_data = get_genetic_data(matrix_file)
genetic_data

Unnamed: 0_level_0,GSM650510,GSM650512,GSM650513,GSM650514,GSM650515,GSM650516,GSM650517,GSM650518,GSM650519,GSM650520,...,GSM650644,GSM650645,GSM650646,GSM650647,GSM650648,GSM650649,GSM650651,GSM650652,GSM650653,GSM650654
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1007_s_at,26.65640,107.61809,48.97026,72.65312,80.83264,40.13933,47.43693,57.90003,65.03053,43.14078,...,81.02742,67.71790,49.50204,64.66049,110.58686,18.62053,66.15737,76.31541,59.93721,45.99937
1053_at,39.00726,135.73965,51.37555,102.65829,78.93955,81.75657,79.74300,74.16394,131.21986,62.78481,...,157.17724,86.83151,106.72321,37.18311,45.04578,77.12382,92.94504,137.79575,26.70354,101.07505
117_at,296.01724,459.14382,378.86384,618.79730,266.87741,320.33815,685.72845,285.89160,996.76063,231.43584,...,913.09583,285.02298,539.54601,75.35680,350.91103,488.36033,252.50309,329.28063,145.30082,276.80480
121_at,209.03575,232.08352,190.15020,185.90196,216.47953,172.25500,188.12218,155.99192,172.30327,141.44389,...,212.27610,309.88049,206.78336,281.03506,189.41136,294.80112,274.37934,232.94968,204.35111,265.61533
1255_g_at,63.82730,50.20000,60.23913,22.03061,41.29200,74.49414,16.84841,48.88556,19.70762,17.45545,...,42.52692,43.31678,21.37225,61.94335,37.83045,19.08806,43.59119,49.46663,36.18051,16.70463
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
91703_at,28.58766,79.13262,22.96121,56.42256,45.11952,34.81763,57.04181,56.23074,105.49385,41.24448,...,69.51566,37.86502,64.40379,28.92133,74.31037,35.14159,43.61235,59.13127,40.32274,38.93599
91816_f_at,0.10766,15.94046,6.61843,11.72339,7.64792,9.16214,4.53755,14.51436,20.55896,12.60243,...,13.79920,10.53937,11.12295,11.67345,10.92976,5.50120,6.32524,17.96985,4.65278,8.73903
91826_at,72.29617,42.95838,71.19351,43.94883,47.81855,54.23739,51.09696,53.49288,29.99378,77.21878,...,28.16819,66.09037,47.22587,58.17597,35.79578,61.40504,52.68096,47.94715,93.90870,45.76559
91920_at,80.84484,76.71756,125.63001,72.40296,105.65854,96.91254,80.77544,66.64779,65.19188,67.64769,...,71.52543,71.21265,74.19817,122.47701,77.78997,114.24505,122.05100,77.77357,117.51711,121.57002


In [184]:
requires_gene_mapping = True

if requires_gene_mapping:
    gene_annotation = get_gene_annotation(soft_file)
    gene_annotation_summary = preview_df(gene_annotation)
    print(gene_annotation_summary)

gene_row_ids = genetic_data.index[:20].tolist()
gene_row_ids

if requires_gene_mapping:
    print(f'''
    As a biomedical research team, we are analyzing a gene expression dataset, and find that its row headers are some identifiers related to genes:
    {gene_row_ids}
    To get the mapping from those identifiers to actual gene symbols, we extracted the gene annotation data from a series in the GEO database, and saved it to a Python dictionary. Please read the dictionary, and decide which key stores the identifiers, and which key stores the gene symbols. Please strictly follow this format in your answer:
    identifier_key = 'key_name1'
    gene_symbol_key = 'key_name2'

    Gene annotation dictionary:
    {gene_annotation_summary}
    ''')

gene_annotation.columns

{'ID': ['1007_s_at', '1053_at', '117_at', '121_at', '1255_g_at'], 'GB_ACC': ['U48705', 'M87338', 'X51757', 'X69699', 'L36861'], 'SPOT_ID': [nan, nan, nan, nan, nan], 'Species Scientific Name': ['Homo sapiens', 'Homo sapiens', 'Homo sapiens', 'Homo sapiens', 'Homo sapiens'], 'Annotation Date': ['Oct 6, 2014', 'Oct 6, 2014', 'Oct 6, 2014', 'Oct 6, 2014', 'Oct 6, 2014'], 'Sequence Type': ['Exemplar sequence', 'Exemplar sequence', 'Exemplar sequence', 'Exemplar sequence', 'Exemplar sequence'], 'Sequence Source': ['Affymetrix Proprietary Database', 'GenBank', 'Affymetrix Proprietary Database', 'GenBank', 'Affymetrix Proprietary Database'], 'Target Description': ['U48705 /FEATURE=mRNA /DEFINITION=HSU48705 Human receptor tyrosine kinase DDR gene, complete cds', 'M87338 /FEATURE= /DEFINITION=HUMA1SBU Human replication factor C, 40-kDa subunit (A1) mRNA, complete cds', "X51757 /FEATURE=cds /DEFINITION=HSP70B Human heat-shock protein HSP70B' gene", 'X69699 /FEATURE= /DEFINITION=HSPAX8A H.sapiens

Index(['ID', 'GB_ACC', 'SPOT_ID', 'Species Scientific Name', 'Annotation Date',
       'Sequence Type', 'Sequence Source', 'Target Description',
       'Representative Public ID', 'Gene Title', 'Gene Symbol',
       'ENTREZ_GENE_ID', 'RefSeq Transcript ID',
       'Gene Ontology Biological Process', 'Gene Ontology Cellular Component',
       'Gene Ontology Molecular Function'],
      dtype='object')

In [185]:
if requires_gene_mapping:
    identifier_key = 'ID'
    gene_symbol_key = 'Gene Symbol'
    gene_mapping = get_gene_mapping(gene_annotation, identifier_key, gene_symbol_key)
    genetic_data = apply_gene_mapping(genetic_data, gene_mapping)

In [186]:
genetic_data = normalize_gene_symbols_in_index(genetic_data)

genetic_data

Unnamed: 0,GSM650510,GSM650512,GSM650513,GSM650514,GSM650515,GSM650516,GSM650517,GSM650518,GSM650519,GSM650520,...,GSM650644,GSM650645,GSM650646,GSM650647,GSM650648,GSM650649,GSM650651,GSM650652,GSM650653,GSM650654
A1BG,41.632590,83.282020,52.959000,44.688750,67.330120,60.513170,44.421540,26.769420,58.695100,46.864560,...,75.627900,63.707620,33.292570,46.106280,54.652120,49.665740,64.701680,69.443120,26.259120,59.140340
A1BG-AS1,8.174400,1.181510,10.394910,9.956910,16.326440,0.197270,7.871850,0.628880,4.194580,0.004840,...,10.395270,0.241240,15.681010,5.025300,7.780800,8.694080,1.798410,0.017850,17.689140,5.513350
A1CF,95.640040,37.203130,69.185405,42.768305,46.413910,60.622755,59.372435,39.065420,26.172005,51.082100,...,30.916605,48.519265,46.315170,95.476055,52.441430,73.571020,52.925425,36.266050,106.746100,63.756045
A2M,137.957640,81.007140,142.215430,62.315015,90.522845,96.440085,88.301515,54.964955,66.736025,55.187155,...,74.264400,124.375640,72.304515,171.866215,117.637435,111.514715,139.480970,79.037120,170.453820,139.043470
A2M-AS1,9.691710,25.732460,32.337940,31.629130,77.091910,26.903060,34.035480,140.055140,177.304150,118.263660,...,12.995710,34.047090,30.125770,98.480480,252.059920,32.989740,58.633670,95.902850,129.589230,41.255640
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ZYG11A,9.732460,1.848590,7.718090,12.207260,8.464150,7.863050,6.603870,9.216050,7.730030,11.694620,...,5.714060,7.159580,5.538420,18.443030,6.041980,11.726640,9.879120,12.435280,6.669430,10.415620
ZYG11B,90.540157,305.102487,202.538620,342.711103,309.137360,230.600267,312.511813,314.228910,377.777233,259.726963,...,381.703610,302.457647,263.350383,205.547687,345.772560,157.514923,245.048360,386.281097,195.116173,225.857520
ZYX,798.485975,1022.778865,707.486985,856.898565,759.016425,780.765135,939.762930,633.710205,974.900610,760.832645,...,949.547505,700.315820,1044.017740,495.855975,1123.309535,867.789670,748.405040,575.659335,957.902505,767.074170
ZZEF1,271.942207,254.717260,341.249793,268.922863,320.187483,304.505493,289.604057,305.160403,280.125727,288.047563,...,270.668230,322.724987,305.778420,273.458767,299.192847,222.870827,280.068597,301.974207,290.095660,254.628487


In [187]:
merged_data = geo_merge_clinical_genetic_data(selected_clinical_data, genetic_data)
# The preprocessing runs through, which means is_available should be True
is_available = True

merged_data

Unnamed: 0,Anxiety disorder,Age,Gender,A1BG,A1BG-AS1,A1CF,A2M,A2M-AS1,A2ML1,A2MP1,...,ZWILCH,ZWINT,ZXDA,ZXDB,ZXDC,ZYG11A,ZYG11B,ZYX,ZZEF1,ZZZ3
GSM650510,1.0,9.83,0.0,41.63259,8.17440,95.640040,137.957640,9.69171,37.034905,40.67379,...,48.976425,14.58944,23.76398,19.403560,37.706935,9.73246,90.540157,798.485975,271.942207,31.204940
GSM650512,0.0,6.58,0.0,83.28202,1.18151,37.203130,81.007140,25.73246,15.882900,38.60556,...,125.516175,114.41336,48.14110,60.779100,95.003493,1.84859,305.102487,1022.778865,254.717260,158.735530
GSM650513,0.0,14.08,0.0,52.95900,10.39491,69.185405,142.215430,32.33794,22.297505,28.38244,...,45.013960,22.81288,11.00663,30.196130,92.381738,7.71809,202.538620,707.486985,341.249793,47.567460
GSM650514,0.0,7.67,0.0,44.68875,9.95691,42.768305,62.315015,31.62913,28.699715,36.18376,...,88.122150,82.61474,30.05556,58.672835,108.238012,12.20726,342.711103,856.898565,268.922863,130.310685
GSM650515,0.0,6.50,0.0,67.33012,16.32644,46.413910,90.522845,77.09191,21.434480,33.22135,...,84.270565,34.40494,37.61480,56.821985,108.401883,8.46415,309.137360,759.016425,320.187483,110.856830
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
GSM650649,0.0,14.00,0.0,49.66574,8.69408,73.571020,111.514715,32.98974,35.383350,39.55603,...,56.185930,39.00678,14.43924,16.463120,42.707600,11.72664,157.514923,867.789670,222.870827,57.563500
GSM650651,0.0,15.25,0.0,64.70168,1.79841,52.925425,139.480970,58.63367,24.866470,42.25776,...,70.175085,45.32123,41.18217,43.689005,85.825290,9.87912,245.048360,748.405040,280.068597,103.637035
GSM650652,0.0,15.42,0.0,69.44312,0.01785,36.266050,79.037120,95.90285,16.773860,29.24067,...,117.394060,74.97703,54.66800,66.129325,96.419222,12.43528,386.281097,575.659335,301.974207,166.960215
GSM650653,0.0,15.92,0.0,26.25912,17.68914,106.746100,170.453820,129.58923,51.762890,45.00520,...,28.036055,14.36849,19.99744,26.536545,52.304638,6.66943,195.116173,957.902505,290.095660,16.300080


In [188]:
print(f"The merged dataset contains {len(merged_data)} samples.")
is_trait_biased, merged_data = judge_and_remove_biased_features(merged_data, TRAIT, trait_type=trait_type)
is_trait_biased

The merged dataset contains 99 samples.
For the feature 'Anxiety disorder', the least common label is '1.0' with 3 occurrences. This represents 3.03% of the dataset.
The distribution of the feature 'Anxiety disorder' in this dataset is severely biased.

Quartiles for 'Age':
  25%: 4.96
  50% (Median): 7.5
  75%: 11.54
Min: 2.75
Max: 17.5
The distribution of the feature 'Age' in this dataset is fine.

For the feature 'Gender', the least common label is '0.0' with 99 occurrences. This represents 100.00% of the dataset.
The distribution of the feature 'Gender' in this dataset is severely biased.



True

In [189]:
if is_available:
    save_cohort_info(cohort, JSON_PATH, is_available, is_trait_biased, merged_data, note='')
else:
    save_cohort_info(cohort, JSON_PATH, is_available)
merged_data.head()
if not is_trait_biased:
    merged_data.to_csv(os.path.join(OUTPUT_DIR, cohort + '.csv'), index=False)

In [190]:
# Stopped: No trait convert
cohort = accession_num = "GSE60491"
cohort_dir = os.path.join(trait_path, accession_num)
soft_file, matrix_file = get_relevant_filepaths(cohort_dir)
soft_file, matrix_file

from utils import *
import io
background_prefixes = ['!Series_title', '!Series_summary', '!Series_overall_design']
clinical_prefixes = ['!Sample_geo_accession', '!Sample_characteristics_ch1']

background_info, clinical_data = get_background_and_clinical_data(matrix_file, background_prefixes, clinical_prefixes)
print(background_info)

clinical_data

!Series_title	"Leukocyte gene expression variation as a function of Big 5 dimensions of human personality"
!Series_summary	"Individual differences in basal leukocyte gene expression profiles as a function of Big 5 personality dimensions"
!Series_overall_design	"Gene expression profiling was carried out on peripheral blood mononuclear cell RNA samples collected from 119 healthy adults measured for the 5 major dimensions of human personality (Neuroticism, Extraversion, Openness, Agreeableness, Conscientiousness) using the NEO-FFI 60-item personality inventory.  Personality measures are z-score standardized.  Analyses control for major demographic characteristics (age, sex, Caucasian vs Non-Caucasian race) as well as Body Mass Index (BMI), smoking (CigDay), alcohol consumption (AlcDay), and physical activity (ExerDay, hours per day).  Additional secondary analyses controlled for the presence of minor physical symptoms (MinorSymptom, e.g.,hayfever, headache), medication use (BirthControl, 

Unnamed: 0,!Sample_geo_accession,GSM1481100,GSM1481101,GSM1481102,GSM1481103,GSM1481104,GSM1481105,GSM1481106,GSM1481107,GSM1481108,...,GSM1481209,GSM1481210,GSM1481211,GSM1481212,GSM1481213,GSM1481214,GSM1481215,GSM1481216,GSM1481217,GSM1481218
0,!Sample_characteristics_ch1,age: 21,age: 22,age: 21,age: 21,age: 22,age: 23,age: 23,age: 23,age: 23,...,age: 24,age: 20,age: 26,age: 22,age: 32,age: 27,age: 22,age: 26,age: 18,age: 20
1,!Sample_characteristics_ch1,male: 0,male: 1,male: 0,male: 1,male: 0,male: 0,male: 0,male: 0,male: 0,...,male: 1,male: 1,male: 0,male: 0,male: 1,male: 1,male: 1,male: 1,male: 1,male: 0
2,!Sample_characteristics_ch1,bmi: 23,bmi: 30,bmi: 18,bmi: 21.6,bmi: 22,bmi: 22,bmi: 16,bmi: 26,bmi: 24,...,bmi: 20,bmi: 24,bmi: 20,bmi: 22,bmi: 21.9,bmi: 22,bmi: 24,bmi: 23,bmi: 24,bmi: 36
3,!Sample_characteristics_ch1,caucasian: 1,caucasian: 1,caucasian: 1,caucasian: 0,caucasian: 0,caucasian: 0,caucasian: 0,caucasian: 0,caucasian: 0,...,caucasian: 0,caucasian: 1,caucasian: 1,caucasian: 1,caucasian: 0,caucasian: missing,caucasian: 1,caucasian: 1,caucasian: 1,caucasian: 1
4,!Sample_characteristics_ch1,cigday: 0,cigday: 7,cigday: 12,cigday: 0,cigday: 0,cigday: 0,cigday: 0,cigday: 0,cigday: 0,...,cigday: 0,cigday: 0,cigday: 0,cigday: 0,cigday: 0,cigday: 0,cigday: 0,cigday: 0,cigday: 0,cigday: 0
5,!Sample_characteristics_ch1,alcday: 0.4,alcday: 0,alcday: 1.3,alcday: 0.9,alcday: 0,alcday: 0,alcday: 0,alcday: 0,alcday: 0,...,alcday: 6,alcday: 0.2,alcday: 0.6,alcday: 1,alcday: 0.1,alcday: 0.4,alcday: 0,alcday: 0,alcday: 0,alcday: 1.3
6,!Sample_characteristics_ch1,exerday: 0.5,exerday: 0.9,exerday: 0.6,exerday: 0.6,exerday: 1,exerday: 0.1,exerday: 0.1,exerday: 0.3,exerday: 0.5,...,exerday: 0.1,exerday: 0.1,exerday: 0.7,exerday: 1,exerday: 1.1,exerday: 0.7,exerday: 1,exerday: 1,exerday: 1,exerday: 0.4
7,!Sample_characteristics_ch1,minorsymptom: 0,minorsymptom: 0,minorsymptom: 0,minorsymptom: 0,minorsymptom: 0,minorsymptom: 0,minorsymptom: 0,minorsymptom: 0,minorsymptom: 0,...,minorsymptom: 0,minorsymptom: 0,minorsymptom: 0,minorsymptom: 0,minorsymptom: 0,minorsymptom: missing,minorsymptom: 0,minorsymptom: 0,minorsymptom: 0,minorsymptom: 0
8,!Sample_characteristics_ch1,birthcontrol: 0,birthcontrol: 0,birthcontrol: 0,birthcontrol: 0,birthcontrol: 0,birthcontrol: 0,birthcontrol: 0,birthcontrol: 0,birthcontrol: 0,...,birthcontrol: 0,birthcontrol: 0,birthcontrol: 1,birthcontrol: 0,birthcontrol: 0,birthcontrol: missing,birthcontrol: 0,birthcontrol: 0,birthcontrol: 0,birthcontrol: 0
9,!Sample_characteristics_ch1,antidepressant: 0,antidepressant: 0,antidepressant: 0,antidepressant: 0,antidepressant: 0,antidepressant: 0,antidepressant: 0,antidepressant: 1,antidepressant: 0,...,antidepressant: 0,antidepressant: 0,antidepressant: 0,antidepressant: 0,antidepressant: 0,antidepressant: missing,antidepressant: 0,antidepressant: 0,antidepressant: 0,antidepressant: 0


In [191]:
tumor_stage_row = clinical_data.iloc[2]
tumor_stage_row.unique()

array(['!Sample_characteristics_ch1', 'bmi: 23', 'bmi: 30', 'bmi: 18',
       'bmi: 21.6', 'bmi: 22', 'bmi: 16', 'bmi: 26', 'bmi: 24', 'bmi: 31',
       'bmi: 21', 'bmi: missing', 'bmi: 19', 'bmi: 25', 'bmi: 17.5',
       'bmi: 20', 'bmi: 28', 'bmi: 29', 'bmi: 35', 'bmi: 17', 'bmi: 27',
       'bmi: 21.9', 'bmi: 36'], dtype=object)

In [None]:
is_gene_availabe = True
trait_row = 2
age_row = None
gender_row = None

trait_type = 'binary'

# Verify and use the functions generated by GPT

# 这个函数将组织类型（tissue type）转换为有关癫痫存在与否的二进制值。
# 它是基于特定的假设，即如果组织类型是“胰腺导管腺癌”（Pancreatic Ductal Adenocarcinoma），则认为癫痫存在（返回1）；否则，认为癫痫不存在（返回0）。
def convert_trait(tissue_type):
    """
    Convert tissue type to epilepsy presence (binary).
    Assuming epilepsy presence for 'Hippocampus' tissue.
    """
    if tissue_type == 'condition: tumor':
        return 1  # Epilepsy present
    else:
        return 0  # Epilepsy not present

### Initial filtering and clinical data preprocessing

In [44]:
import gzip

In [45]:
def line_generator(source, source_type):
    """Generator that yields lines from a file or a string.

    Parameters:
    - source: File path or string content.
    - source_type: 'file' or 'string'.
    """
    if source_type == 'file':
        with gzip.open(source, 'rt') as f:
            for line in f:
                yield line.strip()
    elif source_type == 'string':
        for line in source.split('\n'):
            yield line.strip()
    else:
        raise ValueError("source_type must be 'file' or 'string'")

In [46]:
from typing import Callable, Optional, List, Tuple, Union, Any
import pandas as pd

In [47]:
def filter_content_by_prefix(
    source: str,
    prefixes_a: List[str],
    prefixes_b: Optional[List[str]] = None,
    unselect: bool = False,
    source_type: str = 'file',
    return_df_a: bool = True,
    return_df_b: bool = True
) -> Tuple[Union[str, pd.DataFrame], Optional[Union[str, pd.DataFrame]]]:
    """
    Filters rows from a file or a list of strings based on specified prefixes.

    Parameters:
    - source (str): File path or string content to filter.
    - prefixes_a (List[str]): Primary list of prefixes to filter by.
    - prefixes_b (Optional[List[str]]): Optional secondary list of prefixes to filter by.
    - unselect (bool): If True, selects rows that do not start with the specified prefixes.
    - source_type (str): 'file' if source is a file path, 'string' if source is a string of text.
    - return_df_a (bool): If True, returns filtered content for prefixes_a as a pandas DataFrame.
    - return_df_b (bool): If True, and if prefixes_b is provided, returns filtered content for prefixes_b as a pandas DataFrame.

    Returns:
    - Tuple: A tuple where the first element is the filtered content for prefixes_a, and the second element is the filtered content for prefixes_b.
    """
    filtered_lines_a = []
    filtered_lines_b = []
    prefix_set_a = set(prefixes_a)
    if prefixes_b is not None:
        prefix_set_b = set(prefixes_b)

    # Use generator to get lines
    for line in line_generator(source, source_type):
        matched_a = any(line.startswith(prefix) for prefix in prefix_set_a)
        if matched_a != unselect:
            filtered_lines_a.append(line)
        if prefixes_b is not None:
            matched_b = any(line.startswith(prefix) for prefix in prefix_set_b)
            if matched_b != unselect:
                filtered_lines_b.append(line)

    filtered_content_a = '\n'.join(filtered_lines_a)
    if return_df_a:
        filtered_content_a = pd.read_csv(io.StringIO(filtered_content_a), delimiter='\t', low_memory=False, on_bad_lines='skip')
    filtered_content_b = None
    if filtered_lines_b:
        filtered_content_b = '\n'.join(filtered_lines_b)
        if return_df_b:
            filtered_content_b = pd.read_csv(io.StringIO(filtered_content_b), delimiter='\t', low_memory=False, on_bad_lines='skip')

    return filtered_content_a, filtered_content_b



In [48]:
def get_background_and_clinical_data(file_path,
                                     prefixes_a=['!Series_title', '!Series_summary', '!Series_overall_design'],
                                     prefixes_b=['!Sample_geo_accession', '!Sample_characteristics_ch1']):
    """Extract from a matrix file the background information about the dataset, and sample characteristics data"""
    background_info, clinical_data = filter_content_by_prefix(file_path, prefixes_a, prefixes_b, unselect=False,
                                                              source_type='file',
                                                              return_df_a=False, return_df_b=True)
    return background_info, clinical_data

In [49]:
import io
background_prefixes = ['!Series_title', '!Series_summary', '!Series_overall_design']
clinical_prefixes = ['!Sample_geo_accession', '!Sample_characteristics_ch1']

background_info, clinical_data = get_background_and_clinical_data(matrix_file, background_prefixes, clinical_prefixes)
print(background_info)

!Series_title	"Leukocyte gene expression variation as a function of Big 5 dimensions of human personality"
!Series_summary	"Individual differences in basal leukocyte gene expression profiles as a function of Big 5 personality dimensions"
!Series_overall_design	"Gene expression profiling was carried out on peripheral blood mononuclear cell RNA samples collected from 119 healthy adults measured for the 5 major dimensions of human personality (Neuroticism, Extraversion, Openness, Agreeableness, Conscientiousness) using the NEO-FFI 60-item personality inventory.  Personality measures are z-score standardized.  Analyses control for major demographic characteristics (age, sex, Caucasian vs Non-Caucasian race) as well as Body Mass Index (BMI), smoking (CigDay), alcohol consumption (AlcDay), and physical activity (ExerDay, hours per day).  Additional secondary analyses controlled for the presence of minor physical symptoms (MinorSymptom, e.g.,hayfever, headache), medication use (BirthControl, 

In [50]:
clinical_data.head()

Unnamed: 0,!Sample_geo_accession,GSM1481100,GSM1481101,GSM1481102,GSM1481103,GSM1481104,GSM1481105,GSM1481106,GSM1481107,GSM1481108,...,GSM1481209,GSM1481210,GSM1481211,GSM1481212,GSM1481213,GSM1481214,GSM1481215,GSM1481216,GSM1481217,GSM1481218
0,!Sample_characteristics_ch1,age: 21,age: 22,age: 21,age: 21,age: 22,age: 23,age: 23,age: 23,age: 23,...,age: 24,age: 20,age: 26,age: 22,age: 32,age: 27,age: 22,age: 26,age: 18,age: 20
1,!Sample_characteristics_ch1,male: 0,male: 1,male: 0,male: 1,male: 0,male: 0,male: 0,male: 0,male: 0,...,male: 1,male: 1,male: 0,male: 0,male: 1,male: 1,male: 1,male: 1,male: 1,male: 0
2,!Sample_characteristics_ch1,bmi: 23,bmi: 30,bmi: 18,bmi: 21.6,bmi: 22,bmi: 22,bmi: 16,bmi: 26,bmi: 24,...,bmi: 20,bmi: 24,bmi: 20,bmi: 22,bmi: 21.9,bmi: 22,bmi: 24,bmi: 23,bmi: 24,bmi: 36
3,!Sample_characteristics_ch1,caucasian: 1,caucasian: 1,caucasian: 1,caucasian: 0,caucasian: 0,caucasian: 0,caucasian: 0,caucasian: 0,caucasian: 0,...,caucasian: 0,caucasian: 1,caucasian: 1,caucasian: 1,caucasian: 0,caucasian: missing,caucasian: 1,caucasian: 1,caucasian: 1,caucasian: 1
4,!Sample_characteristics_ch1,cigday: 0,cigday: 7,cigday: 12,cigday: 0,cigday: 0,cigday: 0,cigday: 0,cigday: 0,cigday: 0,...,cigday: 0,cigday: 0,cigday: 0,cigday: 0,cigday: 0,cigday: 0,cigday: 0,cigday: 0,cigday: 0,cigday: 0


In [51]:
def get_unique_values_by_row(dataframe, max_len=30):
    """
    Organize the unique values in each row of the given dataframe, to get a dictionary
    :param dataframe:
    :param max_len:
    :return:
    """
    if '!Sample_geo_accession' in dataframe.columns:
        dataframe = dataframe.drop(columns=['!Sample_geo_accession'])
    unique_values_dict = {}
    for index, row in dataframe.iterrows():
        unique_values = list(row.unique())[:max_len]
        unique_values_dict[index] = unique_values
    return unique_values_dict

In [52]:
clinical_data_unique = get_unique_values_by_row(clinical_data)
clinical_data_unique

{0: ['age: 21',
  'age: 22',
  'age: 23',
  'age: 33',
  'age: 20',
  'age: 34',
  'age: 19',
  'age: 27',
  'age: 53',
  'age: 25',
  'age: 26',
  'age: 45',
  'age: 38',
  'age: 29',
  'age: 30',
  'age: 28',
  'age: 18',
  'age: 24',
  'age: 59',
  'age: 35',
  'age: 51',
  'age: 50',
  'age: 32'],
 1: ['male: 0', 'male: 1'],
 2: ['bmi: 23',
  'bmi: 30',
  'bmi: 18',
  'bmi: 21.6',
  'bmi: 22',
  'bmi: 16',
  'bmi: 26',
  'bmi: 24',
  'bmi: 31',
  'bmi: 21',
  'bmi: missing',
  'bmi: 19',
  'bmi: 25',
  'bmi: 17.5',
  'bmi: 20',
  'bmi: 28',
  'bmi: 29',
  'bmi: 35',
  'bmi: 17',
  'bmi: 27',
  'bmi: 21.9',
  'bmi: 36'],
 3: ['caucasian: 1', 'caucasian: 0', 'caucasian: missing'],
 4: ['cigday: 0',
  'cigday: 7',
  'cigday: 12',
  'cigday: 3',
  'cigday: 5',
  'cigday: 0.1',
  'cigday: 1.4',
  'cigday: 20'],
 5: ['alcday: 0.4',
  'alcday: 0',
  'alcday: 1.3',
  'alcday: 0.9',
  'alcday: 2',
  'alcday: 1',
  'alcday: 0.1',
  'alcday: 1.4',
  'alcday: 0.6',
  'alcday: 2.9',
  'alcday: 

Analyze the metadata to determine data relevance and find ways to extract the clinical data.
Reference prompt:

In [53]:
f'''As a biomedical research team, we are selecting datasets to study the association between the human trait \'{TRAIT}\' and genetic factors, optionally considering the influence of age and gender. After searching the GEO database and parsing the matrix file of a series, we obtained background information and sample characteristics data. We will provide textual information about the dataset background, and a Python dictionary storing a list of unique values for each field of the sample characteristics data. Please carefully review the provided information and answer the following questions about this dataset:
1. Does this dataset contain gene expression data? (Note: Pure miRNA data is not suitable.)
2. For each of the traits \'{TRAIT}\', 'age', and 'gender', please address these points:
   (1) Is there human data available for this trait?
   (2) If so, identify the key in the sample characteristics dictionary where unique values of this trait is recorded. The key is an integer. The trait information might be explicitly recorded, or can be inferred from the field with some biomedical knowledge or understanding about the data collection process.
   (3) Choose an appropriate data type (either 'continuous' or 'binary') for each trait. Write a Python function to convert any given value of the trait to this data type. The function should handle inference about the trait value and convert unknown values to None.
   Name the functions 'convert_trait', 'convert_age', and 'convert_gender', respectively.

Background information about the dataset:
{background_info}

Sample characteristics dictionary (from "!Sample_characteristics_ch1", converted to a Python dictionary that stores the unique values for each field):
{clinical_data_unique}
'''

'As a biomedical research team, we are selecting datasets to study the association between the human trait \'Anxiety disorder\' and genetic factors, optionally considering the influence of age and gender. After searching the GEO database and parsing the matrix file of a series, we obtained background information and sample characteristics data. We will provide textual information about the dataset background, and a Python dictionary storing a list of unique values for each field of the sample characteristics data. Please carefully review the provided information and answer the following questions about this dataset:\n1. Does this dataset contain gene expression data? (Note: Pure miRNA data is not suitable.)\n2. For each of the traits \'Anxiety disorder\', \'age\', and \'gender\', please address these points:\n   (1) Is there human data available for this trait?\n   (2) If so, identify the key in the sample characteristics dictionary where unique values of this trait is recorded. The ke

Understand and verify the answer from GPT, to assign values to the below variables. Assign None to the 'row_id' variables if relevant data row was not found.
Later we need to let GPT format its answer to automatically do these. But given the complexity of this step, let's grow some insight from the free-text answers for now.

In [54]:
age_row = gender_row = None
convert_age = convert_gender = None

In [55]:
is_gene_availabe = True
trait_row = 9
age_row = 0
gender_row = 1
trait_type = 'binary'

In [56]:
is_available = is_gene_availabe and (trait_row is not None)
if not is_available:
    save_cohort_info(cohort, JSON_PATH, is_available)
    print("This cohort is not usable. Please skip the following steps and jump to the next accession number.")

In [57]:
# Verify and use the functions generated by GPT

def convert_trait(tissue_type):
    """
    Convert tissue type to epilepsy presence (binary).
    Assuming epilepsy presence for 'Hippocampus' tissue.
    """
    if tissue_type == 'antidepressant: 0':
        return 1  # Epilepsy present
    else:
        return 0  # Epilepsy not present


def convert_age(age_string):
    """
    Convert age string to a continuous numerical value.
    Unknown values are converted to None.
    """
    if age_string.lower() == 'n.a.':
        return None
    try:
        # Extract age as an integer from the string
        age = int(age_string.split(': ')[1])
        return age
    except (ValueError, IndexError):
        # In case of any format error or unexpected string structure
        return None


# It sometimes maps 'female' to 0, and sometimes 1. Does it matter?
def convert_gender(gender_string):
    """
    Convert gender string to a binary value.
    'female' is represented as 1, 'male' as 0.
    Unknown values are converted to None.
    """
    if gender_string.lower() == 'male: 0':
        return 0
    elif gender_string.lower() == 'male: 1':
        return 1
    else:
        return None

In [58]:
def get_feature_data(clinical_df, row_id, feature, convert_fn):
    """select the row corresponding to a feature in the sample characteristics dataframe, and convert the feature into
    a binary or continuous variable"""
    clinical_df = clinical_df.iloc[row_id:row_id + 1].drop(columns=['!Sample_geo_accession'], errors='ignore')
    clinical_df.index = [feature]
    clinical_df = clinical_df.applymap(convert_fn)

    return clinical_df

In [59]:
def geo_select_clinical_features(clinical_df: pd.DataFrame, trait: str, trait_row: int,
                                 convert_trait: Callable,
                                 age_row: Optional[int] = None,
                                 convert_age: Optional[Callable] = None,
                                 gender_row: Optional[int] = None,
                                 convert_gender: Optional[Callable] = None) -> pd.DataFrame:
    """
    Extracts and processes specific clinical features from a DataFrame representing
    sample characteristics in the GEO database series.

    Parameters:
    - clinical_df (pd.DataFrame): DataFrame containing clinical data.
    - trait (str): The trait of interest.
    - trait_row (int): Row identifier for the trait in the DataFrame.
    - convert_trait (Callable): Function to convert trait data into a desired format.
    - age_row (int, optional): Row identifier for age data. Default is None.
    - convert_age (Callable, optional): Function to convert age data. Default is None.
    - gender_row (int, optional): Row identifier for gender data. Default is None.
    - convert_gender (Callable, optional): Function to convert gender data. Default is None.

    Returns:
    pd.DataFrame: A DataFrame containing the selected and processed clinical features.
    """
    feature_list = []

    trait_data = get_feature_data(clinical_df, trait_row, trait, convert_trait)
    feature_list.append(trait_data)
    if age_row is not None:
        age_data = get_feature_data(clinical_df, age_row, 'Age', convert_age)
        feature_list.append(age_data)
    if gender_row is not None:
        gender_data = get_feature_data(clinical_df, gender_row, 'Gender', convert_gender)
        feature_list.append(gender_data)

    selected_clinical_df = pd.concat(feature_list, axis=0)
    return selected_clinical_df

In [60]:
selected_clinical_data = geo_select_clinical_features(clinical_data, TRAIT, trait_row, convert_trait, age_row=age_row,
                                                      convert_age=convert_age, gender_row=gender_row,
                                                      convert_gender=convert_gender)
selected_clinical_data.head()

  clinical_df = clinical_df.applymap(convert_fn)
  clinical_df = clinical_df.applymap(convert_fn)
  clinical_df = clinical_df.applymap(convert_fn)


Unnamed: 0,GSM1481100,GSM1481101,GSM1481102,GSM1481103,GSM1481104,GSM1481105,GSM1481106,GSM1481107,GSM1481108,GSM1481109,...,GSM1481209,GSM1481210,GSM1481211,GSM1481212,GSM1481213,GSM1481214,GSM1481215,GSM1481216,GSM1481217,GSM1481218
Anxiety disorder,1,1,1,1,1,1,1,0,1,0,...,1,1,1,1,1,0,1,1,1,1
Age,21,22,21,21,22,23,23,23,23,23,...,24,20,26,22,32,27,22,26,18,20
Gender,0,1,0,1,0,0,0,0,0,0,...,1,1,0,0,1,1,1,1,1,0


### Genetic data preprocessing and final filtering

In [61]:
def get_genetic_data(file_path):
    """Read the gene expression data into a dataframe, and adjust its format"""
    genetic_data = pd.read_csv(file_path, compression='gzip', skiprows=52, comment='!', delimiter='\t')
    genetic_data = genetic_data.dropna()
    genetic_data = genetic_data.rename(columns={'ID_REF': 'ID'}).astype({'ID': 'str'})
    genetic_data.set_index('ID', inplace=True)

    return genetic_data


In [62]:
genetic_data = get_genetic_data(matrix_file)
genetic_data.head()

Unnamed: 0_level_0,GSM1481100,GSM1481101,GSM1481102,GSM1481103,GSM1481104,GSM1481105,GSM1481106,GSM1481107,GSM1481108,GSM1481109,...,GSM1481209,GSM1481210,GSM1481211,GSM1481212,GSM1481213,GSM1481214,GSM1481215,GSM1481216,GSM1481217,GSM1481218
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
7A5,122.4265,131.7769,119.4328,118.3859,119.1651,127.3848,127.8088,124.0155,122.6773,123.7244,...,126.0626,133.5475,129.8606,120.7393,126.4916,137.7973,129.5558,121.2862,139.3874,122.1245
A1BG,878.7355,886.0469,949.6923,747.9395,874.6235,653.9559,831.1398,722.0098,1030.245,832.8301,...,1334.319,752.4141,907.3934,941.5509,1106.653,1092.782,1272.846,1060.915,936.6741,1060.601
A1CF,130.8007,132.5247,141.0343,138.1274,144.6797,130.9009,143.6457,144.2968,140.4412,135.7075,...,139.9214,136.9379,142.3792,151.1532,148.4035,154.3208,143.4808,139.8078,134.4173,134.2816
A26C3,119.3683,114.8981,118.6767,112.5082,108.6636,116.0207,113.3711,116.4169,113.982,112.319,...,118.6542,118.0074,123.5586,111.4767,120.3181,118.5838,120.2557,123.6906,115.9226,119.2732
A2BP1,119.3083,117.0984,115.3259,120.6321,117.4373,108.4148,115.9964,113.0969,119.0173,117.6048,...,121.1456,118.6,119.0057,127.2296,119.4768,117.457,116.342,119.2073,122.7845,121.0886


In [63]:
gene_row_ids = genetic_data.index[:20].tolist()
gene_row_ids

['7A5',
 'A1BG',
 'A1CF',
 'A26C3',
 'A2BP1',
 'A2LD1',
 'A2M',
 'A2ML1',
 'A3GALT2',
 'A4GALT',
 'A4GNT',
 'AAA1',
 'AAAS',
 'AACS',
 'AACSL',
 'AADAC',
 'AADACL1',
 'AADACL2',
 'AADACL3',
 'AADACL4']

Check if the gene dataset requires mapping to get the gene symbols corresponding to each data row.

Reference prompt:

In [64]:
f'''
Below are the row headers of a gene expression dataset in GEO. Based on your biomedical knowledge, are they human gene symbols, or are they some other identifiers that need to be mapped to gene symbols? Your answer should be concluded by starting a new line and strictly following this format:
requires_gene_mapping = (True or False)

Row headers:
{gene_row_ids}
'''

"\nBelow are the row headers of a gene expression dataset in GEO. Based on your biomedical knowledge, are they human gene symbols, or are they some other identifiers that need to be mapped to gene symbols? Your answer should be concluded by starting a new line and strictly following this format:\nrequires_gene_mapping = (True or False)\n\nRow headers:\n['7A5', 'A1BG', 'A1CF', 'A26C3', 'A2BP1', 'A2LD1', 'A2M', 'A2ML1', 'A3GALT2', 'A4GALT', 'A4GNT', 'AAA1', 'AAAS', 'AACS', 'AACSL', 'AADAC', 'AADACL1', 'AADACL2', 'AADACL3', 'AADACL4']\n"


If not required, jump directly to the gene normalization step

In [65]:
requires_gene_mapping = False

In [66]:
if requires_gene_mapping:
    gene_annotation = get_gene_annotation(soft_file)
    gene_annotation_summary = preview_df(gene_annotation)
    print(gene_annotation_summary)

Observe the first few cells in the ID column of the gene annotation dataframe, to find the names of columns that store the gene probe IDs and gene symbols respectively.
Reference prompt:

In [67]:
if requires_gene_mapping:
    print(f'''
    As a biomedical research team, we are analyzing a gene expression dataset, and find that its row headers are some identifiers related to genes:
    {gene_row_ids}
    To get the mapping from those identifiers to actual gene symbols, we extracted the gene annotation data from a series in the GEO database, and saved it to a Python dictionary. Please read the dictionary, and decide which key stores the identifiers, and which key stores the gene symbols. Please strictly follow this format in your answer:
    identifier_key = 'key_name1'
    gene_symbol_key = 'key_name2'

    Gene annotation dictionary:
    {gene_annotation_summary}
    ''')

In [68]:
if requires_gene_mapping:
    identifier_key = 'ID'
    gene_symbol_key = 'UCSC_RefGene_Name'
    gene_mapping = get_gene_mapping(gene_annotation, identifier_key, gene_symbol_key)
    genetic_data = apply_gene_mapping(genetic_data, gene_mapping)

In [69]:
def normalize_gene_symbols_in_index(gene_df):
    """Normalize the human gene symbols at the index of a dataframe, and replace the index with its normalized version.
    Remove the rows where the index failed to be normalized."""
    normalized_gene_list = normalize_gene_symbols(gene_df.index.tolist())
    assert len(normalized_gene_list) == len(gene_df.index)
    gene_df.index = normalized_gene_list
    gene_df = gene_df[gene_df.index.notnull()]
    return gene_df

In [70]:
def normalize_gene_symbols(gene_symbols, batch_size=1000):
    """Normalize human gene symbols in batches using the 'mygenes' library"""
    mg = mygene.MyGeneInfo()
    normalized_genes = {}

    # Process in batches
    for i in range(0, len(gene_symbols), batch_size):
        batch = gene_symbols[i:i + batch_size]
        results = mg.querymany(batch, scopes='symbol', fields='symbol', species='human')

        # Update the normalized_genes dictionary with results from this batch
        for gene in results:
            normalized_genes[gene['query']] = gene.get('symbol', None)

    # Return the normalized symbols in the same order as the input
    return [normalized_genes.get(symbol) for symbol in gene_symbols]

In [71]:
import mygene

In [72]:
if NORMALIZE_GENE:
    genetic_data = normalize_gene_symbols_in_index(genetic_data)

13 input query terms found dup hits:	[('ABCC13', 2), ('ABCC6P1', 2), ('ABCC6P2', 3), ('ADAM3A', 2), ('ADAM6', 3), ('AGAP11', 2), ('AK2P2'
95 input query terms found no hit:	['7A5', 'A26C3', 'A2BP1', 'A2LD1', 'AACSL', 'AADACL1', 'AARS', 'ABCA11', 'ABHD7', 'ABHD9', 'ABP1', '
6 input query terms found dup hits:	[('ATP6AP1L', 2), ('ATXN8OS', 2), ('BAGE2', 2), ('BIRC8', 2), ('BRD7P3', 2), ('BRI3P1', 2)]
370 input query terms found no hit:	['ARMC4', 'ARMET', 'ARNTL', 'ARNTL2', 'ARP11', 'ARPM1', 'ARPM2', 'ARPP-21', 'ARS2', 'ARSE', 'ARVP612
812 input query terms found no hit:	['C15ORF51', 'C15ORF52', 'C15ORF53', 'C15ORF54', 'C15ORF55', 'C15ORF56', 'C15ORF57', 'C15ORF58', 'C1
7 input query terms found dup hits:	[('CATSPER2P1', 2), ('CCDC144NL', 2), ('CCNYL2', 2), ('CCNYL3', 2), ('CCT6P1', 2), ('CDR1', 2), ('CE
126 input query terms found no hit:	['CA5BP', 'CABC1', 'CAMSAP1L1', 'CARD17', 'CARKD', 'CARS', 'CART1', 'CASC1', 'CASC4', 'CASC5', 'CASR
4 input query terms found dup hits:	[('CLEC4GP1', 

In [73]:
def geo_merge_clinical_genetic_data(clinical_df, genetic_df):
    """
    Merge the clinical features and gene expression features from two dataframes into one dataframe
    """
    if 'ID' in genetic_df.columns:
        genetic_df = genetic_df.rename(columns={'ID': 'Gene'})
    if 'Gene' in genetic_df.columns:
        genetic_df = genetic_df.set_index('Gene')
    merged_data = pd.concat([clinical_df, genetic_df], axis=0).T.dropna()
    return merged_data


In [74]:
merged_data = geo_merge_clinical_genetic_data(selected_clinical_data, genetic_data)
# The preprocessing runs through, which means is_available should be True
is_available = True

In [75]:
print(f"The merged dataset contains {len(merged_data)} samples.")

The merged dataset contains 119 samples.


In [76]:
def judge_and_remove_biased_features(df, trait, trait_type):
    assert trait_type in ["binary", "continuous"], f"The trait must be either a binary or a continuous variable!"
    if trait_type == "binary":
        trait_biased = judge_binary_variable_biased(df, trait)
    else:
        trait_biased = judge_continuous_variable_biased(df, trait)
    if trait_biased:
        print(f"The distribution of the feature \'{trait}\' in this dataset is severely biased.\n")
    else:
        print(f"The distribution of the feature \'{trait}\' in this dataset is fine.\n")
    if "Age" in df.columns:
        age_biased = judge_continuous_variable_biased(df, 'Age')
        if age_biased:
            print(f"The distribution of the feature \'Age\' in this dataset is severely biased.\n")
            df = df.drop(columns='Age')
        else:
            print(f"The distribution of the feature \'Age\' in this dataset is fine.\n")
    if "Gender" in df.columns:
        gender_biased = judge_binary_variable_biased(df, 'Gender')
        if gender_biased:
            print(f"The distribution of the feature \'Gender\' in this dataset is severely biased.\n")
            df = df.drop(columns='Gender')
        else:
            print(f"The distribution of the feature \'Gender\' in this dataset is fine.\n")

    return trait_biased, df


In [77]:
def judge_binary_variable_biased(dataframe, col_name, min_proportion=0.1, min_num=5):
    """
    Check if the distribution of a binary variable in the dataset is too biased to be usable for analysis
    :param dataframe:
    :param col_name:
    :param min_proportion:
    :param min_num:
    :return:
    """
    label_counter = dataframe[col_name].value_counts()
    total_samples = len(dataframe)
    rare_label_num = label_counter.min()
    rare_label = label_counter.idxmin()
    rare_label_proportion = rare_label_num / total_samples

    print(
        f"For the feature \'{col_name}\', the least common label is '{rare_label}' with {rare_label_num} occurrences. This represents {rare_label_proportion:.2%} of the dataset.")

    biased = (len(label_counter) < 2) or ((rare_label_proportion < min_proportion) and (rare_label_num < min_num))
    return bool(biased)

In [78]:
def judge_continuous_variable_biased(dataframe, col_name):
    """Check if the distribution of a continuous variable in the dataset is too biased to be usable for analysis.
    As a starting point, we consider it biased if all values are the same. For the next step, maybe ask GPT to judge
    based on quartile statistics combined with its common sense knowledge about this feature.
    """
    quartiles = dataframe[col_name].quantile([0.25, 0.5, 0.75])
    min_value = dataframe[col_name].min()
    max_value = dataframe[col_name].max()

    # Printing quartile information
    print(f"Quartiles for '{col_name}':")
    print(f"  25%: {quartiles[0.25]}")
    print(f"  50% (Median): {quartiles[0.5]}")
    print(f"  75%: {quartiles[0.75]}")
    print(f"Min: {min_value}")
    print(f"Max: {max_value}")

    biased = min_value == max_value

    return bool(biased)

In [79]:
is_trait_biased, merged_data = judge_and_remove_biased_features(merged_data, TRAIT, trait_type=trait_type)
is_trait_biased

For the feature 'Anxiety disorder', the least common label is '0.0' with 11 occurrences. This represents 9.24% of the dataset.
The distribution of the feature 'Anxiety disorder' in this dataset is fine.

Quartiles for 'Age':
  25%: 20.0
  50% (Median): 22.0
  75%: 24.0
Min: 18.0
Max: 59.0
The distribution of the feature 'Age' in this dataset is fine.

For the feature 'Gender', the least common label is '1.0' with 35 occurrences. This represents 29.41% of the dataset.
The distribution of the feature 'Gender' in this dataset is fine.



False

In [80]:
def save_cohort_info(cohort: str, info_path: str, is_available: bool, is_biased: Optional[bool] = None,
                     df: Optional[pd.DataFrame] = None, note: str = '') -> None:
    """
    Add or update information about the usability and quality of a dataset for statistical analysis.

    Parameters:
    cohort (str): A unique identifier for the dataset.
    info_path (str): File path to the JSON file where records are stored.
    is_available (bool): Indicates whether both the genetic data and trait data are available in the dataset, and can be
     preprocessed into a dataframe.
    is_biased (bool, optional): Indicates whether the dataset is too biased to be usable.
        Required if `is_available` is True.
    df (pandas.DataFrame, optional): The preprocessed dataset. Required if `is_available` is True.
    note (str, optional): Additional notes about the dataset.

    Returns:
    None: The function does not return a value but updates or creates a record in the specified JSON file.
    """
    if is_available:
        assert (df is not None) and (is_biased is not None), "'df' and 'is_biased' should be provided if this cohort " \
                                                             "is relevant."
    is_usable = is_available and (not is_biased)
    new_record = {"is_usable": is_usable,
                  "is_available": is_available,
                  "is_biased": is_biased if is_available else None,
                  "has_age": "Age" in df.columns if is_available else None,
                  "has_gender": "Gender" in df.columns if is_available else None,
                  "sample_size": len(df) if is_available else None,
                  "note": note}
    
    if not os.path.exists(info_path):
        with open(info_path, 'w') as file:
            json.dump({}, file)
        print(f"A new JSON file was created at: {info_path}")

    with open(info_path, "r") as file:
        records = json.load(file)
    records[cohort] = new_record

    temp_path = info_path + ".tmp"
    try:
        with open(temp_path, 'w') as file:
            json.dump(records, file)
        os.replace(temp_path, info_path)

    except Exception as e:
        print(f"An error occurred: {e}")
        if os.path.exists(temp_path):
            os.remove(temp_path)
        raise

In [81]:
import json
if is_available:
    save_cohort_info(cohort, JSON_PATH, is_available, is_trait_biased, merged_data, note='')
else:
    save_cohort_info(cohort, JSON_PATH, is_available)

In [82]:
merged_data.head()
if not is_trait_biased:
    merged_data.to_csv(os.path.join(OUTPUT_DIR, cohort + '.csv'), index=False)

### 3. Do regression & Cross Validation

In [83]:
def read_json_to_dataframe(json_file: str) -> pd.DataFrame:
    """
    Reads a JSON file and converts it into a pandas DataFrame.

    Args:
    json_file (str): The path to the JSON file containing the data.

    Returns:
    DataFrame: A pandas DataFrame with the JSON data.
    """
    with open(json_file, 'r') as file:
        data = json.load(file)
    return pd.DataFrame.from_dict(data, orient='index').reset_index().rename(columns={'index': 'cohort_id'})

In [84]:
def filter_and_rank_cohorts(json_file: str, condition: Union[str, None] = None) -> Tuple[
    Union[str, None], pd.DataFrame]:
    """
    Reads a JSON file, filters cohorts based on usability and an optional condition, then ranks them by sample size.

    Args:
    json_file (str): The path to the JSON file containing the data.
    condition (str, optional): An additional condition for filtering. If None, only 'is_usable' is considered.

    Returns:
    Tuple: A tuple containing the best cohort ID (str or None if no suitable cohort is found) and
           the filtered and ranked DataFrame.
    """
    # Read the JSON file into a DataFrame
    df = read_json_to_dataframe(json_file)

    if condition:
        filtered_df = df[(df['is_usable'] == True) & (df[condition] == True)]
    else:
        filtered_df = df[df['is_usable'] == True]

    ranked_df = filtered_df.sort_values(by='sample_size', ascending=False)
    best_cohort_id = ranked_df.iloc[0]['cohort_id'] if not ranked_df.empty else None

    return best_cohort_id, ranked_df


In [85]:
# Check the information of usable cohorts
best_cohort, ranked_df = filter_and_rank_cohorts(JSON_PATH)
ranked_df

Unnamed: 0,cohort_id,is_usable,is_available,is_biased,has_age,has_gender,sample_size,note
1,GSE60491,True,True,False,True,True,119,


In [86]:
# If both age and gender have available cohorts, select 'age' as the condition.
condition = 'Age'
filter_column = 'has_' + condition.lower()

condition_best_cohort, condition_ranked_df = filter_and_rank_cohorts(JSON_PATH, filter_column)
condition_best_cohort

'GSE60491'

In [87]:
condition_ranked_df.head()

Unnamed: 0,cohort_id,is_usable,is_available,is_biased,has_age,has_gender,sample_size,note
1,GSE60491,True,True,False,True,True,119,


In [88]:
merged_data = pd.read_csv(os.path.join(OUTPUT_DIR, condition_best_cohort + '.csv'))
merged_data.head()

Unnamed: 0,Anxiety disorder,Age,Gender,A1BG,A1CF,A2M,A2ML1,A3GALT2,A4GALT,A4GNT,...,ZWILCH,ZWINT,ZXDA,ZXDB,ZXDC,ZYG11A,ZYG11B,ZYX,ZZEF1,ZZZ3
0,1.0,21.0,0.0,878.7355,130.8007,98.2296,120.827,127.74,107.5727,203.9451,...,143.9445,142.8054,137.5964,276.6226,286.5479,117.3781,243.3207,143.0981,126.1316,174.5524
1,1.0,22.0,1.0,886.0469,132.5247,107.4875,112.3008,133.3795,102.2422,252.595,...,141.7784,155.8628,132.0846,288.0483,256.291,116.3579,252.8782,138.5269,117.8299,185.881
2,1.0,21.0,0.0,949.6923,141.0343,94.14699,118.4021,122.5672,105.2472,234.2607,...,143.675,148.7864,128.6895,290.3251,250.7705,112.9105,236.6002,140.7121,112.9356,170.8842
3,1.0,21.0,1.0,747.9395,138.1274,110.0556,118.7682,130.9674,112.517,262.9443,...,147.4358,162.256,131.9409,232.3967,275.3432,105.5575,237.31,134.0878,113.5443,170.789
4,1.0,22.0,0.0,874.6235,144.6797,104.103,119.3217,149.0589,109.5576,225.8657,...,128.2727,145.9251,136.56,322.6942,289.0635,107.0224,241.3908,131.0249,103.9684,182.1901


In [89]:
# Remove the other condition to prevent interference.
merged_data = merged_data.drop(columns=['Gender'], errors='ignore').astype('float')

X = merged_data.drop(columns=[TRAIT, condition]).values
Y = merged_data[TRAIT].values
Z = merged_data[condition].values

Select the appropriate regression model depending on whether the dataset shows batch effect.

In [90]:
def detect_batch_effect(X):
    """
    Detect potential batch effects in a dataset using eigenvalues of XX^T.

    Args:
    X (numpy.ndarray): A feature matrix with shape (n_samples, n_features).

    Returns:
    bool: True if a potential batch effect is detected, False otherwise.
    """
    n_samples = X.shape[0]

    # Computing XX^T
    XXt = np.dot(X, X.T)

    # Compute the eigenvalues of XX^T
    eigen_values = np.linalg.eigvalsh(XXt)  # Using eigvalsh since XX^T is symmetric
    eigen_values = sorted(eigen_values, reverse=True)

    # Check for large gaps in the eigenvalues
    for i in range(len(eigen_values) - 1):
        gap = eigen_values[i] - eigen_values[i + 1]
        if gap > 1 / n_samples:  # You may need to adjust this threshold
            return True

    return False

In [91]:
import numpy as np
has_batch_effect = detect_batch_effect(X)
has_batch_effect

True

In [92]:
from sparse_lmm import VariableSelection


In [93]:
# Select appropriate models based on whether the dataset has batch effect.
# We experiment on two models for each branch. We will decide which one to choose later.

if has_batch_effect:
    model_constructor1 = VariableSelection
    model_params1 = {'modified': True, 'lamda': 3e-4}
    model_constructor2 = VariableSelection
    model_params2 = {'modified': False}
else:
    model_constructor1 = Lasso
    model_params1 = {'alpha': 1.0, 'random_state': 42}
    model_constructor2 = VariableSelection
    model_params2 = {'modified': False}

In [94]:
def cross_validation(X, Y, Z, model_constructor, model_params, k=5, target_type='binary'):
    assert target_type in ['binary', 'continuous'], "The target type must be chosen from 'binary' or 'continuous'"
    indices = np.arange(X.shape[0])
    np.random.shuffle(indices)

    fold_size = len(X) // k
    performances = []

    for i in range(k):
        # Split data into train and test based on the current fold
        test_indices = indices[i * fold_size: (i + 1) * fold_size]
        train_indices = np.setdiff1d(indices, test_indices)

        X_train, X_test = X[train_indices], X[test_indices]
        Y_train, Y_test = Y[train_indices], Y[test_indices]
        Z_train, Z_test = Z[train_indices], Z[test_indices]

        normalized_X_train, normalized_X_test = normalize_data(X_train, X_test)
        normalized_Z_train, normalized_Z_test = normalize_data(Z_train, Z_test)

        # model = model_constructor(**model_params)
        model = ResidualizationRegressor(model_constructor, model_params)
        model.fit(normalized_X_train, Y_train, normalized_Z_train)
        predictions = model.predict(normalized_X_test, normalized_Z_test)

        if target_type == 'binary':
            predictions = (predictions > 0.5).astype(int)
            Y_test = (Y_test > 0.5).astype(int)
            performance = accuracy_score(Y_test, predictions)
        elif target_type == 'continuous':
            performance = mean_squared_error(Y_test, predictions)

        performances.append(performance)

    cv_mean = np.mean(performances)
    cv_std = np.std(performances)

    if target_type == 'binary':
        print(f'The cross-validation accuracy is {(cv_mean * 100):.2f}% ± {(cv_std * 100):.2f}%')
    else:
        print(f'The cross-validation MSE is {(cv_mean * 100):.2f} ± {(cv_std * 100):.2f}')

    return cv_mean, cv_std

In [95]:
def normalize_data(X_train, X_test=None):
    """Compute the mean and standard deviation statistics of the training data, use them to normalize the training data,
    and optionally the test data"""
    mean = np.mean(X_train, axis=0)
    std = np.std(X_train, axis=0)

    # Handling columns with std = 0
    std_no_zero = np.where(std == 0, 1, std)

    # Normalize X_train
    X_train_normalized = (X_train - mean) / std_no_zero
    # Set normalized values to 0 where std was 0
    X_train_normalized[:, std == 0] = 0

    if X_test is not None:
        X_test_normalized = (X_test - mean) / std_no_zero
        X_test_normalized[:, std == 0] = 0
    else:
        X_test_normalized = None

    return X_train_normalized, X_test_normalized

In [96]:
class ResidualizationRegressor:
    def __init__(self, regression_model_constructor, params=None):
        if params is None:
            params = {}
        self.regression_model = regression_model_constructor(**params)
        self.beta_Z = None  # Coefficients for regression of Y on Z
        self.beta_X = None  # Coefficients for regression of residual on X
        self.neg_log_p_values = None  # Negative logarithm of p-values
        self.p_values = None  # Actual p-values

    def _reshape_data(self, data):
        """
        Reshape the data to ensure it's in the correct format (2D array).

        :param data: The input data (can be 1D or 2D array).
        :return: Reshaped 2D array.
        """
        if data.ndim == 1:
            return data.reshape(-1, 1)
        return data

    def _reshape_output(self, data):
        """
        Reshape the output data to ensure it's in the correct format (1D array).

        :param data: The output data (can be 1D or 2D array).
        :return: Reshaped 1D array.
        """
        if data.ndim == 2 and data.shape[1] == 1:
            return data.ravel()
        return data
    
    def fit(self, X, Y, Z):
        X = self._reshape_data(X)
        Y = self._reshape_data(Y)
        Z = self._reshape_data(Z)

        # Step 1: Linear regression of Y on Z
        Z_ones = np.column_stack((np.ones(Z.shape[0]), Z))
        self.beta_Z = np.linalg.pinv(Z_ones.T @ Z_ones) @ Z_ones.T @ Y
        Y_hat = Z_ones @ self.beta_Z
        e_Y = Y - Y_hat  # Residual of Y

        # Step 2: Regress the residual on X using the included regression model
        self.regression_model.fit(X, e_Y)

        # Obtain coefficients from the regression model
        if hasattr(self.regression_model, 'coef_'):
            self.beta_X = self.regression_model.coef_
        elif hasattr(self.regression_model, 'getBeta'):
            beta_output = self.regression_model.getBeta()
            self.beta_X = self._reshape_output(beta_output)

        # Obtain negative logarithm of p-values, if available
        if hasattr(self.regression_model, 'getNegLogP'):
            neg_log_p_output = self.regression_model.getNegLogP()
            if neg_log_p_output is not None:
                self.neg_log_p_values = self._reshape_output(neg_log_p_output)
                self.p_values = np.exp(-self.neg_log_p_values)
                # Concatenate the p-values of Z and X. The p-values of Z were not computed, mark with NaN.
                p_values_Z = np.full(Z.shape[1], np.nan)
                self.p_values = np.concatenate((p_values_Z, self.p_values))
    def predict(self, X, Z):
        X = self._reshape_data(X)
        Z = self._reshape_data(Z)

        Z_ones = np.column_stack((np.ones(Z.shape[0]), Z))
        ZX = np.column_stack((Z, X))
        combined_beta = np.concatenate((self.beta_Z[1:].ravel(), self.beta_X.ravel()))
        return ZX @ combined_beta + self.beta_Z[0]

    def get_coefficients(self):
        return np.concatenate((self.beta_Z[1:].ravel(), self.beta_X.ravel()))

    def get_p_values(self):
        return self.p_values


In [97]:
from sklearn.metrics import accuracy_score, mean_squared_error


In [98]:
trait_type = 'binary'  # Remember to set this properly, either 'binary' or 'continuous'
cv_mean1, cv_std1 = cross_validation(X, Y, Z, model_constructor1, model_params1, target_type=trait_type)

AttributeError: Module 'scipy' has no attribute 'exp'

In [None]:
cv_mean2, cv_std2 = cross_validation(X, Y, Z, model_constructor2, model_params2, target_type=trait_type)

The cross-validation accuracy is 48.70% ± 12.42%


In [None]:
normalized_X, _ = normalize_data(X)
normalized_Z, _ = normalize_data(Z)

# Train regression model on the whole dataset to identify significant genes
model1 = ResidualizationRegressor(model_constructor1, model_params1)
model1.fit(normalized_X, Y, normalized_Z)

model2 = ResidualizationRegressor(model_constructor2, model_params2)
model2.fit(normalized_X, Y, normalized_Z)

alpha for Lasso: 0.0003


### 4. Discussion and report

In [None]:
def interpret_result(model: Any, feature_names: List[str], trait: str, condition: str,
                     threshold: float = 0.05, save_output: bool = True,
                     output_dir: str = './output', model_id: int = 1) -> None:
    """This function interprets and reports the result of a trained linear regression model, where the regressor
    consists of one variable about condition and multiple variables about genetic factors.
    The function extracts coefficients and p-values from the model, and identifies the significant genes based on
    p-values or non-zero coefficients, depending on the availability of p-values.

    Parameters:
    model (Any): The trained regression Model.
    feature_names (List[str]): A list of feature names corresponding to the model's coefficients.
    trait (str): The target trait of interest.
    condition (str): The specific condition to examine within the model.
    threshold (float): Significance level for p-value correction. Defaults to 0.05.
    save_output (bool): Flag to determine whether to save the output to a file. Defaults to True.
    output_dir (str): Directory path where output files are saved. Defaults to './output'.
    model_id (int): The index of the model, 1 or 2.

    Returns:
    None: This function does not return anything but prints and optionally saves the output.
    """
    coefficients = model.get_coefficients().reshape(-1).tolist()
    p_values = model.get_p_values()
    if p_values is None:
        regression_df = pd.DataFrame({
            'Variable': feature_names,
            'Coefficient': coefficients
        })
    else:
        regression_df = pd.DataFrame({
            'Variable': feature_names,
            'Coefficient': coefficients,
            'p_value': p_values.reshape(-1).tolist()
        })
    condition_effect = regression_df[regression_df['Variable'] == condition].iloc[0]

    print(f"Effect of the condition on the target variable:")
    print(f"Variable: {condition}")
    print(f"Coefficient: {condition_effect['Coefficient']:.4f}")
    gene_regression_df = regression_df[regression_df['Variable'] != condition]
    if p_values is None:
        gene_regression_df['Absolute Coefficient'] = gene_regression_df['Coefficient'].abs()
        significant_genes = gene_regression_df[gene_regression_df['Coefficient'] != 0]
        significant_genes_sorted = significant_genes.sort_values(by='Absolute Coefficient', ascending=False)
        print(
            f"Found {len(significant_genes_sorted)} genes with non-zero coefficients associated with the trait '{trait}' "
            f"conditional on the factor '{condition}'. These genes are identified as significant based on the regression model.")
    else:
        # Apply the Benjamini-Hochberg correction, to get the corrected p-values
        corrected_p_values = multipletests(gene_regression_df['p_value'], alpha=threshold, method='fdr_bh')[1]
        gene_regression_df.loc[:, 'corrected_p_value'] = corrected_p_values
        significant_genes = gene_regression_df.loc[gene_regression_df['corrected_p_value'] < threshold]
        significant_genes_sorted = significant_genes.sort_values('corrected_p_value')
        print(
            f"Found {len(significant_genes_sorted)} significant genes associated with the trait '{trait}' conditional on "
            f"the factor '{condition}', with corrected p-value < {threshold}:")

    print(significant_genes_sorted.to_string(index=False))

    # Optionally, save this to a CSV file
    if save_output:
        significant_genes_sorted.to_csv(
            os.path.join(output_dir, f'significant_genes_condition_{condition}_{model_id}.csv'), index=False)

In [None]:
feature_cols = merged_data.columns.tolist()
feature_cols.remove(TRAIT)

threshold = 0.05
interpret_result(model1, feature_cols, TRAIT, condition, threshold=threshold, save_output=True,
                 output_dir=OUTPUT_DIR, model_id=1)

Effect of the condition on the target variable:
Variable: Age
Coefficient: -0.0515
Found 53 genes with non-zero coefficients associated with the trait 'Anxiety disorder' conditional on the factor 'Age'. These genes are identified as significant based on the regression model.
    Variable  Coefficient  Absolute Coefficient
        BAG4     3.651880              3.651880
       TMOD4    -3.645051              3.645051
       TEKT5    -3.502467              3.502467
       ADH1B    -3.340335              3.340335
       BCAT1    -3.058921              3.058921
      TTTY13    -2.910795              2.910795
   KRTAP20-1     2.371920              2.371920
       LRRN4     2.309673              2.309673
       FBXL6    -2.205952              2.205952
      GOLGB1    -1.840282              1.840282
     ZDHHC24     1.769598              1.769598
      TSPAN7    -1.746164              1.746164
        RAG2    -1.686070              1.686070
     SLC46A1     1.652568              1.652568
LOC1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  gene_regression_df['Absolute Coefficient'] = gene_regression_df['Coefficient'].abs()


In [None]:
from statsmodels.stats.multitest import multipletests


In [None]:
interpret_result(model2, feature_cols, TRAIT, condition, threshold=threshold, save_output=True,
                 output_dir=OUTPUT_DIR, model_id=2)

Effect of the condition on the target variable:
Variable: Age
Coefficient: -0.0515
Found 0 significant genes associated with the trait 'Anxiety disorder' conditional on the factor 'Age', with corrected p-value < 0.05:
Empty DataFrame
Columns: [Variable, Coefficient, p_value, corrected_p_value]
Index: []


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  gene_regression_df.loc[:, 'corrected_p_value'] = corrected_p_values
