Column_description for GSM table


1. installing and importing neccessary dependancies

In [1]:
pip install GEOparse

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip available: 22.3.1 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
pip install bs4

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip available: 22.3.1 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [3]:
import GEOparse
import os
import pandas as pd
import json

2. download and preprocess the dataset.<br>
it includes:<br>
        -downloading a single GSE data from GEO by using its ID.<br>
        -saving its metadata to a separate json file.<br>
             -iterate through all GSM files to extract and save the respective metadata.<br>
             -visualize one GSM sample to familirize the data.<br>
             -store the table data of all GSMs in an exel file.

In [4]:
# Downloading the data
gse = GEOparse.get_GEO(geo="GSE1563", destdir="./")

# Save gse metadata
gse_metadata = gse.metadata
with open("gse_metadata.json", "w") as f:
    json.dump(gse_metadata, f, indent=4)

print("GSE metadata saved successfully!")

20-Mar-2025 13:33:41 DEBUG utils - Directory ./ already exists. Skipping.
20-Mar-2025 13:33:41 INFO GEOparse - File already exist: using local version.
20-Mar-2025 13:33:41 INFO GEOparse - Parsing ./GSE1563_family.soft.gz: 
20-Mar-2025 13:33:41 DEBUG GEOparse - DATABASE: GeoMiame
20-Mar-2025 13:33:41 DEBUG GEOparse - SERIES: GSE1563
20-Mar-2025 13:33:41 DEBUG GEOparse - PLATFORM: GPL8300
20-Mar-2025 13:33:43 DEBUG GEOparse - SAMPLE: GSM26805
20-Mar-2025 13:33:43 DEBUG GEOparse - SAMPLE: GSM26806
20-Mar-2025 13:33:43 DEBUG GEOparse - SAMPLE: GSM26807
20-Mar-2025 13:33:43 DEBUG GEOparse - SAMPLE: GSM26808
20-Mar-2025 13:33:44 DEBUG GEOparse - SAMPLE: GSM26809
20-Mar-2025 13:33:44 DEBUG GEOparse - SAMPLE: GSM26810
20-Mar-2025 13:33:44 DEBUG GEOparse - SAMPLE: GSM26811
20-Mar-2025 13:33:44 DEBUG GEOparse - SAMPLE: GSM26812
20-Mar-2025 13:33:44 DEBUG GEOparse - SAMPLE: GSM26813
20-Mar-2025 13:33:44 DEBUG GEOparse - SAMPLE: GSM26814
20-Mar-2025 13:33:44 DEBUG GEOparse - SAMPLE: GSM26815
20-M

GSE metadata saved successfully!


In [5]:
# Extract all gsms
gsms = gse.gsms

# Save all gsm metadata
gsm_metadata = {gsm: gsms[gsm].metadata for gsm in gsms}


with open("gsm_metadata.json", "w") as f:
    json.dump(gsm_metadata, f, indent=4)

print("All gsm metadata saved successfully!")



All gsm metadata saved successfully!


In [6]:
print()
print("GSM example:")
for gsm_name, gsm in gse.gsms.items():
    print("Name: ", gsm_name)
    print("Metadata:",)
    for key, value in gsm.metadata.items():
        print(" - %s : %s" % (key, ", ".join(value)))
    print ("Table data:",)
    print (gsm.table.head())
    break



GSM example:
Name:  GSM26805
Metadata:
 - title : C1PBL
 - geo_accession : GSM26805
 - status : Public on Jul 14 2004
 - submission_date : Jul 14 2004
 - last_update_date : Mar 16 2009
 - type : RNA
 - channel_count : 1
 - source_name_ch1 : PBL
 - organism_ch1 : Homo sapiens
 - taxid_ch1 : 9606
 - molecule_ch1 : total RNA
 - description : Clinical status: control healthy blood donor, Age: unknown, Sex: unknown, Immunosupression: none, Histopathology: none, Donor type: NA, Scr (mg/dL): unknown, Days post transplant: NA, Abbreviations used in sample description: Abreviations used to describe patient samples include the following: BX - Biopsy; PBL- Peripheral Blood Lymphocytes; CsA -Cyclosporine; MMF - Mycophenolate Mofetil; P - Prednisone; FK - Tacrolimus;  SRL - Sirolimus; CAD -Cadaveric;  LD - Live Donor; Scr - Serum Creatinine; ATN - Acute Tubular Necrosis CNI - Calcineurin Inhibitor; FSGS - Focal Segmental Glomerulosclerosis, Keywords = DNA microarrays, gene expression, kidney, reje

In [7]:
sampled_rows = []

for gsm_id, gsm in gsms.items():
    table = gsm.table
    if len(table) >= 5: 
        sampled_rows.append(table.sample(5))  # Randomly sample 5 rows from each GSM

sampled_df = pd.concat(sampled_rows)

# Save to CSV
sampled_df.to_csv("gsm_sample_data.csv", index=False)

print("Sample rows saved to CSV!")


Sample rows saved to CSV!


3. collecting discription information about 'GSM' from the official NBCI website.

In [11]:
import requests
from bs4 import BeautifulSoup

def fetch_gsm_description():
    """
    scrapes the NCBI GEO website to retrieve a detailed description of what a GSM (GEO Sample) is.
    """
    url = "https://www.ncbi.nlm.nih.gov/geo/info/overview.html"
    response = requests.get(url)

    if response.status_code == 200:
        soup = BeautifulSoup(response.text, "html.parser")

        # try to extract relevant sections
        sections = soup.find_all("p") 
        description = "\n".join([p.get_text(strip=True) for p in sections if "sample" in p.get_text().lower()])

        if not description:
            description = "no detailed GSM description"

        # save to file
        with open("gsm_description.txt", "w") as f:
            f.write(description)

        print("GSM description saved successfully!")
    else:
        print("Failed to retrieve the page.")

# run the function
fetch_gsm_description()



GSM description saved successfully!


4. generating prompt

In [8]:
# Define prompt
prompt_text = """This dataset comes from the Gene Expression Omnibus (GEO) and includes multiple GSM (GEO Sample) records. Each GSM represents an individual biological sample, containing metadata, experimental conditions, and gene expression values. The dataset also includes GSE (GEO Series) metadata, which provides context for the overall study.

### Provided files:
1. gse_metadata.json – Metadata about the GSE dataset.  
2. gsm_metadata.json – Compiled metadata for all GSMs.  
3. gsm_sample_data.csv – Sampled rows from the GSM data tables.  
4. prompt.txt – This prompt document.  
5. gsm_description.txt – General description of GSMs and GSEs.  

### Task:
Using these files, generate structured column descriptions that:  
- Clearly explain each column’s meaning  
- Reflect the biological and experimental significance of the data  
- Maintain consistency across GSM samples  
- Provide concise yet informative descriptions  

### Output format:
The column descriptions should be in JSON format, where each column is mapped to a short but clear explanation.  

Example:  
{
  "gene_id": "Unique identifier for each gene in the dataset.",
  "expression_value": "The measured intensity of gene expression in the sample.",
  "sample_condition": "The biological or experimental condition associated with the sample."
}

Make sure the descriptions are accurate, easy to understand, and relevant to the dataset."""

# Save to a file
with open("prompt.txt", "w") as f:
    f.write(prompt_text)

print("Prompt saved successfully!")


Prompt saved successfully!


5. setup and use of the google generative ai model - gemini to generate column discription for the GSM table.

In [9]:
pip install google-generativeai

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip available: 22.3.1 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [10]:
pip install python-dotenv

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip available: 22.3.1 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [11]:
import google.generativeai as genai

from dotenv import load_dotenv
load_dotenv()

API_KEY = os.getenv("GOOGLE_API_KEY")

genai.configure(api_key=API_KEY)


  from .autonotebook import tqdm as notebook_tqdm


In [12]:
# Load GSE metadata
with open("gse_metadata.json", "r") as f:
    gse_metadata = json.load(f)

# Load GSM metadata
with open("gsm_metadata.json", "r") as f:
    gsm_metadata = json.load(f)

# Load Sample Data
sampled_data = pd.read_csv("gsm_sample_data.csv")

# Load Prompt
with open("prompt.txt", "r") as f:
    prompt_message = f.read()

# Load Dataset Description
with open("gsm_description.txt", "r") as f:
    dataset_description = f.read()


In [13]:
# Define the Gemini model
model = genai.GenerativeModel("gemini-1.5-pro-latest")

# Create the structured input for the model
input_text = f"""{prompt_message}

### Dataset Description:
{dataset_description}

### GSE Metadata:
{json.dumps(gse_metadata, indent=2)}

### GSM Metadata (First 5 samples):
{json.dumps(dict(list(gsm_metadata.items())[:5]), indent=2)}

### Sampled Data:
{sampled_data.head().to_json()}

Generate a JSON mapping of column names to their descriptions.
"""

# Generate column descriptions
response = model.generate_content(input_text)

# Save the response


output_folder = "output"
os.makedirs(output_folder, exist_ok=True)
output_file_path = os.path.join(output_folder, "column_descriptions.json")


column_descriptions = response.text
with open(output_file_path, "w") as f:
    f.write(column_descriptions)

print(f"Column descriptions generated and saved successfully in {output_file_path}!")


Column descriptions generated and saved successfully in output\column_descriptions.json!


In [14]:
# Open and read the JSON file
with open("./output/column_descriptions.json", "r") as f:
    column_descriptions = f.read()

# Print the contents of the file
print(column_descriptions)


```json
{
  "ID_REF": "Unique identifier for the gene, corresponding to a specific probe on the microarray platform (GPL8300).",
  "VALUE": "The measured expression level of the gene.  This value represents the intensity of the signal detected by the probe, reflecting the abundance of the corresponding mRNA transcript in the sample.",
  "ABS_CALL": "A qualitative assessment of the reliability of the expression measurement. 'A' indicates the signal is reliably detected as present (Absent/Present call). 'P' indicates the signal is marginally detected, potentially present, but with lower confidence. 'M' (if present in the full dataset, but not sampled data) indicates the signal is not detected and considered absent.",
  "DETECTION P-VALUE": "The probability that the observed expression signal is due to random noise. Lower p-values indicate higher confidence in the detection of gene expression. This value is used in determining the ABS_CALL."
}
```

