# I. Data 

# 1. Extract Descriptions:
Read and parse the compressed JSON file containing metadata about the experiments

In [9]:
import gzip
import json
import pandas as pd

# Path to the file
file_path = '/Users/souhatifour/Downloads/aggregated_metadata.json.gz'

# Open and read the compressed JSON file
with gzip.open(file_path, 'rt', encoding='utf-8') as f:
    data = json.load(f)

# Extract the 'experiments' section
experiments = data.get('experiments', {})

# Convert to DataFrame :) 
df = pd.DataFrame.from_dict(experiments, orient='index')

df.head(3)
df.shape

(6461, 14)

## 2. Filtering Out Entries with 'No Description'

In [8]:
# Define a list of terms/phrases that indicate no description
no_description_terms = [
    'No description.',
    'No description available.',
    'N/A',
    'Not available',
    'none provided',
    '',  # Empty strings
    None  # Null values
]

# Filter out rows where the 'description' column matches any of the no_description_terms
filtered_df = df[~df['description'].str.strip().isin(no_description_terms)]
filtered_df.shape

(6403, 14)

In [10]:
import pandas as pd

# Extract the 'description' column
descriptions = filtered_df['description']

# Save the descriptions to a TSV file
output_file = '/Users/souhatifour/Downloads/refinebio_descriptions_filtered.tsv'
descriptions.to_csv(output_file, sep='\t', index=False, header=False)

print(f"Extracted {len(descriptions)} descriptions and saved to {output_file}")

Extracted 6403 descriptions and saved to /Users/souhatifour/Downloads/refinebio_descriptions_filtered.tsv


## 3. Save the Accession Codes to a TSV File (after filtering out studies with no discription)

In [None]:
# Extract the "accession code" column
accession_codes = filtered_df['accession_code']

# File path where the accession codes will be saved as a tsv
output_csv_file = '/Users/souhatifour/Downloads/IDs.tsv'

# Save the accession codes to a tsv file
accession_codes.to_csv(output_csv_file, header=None,index=False)

# II. Preprocess the Extracted Descriptions

### 1. Preprocess Text Descriptions:
Run **txt2onto2.0/src/preprocess.py** script to clean and preprocess the extracted descriptions by removing URLs, specific strings, file names, non-UTF-8 characters, and applying text normalization techniques.

**Note:** There are some descriptions that contain new lines (\n) within them. So each description was being separated by a new line, which resulted in more data rows after preprocessing) -> solution: remove some weird characters like ‘\n’ an ‘\n\n’ from the text descriptions before running the **preprocess.py** 

# III. Generate Embeddings for Processed Descriptions:
 used **src/embedding_lookup_table.py** script to generate embeddings for the preprocessed descriptions using a pretrained language model (BiomedBERT).

# IV. Run Predictions Using MONDO Model Files:

First, check whether specific MONDO terms are present in the full dataset (including redundant terms) and whether they have at least three associated samples to see if we can train a model for each term

In [15]:
# Load the dataset
file_path = '/Users/souhatifour/Downloads/true_label__inst_type=study__task=disease.csv.gz'
df2 = pd.read_csv(file_path, compression='gzip', index_col=0)


# List of MONDO terms to check
mondo_terms = [
    "MONDO:0011918",
    "MONDO:0004907",
    "MONDO:0007156",
    "MONDO:0008661",
    "MONDO:0005365",
    "MONDO:0005420",
    "MONDO:0005296",
    "MONDO:0100470",
    "MONDO:0005441"
]

# Check if MONDO terms are in the dataset and have at least three samples belonging to them
for term in mondo_terms:
    if term in df2.columns:
        # Count the number of samples with a value of 1 for this MONDO term
        count_ones = (df2[term] == 1).sum()
        if count_ones >= 3:
            print(f"{term}  (Total: {count_ones} studies)")
        else:
            print(f"{term}  (Total: {count_ones} samples)")
    else:
        print(f"{term} is NOT in the dataset")

MONDO:0011918  (Total: 0 samples)
MONDO:0004907  (Total: 2 samples)
MONDO:0007156  (Total: 0 samples)
MONDO:0008661  (Total: 0 samples)
MONDO:0005365  (Total: 2 samples)
MONDO:0005420  (Total: 0 samples)
MONDO:0005296  (Total: 1 samples)
MONDO:0100470  (Total: 0 samples)
MONDO:0005441  (Total: 0 samples)


###### 1. Ensure the following files are ready:
- refinebio_descriptions_filtered.tsv
- IDs.tsv
- my_custom_embeddings.npz
- disease_desc_embedding.npz
- 18 MONDO model files (MONDO_0004986__model.pkl, MONDO_0008903__model.pkl, etc.)

###### 2. Run Predictions for Each MONDO Model File:
Execute the predict.py script for each MONDO model file, updating the model file path in each command: