### Programming for Biomedical Informatics
#### Week 11 - Revision Session 2 - Psuedocoding Examples

We're going to go through a few different examples of psuedocoding to show you what we are looking for in exam answers for structure, granularity, and commenting.

The principle we are applying to guide this is that we expect a suitably experienced researcher to be able to take the pseudocoding prompts you include in your answer and write the specfic lines of corresponding code. Writing a high-level pseudocode line that would need to be coded by multiple lines of code is not ordinarily sufficient. For example, if you were writing a `for` loop to perform a caclulation over elements of a list it would not be sufficient to simply write that in one line, you would be expected to indent smaller, more specific, pseduocode instructions on what would become each line of code in the final coded implementation. For a `for` loop it could be that you intend the coder to use a list comprehension which can be implemented in one line of code. In that case an explanation of how the list comprehension needs to be constructed should be included in the psuedocode comment as it's syntax and formulation is harder to write than a step by step process implemented in multiple lines of a `for` loop.

Below we present a number of examples. We will use a reverse approach, starting with a complete commented code snippet and deconstructing it back down to pseduocode.

We recommend that you practice this by taking some of your own code or code from the various notebooks in the course and going through the same process.

In the following examples we will use:

In [None]:
''' to contain explanatory text and '''

# at the start of lines that contain the actual pseudocode text

In [None]:
# Example 1 - Original Code - Single Lines

'''This very simple example has no complex structure, simply single lines of code. The key here is to make sure that the commenting and the pseduocode are clear and concise.
The code is simple enough that the pseudocode is not really necessary, but it is included for demonstration purposes.'''

'''This code reads in a gene coexpression network and a gene expression dataset, and prints the number of nodes in the gene coexpression network'''

# import the required libraries
import pandas as pd
import networkx as nx

# load the gene coexpression network

# define the data directory
data_dir = './data/'

# Define paths to .gml network files from Section 1
# These paths point to the Gene correlation network, Patient network from (1) TCGA gene expression data and (2) TCGA DNA methylation data.
G_gxp_path = data_dir + 'gene_coexpression_network.gml'

# Load the GML graphs into NetworkX graph objects
# nx.read_gml() function reads a graph from a GML file
G_gxp = nx.read_gml(G_gxp_path)  # Gene correlation network

# Get all nodes in each graph
# The nodes represent genes or patients depending on the network
G_gxp_nodes_list = list(G_gxp.nodes())  # Nodes in the gene correlation network

# Define paths to the raw TCGA datasets
# tcga_dnam_path = 'section2_data/ISMB_TCGA_DNAm.pkl'  # TCGA DNA methylation data
tcga_gxp_path = data_dir + 'ISMB_TCGA_GE.pkl'  # TCGA Gene expression data

# Load the gene expression dataset
# pd.read_pickle() function loads a pickled pandas DataFrame or Series
tcga_gxp = pd.read_pickle(tcga_gxp_path)

# For this example, we'll use a CSV file that includes gene symbols
# pd.read_csv() function loads a CSV file into a pandas DataFrame
tcga_gxp_df = pd.read_csv(data_dir + 'tcga_ge_df_symbols_t.csv') # Dataset with gene symbols
# Set 'GENES' column as the index for easy access to gene-specific data
tcga_gxp_df.set_index('GENES', inplace=True)

# Extract metadata from the gene expression dataset
# Metadata might include information such as patient IDs, sample conditions, etc.
tcga_gxp_meta = tcga_gxp['datMeta']

# Print the number of nodes in each network
# This provides a quick overview of the size of each network
print(f"Number of nodes in gene correlation network: {len(G_gxp_nodes_list)}")

In [None]:
# Example 1 - Pseudocode

'''
In this example I have broadly generalised the specific code above, but it would also have been fine to use specific parts of the code in the pseudocode and to explictly specify the names of files and variables.
The latter can be especially useful if you will later perform operations on those variables, as it can help you to keep track of what you are doing. In this case it's implicit which variable we will use because it's
a simple example, but in more complex examples it can be useful to be explicit.

It's good practice to use a explanatory comment at the start of the pseudocode to explain what the code does, and to use comments to explain the intended function of each line of code. This is the approach we want you to use in the exam.
'''

# PSEUDOCODE START

'''
This code reads in a gene coexpression network, gene expression dataset, and an associated meta-data file and prints the number of nodes in the gene coexpression network. It specifies the loading of a series of files that 
will be used later in a downstream analysis pipeline.

For this code we will need three data files:
- A gene coexpression network in GML format
- A gene expression dataset in pickle format
- A metadata file in CSV format

pandas will be needed to read in meta-data from a pickle file
networkx will be needed to read in the gene coexpression network from a gml format file
'''

# import the pandas and networkx libraries

# set a variable to store the relative file path of the base directory where data is located

# define two variables to store the file paths of the gene coexpression network and the gene expression data, respectively

# Load the GML file for the gene expression network into NetworkX using the read_gml function

# Get a list of all the nodes in the gene expression network using the nodes() function of networkx and store them in a variable

# Load the gene expression dataset from the pickle file using the read_pickle function of pandas and store in a variable

# Load the meta-data file that contains gene symbols from a CSV file using the read_csv function of pandas and store in a variable

# Re-index the dataframe so that the 'GENES' column is the index for easy access to gene-specific data using the set_index function of pandas

# Extract metadata from the gene expression dataset using the 'datMeta' key and store in a variable

# Print the number of nodes in each network


# PSEUDOCODE END

In [None]:
# Example 2 - Simple nested code in a function

# the human phenotype ontology contains database_cross_reference entries that include the UMLS concept id
# we can use this to link the HPO terms to the UMLS concepts

# load the HPO data using pronto
import pronto

# load the HPO ontology
# fetch the Human Phenotype Onology OBO file and parse it with pronto

# download the HPO ontology OBO file
import urllib.request

current_hpo_url = 'http://purl.obolibrary.org/obo/hp.obo'

# download the file
urllib.request.urlretrieve(current_hpo_url,'hpo.obo');

# parse the file
hpo = pronto.Ontology('hpo.obo')

# we can look in the xrefs (sic. cross-references) of a term to find the UMLS concept id
def hpo2concept(hpo_id):
    term = hpo[hpo_id]
    xrefs = [xref.id for xref in term.xrefs]
    try:
        umls_id = [xref for xref in xrefs if xref.startswith('UMLS')][0].split(':')[1]
        return umls_id
    except:
        return None

# let's test this function
print(hpo2concept)('HP:0001695')

In [None]:
# Example 2 - Pseudocode

'''
Note the indentations!
These should be used in loops the same way that they would in actual code.
'''

# PSEUDOCODE START

'''
This code reads in the Human Phenotype Ontology (HPO) and defines a function that takes an HPO term ID as input and returns the corresponding UMLS concept ID.

pronto will be needed to read in the Human Phenotype Ontology (HPO) data
urllib will be needed to download the HPO ontology file
'''

# load the relevant libraries

# store the URL for the HPO OBO file in a variable

# use urllib.request to download the HPO OBO file

# use pronto to load the onology from the OBO file

'''
Define a function that will take an accession ID from the Human Phenotype Ontology (HPO) as input and return any cross-references from the Unified Medical Language System (UMLS).

Note: the function should use exception handling to return None if no cross-references are found.

To achieve this we will need to use:

- An OBO format file for the HPO ontology, which can be downloaded from the NCBO BioPortal
- The urllib.request library to download the HPO ontology file using the corresponding URL found on the NCBO BioPortal website
- The `pronto` library to parse the OBO file
'''

# Define a function called hpo2concept that takes a single argument, hpo_id

# def hpoconcept(hpo_id)
    # Retrieve the term corresponding to the hpo_id from the hpo dictionary

    # Extract the cross-references for the term and store them in a list called xrefs

    # open a try: except: block to handle exceptions

    # try:
        # Use a list comprehension to retrieve the cross-references fot the provided ontology term from the xrefs list
            # The cross-references are stored as strings in a CURIE format 'database:accession', e.g., 'UMLS:C0007222'
            # Extract the database and accession separately to check for a UMLS cross-reference and extract the UMLS concept ID
        # return the UMLS concept IDs

    # except:
        # return None

# Test the function by calling it with a specific HPO ID that you know from checking on the NCBO BioPortal website has a UMLS cross-reference associated with it and print the result

# PSEUDOCODE END

In [None]:
# Example 3 - More complex nested code in a function

# the ASDPTO ontology is a custom ontology for autism spectrum disorder the URL for the ontologu file is:
current_asdpto_url = 'https://data.bioontology.org/ontologies/ASDPTO/submissions/1/download?apikey=4a2fbff0-ef88-432e-b1a1-dffc07e71146'

# download the file
urllib.request.urlretrieve(current_asdpto_url,'autism.obo');

# parse the file using pronto
autism  = pronto.Ontology('autism.obo')

# function to find the UMLS concept for a term in ASDPTO
def find_concept(term):
    for annotation in term.annotations:
        try:
            # if the string contains a cui= then it is a UMLS concept
            # extract the CUI
            if 'cui=' in annotation.resource:
                #split the string on 'cui=' and take the remainder
                concept = annotation.resource.split('cui=')[1]
                return(concept)
        except:
            pass

# we can now use this function to find the UMLS concept for each term in the ASDPTO ontology
asdpto2umls = {term.name:find_concept(term) for term in autism.terms()}

# remove any None entries
asdpto2umls = {k:v for k,v in asdpto2umls.items() if v is not None}

# print how many terms have been mapped to UMLS concepts
print(f'There are ',len(asdpto2umls),' terms in the ASDPTO ontology that have been mapped to UMLS concepts')

# look at the first 10 entries
list(asdpto2umls.items())[:10]

In [None]:
#Example 3 - Pseudocode

'''
NB Note the indentations! especially the use of if: else: and try: except: blocks which are indentations that should be used in loops the same way that they would in actual code.
'''

# PSEUDOCODE START
'''
This code reads in the Autism Spectrum Disorder Phenotype Ontology (ASDPTO) and defines a function that takes an ASDPTO term as input and returns the corresponding UMLS concept ID.

pronto will be needed to read in the Autism Spectrum Disorder Phenotype Ontology (ASDPTO) data
urllib will be needed to download the ASDPTO ontology file
'''

# load the relevant libraries

# store the URL for the ASDPTO OBO file in a variable

# use urllib.request to download the ASDPTO OBO file

# use pronto to load the onology from the OBO file

'''
Define a function that will take an accession ID from the Autism Spectrum Disorder Phenotype Ontology (ASDPTO) as input and return any cross-references from the Unified Medical Language System (UMLS).

Note: the function should use exception handling to return None if no cross-references are found.

To achieve this we will need to use:

- An OBO format file for the HPO ontology, which can be downloaded from the NCBO BioPortal
- The urllib.request library to download the HPO ontology file using the corresponding URL found on the NCBO BioPortal website
- The `pronto` library to parse the OBO file
'''

# Define a function called find_concept that takes a single argument, a term from the ASDPTO ontology

# function to find the UMLS concept for a term in ASDPTO

# def find_concept(term):
    # for each annotation that the term object has:
        #try:
            # check whether the resource slot of the annotation object is a UMLS concept
            # if the string contains `cui=``:
                # split the string on 'cui=' and take the remainder as the UMLS concept ID
                # return(concept)
            # else:
                # pass
        # except:
            # pass

# use this function to find the UMLS concept for every term in the ASDPTO ontology and store the results in a dictionary
# the dictionary should have the term name as the key and the UMLS concept ID as the value
# use a lists comprehension to iterate over the terms in the ASDPTO ontology and call the find_concept function on each term

# check and remove any `None` (empty) entries from the dictionary using a dictionary comprehension

# print how many terms have been mapped to UMLS concepts using a printf statement

# print the first 10 entries in the dictionary