# Gene Uniqueness

This program requires either of the following files:

    Gene-Disease Associations, All GO IDs.csv
    OR
    Gene-Disease Associations, No GO ID Ancestors.csv

The program outputs the following file:

    Gene Uniqueness.csv
 
This program takes about 3 minutes and may seem unresponsive.

## Define default filenames

In [1]:
# Use os module to access files in other directories. 
from os import path

# Default file names.
gene_disease_associations_path = path.abspath(
    '../Gene-Disease Associations/Gene-Disease Associations, All GO IDs.csv')

# Could have used the file below as well, since both files contain the
# same information about genes.
# gene_disease_associations_path = path.abspath(
#     '../Gene-Disease Associations/Gene-Disease Associations, No GO ID Ancestors.csv')

## Define a function to open gene-disease association files

In [2]:
# Import the pandas module for opening files.
import pandas

def get_disease_genes(filename):
    '''Open the gene-disease associations file.
    
    Parameters:
    filename (str): The file to work with.
    dtype: The data type.
    header (int): The row to serve as header.
    usecols (str): The columns to use.
    '''
    disease_genes = pandas.read_csv(
        filename,
        dtype = str,
        header = 0,
        usecols=['DB ID', 'Disease', 'Gene Symbol'])  
    
    # Return the opened file object.
    return disease_genes

## Open the gene-disease associations files to access the genes

In [3]:
# Open gene-disease file with all GO IDs.
disease_genes = get_disease_genes(gene_disease_associations_path)

### Display the contents of the gene-disease associations file

In [4]:
# For visualization only: may delete code line.
disease_genes

Unnamed: 0,DB ID,Disease,Gene Symbol
0,114500,Colorectal cancer with chromosomal instability...,TLR2 | DCC | FLCN | NRAS | BAX | PDGFRL | ODC1...
1,114480,"Breast cancer, somatic | {Breast cancer, prote...",ATM | CDH1 | RAD51 | BARD1 | CHEK2 | KRAS | XR...
2,125853,"Diabetes mellitus, noninsulin-dependent, late ...",SLC2A2 | IRS1 | NEUROD1 | PAX4 | IL6 | PPP1R3A...
3,611162,"{Malaria, resistance to} | {Malaria, protectio...",NOS2 | FCGR2B | HBB | SLC4A1 | TNF | ICAM1 | G...
4,167000,"Ovarian cancer, somatic",PIK3CA | CDH1 | AKT1 | PRKN | CTNNB1 | OPCML
...,...,...,...
5410,235550,Hepatic venoocclusive disease with immunodefic...,SP110
5411,615544,?Periventricular nodular heterotopia 6,ERMARD
5412,234050,"Trichothiodystrophy 4, nonphotosensitive",MPLKIP
5413,614700,"Immunodeficiency, common variable, 8, with aut...",LRBA


## Define a function that creates a dictionary that stores the number of diseases annotated with a gene

For a specific gene, increase the disease count every time the same gene is found.

In [5]:
def count_gene_associations(disease_genes):
    '''Return a dictionary with the number of diseases
    associated to a gene.
    
    Parameters:
    disease_genes: Pandas data frame containing the gene-disease
    associations file. The file must have a 'Gene Symbol' column.
    '''
    # Create a dictionary to count diseases associated to a gene.
    gene_associations = {}

    # Iterate thru ever row in the gene-disease associations file.
    for index, row in disease_genes.iterrows():

        # Get the genes and convert them into a list.
        genes = row['Gene Symbol'].split(' | ')

        # Iterate through every gene in the list.
        for gene in genes:
            
            try:

                # Increase number of diseases annotated with gene.
                gene_associations[gene] += 1

            except KeyError:

                # Error if GO ID key does not exist, so create key.
                gene_associations[gene] = 1
        
    # Return dictionary with number of diseases associated to genes.
    return gene_associations

## Creates a dictionary that stores the number of diseases annotated with each gene

In [6]:
# Get number of annotated diseases from file with all GO IDs.
gene_associations = count_gene_associations(disease_genes)

### Display the number of diseases annotated with each gene

In [7]:
# For visualization only: may delete code line.
pandas.DataFrame.from_dict(gene_associations, orient = 'index', 
                           columns = ['Diseases'])

Unnamed: 0,Diseases
TLR2,3
DCC,4
FLCN,4
NRAS,8
BAX,2
...,...
RCBTB1,1
ERMARD,1
MPLKIP,1
LRBA,1


## Define a function to find the ***uniqueness*** of each gene

The method was presented by Carson et. al. in the article ["A disease similarity matrix based on the uniqueness of shared genes"](https://doi.org/10.1186/s12920-017-0265-2), where gene ***uniqueness*** is found using the formula 

$$ u_i = 1 - \sqrt{\dfrac{d_i}{d_n}}$$

where $d_i$ is the number of diseases associated with each gene $i$ and $d_n$ is the number of diseases in the data set.

In [8]:
import math

def get_uniqueness(gene_associations, total_disease_number):
    '''Return a dictionary storing the gene uniqueness value of
    every gene.
    
    Parameters:
    gene_associations: Dictionary with genes (the keys) and the 
    number of diseases annotated with each gene (the values).
    total_disease_number: Integer value representing the total sum
    of diseases in the data set. Example: there could be about 6000
    diseases in the data set, but about 4000 after merging diseases
    with the same gene symbols, database IDs, etc. 
    '''
    # Create a dictionary to store the gene uniqueness of each gene.
    uniqueness = {}
    
    # Iterate thru every gene (the keys) in the dictionary.
    for gene in gene_associations:
        
        # Get the number of diseases associated to gene.
        count = gene_associations[gene]
        
        # Apply uniqueness formula and store uniqueness of the gene.
        uniqueness[gene] = 1 - math.sqrt(count / total_disease_number)
        
    # Return dictionary with uniqueness of each each gene.
    return uniqueness

## Find the uniqueness of each gene

In [9]:
# Get the total number of diseases in the data set.
diseases_number = len(disease_genes)

# Get the uniqueness of each gene.
uniqueness = get_uniqueness(
    gene_associations, diseases_number)

### Display the uniqueness of each gene

In [10]:
# For visualization only: may delete code line.
pandas.DataFrame.from_dict(uniqueness, orient = 'index', 
                           columns = ['Weight'])

Unnamed: 0,Weight
TLR2,0.976462
DCC,0.972821
FLCN,0.972821
NRAS,0.961563
BAX,0.980782
...,...
RCBTB1,0.986411
ERMARD,0.986411
MPLKIP,0.986411
LRBA,0.986411


## Create gene uniqueness score file

In [11]:
# Get 'DB ID' and 'Disease' columns for uniqueness score file.
uniqueness_score = disease_genes[['DB ID', 'Disease']]

### Display the gene uniqueness score file

In [12]:
# For visualization only: may delete code line.
uniqueness_score

Unnamed: 0,DB ID,Disease
0,114500,Colorectal cancer with chromosomal instability...
1,114480,"Breast cancer, somatic | {Breast cancer, prote..."
2,125853,"Diabetes mellitus, noninsulin-dependent, late ..."
3,611162,"{Malaria, resistance to} | {Malaria, protectio..."
4,167000,"Ovarian cancer, somatic"
...,...,...
5410,235550,Hepatic venoocclusive disease with immunodefic...
5411,615544,?Periventricular nodular heterotopia 6
5412,234050,"Trichothiodystrophy 4, nonphotosensitive"
5413,614700,"Immunodeficiency, common variable, 8, with aut..."


## Create a square matrix to store gene uniqueness scores

In [13]:
#Form square matrix by concatenating to transpose.
uniqueness_score = pandas.concat([uniqueness_score,
                                 uniqueness_score.transpose()])

### Display gene uniqueness score file

In [14]:
# For visualization only: may delete code line.
uniqueness_score

Unnamed: 0,DB ID,Disease,0,1,2,3,4,5,6,7,...,5405,5406,5407,5408,5409,5410,5411,5412,5413,5414
0,114500,Colorectal cancer with chromosomal instability...,,,,,,,,,...,,,,,,,,,,
1,114480,"Breast cancer, somatic | {Breast cancer, prote...",,,,,,,,,...,,,,,,,,,,
2,125853,"Diabetes mellitus, noninsulin-dependent, late ...",,,,,,,,,...,,,,,,,,,,
3,611162,"{Malaria, resistance to} | {Malaria, protectio...",,,,,,,,,...,,,,,,,,,,
4,167000,"Ovarian cancer, somatic",,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5412,234050,"Trichothiodystrophy 4, nonphotosensitive",,,,,,,,,...,,,,,,,,,,
5413,614700,"Immunodeficiency, common variable, 8, with aut...",,,,,,,,,...,,,,,,,,,,
5414,618425,Neurodevelopmental disorder with impaired spee...,,,,,,,,,...,,,,,,,,,,
DB ID,,,114500,114480,125853,611162,167000,114550,211980,601626,...,609432,140000,228600,617175,176305,235550,615544,234050,614700,618425


## Get the genes associated to each disease based on the disease's index

In [15]:
# Get dictionary from the gene-disease file.
# Split each string and convert the resulting list into a set.
# Then store every set into a dictionary.
gene_dict = disease_genes['Gene Symbol'].apply(
    lambda term: set(term.split(' | '))).to_dict()

### Display dictionary with row index keys and gene list values

In [16]:
# For visualization only: may delete code line.
pandas.DataFrame.from_dict(gene_dict, orient = 'index')

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,18,19,20,21,22,23,24,25,26,27
0,DCC,PIK3CA,PLA2G2A,NRAS,MLH3,TP53,DLC1,APC,TLR2,CCND1,...,RAD54B,PDGFRL,AURKA,CTNNB1,BUB1B,EP300,BUB1,FLCN,AKT1,
1,XRCC3,PHB,RAD54L,PIK3CA,TP53,ESR1,SLC22A18,BRIP1,PALB2,CHEK2,...,CDH1,BRCA2,AKT1,,,,,,,
2,PPARG,IRS2,SLC2A2,PTPN1,IL6,PAX4,AKT2,IRS1,GCK,MTNR1B,...,IGF2BP2,GPD2,NEUROD1,WFS1,HMGA1,PDX1,ENPP1,LIPC,PPP1R3A,KCNJ11
3,GYPA,NOS2,HBB,ICAM1,G6PD,CD36,FCGR2A,ACKR1,CR1,FCGR2B,...,,,,,,,,,,
4,PIK3CA,PRKN,OPCML,CDH1,CTNNB1,AKT1,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5410,SP110,,,,,,,,,,...,,,,,,,,,,
5411,ERMARD,,,,,,,,,,...,,,,,,,,,,
5412,MPLKIP,,,,,,,,,,...,,,,,,,,,,
5413,LRBA,,,,,,,,,,...,,,,,,,,,,


## Define a function that takes two row indexes (the diseases) and calculates the ***uniqueness score*** between the two diseases

The method was presented by Carson et. al. in the article ["A disease similarity matrix based on the uniqueness of shared genes"](https://doi.org/10.1186/s12920-017-0265-2), where the ***uniqueness score*** is defined as 

$$d_{ij} = u_{s_1}+u_{s_2}+\dots + u_{s_n}$$

where $d_{ij}$ is a disease pair and $𝑢_{𝑠_𝑛}$ is the uniqueness value for each gene shared between the two.

In [17]:
def get_pair_uniqueness_score(row1, row2):
    '''Return uniqueness score between two diseases. The uniqueness
    score is calculated by summing the uniqueness of the genes shared
    by both diseases. 
    
    Parameters:
    row1 (int): the row index number of disease 1
    row2 (int): the row index number of disease 2.
    
    Notes:
    The following variables are global variables. The function needs
    them in order to work. Passing variables as arguments is slower.
    
    gene_dict (dict): Stores row index keys and gene list values. 
    uniqueness (dict): Stores the gene uniqueness for each gene in 
    the gene-disease associations file.
    '''
    # Create accumulator for summing uniqueness of each shared gene.
    pair_uniqueness_score = 0

    # Iterate thru every gene in row 1.
    for gene in gene_dict[row1]:

        # Check if gene also appears in row 2.
        if gene in gene_dict[row2]:

            # Add uniqueness of shared gene if gene is in row 1 and 2.
            pair_uniqueness_score += uniqueness[gene]

    # Return the uniqueness score of the genes shared by the diseases.
    return pair_uniqueness_score

## Define a function that stores the uniqueness score for every disease combination

In [18]:
def store_uniqueness_scores(file):
    '''Store the gene uniqueness score between two diseases, for
    every disease combination.
    
    Parameters:
    file: Pandas data frame storing a square matrix of every disease 
    combination.
    
    Notes: 
    The following variables are global variables. The function needs
    them in order to work. Passing variables as arguments is slower.
    
    gene_dict (dict): Stores row index keys and gene list values. 
    uniqueness (dict): Stores the gene uniqueness for each gene in 
    the gene-disease associations file.
    '''
    # The number of rows is equal to number of keys in dictionary.
    # This is also equal to the number of diseases.
    row_count = len(gene_dict)
    
    # Iterate thru every row in the square matrix.
    for row in range(0, row_count):
        
        # Iterate thru every column equal or larger than row index:
        # This will fill half of the matrix.
        for column in range(row, row_count):
                
            # Store the uniqueness score.
            file.at[row,column]=get_pair_uniqueness_score(row,column)

## Store the uniqueness score for every disease combination

This takes about 2 minutes.

In [19]:
# Fill the square matrix with gene uniqueness scores.
store_uniqueness_scores(uniqueness_score)

### Display gene uniqueness score file

In [20]:
# For visualization only: may delete code line.
uniqueness_score

Unnamed: 0,DB ID,Disease,0,1,2,3,4,5,6,7,...,5405,5406,5407,5408,5409,5410,5411,5412,5413,5414
0,114500,Colorectal cancer with chromosomal instability...,26.3411,2.87354,0,0,2.88466,4.81837,0.951003,0,...,0,0,0,0,0,0,0,0,0,0
1,114480,"Breast cancer, somatic | {Breast cancer, prote...",,20.4521,0,0,2.88733,2.87675,3.85321,0.952925,...,0,0,0,0,0,0,0,0,0,0
2,125853,"Diabetes mellitus, noninsulin-dependent, late ...",,,27.422,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,611162,"{Malaria, resistance to} | {Malaria, protectio...",,,,13.6712,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,167000,"Ovarian cancer, somatic",,,,,5.81425,1.91505,1.92747,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5412,234050,"Trichothiodystrophy 4, nonphotosensitive",,,,,,,,,...,,,,,,,,0.986411,0,0
5413,614700,"Immunodeficiency, common variable, 8, with aut...",,,,,,,,,...,,,,,,,,,0.986411,0
5414,618425,Neurodevelopmental disorder with impaired spee...,,,,,,,,,...,,,,,,,,,,0.986411
DB ID,,,114500,114480,125853,611162,167000,114550,211980,601626,...,609432,140000,228600,617175,176305,235550,615544,234050,614700,618425


## Save the gene uniqueness file as a .csv file

In [21]:
# Specify the filename.
filename = 'Gene Uniqueness.csv'

# Make index = True so that index columns aren't dropped.
uniqueness_score.to_csv(filename, index = True)