# Max Gene Uniqueness

This program requires the following files:

    Gene Uniqueness.csv
    Max Value Finder.ipynb

The program outputs the following file:

    Max Gene Uniqueness.csv

The program takes about 1 minute to produce this file and may seem unresponsive.

## Define the default filenames

In [1]:
# Use os module to access files in other directories. 
from os import path

gene_uniqueness_path = path.abspath(
    '../Gene Uniqueness/Gene Uniqueness.csv')

## Import the function that will get the max gene uniqueness scores

In [2]:
# Run Jupyter Notebook that has the get_max_values and the 
# replace_index_rows functions. The notebook has reusable code that
# can be used to find the maximum values of half-full matrices.
%run "Max Value Finder.ipynb"

## Get the max gene uniqueness scores

In [3]:
# The get_max_values function will return the disease pairs with the 
# highest gene uniqueness sorted in descending order.
max_values = get_max_values(gene_uniqueness_path, sort = True)

Opening file.
Getting square matrix of size 5415x5415 as type float.
Replacing matrix diagonal with NaN values.
Finding max values vertically.
Finding max values horizontally.
Comparing vertical and horizontal values.


### Display contents of the max gene uniqueness file

In [4]:
# For visualization only: may delete code line.
# Note that the index column and 'Index 1' are similar because each 
# disease may appear only once in 'Index 1'.
max_values

Unnamed: 0,Index 1,Index 2,Value
0,0,5,4.818368
5,5,0,4.818368
1,1,6,3.853211
6,6,1,3.853211
3815,3815,3846,2.938026
...,...,...,...
2752,2752,2753,0.000000
2753,2753,2754,0.000000
2755,2755,2756,0.000000
2756,2756,2757,0.000000


## Replace index columns with specified columns coming from the gene uniqueness file

In [5]:
# Replace columns with corresponding 'DB ID' and 'Disease' data.
max_values = replace_index_rows(max_values, gene_uniqueness_path, 
                                columns=['DB ID', 'Disease'])

## Rename 'Value' column to 'Max Gene Uniqueness'

In [6]:
# Rename the 'Value' column to 'Gene Uniqueness'.
max_values = max_values.rename(columns = {'Value': 'Max Gene Uniqueness'})

### Display contents of the max gene uniqueness file

In [7]:
# For visualization only: may delete code line.
max_values

Unnamed: 0,DB ID 1,Disease 1,DB ID 2,Disease 2,Max Gene Uniqueness
0,114500,Colorectal cancer with chromosomal instability...,114550,"Hepatocellular cancer, somatic | Hepatoblastom...",4.818368
5,114550,"Hepatocellular cancer, somatic | Hepatoblastom...",114500,Colorectal cancer with chromosomal instability...,4.818368
1,114480,"Breast cancer, somatic | {Breast cancer, prote...",211980,"Adenocarcinoma of lung, response to tyrosine k...",3.853211
6,211980,"Adenocarcinoma of lung, response to tyrosine k...",114480,"Breast cancer, somatic | {Breast cancer, prote...",3.853211
3815,242600,"Iminoglycinuria, digenic",138500,Hyperglycinuria,2.938026
...,...,...,...,...,...
2752,618356,Neurodevelopmental disorder with central and p...,125800,"Diabetes insipidus, nephrogenic",0.000000
2753,125800,"Diabetes insipidus, nephrogenic",615516,"Mental retardation, autosomal recessive 38",0.000000
2755,617899,"Leukodystrophy, hypomyelinating, 14",214500,Chediak-Higashi syndrome,0.000000
2756,214500,Chediak-Higashi syndrome,121850,Corneal fleck dystrophy,0.000000


## Define a function that removes duplicate rows

Some rows appear more than once, so the duplicates should be removed. For example, two rows could have a value of 1.234567 and compare the same diseases:

| |DB ID 1 |Disease 1 |DB ID 2 |Disease 2 |Max Gene Uniqueness|
| -- | -- | -- | -- | -- | -- |
| **2378** |114550 | Hepatoblastoma, somatic \| Hepatocellular cance... |114500 |Colon cancer, somatic \| {?Colorectal cancer, s... | 1.234567 |
| **1370** |114500 | Colon cancer, somatic \| {?Colorectal cancer, s... |114550 |Hepatoblastoma, somatic \| Hepatocellular cance... | 1.234567 |

In [8]:
def drop_duplicates(max_values):
    '''Drop rows that contain the same information. For example,
    comparing diseases A and B will give the same values as comparing
    diseases B and A. Therefore, one of the duplicate rows can be
    dropped.
    
    max_values: Pandas dataframe containing the max val disease pairs.
    File must contain a 'DB ID 1' and a 'DB ID 2' column because the
    database IDs are used to drop duplicates. 
    '''
    # Create an empty list to store repeated database IDs.
    repeats = []
    
    # Create a new dataframe and assign it the existing column names.
    df = pandas.DataFrame(columns = max_values.columns)
    
    # Iterate thru every disease pair in the max_values file.
    for index, row in max_values.iterrows():
        
        # Get the DB ID of the first disease.
        db_id1 = row['DB ID 1']
        
        # Get the DB ID of the second disease.
        db_id2 = row['DB ID 2']
        
        # Get the set of both DB IDs.
        db_id_set = set([db_id1, db_id2])
        
        # If DB IDs has not been processed before, then the disease
        # combination is the first of its kind.
        if db_id_set not in repeats:
            
            # Add the row to the dataframe.
            df = df.append(row)
        
        # Since DB ID set was processed, add it to the repats list.
        repeats += [db_id_set]
            
    # Return file with max value disease pair without duplicates.
    return df

## Remove duplicate rows

In [9]:
# Remove duplicate rows from the max_values table.
max_val_no_duplicates = drop_duplicates(max_values)

### Display the contents of the file without duplicate rows

In [10]:
# For visualization only: may delete code line.
max_val_no_duplicates

Unnamed: 0,DB ID 1,Disease 1,DB ID 2,Disease 2,Max Gene Uniqueness
0,114500,Colorectal cancer with chromosomal instability...,114550,"Hepatocellular cancer, somatic | Hepatoblastom...",4.818368
1,114480,"Breast cancer, somatic | {Breast cancer, prote...",211980,"Adenocarcinoma of lung, response to tyrosine k...",3.853211
3815,242600,"Iminoglycinuria, digenic",138500,Hyperglycinuria,2.938026
442,202400,"Afibrinogenemia, congenital | Hypofibrinogenem...",616004,"Dysfibrinogenemia, congenital | Hypodysfibrino...",2.938026
2192,226650,"Epidermolysis bullosa, generalized atrophic be...",226700,"Epidermolysis bullosa, junctional, Herlitz type",2.933707
...,...,...,...,...,...
2752,618356,Neurodevelopmental disorder with central and p...,125800,"Diabetes insipidus, nephrogenic",0.000000
2753,125800,"Diabetes insipidus, nephrogenic",615516,"Mental retardation, autosomal recessive 38",0.000000
2755,617899,"Leukodystrophy, hypomyelinating, 14",214500,Chediak-Higashi syndrome,0.000000
2756,214500,Chediak-Higashi syndrome,121850,Corneal fleck dystrophy,0.000000


## Save the max gene uniqueness file as an Excel file

In [11]:
# Specify the filename.
filename = 'Max Gene Uniqueness.xlsx'

# Make index = True so that index columns aren't dropped.
# Saving Excel files is much slower than saving .csv files, but this
# is not a problem because the files are very small.
with pandas.ExcelWriter(filename) as spreadsheet:  

    max_values.to_excel(
        spreadsheet, sheet_name='All Values',    index = True)
    max_val_no_duplicates.to_excel(
        spreadsheet, sheet_name='No Duplicates', index = True)