# Max BP-Weight

This program requires the following files:

    BP-Weight.csv
    Max Value Finder.ipynb

The program outputs the following file:

    Max BP-Weight.xlsx

The program takes about 2 minutes to produce this file and may seem unresponsive.

## Define the default filenames

In [1]:
# Use os module to access files in other directories. 
from os import path

bp_weight_path = path.abspath(
    '../BP-Weight/BP-Weight.csv')

## Import the function that will get the max BP-weight scores

In [2]:
# Run Jupyter Notebook that has the get_max_values and the 
# replace_index_rows functions. The notebook has reusable code that
# can be used to find the maximum values of half-full matrices.
%run "Max Value Finder.ipynb"

## Get the max BP-weight scores

In [3]:
# The get_max_values function will return the disease pairs with the 
# highest BP-weights sorted in descending order.
max_values = get_max_values(bp_weight_path, sort = True)

Opening file.
Getting square matrix of size 5415x5415 as type float.
Replacing matrix diagonal with NaN values.
Finding max values vertically.
Finding max values horizontally.
Comparing vertical and horizontal values.


### Display contents of the max BP-weight file

In [4]:
# For visualization only: may delete code line.
# Note that the index column and 'Index 1' are similar because each 
# disease may appear only once in 'Index 1'.
max_values

Unnamed: 0,Index 1,Index 2,Value
0,0,1,304.279769
1,1,0,304.279769
5,5,0,281.160848
4,4,0,274.562676
2,2,0,211.596550
...,...,...,...
5347,5347,0,0.893863
5379,5379,0,0.868246
5411,5411,0,0.710447
5413,5413,0,0.577852


## Replace index columns with specified columns coming from the BP term weight files

In [5]:
# Replace columns with corresponding 'DB ID' and 'Disease' data.
max_values = replace_index_rows(max_values, bp_weight_path, 
                                columns=['DB ID', 'Disease'])

## Rename 'Value' column to 'BP-Weight'

In [6]:
# Rename the 'Value' column to 'Max BP-Weight'.
max_values = max_values.rename(columns = {'Value': 'Max BP-Weight'})

### Display contents of the max BP-weight file

In [7]:
# For visualization only: may delete code line.
max_values

Unnamed: 0,DB ID 1,Disease 1,DB ID 2,Disease 2,Max BP-Weight
0,114500.0,Colorectal cancer with chromosomal instability...,114480.0,"Breast cancer, somatic | {Breast cancer, prote...",304.279769
1,114480.0,"Breast cancer, somatic | {Breast cancer, prote...",114500.0,Colorectal cancer with chromosomal instability...,304.279769
5,114550.0,"Hepatocellular cancer, somatic | Hepatoblastom...",114500.0,Colorectal cancer with chromosomal instability...,281.160848
4,167000.0,"Ovarian cancer, somatic",114500.0,Colorectal cancer with chromosomal instability...,274.562676
2,125853.0,"Diabetes mellitus, noninsulin-dependent, late ...",114500.0,Colorectal cancer with chromosomal instability...,211.596550
...,...,...,...,...,...
5347,301035.0,"Hypothyroidism, congenital, nongoitrous, 9",114500.0,Colorectal cancer with chromosomal instability...,0.893863
5379,615355.0,Noonan syndrome 8,114500.0,Colorectal cancer with chromosomal instability...,0.868246
5411,615544.0,?Periventricular nodular heterotopia 6,114500.0,Colorectal cancer with chromosomal instability...,0.710447
5413,614700.0,"Immunodeficiency, common variable, 8, with aut...",114500.0,Colorectal cancer with chromosomal instability...,0.577852


## Define a function that removes duplicate rows

Some rows appear more than once, so the duplicates should be removed. For example, two rows could have a value of 123.4567 and compare the same diseases:

| |DB ID 1 |Disease 1 |DB ID 2 |Disease 2 |Max BP-Weight|
| -- | -- | -- | -- | -- | -- |
| **2378** |114550 | Hepatoblastoma, somatic \| Hepatocellular cance... |114500 |Colon cancer, somatic \| {?Colorectal cancer, s... | 123.4567 |
| **1370** |114500 | Colon cancer, somatic \| {?Colorectal cancer, s... |114550 |Hepatoblastoma, somatic \| Hepatocellular cance... | 123.4567 |

In [8]:
def drop_duplicates(max_values):
    '''Drop rows that contain the same information. For example,
    comparing diseases A and B will give the same values as comparing
    diseases B and A. Therefore, one of the duplicate rows can be
    dropped.
    
    max_values: Pandas dataframe containing the max value disease pairs.
    The file must contain a 'DB ID 1' and a 'DB ID 2' column because the 
    database IDs are used to drop duplicates. 
    '''
    # Create an empty list to store repeated database IDs.
    repeats = []
    
    # Create a new dataframe and assign it the existing column names.
    df = pandas.DataFrame(columns = max_values.columns)
    
    # Iterate thru every disease pair in the max_values file.
    for index, row in max_values.iterrows():
        
        # Get the DB ID of the first disease.
        db_id1 = row['DB ID 1']
        
        # Get the DB ID of the second disease.
        db_id2 = row['DB ID 2']
        
        # Get the set of both DB IDs.
        db_id_set = set([db_id1, db_id2])
        
        # If DB IDs has not been processed before, then the disease
        # combination is the first of its kind.
        if db_id_set not in repeats:
            
            # Add the row to the dataframe.
            df = df.append(row)
        
        # Since DB ID set was processed, add it to the repats list.
        repeats += [db_id_set]
            
    # Return file with max value disease pair without duplicates.
    return df

## Remove duplicate rows

In [9]:
# Remove duplicate rows from the max_values table.
max_val_no_duplicates = drop_duplicates(max_values)

### Display the contents of the file without duplicate rows

In [10]:
# For visualization only: may delete code line.
max_val_no_duplicates

Unnamed: 0,DB ID 1,Disease 1,DB ID 2,Disease 2,Max BP-Weight
0,114500.0,Colorectal cancer with chromosomal instability...,114480.0,"Breast cancer, somatic | {Breast cancer, prote...",304.279769
5,114550.0,"Hepatocellular cancer, somatic | Hepatoblastom...",114500.0,Colorectal cancer with chromosomal instability...,281.160848
4,167000.0,"Ovarian cancer, somatic",114500.0,Colorectal cancer with chromosomal instability...,274.562676
2,125853.0,"Diabetes mellitus, noninsulin-dependent, late ...",114500.0,Colorectal cancer with chromosomal instability...,211.596550
6,211980.0,"Adenocarcinoma of lung, response to tyrosine k...",114500.0,Colorectal cancer with chromosomal instability...,206.873915
...,...,...,...,...,...
5347,301035.0,"Hypothyroidism, congenital, nongoitrous, 9",114500.0,Colorectal cancer with chromosomal instability...,0.893863
5379,615355.0,Noonan syndrome 8,114500.0,Colorectal cancer with chromosomal instability...,0.868246
5411,615544.0,?Periventricular nodular heterotopia 6,114500.0,Colorectal cancer with chromosomal instability...,0.710447
5413,614700.0,"Immunodeficiency, common variable, 8, with aut...",114500.0,Colorectal cancer with chromosomal instability...,0.577852


## Save the max BP-weight files as Excel files

In [11]:
# Specify the filename.
filename = 'Max BP-Weight.xlsx'

# Make index = True so that index columns aren't dropped.
# Saving Excel files is much slower than saving .csv files, but this
# is not a problem because the files are very small.
with pandas.ExcelWriter(filename) as spreadsheet:  

    max_values.to_excel(
        spreadsheet, sheet_name='All Values', index = True)
    max_val_no_duplicates.to_excel(
        spreadsheet, sheet_name='No Duplicates', index = True)