# KL Distance Cytoscape Visualization, Version 0.7
#### By Peter Rucker, ptrucker@ucdavis.edu

## Purpose:
The purpose of this script is to take data from multiple samples, calculate pair-wise KL distance for each sample pair, and then generate a format which can be inerpreted by Cytoscape for visualization.

## Modules Required:
Pandas, Numpy

## Input:

The first line of code with "current_file" should be updated to reflect a input data in the same folder as the script.

Input data will be in the form of a comma seperated file (.csv), the contents of which must obey this format:

| *Notes*         | Sample Name 1 | Sample Name 2 |
| ------------ | ------------- | ------------- |
| Event Type 1 | 10            | 12            |
| Event Type 2 | 1             | 23            |

* Sample Name will refer to the unique samples studied in this data
* Event Type refers to the events observed and counted, across all samples
* The values underneath each sample name should be whole numbers counting the observed event for each sample. **Noted: This must be observed counts, Not frequencies.**
* The "Notes" field is optional, and can be used for metadata

## Output:
Once all cells of the notebook have been ran, the script will output the name of the file created. The file will be a .fsi format ready to be imported as a "Network from file" in Cytoscape. When importing to cytoscape, the KL_


In [5]:
current_file="rearrangement_updated_32720.csv"

In [14]:
import numpy as np
import pandas as pd
import csv
from datetime import date
import os
todaydate = date.today()
outputname=("kl_cyto_" + current_file[:-4] + str(todaydate) + ".fsi")

In [15]:
KLdf=pd.read_csv(current_file)
KLdf.set_index("type", inplace=True)

In [16]:
#adding one to every field for psuedocounting purposes
KLdf = KLdf.astype("float64")
for key, row in KLdf.iteritems():
    for key2, row2 in row.iteritems():
        KLdf.loc[key2][key] += 1
        
for key, row in KLdf.iteritems():
    for key2, row2 in row.iteritems():
        if KLdf.loc[key2][key] <= 0:
            print("error in",key2,key)

In [17]:
#Convert dataframes to frequencies of each column
for key, row in KLdf.iteritems():
    for key2, row2 in row.iteritems():
        value = KLdf.loc[key2][key]
        KLdf.loc[key2][key] = (value / sum(row))

In [18]:
# calculating KL distance between two distrubtions. Distance(P||Q) = summation[Pi*Log2(Pi/Qi)]
KLresults = {}
for key, row in KLdf.iteritems():
    for key2, row2 in KLdf.iteritems():
        if key == key2:
            continue
        data = row * np.log2(row/row2)
        resultkey = key + " " + key2
        reversekey = key2 + " " + key
        if reversekey in KLresults:
            KLresults[reversekey] += sum(data)
        else:
            KLresults[resultkey] = sum(data)

In [19]:
resultframe = pd.DataFrame.from_dict(KLresults, orient = 'index', columns = ["KL_Distance"])
resultframe.reset_index(inplace=True)

# new data frame with split value columns 
new = resultframe["index"].str.split(" ", n = 1, expand = True)
# making separate first name column from new data frame 
resultframe["Node_1"]= new[0] 
  
# making separate last name column from new data frame 
resultframe["Node_2"]= new[1] 
  
# Dropping old Name columns 
resultframe.drop(columns =["index"], inplace = True) 

resultframe = resultframe[['Node_1','Node_2',"KL_Distance"]]

In [24]:
display(resultframe)
display(KLdf)

Unnamed: 0,Node_1,Node_2,KL_Distance
0,NHLC_WT,NHLC_1A,8.016299
1,NHLC_WT,NHLC_1B,7.265504
2,NHLC_WT,NHLC_2A,2.463260
3,NHLC_WT,NHLC_2B,3.193799
4,NHLC_WT,NHLC_3A,2.869800
...,...,...,...
775,NHLC_20A,NHLC_21A,17.417813
776,NHLC_20A,NHLC_21B,15.984040
777,NHLC_20B,NHLC_21A,17.471889
778,NHLC_20B,NHLC_21B,15.989879


Unnamed: 0_level_0,NHLC_WT,NHLC_1A,NHLC_1B,NHLC_2A,NHLC_2B,NHLC_3A,NHLC_3B,NHLC_4B,NHLC_4C,NHLC_5A,...,NHLC_17A,NHLC_17B,NHLC_18A,NHLC_18B,NHLC_19A,NHLC_19B,NHLC_20A,NHLC_20B,NHLC_21A,NHLC_21B
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
p1p7del,0.005909,0.000525,0.011116,0.000415,0.002627,0.000485,0.002883,0.000377,0.000281,0.000539,...,0.000313,0.020182,0.001139,0.000606,0.00041,0.000686,0.000435,0.001707,0.000213,0.000651
p1p6del,0.016004,0.000525,0.008899,0.000416,0.004138,0.000485,0.007848,0.002637,0.000281,0.00054,...,0.001563,0.000997,0.020525,0.000607,0.003692,0.000686,0.000435,0.000428,0.012774,0.000217
p1p4del,0.000465,0.000525,0.000473,0.000416,0.000378,0.000485,0.005828,0.000378,0.000282,0.00054,...,0.000313,0.000333,0.000388,0.000607,0.000412,0.000687,0.000435,0.000428,0.000216,0.000217
p1p5del,0.000465,0.000525,0.000473,0.000416,0.000378,0.000485,0.000419,0.000378,0.000282,0.00054,...,0.001566,0.000333,0.000388,0.000608,0.000412,0.000687,0.000435,0.000428,0.000431,0.000217
p2p6del,0.006512,0.000526,0.000473,0.000416,0.000378,0.000486,0.000419,0.000378,0.000845,0.000541,...,0.000314,0.000333,0.001942,0.000608,0.000412,0.000688,0.325207,0.318493,0.038627,0.043469
p2p7del,0.000468,0.495002,0.477042,0.001249,0.003782,0.000486,0.000419,0.000378,0.000282,0.000541,...,0.000941,0.002996,0.440852,0.000608,0.000412,0.000688,0.00258,0.000628,0.000673,0.000682
p2p3del,0.000468,0.001041,0.000905,0.000417,0.00038,0.000486,0.000419,0.000378,0.000282,0.000541,...,0.000314,0.000334,0.000696,0.000609,0.000412,0.000689,0.000647,0.000628,0.000225,0.000227
p2p5del,0.000469,0.001042,0.000905,0.000417,0.00038,0.000486,0.000419,0.000378,0.000282,0.000541,...,0.000314,0.000334,0.000696,0.000609,0.000413,0.000689,0.000647,0.000629,0.000225,0.000227
p2p4del,0.000469,0.001043,0.000906,0.000417,0.00038,0.000487,0.00042,0.000379,0.000282,0.000542,...,0.14582,0.154359,0.001393,0.000609,0.000413,0.00069,0.000648,0.000629,0.000225,0.000227
p3p4del,0.000469,0.001044,0.000907,0.000417,0.00038,0.000487,0.00042,0.000379,0.000282,0.000542,...,0.000368,0.000395,0.000698,0.00061,0.000413,0.00069,0.000648,0.00063,0.000899,0.000228


In [21]:
os.mkdir("output")
f = open(outputname, "w")
f.write(resultframe.to_csv(index=False, sep='\t'))
f.close()
print(f"File {outputname} has been succesfully created")

File kl_cyto_rearrangement_updated_327202020-04-08.fsi has been succesfully created
