# Distance
Define distance (RNA1, RNA2, K) as:   
Euclidean distance between K-mer count vectors of RNA1 and RNA2.

In MLP 15, we examined the 2 mRNA sequences with minimal distance.
These turned out to be from the same gene: 
a normal gene called CENPS, 
plus its overlapping "readthrough" called CENPS-CORT.
(Why does Ensembl consider these different genes???)

We may need to compute all-vs-all pairwise distances.
That will require HPC.

Much less expensive would be all-vs-mean distances.
In this notebook, we try to find the mean K-mer vectors
for nuclear vs cytoplasmic, 
within the training set mRNA, at K=4.

We got more separation by restricting the counts to RNA length range 2K-4K.
This is sensible when using raw counts (not percents).

This notebook shows that the low-RCI K-mer counts are closer to the average low than to the average high. 
Similarly, the high-RCI K-mer counts are closer to the average high than to the average low.
We haven't measured variance, but this suggests the two populations have separate distributions and individuals could be assigned to one by their K-mer counts. 

In [1]:
from datetime import datetime
print(datetime.now())
from platform import python_version
print('Python',python_version())
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt 
import sklearn   # pip install --upgrade scikit-learn
print('sklearn',sklearn.__version__)

2022-10-23 15:54:06.065302
Python 3.10.0
sklearn 1.1.2


In [2]:
from KmerCounter import KmerCounter
K=4
counter=KmerCounter(K)
VOCABULARY_SIZE = counter.get_vocabulary_size() 
from cell_lines import Cell_Lines
CELL_LINE_NUMBER = 0
all_cell_lines = Cell_Lines.get_ordered_list()
cell_line_name = all_cell_lines[CELL_LINE_NUMBER]
print('Cell line for today:',CELL_LINE_NUMBER,'=',cell_line_name)

Cell line for today: 0 = A549


In [3]:
ATLAS_DIR = '/Users/jasonmiller/WVU/Localization/LncAtlas/'
RCI_FILE = 'CNRCI_coding_train_genes.csv'
GENCODE_DIR = '/Users/jasonmiller/WVU/Localization/GenCode/'
COUNTS_FILE='CNRCI_coding_train_counts.K4.csv'

In [4]:
from TrainValidSplit2 import Splitter2
splitter = Splitter2()
ATLAS_PATH = ATLAS_DIR+RCI_FILE
gene_rci = splitter.get_gene_universe(ATLAS_PATH,CELL_LINE_NUMBER)
COUNTS_PATH = GENCODE_DIR+COUNTS_FILE
gid_tid,ordered_counts = splitter.load_counts_universe(COUNTS_PATH)
splitter = None

Loaded RCI values for cell line 0
Selected 10354 values out of 13930 genes.
Loaded 67964 gid+tid combinations.
Loaded 67964 rows of K-mer counts.


In [5]:
print(type(gene_rci),type(gid_tid),type(ordered_counts))

<class 'dict'> <class 'list'> <class 'list'>


In [6]:
def split_genes(pairs,t1,t2):
    low=[]
    middle=[]
    high=[]
    for (gid,rci) in pairs.items():
        if rci<t1:
            low.append(gid)
        elif rci<t2:
            middle.append(gid)
        else:
            high.append(gid)
    return low,middle,high
def split_counts(ordered_counts,ordered_ids,select_genes):
    select_counts=[]
    for i in range(len(ordered_ids)):
        (gid,tid) = ordered_ids[i]
        if gid in select_genes:
            select_counts.append(ordered_counts[i])
    return select_counts

In [7]:
low_genes,middle_genes,high_genes = split_genes(gene_rci,-1,1)
low_counts    = split_counts(ordered_counts,gid_tid,low_genes)
print('Genes/Transcripts Low   :',len(low_genes),len(low_counts))
middle_counts = split_counts(ordered_counts,gid_tid,middle_genes)
print('Genes/Transcripts Middle:',len(middle_genes),len(middle_counts))
high_counts   = split_counts(ordered_counts,gid_tid,high_genes)
print('Genes/Transcripts High  :',len(high_genes),len(high_counts))

Genes/Transcripts Low   : 2085 10936
Genes/Transcripts Middle: 6392 34599
Genes/Transcripts High  : 1877 8816


In [8]:
def length_filter(counts,minimum,maximum):
    filtered = []
    for count in counts:
        tot = np.sum(count)
        if tot>=minimum and tot<=maximum:
            filtered.append(count)
    return filtered

In [9]:
low_counts    = length_filter(low_counts,2000,4000)
print('Genes/Transcripts Low   :',len(low_genes),len(low_counts))
middle_counts = length_filter(middle_counts,2000,4000)
print('Genes/Transcripts Middle:',len(middle_genes),len(middle_counts))
high_counts   = length_filter(high_counts,2000,4000)
print('Genes/Transcripts High  :',len(high_genes),len(high_counts))

Genes/Transcripts Low   : 2085 2517
Genes/Transcripts Middle: 6392 5548
Genes/Transcripts High  : 1877 392


In [10]:
def compute_average_vector(counts):
    means = np.mean(counts,axis=0)
    return means
low_avg = compute_average_vector(low_counts)
middle_avg = compute_average_vector(middle_counts)
high_avg = compute_average_vector(high_counts)

In [11]:
def distance(a,b):
    ss = 0
    dim = len(a)
    for i in range(dim):
        ai = a[i]
        bi = b[i]
        df = ai-bi
        sq = df**2
        ss += sq
    return np.sqrt(ss)

In [13]:
print('dist(low,middle) =',distance(low_avg,middle_avg))
print('dist(middle,high)=',distance(middle_avg,high_avg))
print('dist(low,high)   =',distance(low_avg,high_avg))

dist(low,middle) = 70.54739087237033
dist(middle,high)= 37.45022931134671
dist(low,high)   = 106.52814939423472


In [14]:
def average_distance(data,center):
    tot = 0
    for counts in data:
        dist = distance(counts,center)
        tot += dist
    avg = tot / len(data)
    return avg

In [15]:
low_avg_dist = average_distance(low_counts,low_avg)
print('avg(low - avg(low))  ',low_avg_dist)
high_avg_dist = average_distance(high_counts,high_avg)
print('avg(high - avg(high))',high_avg_dist)

avg(low - avg(low))   95.94030416773376
avg(high - avg(high)) 78.62989749143739


In [16]:
low_high_avg_dist = average_distance(low_counts,high_avg)
print('avg(low - avg(high))',low_high_avg_dist)
high_low_avg_dist = average_distance(high_counts,low_avg)
print('avg(high - avg(low))',high_low_avg_dist)

avg(low - avg(high)) 139.8907798504645
avg(high - avg(low)) 132.26876624371477


Ideas:

Narrow the RNA set to just extremes, Gudenas-style.

The following vectors have 256 counts.

Compute the mean vector for high and for low RCI.

Compute vector of presence/absence i.e. sequences with this K-mer.

Compute mean & stdev distance to the mean vector.

Are nuclear RNA more like the nuclear mean vector?
