# LncATLAS
Publication: LncATLAS database for subcellular localization of long noncoding RNAs (2017) David Mas-Ponte, Joana Carlevaro-Fita, Emilio Palumbo, Toni Hermoso Pulido, Roderic Guigo, and Rory Johnson. RNA 23:1080–1087

Publication [online](https://rnajournal.cshlp.org/content/23/7/1080)

Here, extract just the genes with a cytoplasm:nuclear RCI value for at least one cell line.  
Transpose from row-per-cell-line to row-per-gene.  

In [1]:
ATLAS_DIR='/Users/jasonmiller/WVU/MDPI/LncAtlas/'
ATLAS_DATA='lncATLAS_all_data_RCI.csv'
CODING_OUTFILE='CNRCI_coding_genes.csv'
NONCODING_OUTFILE='CNRCI_noncoding_genes.csv'

## Raw data

In [2]:
import pandas as pd
infile = ATLAS_DIR+ATLAS_DATA
df=pd.read_csv(infile)
df

Unnamed: 0,ENSEMBL ID,Data Source,Data Type,Value,Gene Name,Coding Type,Biotype
0,ENSG00000000003,A549,CNRCI,1.08068,TSPAN6,coding,coding
1,ENSG00000000003,GM12878,CNRCI,,TSPAN6,coding,coding
2,ENSG00000000003,H1.hESC,CNRCI,1.85734,TSPAN6,coding,coding
3,ENSG00000000003,HeLa.S3,CNRCI,1.86839,TSPAN6,coding,coding
4,ENSG00000000003,HepG2,CNRCI,2.29436,TSPAN6,coding,coding
...,...,...,...,...,...,...,...
714515,ENSG00000283125,NCI.H460,CNRCI,,RP11-299P2.2,nc,nc
714516,ENSG00000283125,NHEK,CNRCI,,RP11-299P2.2,nc,nc
714517,ENSG00000283125,SK.MEL.5,CNRCI,,RP11-299P2.2,nc,nc
714518,ENSG00000283125,SK.N.DZ,CNRCI,,RP11-299P2.2,nc,nc


In [3]:
# Coding Type same as Biotype in every case
print('Biotype')
print(df['Biotype'].value_counts())

Biotype
coding    395940
nc        318580
Name: Biotype, dtype: int64


## Filtered data

In [4]:
# Filtered for binary cytosolic-nuclear i.e. Data Type = CNRCI
# One cell line (K562) has other values like RCIno
bf = df.loc[df['Data Type']=='CNRCI']

In [5]:
# Filter against Value=NaN
qf = bf.loc[~bf['Value'].isnull()]
qf

Unnamed: 0,ENSEMBL ID,Data Source,Data Type,Value,Gene Name,Coding Type,Biotype
0,ENSG00000000003,A549,CNRCI,1.080680,TSPAN6,coding,coding
2,ENSG00000000003,H1.hESC,CNRCI,1.857340,TSPAN6,coding,coding
3,ENSG00000000003,HeLa.S3,CNRCI,1.868390,TSPAN6,coding,coding
4,ENSG00000000003,HepG2,CNRCI,2.294360,TSPAN6,coding,coding
5,ENSG00000000003,HT1080,CNRCI,0.866395,TSPAN6,coding,coding
...,...,...,...,...,...,...,...
714484,ENSG00000283122,HepG2,CNRCI,-2.584960,HYMAI,nc,nc
714485,ENSG00000283122,HT1080,CNRCI,-1.485430,HYMAI,nc,nc
714487,ENSG00000283122,IMR.90,CNRCI,-3.305810,HYMAI,nc,nc
714494,ENSG00000283122,MCF.7,CNRCI,-3.544320,HYMAI,nc,nc


In [6]:
coding = qf.loc[qf['Biotype']=='coding']
print('Coding values:',len(coding))
noncoding = qf.loc[qf['Biotype']=='nc']
print('Noncoding values:',len(noncoding))

Coding values: 169966
Noncoding values: 28217


In [7]:
# Assume the coding and noncoding dataframes have the same cell lines
# Use the cell line names as csv column headers
cell_line_names = list(coding['Data Source'].unique())
column_names = ['gene_id']+cell_line_names
column_names

['gene_id',
 'A549',
 'H1.hESC',
 'HeLa.S3',
 'HepG2',
 'HT1080',
 'HUVEC',
 'MCF.7',
 'NCI.H460',
 'NHEK',
 'SK.MEL.5',
 'SK.N.DZ',
 'SK.N.SH',
 'GM12878',
 'K562',
 'IMR.90']

In [8]:
# It is sloppy, but the cell line names are hard-coded here
# for conversion from cell_line to array_index.
line_to_index={}
line_to_index['A549']    =0
line_to_index['H1.hESC'] =1
line_to_index['HeLa.S3'] =2
line_to_index['HepG2']   =3
line_to_index['HT1080']  =4
line_to_index['HUVEC']   =5
line_to_index['MCF.7']   =6
line_to_index['NCI.H460']=7
line_to_index['NHEK']    =8
line_to_index['SK.MEL.5']=9
line_to_index['SK.N.DZ'] =10
line_to_index['SK.N.SH'] =11
line_to_index['GM12878'] =12
line_to_index['K562']    =13
line_to_index['IMR.90']  =14
NUMBER_OF_LINES=15

In [9]:
class values_one_gene():
    '''Capture one gene_id plus all its RCI value for each cell line.'''
    def __init__(self,gene):
        self.gene_id = gene
        self.values = [0] * NUMBER_OF_LINES
    def __repr__(self):
        return str(self.values)
    def add(self,cell_line,RCI_value):
        index = line_to_index[cell_line]
        self.values[index] = RCI_value
    def get_values(self):
        return self.values

In [10]:
def populate_gene_values(df):
    '''Transform the given dataframe (one value per gene+cell_line)
    to a dict (key = gene, value = values_one_gene)'''
    all_genes = {}
    for ndx,row in df.iterrows():
        gene_id   = row['ENSEMBL ID']
        RCI_value = row['Value']
        cell_line = row['Data Source']
        if gene_id in all_genes:
            values = all_genes[gene_id]
            values.add(cell_line,RCI_value)
        else:
            values = values_one_gene(gene_id)
            values.add(cell_line,RCI_value)
            all_genes[gene_id] = values
    return all_genes

In [11]:
print('coding...')
coding_gene_map = populate_gene_values(coding)
print('noncoding...')
noncoding_gene_map = populate_gene_values(noncoding)
print('done')

coding...
noncoding...
done


In [12]:
def save_to_csv(gene_to_RCI_map,fn):
    with open(fn,'w') as handle:
        handle.write(",".join(column_names) + '\n')
        for key,vog in gene_to_RCI_map.items():
            handle.write(key + ',')
            values = vog.get_values()
            L = len(values)
            for i in range(L-1):
                handle.write(str(values[i]) + ',')
            handle.write(str(values[L-1]) + '\n')

In [13]:
print('coding...')
save_to_csv(coding_gene_map, ATLAS_DIR+CODING_OUTFILE)
print('noncoding...')
save_to_csv(noncoding_gene_map, ATLAS_DIR+NONCODING_OUTFILE)
print('done')

coding...
noncoding...
done
