## Rationale: deconvoluting human RNA-Seq data using mouse scRNA-Seq reference

The aim is to demonstrate that human RNA-Seq datasets can be deconvoluted into scRNA-Seq data using non-human references. In this case, we are using mouse data as non-human reference. If we can demonstrate that A.I. deconvolution performs well in this scenario, we could potentially use this deconvolution method to generate single cell data from human organs/tissues that have not been characterized yet. 

## Load modules

In [1]:
import pandas as pd
import numpy as np
from tqdm import tqdm

## Load the input files

Depending on what tissue needs to be processed, run the appropriate snippet.

The first snipped is used to load the RA data.

The "RA_mouse_sc_data.csv" is the sc data from the mouse heart, extracted from Tabula Muris. All cells present in the subtissue "RA" were selected. Importantly, "ventricular cardiomyocytes" were excluded as should not be present in the atrium.

The processing of "RA_mouse_sc_data.csv" is described in the "Processing sc_data and meta" R script.

"Mouse_to_human.txt" is the collection of mouse-to-human orthologs. These are genes that are conserved in both mice and humans. The file was retreived from Ensembl's BioMart.

In [2]:
# load the RA mouse sc_data and mouse_to_human files
sc_data = pd.read_csv('Right Atrium/RA_mouse_sc_data.csv')
mus_to_hum_reference_genes = pd.read_csv('Mouse_to_human.txt')
sc_data

Unnamed: 0.1,Unnamed: 0,C_1,C_2,C_3,C_4,C_5,C_6,C_7,C_8,C_9,...,C_2890,C_2891,C_2892,C_2893,C_2894,C_2895,C_2896,C_2897,C_2898,C_2899
0,0610005C13Rik,0.000000,0.00000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
1,0610007C21Rik,0.000000,0.00000,1.395622,0.910209,0.0,1.301201,1.439867,1.425038,1.614110,...,0.072375,0.0,0.042181,1.521544,1.528946,1.035905,0.069440,2.067902,1.839154,2.086276
2,0610007L01Rik,0.000000,0.00000,1.355221,1.527729,0.0,1.227175,1.583265,0.000000,0.000000,...,0.031661,0.0,0.042181,0.227084,0.000000,0.000000,0.000000,0.017039,0.000000,0.000000
3,0610007N19Rik,0.000000,0.00000,0.727492,1.127542,0.0,0.881903,0.000000,0.363286,0.000000,...,0.005347,0.0,0.000000,0.000000,0.000000,0.000000,0.702453,0.000000,0.000000,0.000000
4,0610007P08Rik,0.357854,0.00000,0.000000,0.000000,0.0,0.000000,0.348624,0.000000,0.000000,...,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,1.557381
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
22961,Zyg11a,0.000000,0.00000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
22962,Zyg11b,0.000000,0.00000,0.000000,0.045357,0.0,0.000000,0.016547,0.049681,0.816359,...,0.000000,0.0,0.644347,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.177825
22963,Zyx,1.854661,2.90101,1.186330,0.000000,0.0,1.548572,0.710224,1.302701,1.656391,...,0.000000,0.0,2.249960,1.639064,1.910998,0.000000,0.934533,0.000000,0.000000,2.247689
22964,Zzef1,0.000000,0.00000,0.000000,0.000000,0.0,0.000000,1.318090,0.000000,0.000000,...,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000


The second snippet (below) is used to load the SN data.

The "SN_mouse_sc_data.csv" is the sc data from the mouse SN, extracted from Linscheid et al. (2019). All cells were selected and processed as described in the "Processing sc_data and meta" R script.

"Mouse_to_human.txt" is the same a described above.

In [22]:
# load the SN mouse sc_data and mouse_to_human files
sc_data = pd.read_csv('Sinus Node/SN_mouse_no_doublets_sc_data.csv')
mus_to_hum_reference_genes = pd.read_csv('Mouse_to_human.txt')
sc_data

Unnamed: 0.1,Unnamed: 0,C_1,C_2,C_3,C_4,C_5,C_6,C_7,C_8,C_9,...,C_4995,C_4996,C_4997,C_4998,C_4999,C_5000,C_5001,C_5002,C_5003,C_5004
0,Xkr4,0.000000,0.0,0.788676,1.337157,0.0,0.000000,1.278091,0.0,1.397498,...,1.058914,0.000000,0.0,0.0,1.060886,0.000000,1.426369,0.000000,0.000000,1.056583
1,Gm1992,0.000000,0.0,0.788676,0.000000,0.0,0.000000,0.000000,0.0,0.000000,...,0.000000,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
2,Gm37381,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.000000,...,0.000000,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
3,Rp1,0.000000,0.0,0.000000,0.000000,0.0,0.000000,1.003026,0.0,0.000000,...,0.000000,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
4,Rp1.1,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.000000,...,0.000000,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
27993,AC168977.1,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.000000,...,0.000000,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
27994,PISD,1.197756,0.0,0.788676,0.000000,0.0,1.892896,0.622328,0.0,0.000000,...,1.058914,1.876556,0.0,0.0,1.060886,1.433137,0.000000,0.000000,2.222655,1.056583
27995,DHRSX,0.000000,0.0,0.000000,0.000000,0.0,1.340063,0.000000,0.0,0.000000,...,1.058914,0.000000,0.0,0.0,0.000000,0.000000,0.000000,1.008082,0.000000,1.056583
27996,Vmn2r122,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.000000,...,0.000000,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000


## Convert genes names from mouse to human

Most mouse genes have an identical human counterpart, and it is usually the upper case version of the mouse gene.
For example Acta2 in mouse corresponds to ACTA2 in humans. 

Things can get a bit problematic because some genes don't match so easily; some genes in mice may have several isoform in human and vice versa. In these cases it is unclear what the right mapping should be.
In the RA sc_data file, 473 mouse genes have multiple human isoforms, out of 15721 total genes.

For simplicity, we chose to exclude these genes (i.e. excluding 3% of genes). This shouldn't make a big difference in the analysis.

In [3]:
# Convert mouse gene names into human mouse names
mouse_genes = sc_data['Unnamed: 0'].tolist()

converted = 0
mouse_only = 0
multiple_genes = 0
for gene in tqdm(mouse_genes):
    if gene in mus_to_hum_reference_genes['Gene name'].values:
        if len(mus_to_hum_reference_genes.loc[mus_to_hum_reference_genes['Gene name'] == gene]['Human gene name'].values) == 1:
            human_ortolog = mus_to_hum_reference_genes.loc[mus_to_hum_reference_genes['Gene name']
                                                           == gene]['Human gene name'].values[0]
            sc_data['Unnamed: 0'].replace(gene, human_ortolog, inplace=True)
            converted += 1
        else:
            multiple_genes += 1
    else:
        mouse_only += 1
total_genes = mouse_only + converted + multiple_genes
percentage = (100*converted)/len(mouse_genes)

print('There are ', total_genes, 'total gene:')
print(' - ', converted, ' were converted (', percentage, '% )')
print(' - ', multiple_genes, ' had multiple isoforms and were excluded')
print(' - ', mouse_only, ' were mouse-specific')


  0%|          | 0/22966 [00:00<?, ?it/s]

100%|██████████| 22966/22966 [00:48<00:00, 469.09it/s]

There are  22966 total gene:
 -  15248  were converted ( 66.39379952973961 % )
 -  473  had multiple isoforms and were excluded
 -  7245  were mouse-specific





66% of genes were converted in RA
 -  15248  were converted ( 66.39379952973961 % )
 -  473  had multiple isoforms and were excluded
 -  7245  were mouse-specific

58% of genes were converted in SN
 -  16290  were converted ( 58.182727337666975 % )
 -  601  had multiple isoforms and were excluded
 -  11107  were mouse-specific

## Polish the resulting dataset
- Trim rows with mouse genes that weren't converted
- Remove rows with NaN values
- Keep the first duplicate of the same gene (These duplicates exist because different mouse genes may be known under the same human gene... nothing is ever easy)
- Sort the dataframe alphabetically

In [24]:
# Trim the mouse genes not converted and remove NaNs
trimmed_sc_data = sc_data.loc[~sc_data['Unnamed: 0'].isin(mouse_genes)]
trimmed_sc_data = trimmed_sc_data.dropna(subset=['Unnamed: 0'])

# Dealt with duplicates
if trimmed_sc_data['Unnamed: 0'].is_unique:
    print('All genes are unique')
else:
    print('These genes are not unique:')
    duplicates = trimmed_sc_data.duplicated(subset=['Unnamed: 0'])
    duplicates = trimmed_sc_data.loc[duplicates]
    display(duplicates)

    print('Before dropping duplicates the dataframe is ', trimmed_sc_data.shape)
    trimmed_sc_data = trimmed_sc_data.drop_duplicates(
        subset='Unnamed: 0', keep='first')
    print('After dropping duplicates the dataframe is ', trimmed_sc_data.shape)

# Sort alphabetically
trimmed_sc_data = trimmed_sc_data.sort_values('Unnamed: 0')
trimmed_sc_data


These genes are not unique:


Unnamed: 0.1,Unnamed: 0,C_1,C_2,C_3,C_4,C_5,C_6,C_7,C_8,C_9,...,C_4995,C_4996,C_4997,C_4998,C_4999,C_5000,C_5001,C_5002,C_5003,C_5004
291,DNAH7,0.000000,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0,0.000000,...,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.0
330,DNAH7,0.000000,0.0,0.788676,0.0,0.0,0.0,0.000000,0.0,1.397498,...,0.0,0.000000,0.000000,0.992648,1.060886,0.0,0.0,0.0,0.000000,0.0
696,UGT1A6,0.000000,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0,0.000000,...,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.0
697,UGT1A6,0.000000,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0,0.000000,...,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.0
829,SLCO6A1,0.000000,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0,0.000000,...,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
27622,IFIT1B,0.000000,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0,0.000000,...,0.0,0.000000,0.000000,0.992648,0.000000,0.0,0.0,0.0,0.000000,0.0
27623,IFIT1B,1.727283,0.0,0.788676,0.0,0.0,0.0,0.622328,0.0,0.000000,...,0.0,1.325877,1.132385,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.0
27740,SCD,0.000000,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0,0.000000,...,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.973186,0.0
27826,INS,0.000000,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0,0.000000,...,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.0


Before dropping duplicates the dataframe is  (16265, 5005)
After dropping duplicates the dataframe is  (15579, 5005)


Unnamed: 0.1,Unnamed: 0,C_1,C_2,C_3,C_4,C_5,C_6,C_7,C_8,C_9,...,C_4995,C_4996,C_4997,C_4998,C_4999,C_5000,C_5001,C_5002,C_5003,C_5004
23161,A1BG,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000
27590,A1CF,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000
11216,A2M,1.197756,0.000000,0.000000,0.000000,0.000000,0.000000,0.622328,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000
7686,A3GALT2,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000
23465,A4GALT,0.000000,0.000000,0.000000,0.000000,1.441949,0.000000,0.622328,0.000000,1.397498,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7368,ZYG11A,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000
7369,ZYG11B,0.000000,0.000000,1.224058,0.000000,0.000000,1.340063,1.278091,1.381439,1.958704,...,0.000000,1.876556,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.973186,0.000000
10425,ZYX,0.000000,0.000000,0.000000,1.337157,0.000000,0.000000,1.278091,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,1.060886,0.0,0.000000,0.000000,0.000000,1.558786
19983,ZZEF1,0.000000,1.827642,0.788676,1.337157,0.000000,0.000000,1.278091,0.000000,0.000000,...,1.561608,0.000000,1.649831,0.992648,1.060886,0.0,0.000000,1.008082,0.973186,1.056583


## Export the result in a .csv

In [25]:
# Format column names with the first one being empty
columns = trimmed_sc_data.columns.to_list()
columns[0] = ''

# Export the converted sc_data
trimmed_sc_data.to_csv('SN_converted_human_no_doublets_sc_data.csv', index=False, header=columns)
