### Single cell count data for highly expressed genes in Islet $\beta$ cells

Nuha BinTayyash, 2020

This notebook shows how to get count data for highly expressed genes in Islet $\beta$ cells from [GSE8737 single cell RNA-seq ](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE87375) dataset.

In [1]:
import pandas as pd
import numpy as np

### Read scRNA-seq gene expression data

The gene expression data were quantified as transcripts per million (TPM), download the data from [GSE87375_Single_Cell_RNA-seq_Gene_TPM.txt.gz](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE87375)

In [2]:
counts = pd.read_csv('tpm.geo.csv',index_col=[0]) 
counts.head()

Unnamed: 0_level_0,Symbol,bE17.5_1_01,bE17.5_1_02,bE17.5_1_03,bE17.5_1_04,bE17.5_1_05,bE17.5_2_01,bE17.5_2_02,bE17.5_2_03,bE17.5_2_04,...,aE17.5_2_22,aE17.5_2_23,aE17.5_4_07,aE17.5_4_08,aP0_2_12,aP0_2_13,aP0_2_14,aP0_3_15,aP0_3_16,aP18_2_14
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ENSMUSG00000000001,Gnai3,106.11,147.1,67.15,113.42,114.49,18.11,99.58,128.12,166.99,...,60.14,103.79,145.43,162.15,39.02,54.64,86.86,66.66,157.06,18.9
ENSMUSG00000000003,Pbsn,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ENSMUSG00000000028,Cdc45,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,27.61,...,0.0,0.0,22.83,28.15,0.0,0.0,0.0,0.0,0.0,0.0
ENSMUSG00000000031,H19,0.0,0.0,0.0,0.0,0.0,0.0,0.4,0.0,0.0,...,63.69,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ENSMUSG00000000037,Scml2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Drop ERCC spike-in 

In [5]:
counts['Symbol']= counts['Symbol'].astype(str)
counts['Symbol']
#counts.loc[counts['Symbol']]
#counts.head()

ID
ENSMUSG00000000001         Gnai3
ENSMUSG00000000003          Pbsn
ENSMUSG00000000028         Cdc45
ENSMUSG00000000031           H19
ENSMUSG00000000037         Scml2
                         ...    
ERCC-00164            ERCC-00164
ERCC-00165            ERCC-00165
ERCC-00168            ERCC-00168
ERCC-00170            ERCC-00170
ERCC-00171            ERCC-00171
Name: Symbol, Length: 40916, dtype: object

#### Get count data for highly expressed genes in  Islet $\beta$ cells [from table S2. Differentially Expressed Genes between β Lineage and α Lineage](https://www.cell.com/cell-metabolism/fulltext/S1550-4131(17)30208-5)

Load names of highly expressed genes in $\beta$ 

In [6]:
DE_beta = pd.read_csv('Sheet 2_Genes highly expressed in β-lineage.csv',index_col=[0])
DE_beta.head()

Unnamed: 0_level_0,Symbol,Description,Transcription Factor,log2FoldChange,Adjusted p-value
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
ENSMUSG00000044988,Ucn3,urocortin 3,0.0,9.191703,0.0
ENSMUSG00000024027,Glp1r,glucagon-like peptide 1 receptor,,8.919029,0.0
ENSMUSG00000027004,Frzb,frizzled-related protein,,8.882942,0.0
ENSMUSG00000035804,Ins1,insulin I,,8.78263,0.0
ENSMUSG00000027690,Slc2a2,solute carrier family 2 (facilitated glucose t...,,8.742927,0.0


Get counts for highly expressed genes in $\beta$ cells

In [7]:
DE_beta_counts = counts.loc[DE_beta.index.values]
DE_beta_counts = DE_beta_counts.set_index('Symbol')
DE_beta_counts.head()

Unnamed: 0_level_0,bE17.5_1_01,bE17.5_1_02,bE17.5_1_03,bE17.5_1_04,bE17.5_1_05,bE17.5_2_01,bE17.5_2_02,bE17.5_2_03,bE17.5_2_04,bE17.5_2_05,...,aE17.5_2_22,aE17.5_2_23,aE17.5_4_07,aE17.5_4_08,aP0_2_12,aP0_2_13,aP0_2_14,aP0_3_15,aP0_3_16,aP18_2_14
Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Ucn3,54.1,43.55,0.0,112.1,0.58,25.32,0.0,146.84,196.01,3.28,...,0.0,0.0,0.0,0.0,0.0,0.0,31.95,35.18,0.0,0.0
Glp1r,82.15,132.06,93.1,63.18,178.76,13.31,87.67,18.83,8.96,17.05,...,0.0,0.0,0.0,0.0,0.0,0.0,38.74,0.0,0.0,7.75
Frzb,211.13,92.55,195.6,24.53,46.47,51.29,83.56,37.25,168.44,226.22,...,0.0,0.0,9.42,0.0,31.54,0.0,104.61,0.0,0.0,52.35
Ins1,84093.06,81966.83,88813.1,53633.93,96649.9,140977.56,84988.75,26767.14,127895.34,55580.3,...,1.51,60.59,106.52,19744.83,108.75,19.75,125395.57,15694.94,29.58,232.81
Slc2a2,127.63,398.18,622.37,432.56,323.57,425.24,394.28,604.61,451.2,285.56,...,9.56,0.0,0.0,0.0,0.0,0.0,30.67,14.74,0.0,0.0


### Pseudotime information 

In [8]:
beta_pseudotime = pd.read_csv('beta_pca.csv',index_col=[0])
beta_pseudotime.head()

Unnamed: 0_level_0,Pseudotime
Sample,Unnamed: 1_level_1
bE17.5_3_05,-19.127067
bE17.5_2_08,-17.841766
bE17.5_2_03,-17.585434
bE17.5_2_05,-17.477666
bE17.5_2_26,-16.961845


Drop $\beta$ cells from counts of highly expressed genes in $\beta$ cells 

In [15]:
DE_beta_counts = DE_beta_counts[beta_pseudotime.index.values] # get cells with pseudotime
DE_beta_counts = DE_beta_counts.astype('int')
DE_beta_counts = DE_beta_counts.astype('float')
DE_beta_counts.head()

Unnamed: 0_level_0,bE17.5_3_05,bE17.5_2_08,bE17.5_2_03,bE17.5_2_05,bE17.5_2_26,bE17.5_2_25,bE17.5_1_03,bE17.5_2_14,bE17.5_2_28,bE17.5_2_10,...,bP60_5_17,bP60_5_08,bP60_2_05,bP60_2_04,bP60_2_12,bP60_5_03,bP60_2_14,bP60_1_10,bP60_3_29,bP60_3_04
Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Ucn3,92.0,27.0,146.0,3.0,71.0,197.0,0.0,138.0,23.0,58.0,...,305.0,245.0,566.0,358.0,360.0,231.0,311.0,743.0,279.0,392.0
Glp1r,0.0,140.0,18.0,17.0,207.0,105.0,93.0,183.0,107.0,239.0,...,62.0,83.0,136.0,149.0,115.0,180.0,173.0,105.0,44.0,184.0
Frzb,350.0,166.0,37.0,226.0,121.0,209.0,195.0,156.0,366.0,132.0,...,66.0,49.0,40.0,66.0,40.0,111.0,73.0,56.0,99.0,89.0
Ins1,38745.0,120473.0,26767.0,55580.0,83616.0,120219.0,88813.0,117196.0,115911.0,12350.0,...,127884.0,157610.0,130417.0,137082.0,136779.0,123396.0,128148.0,144336.0,141662.0,147705.0
Slc2a2,210.0,575.0,604.0,285.0,446.0,979.0,622.0,396.0,605.0,424.0,...,328.0,396.0,232.0,277.0,203.0,160.0,278.0,225.0,172.0,361.0


In [10]:
DE_beta_counts.to_csv('beta_counts.csv')

$\beta$ pseudotime scaling

In [11]:
X = beta_pseudotime['Pseudotime'].values
X_scaled_beta = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0)) 

In [12]:
time_points = pd.DataFrame(data= X_scaled_beta,index= beta_pseudotime['Pseudotime'].index,columns=['pseudotime'])
time_points.to_csv('beta_time_points.csv')