### Single cell count data for highly expressed genes in Islet $\alpha$ cells

Nuha BinTayyash, 2020

This notebook shows how to get count data for highly expressed genes in Islet $\alpha$ cells from [GSE8737 single cell RNA-seq ](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE87375) dataset.

In [1]:
import pandas as pd
import numpy as np

### Read scRNA-seq gene expression data

The gene expression data were quantified as transcripts per million (TPM), download the data from [GSE87375_Single_Cell_RNA-seq_Gene_TPM.txt.gz](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE87375)

In [2]:
counts = pd.read_csv('GSE87375_Single_Cell_RNA-seq_Gene_TPM.csv',index_col=[0]) 
counts.head()

Unnamed: 0_level_0,Symbol,bE17.5_1_01,bE17.5_1_02,bE17.5_1_03,bE17.5_1_04,bE17.5_1_05,bE17.5_2_01,bE17.5_2_02,bE17.5_2_03,bE17.5_2_04,...,aE17.5_2_22,aE17.5_2_23,aE17.5_4_07,aE17.5_4_08,aP0_2_12,aP0_2_13,aP0_2_14,aP0_3_15,aP0_3_16,aP18_2_14
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ENSMUSG00000000001,Gnai3,106.11,147.1,67.15,113.42,114.49,18.11,99.58,128.12,166.99,...,60.14,103.79,145.43,162.15,39.02,54.64,86.86,66.66,157.06,18.9
ENSMUSG00000000003,Pbsn,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ENSMUSG00000000028,Cdc45,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,27.61,...,0.0,0.0,22.83,28.15,0.0,0.0,0.0,0.0,0.0,0.0
ENSMUSG00000000031,H19,0.0,0.0,0.0,0.0,0.0,0.0,0.4,0.0,0.0,...,63.69,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ENSMUSG00000000037,Scml2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Drop ERCC spike-in 

In [3]:
counts['Symbol']= counts['Symbol'].astype(str)
counts.loc[counts['Symbol']]
counts.head()

Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike
  


Unnamed: 0_level_0,Symbol,bE17.5_1_01,bE17.5_1_02,bE17.5_1_03,bE17.5_1_04,bE17.5_1_05,bE17.5_2_01,bE17.5_2_02,bE17.5_2_03,bE17.5_2_04,...,aE17.5_2_22,aE17.5_2_23,aE17.5_4_07,aE17.5_4_08,aP0_2_12,aP0_2_13,aP0_2_14,aP0_3_15,aP0_3_16,aP18_2_14
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ENSMUSG00000000001,Gnai3,106.11,147.1,67.15,113.42,114.49,18.11,99.58,128.12,166.99,...,60.14,103.79,145.43,162.15,39.02,54.64,86.86,66.66,157.06,18.9
ENSMUSG00000000003,Pbsn,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ENSMUSG00000000028,Cdc45,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,27.61,...,0.0,0.0,22.83,28.15,0.0,0.0,0.0,0.0,0.0,0.0
ENSMUSG00000000031,H19,0.0,0.0,0.0,0.0,0.0,0.0,0.4,0.0,0.0,...,63.69,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ENSMUSG00000000037,Scml2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Get count data for highly expressed genes in  Islet $\alpha$ cells [from table S2. Differentially Expressed Genes between β Lineage and α Lineage](https://www.cell.com/cell-metabolism/fulltext/S1550-4131(17)30208-5)

Load names of highly expressed genes in $\alpha$ 

In [4]:
DE_alpha = pd.read_csv('Sheet 2_Genes highly expressed in α-lineage.csv',index_col=[0])
DE_alpha.head()

Unnamed: 0_level_0,Symbol,Description,Transcription Factor,log2FoldChange,Adjusted p-value
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
ENSMUSG00000000394,Gcg,glucagon,0.0,-11.282196,0.0
ENSMUSG00000001504,Irx2,Iroquois related homeobox 2 (Drosophila),1.0,-9.326386,0.0
ENSMUSG00000004631,Sgce,"sarcoglycan, epsilon",0.0,-9.84103,0.0
ENSMUSG00000027524,Edn3,endothelin 3,,-9.563078,0.0
ENSMUSG00000027971,Ndst4,N-deacetylase/N-sulfotransferase (heparin gluc...,,-7.36242,0.0


Get counts for highly expressed genes in $\alpha$ cells

In [5]:
DE_alpha_counts = counts.loc[DE_alpha.index.values]
DE_alpha_counts = DE_alpha_counts.set_index('Symbol')
DE_alpha_counts.head()

Unnamed: 0_level_0,bE17.5_1_01,bE17.5_1_02,bE17.5_1_03,bE17.5_1_04,bE17.5_1_05,bE17.5_2_01,bE17.5_2_02,bE17.5_2_03,bE17.5_2_04,bE17.5_2_05,...,aE17.5_2_22,aE17.5_2_23,aE17.5_4_07,aE17.5_4_08,aP0_2_12,aP0_2_13,aP0_2_14,aP0_3_15,aP0_3_16,aP18_2_14
Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Gcg,0.0,0.0,0.0,3.74,0.0,9.96,0.0,12.13,0.0,0.89,...,41725.25,273684.9,103753.74,130910.21,310587.7,292315.62,151614.86,258446.33,247865.8,386980.13
Irx2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,102.76,172.83,44.62,114.15,141.2,21.58,17.56,75.03,52.34,27.31
Sgce,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,160.74,37.1,141.41,34.55,4.43,0.0,16.27,50.58,46.58,18.15
Edn3,0.0,5.26,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,32.55,52.32,153.56,253.53,104.14,13.35,118.29,352.15,21.62
Ndst4,1.73,0.21,4.55,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,2.69,15.22,29.69,18.02,0.25,13.48,18.77,22.15,36.46,28.46


### Pseudotime information 

In [6]:
alpha_pseudotime = pd.read_csv('alpha_pca.csv',index_col=[0])
alpha_pseudotime.head()

Unnamed: 0_level_0,Pseudotime
Sample,Unnamed: 1_level_1
aE17.5_2_09,-14.189071
aE17.5_2_16,-14.073731
aE17.5_1_11,-13.706203
aE17.5_3_07,-13.696796
aE17.5_4_06,-13.539157


Drop $\beta$ cells from counts of highly expressed genes in $\alpha$ cells 

In [7]:
DE_alpha_counts = DE_alpha_counts[alpha_pseudotime.index.values] # get cells with pseudotime
DE_alpha_counts = DE_alpha_counts.astype('int')
DE_alpha_counts = DE_alpha_counts.astype('float')
DE_alpha_counts.head()

Unnamed: 0_level_0,aE17.5_2_09,aE17.5_2_16,aE17.5_1_11,aE17.5_3_07,aE17.5_4_06,aE17.5_3_04,aE17.5_2_11,aE17.5_1_25,aE17.5_4_01,aE17.5_4_03,...,aP18_3_12,aP60_1_11,aP60_3_05,aP15_1_15,aP60_1_13,aP60_3_08,aP60_5_16,aP18_1_17,aP60_1_10,aP60_5_05
Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Gcg,245069.0,156201.0,174110.0,264359.0,96636.0,145083.0,160109.0,108540.0,261634.0,167305.0,...,549279.0,468779.0,645307.0,456730.0,660118.0,450766.0,637129.0,414055.0,427924.0,654186.0
Irx2,12.0,117.0,49.0,70.0,97.0,71.0,32.0,77.0,24.0,14.0,...,4.0,45.0,27.0,17.0,0.0,0.0,0.0,2.0,22.0,26.0
Sgce,34.0,66.0,33.0,79.0,144.0,106.0,38.0,115.0,39.0,190.0,...,0.0,54.0,44.0,51.0,0.0,44.0,0.0,42.0,0.0,35.0
Edn3,103.0,9.0,0.0,56.0,6.0,35.0,0.0,39.0,78.0,0.0,...,19.0,3.0,4.0,19.0,73.0,0.0,0.0,12.0,96.0,0.0
Ndst4,12.0,0.0,0.0,0.0,15.0,4.0,0.0,10.0,23.0,0.0,...,37.0,19.0,24.0,26.0,0.0,9.0,17.0,45.0,0.0,22.0


In [8]:
DE_alpha_counts.to_csv('alpha_counts.csv')

$\alpha$ pseudotime scaling

In [9]:
X = alpha_pseudotime['Pseudotime'].values
X_scaled_alpha = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0)) 

In [10]:
time_points = pd.DataFrame(data= X_scaled_alpha,index= alpha_pseudotime['Pseudotime'].index,columns=['pseudotime'])
time_points.to_csv('alpha_time_points.csv')