Find genes differentially expressed between IP and oRG in scRNAseq data to have a benchmark for Nanostring.

In [1]:
import pandas as pd
import scanpy as sc
import numpy as np
import diffxpy.api as de
import pickle

  from pandas.core.index import RangeIndex


In [2]:
raw_counts = pd.read_csv('/nfs/team283/brainData/human_fetal/Polioudakis2019/raw_counts_mat.csv')

In [10]:
raw_counts.index = np.array(raw_counts['Unnamed: 0'])
raw_counts = raw_counts.iloc[:,1:np.shape(raw_counts)[1]]

KeyError: 'Unnamed: 0'

In [4]:
metadata = pd.read_csv('/nfs/team283/brainData/human_fetal/Polioudakis2019/cell_metadata.csv')

In [9]:
metadata.index = metadata['Cell']
metadata = metadata.reindex(np.array(raw_counts.columns))
metadata['Cluster'] = metadata['Cluster'].astype(str)

ValueError: cannot reindex from a duplicate axis

In [11]:
total_counts = np.asarray(np.sum(raw_counts, axis = 0))
cpm_counts = np.asarray(raw_counts)/total_counts.reshape(1, 33986)*10**6

In [25]:
adata = sc.AnnData(X=np.array(cpm_counts).T, obs=metadata, var=np.array(raw_counts.index))

Observation names are not unique. To make them unique, call `.obs_names_make_unique`.


In [26]:
adata = adata[np.asarray([adata.obs['Cluster'][i] in ('oRG', 'IP', 'PgG2M', 'PgS') for i in range(len(adata.obs['Cluster']))]),:]

In [27]:
adata.obs['Cluster'][[adata.obs['Subcluster'][i] in 
                      ('IP_0', 'IP_1', 'IP_2', 'IP_3', 'PgG2M_1', 'PgG2M_2', 'PgG2M_3', 'PgG2M_4', 'PgS_0', 'PgS_1',
                       'PgS_2', 'PgS_3') for i in range(len(adata.obs['Cluster']))]] = 'IP'
adata.obs['Cluster'][[adata.obs['Subcluster'][i] in ('PgG2M_0', 'PgS_4') for i in range(len(adata.obs['Subcluster']))]] = 'oRG'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


In [28]:
adata.obs['Cluster']

Cell
GCGAGGCTCGTG     IP
CCAGCTGTCCAA    oRG
GCAACCTATACA    oRG
GCTCGACCTGAA     IP
TCGGATCTAGGG    oRG
               ... 
TAGTAGAGCGTA     IP
GCACGCATCTGA     IP
GGTACCTATGAC     IP
TCTATTGAGAAG     IP
CATACAGTTGAG     IP
Name: Cluster, Length: 5370, dtype: object

In [31]:
test_rank = de.test.rank_test(
    data=adata,
    grouping="Cluster",
    gene_names=raw_counts.index
)

In [32]:
res = test_rank.summary()

In [33]:
pickle.dump(res, open("/nfs/team283/aa16/KR_NAS/data/oRG_IP_scRNAseq_DE_cpm_results.p", "wb" ))

Briefly check how many cells these results are based on:

In [29]:
np.sum(adata.obs['Cluster'] == 'IP')

3865

In [30]:
np.sum(adata.obs['Cluster'] == 'oRG')

1505

Briefly check how many differentially expressed genes we have:

In [34]:
print(sum([res['qval'][i] < 0.05 and res['log2fc'][i] > 0.5 for i in range(len(res['qval']))]))
print(sum([res['qval'][i] < 0.05 for i in range(len(res['qval']))]))

4819
9140


In [35]:
res

Unnamed: 0,gene,pval,qval,log2fc,mean,zero_mean,zero_variance
0,TSPAN6,3.494090e-25,8.984762e-24,0.895070,57.871228,False,False
1,DPM1,9.441821e-01,9.605539e-01,-0.423651,76.870863,False,False
2,SCYL3,5.433284e-02,1.469695e-01,0.049111,11.085968,False,False
3,C1orf112,3.772491e-01,5.743615e-01,-0.277601,11.319888,False,False
4,FGR,,,0.000000,0.000000,True,True
...,...,...,...,...,...,...,...
35538,ENSG00000276144,,,0.000000,0.000000,True,True
35539,SNORD114-7,5.328915e-01,6.505033e-01,-1069.313793,0.027957,False,False
35540,ZNF965P,,,0.000000,0.000000,True,True
35541,GOLGA8K,1.091298e-01,2.386600e-01,1071.703595,0.057053,False,False
