# 教程1: 单细胞上游结果预处理

## 1. 引言

思来想后，还是写一个完整的分析教程吧，与以往的分析不同，本次分析将从`cellranger`比对后的文件开始，并且会写清楚每一步分析的思路，希望你学有所获
    
    by-泽华/胡磊

## 2. Cellranger文件导入

在开始分析前，我们先认识一下`cellranger`的输出，一般有两种方式

- `filtered_feature_bc_matrix.h5`：这个是`cellranger`的一个输出文件，存放在`outs`目录，我们一般使用`scanpy.read_10x_h5`进行导入
- `filtered_feature_bc_matrix`: 这是一个文件夹，里面有`barcodes.tsv.gz`, `features.tsv.gz`, `matrix.mtx.gz`三个文件，也存放在`outs`目录，我们一般使用`scanpy.read_10x_mtx`进行导入

通常文章发表后，我们可以从作者给的Coda and Data availability找到上面两种形式其中一种的数据，然后把他下载下来，当然也不排除有`.h5ad`或者`.rds`两种格式，前者是`scanpy`可以直接导入的格式，后者就直接跳过这个数据吧，这是R语言的格式。

在这个分析中，我们使用了GSE166173的scRNA-seq和scATAC-seq数据，选择了'SRR13633759', 'SRR13633760', 'SRR13633761', 'SRR13633762'四个scRNA-seq文件，选择了'SRR13633766', 'SRR13633772'两个scATAC-seq文件

我将cellranger与cellranger-atac的输出存放到了`data`目录下的`cellranger`和`cellranger-atac`目录下

并且我还从ncbi上整理了每一个SRR文件的meta信息存放在`data`目录下的`meta`目录

In [1]:
#导入包
import anndata
print('anndata(Ver): ',anndata.__version__)
import scanpy as sc
print('scanpy(Ver): ',sc.__version__)
import matplotlib.pyplot as plt
import matplotlib
print('matplotlib(Ver): ',matplotlib.__version__)
import seaborn as sns
print('seaborn(Ver): ',sns.__version__)
import numpy as np
print('numpy(Ver): ',np.__version__)
import pandas as pd
print('pandas(Ver): ',pd.__version__)
import scvelo as scv
print('scvelo(Ver): ',scv.__version__)
import gc
import os
current_path='/home/leihu/data/analysis/rb_tutorial/'

anndata(Ver):  0.8.0
scanpy(Ver):  1.9.1
matplotlib(Ver):  3.5.1
seaborn(Ver):  0.11.2
numpy(Ver):  1.22.3
pandas(Ver):  1.3.5
scvelo(Ver):  0.2.4


## 3. scRNA-seq预处理

我们先导入scRNA-seq的meta文件，注意meta实际上就是我们`anndata`的obs

In [3]:
rna_meta=pd.read_csv(current_path+'data/meta/RB_rna_meta.csv',index_col=0)
rna_meta.head()

Unnamed: 0,Run,BioSample,AvgSpotLen,Bases,Bytes,Developmental_Stage,Experiment,GEO_Accession,Sample Name,source_name,Tissue
1,SRR13633752,SAMN17796859,125,62.73 G,18.89 Gb,10PCW,SRX10031184,GSM5065157,GSM5065157,Retina,Retina
2,SRR13633753,SAMN17796857,125,71.03 G,21.19 Gb,21PCW,SRX10031185,GSM5065158,GSM5065158,Retina,Retina
3,SRR13633754,SAMN17796856,132,41.03 G,16.60 Gb,20PCW,SRX10031186,GSM5065159,GSM5065159,Retina,Retina
4,SRR13633755,SAMN17796854,125,47.95 G,13.88 Gb,15PCW,SRX10031187,GSM5065160,GSM5065160,Retina,Retina
5,SRR13633756,SAMN17796853,125,22.91 G,6.58 Gb,12PCW,SRX10031188,GSM5065161,GSM5065161,Retina,Retina


我们提取一下视网膜母细胞瘤的样本信息

In [4]:
srr_name=rna_meta.loc[rna_meta['Tissue']=='Retinoblastoma','Run'].values
srr_name

array(['SRR13633759', 'SRR13633760', 'SRR13633761', 'SRR13633762'],
      dtype=object)

In [17]:
#我们从cellranger的输出中读取第一个文件
name=srr_name[0]
print('Now read: ',name)
#读取文件
adata1=sc.read_10x_h5(current_path+'data/cellranger/{}/filtered_feature_bc_matrix.h5'.format(name))
#adata1=sc.read_10x_mtx(current_path+'data/cellranger/{}/filtered_feature_bc_matrix'.format(name))

#由于我们使用了velocyto计算了每一个测序文件的RNA velocity，所以我们在预处理文件的时候需要一并导入
#需要注意的是，velocyto的输出后缀是不定的，所以我们需要通过os.listdir读取文件名
a=os.listdir(current_path+'data/cellranger/{}/velocyto'.format(name))[0]
ldata = scv.read(current_path+'data/cellranger/{}/velocyto/{}'.format(name,a))


#ldata就是velocyto的输出，我们使用scv.utils.merge进行拼接
scv.utils.clean_obs_names(adata1)
scv.utils.clean_obs_names(ldata)
adata1 = scv.utils.merge(adata1, ldata)

#由于barcodes跟genes可能重名，所以我们make_unique
adata1.var_names_make_unique()
adata1.obs_names_make_unique()

#由于我们接下来需要合并不同的测序文件，为了避免barcode重复，导致报错，我们为每一个SRR的obs.index加上一个SRR的后缀
#例如SRR13633759，我们加入的后缀就是759，用633来split
adata1.obs.index=['{}-{}'.format(i,name.split('633')[1]) for i in adata1.obs.index]

#导入meta信息
adata1.obs['Tissue']='Retinoblastoma'
adata1.obs['Developmental_Stage']=rna_meta.loc[rna_meta['Run']==name,'Developmental_Stage'].values[0]
print('Read success: ',name)
adata1

Now read:  SRR13633759
Read success:  SRR13633759


AnnData object with n_obs × n_vars = 2452 × 64202
    obs: 'initial_size_spliced', 'initial_size_unspliced', 'initial_size', 'Tissue', 'Developmental_Stage'
    var: 'gene_ids', 'feature_types', 'genome', 'Accession', 'Chromosome', 'End', 'Start', 'Strand'
    layers: 'matrix', 'ambiguous', 'spliced', 'unspliced'

在成功读取一个文件后，我们还有三个文件，为了避免重复写三次代码，我们使用循环来读取

In [18]:
for name in srr_name:
    if 'SRR13633759' in name:
        continue
    print('Now read: ',name)
    a=os.listdir(current_path+'data/cellranger/{}/velocyto'.format(name))[0]
    ldata = scv.read(current_path+'data/cellranger/{}/velocyto/{}'.format(name,a))
    adata_test=sc.read_10x_h5(current_path+'data/cellranger/{}/filtered_feature_bc_matrix.h5'.format(name))
    scv.utils.clean_obs_names(adata_test)
    scv.utils.clean_obs_names(ldata)
    adata_test = scv.utils.merge(adata_test, ldata)
    adata_test.var_names_make_unique()
    adata_test.obs_names_make_unique()
    adata_test.obs.index=['{}-{}'.format(i,name.split('633')[1]) for i in adata_test.obs.index]
    adata_test.obs['Tissue']='Retinoblastoma'
    adata_test.obs['Developmental_Stage']=rna_meta.loc[rna_meta['Run']==name,'Developmental_Stage'].values[0]
    adata1=anndata.concat([adata1,adata_test],merge='same')
    print('Read success: ',name)
    
    #gc.collect是释放内存的意思，因为我们读取文件都是用adata_test变量进行存放，覆盖的过程会出现垃圾内存
    #Python虽然会自动释放但是太慢了
    gc.collect()

Now read:  SRR13633760
Read success:  SRR13633760
Now read:  SRR13633761
Read success:  SRR13633761
Now read:  SRR13633762
Read success:  SRR13633762


In [20]:
#可以看到我们现在的adata1有22,048个细胞，说明四个文件成功读取完了
adata1

AnnData object with n_obs × n_vars = 22048 × 64202
    obs: 'initial_size_spliced', 'initial_size_unspliced', 'initial_size', 'Tissue', 'Developmental_Stage'
    var: 'gene_ids', 'feature_types', 'genome', 'Accession', 'End', 'Start'
    layers: 'matrix', 'ambiguous', 'spliced', 'unspliced'

In [21]:
#保存文件
adata1.write_h5ad(current_path+'data/raw_data/RB-rna-Retinoblastoma.h5ad',compression='gzip')

## 4. scATAC-seq预处理

scATAC-seq的预处理是类似的，但是文件的合并有所不同

In [23]:
atac_meta=pd.read_csv(current_path+'data/meta/RB_atac_meta.csv',index_col=0)
atac_meta.head()

Unnamed: 0,Run,BioSample,Bases,Bytes,Developmental_Stage,Experiment,GEO_Accession,Sample Name,source_name,Tissue
1,SRR13633765,SAMN17796941,36.97 G,11.22 Gb,20PCW,SRX10031197,GSM5065170,GSM5065170,Retina,Retina
2,SRR13633766,SAMN17796939,39.16 G,11.81 Gb,Retinoblastoma_4months,SRX10031198,GSM5065171,GSM5065171,Retinoblastoma,Retinoblastoma
3,SRR13633767,SAMN17796938,43.63 G,13.21 Gb,12PCW,SRX10031199,GSM5065172,GSM5065172,Retina,Retina
4,SRR13633768,SAMN17796936,44.50 G,13.35 Gb,16PCW,SRX10031200,GSM5065173,GSM5065173,Retina,Retina
5,SRR13633769,SAMN17796934,46.94 G,14.15 Gb,12PCW,SRX10031201,GSM5065174,GSM5065174,Retina,Retina


In [24]:
#我们提取一下视网膜母细胞瘤的样本信息
srr_name=atac_meta.loc[atac_meta['Tissue']=='Retinoblastoma','Run'].values
srr_name

array(['SRR13633766', 'SRR13633772'], dtype=object)

In [None]:
import episcanpy
name='SRR13633766'
adata=episcanpy.pp.read_ATAC_10x(current_path+'data/cellranger-atac/SRR13633766/filtered_peak_bc_matrix/matrix.mtx', \
                                cell_names=current_path+'data/cellranger-atac/SRR13633766/filtered_peak_bc_matrix/barcodes.tsv', \
                                var_names=current_path+'data/cellranger-atac/SRR13633766/filtered_peak_bc_matrix/peaks.bed')
adata.var_names_make_unique()
adata.obs_names_make_unique()
adata.obs.index=['{}-{}'.format(i,name.split('633')[1]) for i in adata.obs.index]
adata.obs['Tissue']='Retinoblastoma'


In [28]:
adata.obs['Developmental_Stage']=atac_meta.loc[atac_meta['Run']=='SRR13633766','Developmental_Stage'].values[0]
adata

AnnData object with n_obs × n_vars = 20001 × 234628
    obs: 'Tissue', 'Developmental_Stage'
    uns: 'omic'

In [29]:
#保存文件
adata.write_h5ad(current_path+'data/raw_data/RB-atac-SRR13633766.h5ad',compression='gzip')

In [30]:
name='SRR13633772'
adata=episcanpy.pp.read_ATAC_10x(current_path+'data/cellranger-atac/SRR13633772/filtered_peak_bc_matrix/matrix.mtx', \
                                cell_names=current_path+'data/cellranger-atac/SRR13633772/filtered_peak_bc_matrix/barcodes.tsv', \
                                var_names=current_path+'data/cellranger-atac/SRR13633772/filtered_peak_bc_matrix/peaks.bed')
adata.var_names_make_unique()
adata.obs_names_make_unique()
adata.obs.index=['{}-{}'.format(i,name.split('633')[1]) for i in adata.obs.index]
adata.obs['Tissue']='Retinoblastoma'
adata.obs['Developmental_Stage']=atac_meta.loc[atac_meta['Run']=='SRR13633772','Developmental_Stage'].values[0]
gc.collect()
adata

AnnData object with n_obs × n_vars = 20031 × 204576
    obs: 'Tissue', 'Developmental_Stage'
    uns: 'omic'

In [31]:
#保存文件
adata.write_h5ad(current_path+'data/raw_data/RB-atac-SRR13633772.h5ad',compression='gzip')