# M.Musculus_mm10_Camk2aTTA_S1S2  

## Custom genome reference for 10X CellRanger: 

## B6_Tg_Camk2a_tTA_S1_S2 

_v2.0_ 

_VG 12.02.23_ 

Here, a custom _mus musculus_ genome reference is generated for use with 10X genomics pipelines (CellRanger, SpaceRanger) that includes transgenic sequences of both transgenes in the double-transgenic B6/Tg Camk2a-tTA S1-S2 (alpha-synculein) mice. 
Transgenic sequences have to be included for an accurate quantification of the transgene-specific gene expression. 

### 1.  Generation of transgene sequences 

Two transgenes are included in a double-transgenic B6/Tg Camk2a-tTa S1-S2 mouse: 

- The tetracycline-transactivator (protein) under the control of the Camk2a promoter  
- The split-hGLuc (humanized Gaussia Princeps Luciferase) -Synculein transgene under the control of the tRE promoter  


#### Camk2a-tTA transgene: 

The tTA driver mouse strain is commercially available from The Jackson Laboratory (https://www.jax.org/strain/007004). Even though the transgene generation, structure and localization (Chr. 12) are well-described in the JAX documentation, the transgene sequence is not provided neither from JAX, nor in the original (and follow-up) publications: https://pubmed.ncbi.nlm.nih.gov/8939850/  

Therefore, a sequence found in the NCBI Nucleotide collection will be used here: 
https://www.ncbi.nlm.nih.gov/nuccore/%20MK674482  

Its description fits to the original description of the Camk2a-tTA mouse transgene: an artificial intron incorporating splice sites at 5'-end and SV40 polyadenulation signal at the 3'end.  

**Including the 5' UTR and 3' UTR is essential for 10X GEX quantification, which captures transcripts with high coverage at the 5' UTR or 3' UTR based on the chemistry.** In addition, the mouse SNCA CDS and the human SNCA CDS are very highly conserved, therefore transgene reads can be more reliably capruted with the 5'UTR and 3'UTR which differ significantly. 

The Camk2a-tTA transgene FASTA was downloaded from: 

https://www.ncbi.nlm.nih.gov/nuccore/MK674482.1?report=fasta 


In [1]:
head Input/Camk2a_tTA.fa 

>Camk2aTTA
GGGCGGCCTCGACGGTATCGATAAGCTTCGATCTTTTTTCCGTAAACTCAATACCAGGCTGATGTCCCAC
CGGATCTGATGGCTTAGGGTGGCAGGGAATCTCAGTTCCCCTCAGACACTCTCCCTTTGCTGGTTCTCAG
GGAGGAGGCAAGGTCAAGTCTTCATCTGTAGGCACGTGGAGGGAGGGCACAGAAGCCCTCAGCTGAATAG
GGTGGGACTTGGGGAAGGGCAGCAACCAGGCTGGGTTGCCTGGGTCACAATCCTGCCTCTTTCCTGATGA
GTTTCCTTTTTGCCCTCAGGTTACCTATAGCAGCATTCTGCCTCAATCTCACCCCTAAGATGAGCTCTGG
TGACTTTAGGACTCCAGTGTACACATGTGTCTGGGGCCATGGCAGGGTTTCTTGCTGACCTTGTCACCTT
CCAGACAACTTGAGTCCATGACCCTCTTTCCAGCTCTCTGTGGTGCTCTTGGATATCAGCTGGAGTATGG
CCAGCTGGCTGCTGCTCTGTTGAACAACTCAATGAGAGAACGGACAGGGTAGGCTCTGAGAAATCTTTAC
GTTCCTGGAGCCTCATGACTTGGGAGCCTAGTGGAATTCTTCTCTTTTGGTCCCCAACATCTGGGGGGAG


In [2]:
tail Input/Camk2a_tTA.fa

AGCTGCACTGCTATACAAGAAAATTATGGAAAAATATTTGATGTATAGTGCCTTGACTAGAGATCATAAT
CAGCCATACCACATTTGTAGAGGTTTTACTTGCTTTAAAAAACCTCCCACACCTCCCCCTGAACCTGAAA
CATAAAATGAATGCAATTGTTGTTGTTAACTTGTTTATTGCAGCTTATAATGGTTACAAATAAAGCAATA
GCATCACAAATTTCACAAATAAAGCATTTTTTTCACTGCATTCTAGTTGTGGTTTGTCCAAACTCATCAA
TGTATCTTATCATGTCTGGATCGATCCCGCCATGGTATCAACGCCATATTTCTATTTACAGTAGGGACCT
CTTCGTTGTGTAGGTACCGCTGTATTCCTAGGGAAATAGTAGAGGCACCTTGAACTGTCTGCATCAGCCA
TATAGCCCCCGCTGTTCGATTTACAAACACAGGCACAGTACTGACAAACCCATACACCTCCTCTGAAATA
CCCATAGTTGCTAGGGCTGTCTCCGAACTCATTACACCCTCCAAAGTCAGAGCTGTAATTTCGCCATCAA
GGGCAGCGAGGGCTTCTCCAGATAAAATAGCTTCTGCCGAGAGTCCCGTAAGGGTAGACACTTCAGCTAA
TCCCTCGAGCGCGGCCGCCACGGTCGAGGCCGCCC

 - Determine total contig length: 

In [3]:
cat Input/Camk2a_tTA.fa | grep -v "^>"  | tr -d "\n" | wc -c

10685


- Generate a GTF file: 

The original description of the transgene (link above) defines following segments: 

\- 23-8018: _promoter, Camk2 promoter segment from mouse_     
\- 8372-9379: _gene, tTA_    
\- 9932-10172: _regulatory, SV40 polyA terminator region_     

In addition, following features were found by inspection of the DNA sequence:  



![title](./img/Camk2a_tTA.png)

The gene will therefore be defined as 8018-10172 to include 5'UTR and 3'UTR: 

In [9]:
head Input/Camk2a_tTA.gtf

Camk2aTTA	Veselin	gene	6729	10172	.	+	.	gene_id "Camk2aTTA"; gene_name "Camk2aTTA"; gene_biotype "protein_coding";
Camk2aTTA	Veselin	transcript	8018	10172	.	+	.	gene_id "Camk2aTTA"; transcript_id "tTA"; gene_name "Camk2aTTA"; transcript_name "tTA"; transcript_biotype "protein_coding"; gene_biotype "protein_coding";
Camk2aTTA	Veselin	exon	8018	8088	.	+	.	gene_id "Camk2aTTA"; transcript_id "tTA"; gene_name "Camk2aTTA"; transcript_name "tTA"; gene_biotype "protein_coding";
Camk2aTTA	Veselin	exon	8319	9550	.	+	.	gene_id "Camk2aTTA"; transcript_id "tTA"; gene_name "Camk2aTTA"; transcript_name "tTA"; gene_biotype "protein_coding";
Camk2aTTA	Veselin	exon	9617	10172	.	+	.	gene_id "Camk2aTTA"; transcript_id "tTA"; gene_name "Camk2aTTA"; transcript_name "tTA"; gene_biotype "protein_coding";


#### S1-S2 trangene: 

The transgene was generated by Dr. Björn von Einem in a pBI-5 vector, linearized at both BsrBI restriction sites and injected genomic into mouse ovocytes. The exact insertion site of the transgene is not known. 

![title](img/S1S2v.png)

The transgene consists of a linear combination of two ORFs controlled by a bidirectional, tRE promoter. Therefore, the whole linear sequence between both BsrBI restriction sites is used as genomic location, and both transcripts are included as two different transcripts in the GTF file:  

In [10]:
head Input/S1S2.fa | cut -c 1-100

>S1S2 
CGGATACATATTTGAATGTATTTAGAAAAATAAACAAATAGGGGTTCCGCGCACATTTCCCCGAAAAGTGCCACCTGACAACGCAGTCGAGTTCATAAGA


- Determine total contig length: 

In [11]:
cat Input/S1S2.fa | grep -v "^>" |  tr -d "\n" | wc -c

4701


- Generate a GTF file: 

![title](img/S1S2.png)

Since the exact TSS is not known in this case, both transcripts will be assigned TSS at the 1st bp after TRE/pMinCMV and reach until the end of the B2G polyadenylation signal 

In [16]:
cat Input/S1S2.gtf

S1S2	Veselin	exon	412	663	.	-	.	gene_id "S1S2"; transcript_id "SynLuc2"; gene_name "S1S2"; gene_biotype "protein_coding";
S1S2	Veselin	exon	1237	2153	.	-	.	gene_id "S1S2"; transcript_id "SynLuc2"; gene_name "S1S2"; gene_biotype "protein_coding";
S1S2	Veselin	exon	2495	3442	.	+	.	gene_id "S1S2"; transcript_id "SynLuc1"; gene_name "S1S2"; gene_biotype "protein_coding";
S1S2	Veselin	exon	4016	4267	.	+	.	gene_id "S1S2"; transcript_id "SynLuc1"; gene_name "S1S2"; gene_biotype "protein_coding";


### 2. Build a custom genome with the transgene sequences 

- Get 10X GEX mm10 Genome: 

In [17]:
cp /opt/refdata-gex-mm10-2020-A/fasta/genome.fa ./genome_mm10_Camk2aTTA_S1S2.fa

In [18]:
cp /opt/refdata-gex-mm10-2020-A/genes/genes.gtf ./genes_mm10_Camk2aTTA_S1S2.gtf

- Append both transgene's sequences to the Genome mm10 FASTA (genome.fa and genes.gtf were extracted from the 10X mm10 reference **_refdata-gex-mm10-2020-A_**) : 

In [19]:
cat Input/Camk2a_tTA.fa >> Input/genome_mm10_Camk2aTTA_S1S2.fa

In [20]:
echo -e "\n" >> Input/genome_mm10_Camk2aTTA_S1S2.fa

In [21]:
cat Input/S1S2.fa >> Input/genome_mm10_Camk2aTTA_S1S2.fa

Verify custom genome contigs: 

In [22]:
grep ">" Input/genome_mm10_Camk2aTTA_S1S2.fa

[01;31m[K>[m[Kchr1 1
[01;31m[K>[m[Kchr10 10
[01;31m[K>[m[Kchr11 11
[01;31m[K>[m[Kchr12 12
[01;31m[K>[m[Kchr13 13
[01;31m[K>[m[Kchr14 14
[01;31m[K>[m[Kchr15 15
[01;31m[K>[m[Kchr16 16
[01;31m[K>[m[Kchr17 17
[01;31m[K>[m[Kchr18 18
[01;31m[K>[m[Kchr19 19
[01;31m[K>[m[Kchr2 2
[01;31m[K>[m[Kchr3 3
[01;31m[K>[m[Kchr4 4
[01;31m[K>[m[Kchr5 5
[01;31m[K>[m[Kchr6 6
[01;31m[K>[m[Kchr7 7
[01;31m[K>[m[Kchr8 8
[01;31m[K>[m[Kchr9 9
[01;31m[K>[m[KchrM MT
[01;31m[K>[m[KchrX X
[01;31m[K>[m[KchrY Y
[01;31m[K>[m[KJH584299.1 JH584299.1
[01;31m[K>[m[KGL456233.1 GL456233.1
[01;31m[K>[m[KJH584301.1 JH584301.1
[01;31m[K>[m[KGL456211.1 GL456211.1
[01;31m[K>[m[KGL456350.1 GL456350.1
[01;31m[K>[m[KJH584293.1 JH584293.1
[01;31m[K>[m[KGL456221.1 GL456221.1
[01;31m[K>[m[KJH584297.1 JH584297.1
[01;31m[K>[m[KJH584296.1 JH584296.1
[01;31m[K>[m[KGL456354.1 GL456354.1
[01;31m[K>[m[KJH584294.1 

In [23]:
cat Camk2a_tTA.gtf >> Input/genes_mm10_Camk2aTTA_S1S2.gtf

In [24]:
cat S1S2.gtf >> Input/genes_mm10_Camk2aTTA_S1S2.gtf

Verify custom genome GTF: 

In [25]:
tail Input/genes_mm10_Camk2aTTA_S1S2.gtf

JH584304.1	ENSEMBL	UTR	52691	54867	.	-	.	gene_id "ENSMUSG00000095041"; gene_version "7"; transcript_id "ENSMUST00000178343"; transcript_version "1"; gene_type "protein_coding"; gene_name "AC149090.1"; transcript_type "protein_coding"; transcript_name "AC149090.1-202"; exon_number 4; exon_id "ENSMUSE00001045433"; exon_version "1"; level 3; protein_id "ENSMUSP00000136649.1"; transcript_support_level "1"; tag "basic";
Camk2aTTA	Veselin	gene	6729	10172	.	+	.	gene_id "Camk2aTTA"; gene_name "Camk2aTTA"; gene_biotype "protein_coding";
Camk2aTTA	Veselin	transcript	8018	10172	.	+	.	gene_id "Camk2aTTA"; transcript_id "tTA"; gene_name "Camk2aTTA"; transcript_name "tTA"; transcript_biotype "protein_coding"; gene_biotype "protein_coding";
Camk2aTTA	Veselin	exon	8018	8088	.	+	.	gene_id "Camk2aTTA"; transcript_id "tTA"; gene_name "Camk2aTTA"; transcript_name "tTA"; gene_biotype "protein_coding";
Camk2aTTA	Veselin	exon	8319	9550	.	+	.	gene_id "Camk2aTTA"; transcript_id "tTA"; gene_name "Camk2aTTA"; tr

### 3.0 Save Genome reference 

In [32]:
tree M.Musculus_mm10_Camk2aTTA_S1S2_2.0.0

[01;34mM.Musculus_mm10_Camk2aTTA_S1S2_2.0.0[0m
├── [01;34mfasta[0m
│   ├── genome.fa
│   └── genome.fa.fai
├── [01;34mgenes[0m
│   └── [01;31mgenes.gtf.gz[0m
├── reference.json
└── [01;34mstar[0m
    ├── chrLength.txt
    ├── chrNameLength.txt
    ├── chrName.txt
    ├── chrStart.txt
    ├── exonGeTrInfo.tab
    ├── exonInfo.tab
    ├── geneInfo.tab
    ├── Genome
    ├── genomeParameters.txt
    ├── SA
    ├── SAindex
    ├── sjdbInfo.txt
    ├── sjdbList.fromGTF.out.tab
    ├── sjdbList.out.tab
    └── transcriptInfo.tab

3 directories, 19 files


In [None]:
cp -r  M.Musculus_mm10_Camk2aTTA_S1S2_2.0.0 /opt/refdata_spaceranger/

In [33]:
cp -r M.Musculus_mm10_Camk2aTTA_S1S2_2.0.0 /home/workstation2/NAS/Bioinformatic_Data/Genome_References/SpaceRanger

In [38]:
tree /opt/refdata_spaceranger/M.Musculus_mm10_Camk2aTTA_S1S2_2.0.0

[01;34m/opt/refdata_spaceranger/M.Musculus_mm10_Camk2aTTA_S1S2_2.0.0[0m
├── [01;34mfasta[0m
│   ├── genome.fa
│   └── genome.fa.fai
├── [01;34mgenes[0m
│   └── [01;31mgenes.gtf.gz[0m
├── reference.json
└── [01;34mstar[0m
    ├── chrLength.txt
    ├── chrNameLength.txt
    ├── chrName.txt
    ├── chrStart.txt
    ├── exonGeTrInfo.tab
    ├── exonInfo.tab
    ├── geneInfo.tab
    ├── Genome
    ├── genomeParameters.txt
    ├── SA
    ├── SAindex
    ├── sjdbInfo.txt
    ├── sjdbList.fromGTF.out.tab
    ├── sjdbList.out.tab
    └── transcriptInfo.tab

3 directories, 19 files


Clean-up: 

### 5.0 Build Genome with Cellranger mkref 

In [41]:
cellranger mkref --genome=M.Musculus_mm10_Camk2aTTA_S1S2_2.0.0 \
    --fasta=Input/genome_mm10_Camk2aTTA_S1S2.fa \
    --genes=Input/genes_mm10_Camk2aTTA_S1S2.gtf \
    --memgb=142 \
    --nthreads=24 

['/opt/cellranger-7.1.0/bin/rna/mkref', '--genome=M.Musculus_mm10_Camk2aTTA_S1S2_2.0.0', '--fasta=genome_mm10_Camk2aTTA_S1S2.fa', '--genes=genes_mm10_Camk2aTTA_S1S2.gtf', '--memgb=142', '--nthreads=24']
Creating new reference folder at /home/workstation2/Veselin/S1S2_Ref_v2/M.Musculus_mm10_Camk2aTTA_S1S2_2.0.0
...done

Writing genome FASTA file into reference folder...
...done

Indexing genome FASTA file...
...done

Writing genes GTF file into reference folder...
...done

Generating STAR genome index (may take over 8 core hours for a 3Gb genome)...
Feb 12 00:54:02 ..... started STAR run
Feb 12 00:54:02 ... starting to generate Genome files
Feb 12 00:54:55 ... starting to sort Suffix Array. This may take a long time...
Feb 12 00:55:03 ... sorting Suffix Array chunks and saving them to disk...
Feb 12 01:01:14 ... loading chunks from disk, packing SA...
Feb 12 01:01:57 ... finished generating suffix array
Feb 12 01:01:57 ... generating Suffix Array index
Feb 12 01:04:54 ... completed Suff

### 6.0 Save Genome reference  

In [42]:
tree M.Musculus_mm10_Camk2aTTA_S1S2_2.0.0

[01;34mM.Musculus_mm10_Camk2aTTA_S1S2_2.0.0[0m
├── [01;34mfasta[0m
│   ├── genome.fa
│   └── genome.fa.fai
├── [01;34mgenes[0m
│   └── [01;31mgenes.gtf.gz[0m
├── reference.json
└── [01;34mstar[0m
    ├── chrLength.txt
    ├── chrNameLength.txt
    ├── chrName.txt
    ├── chrStart.txt
    ├── exonGeTrInfo.tab
    ├── exonInfo.tab
    ├── geneInfo.tab
    ├── Genome
    ├── genomeParameters.txt
    ├── SA
    ├── SAindex
    ├── sjdbInfo.txt
    ├── sjdbList.fromGTF.out.tab
    ├── sjdbList.out.tab
    └── transcriptInfo.tab

3 directories, 19 files


In [None]:
cp -r  M.Musculus_mm10_Camk2aTTA_S1S2_2.0.0 /opt/ 

In [None]:
cp -r M.Musculus_mm10_Camk2aTTA_S1S2_2.0.0 /home/workstation2/NAS/Bioinformatic_Data/Genome_References/CellRanger

In [43]:
tree /opt/M.Musculus_mm10_Camk2aTTA_S1S2_2.0.0

[01;34m/opt/M.Musculus_mm10_Camk2aTTA_S1S2_2.0.0[0m
├── [01;34mfasta[0m
│   ├── genome.fa
│   └── genome.fa.fai
├── [01;34mgenes[0m
│   └── [01;31mgenes.gtf.gz[0m
├── reference.json
└── [01;34mstar[0m
    ├── chrLength.txt
    ├── chrNameLength.txt
    ├── chrName.txt
    ├── chrStart.txt
    ├── exonGeTrInfo.tab
    ├── exonInfo.tab
    ├── geneInfo.tab
    ├── Genome
    ├── genomeParameters.txt
    ├── SA
    ├── SAindex
    ├── sjdbInfo.txt
    ├── sjdbList.fromGTF.out.tab
    ├── sjdbList.out.tab
    └── transcriptInfo.tab

3 directories, 19 files


Clean-up: 

In [None]:
rm -r M.Musculus_mm10_Camk2aTTA_S1S2_2.0.0

**# Changelog**

\# v1.0 VG 09.02.23 -> Initial commit  
\# v2.0 VG 12.02.23 -> V2.0 Update, GTF re-formatted to encode both SNCA_L1 and SNCA_L2 transcript on the same gene based on analysis of the alignments on v1.0 
