# Notebook for mapping ATACseq Data

Raw data downloaded from [**here**](https://www.ebi.ac.uk/biostudies/arrayexpress/studies/E-MTAB-12919/sdrf)

    Developed by: Christian Eger
    Würzburg Institute for Systems Immunology - Faculty of Medicine - Julius Maximilian Universität Würzburg
    Created on: 240415
    Last modified: 240416

In [4]:
import os
import pandas as pd
import json
import matplotlib.pyplot as plt

## Creating input file for download and processing script

Downloaded from [**here**](https://www.ebi.ac.uk/biostudies/arrayexpress/studies/E-MTAB-12919)

In [2]:
data_path = '../.data/meta_data'

In [5]:
meta_data = pd.read_csv(
    filepath_or_buffer=os.path.join(data_path, 'E-MTAB-12919.sdrf.txt'),
    sep='\t',
)
meta_data.head()

Unnamed: 0,Source Name,Comment[ENA_SAMPLE],Comment[BioSD_SAMPLE],Characteristics[organism],Characteristics[age],Characteristics[developmental stage],Characteristics[sex],Characteristics[individual],Characteristics[organism part],Characteristics[disease],...,Assay Name,Technology Type,Comment[ENA_EXPERIMENT],Scan Name,Comment[SUBMITTED_FILE_NAME],Comment[ENA_RUN],Comment[FASTQ_URI],Comment[read_index],Comment[read_type],Factor Value[organism part]
0,HCAHeart9508819,ERS15408182,SAMEA113413051,Homo sapiens,55 to 60,adult,male,D3,heart left ventricle,normal,...,HCAHeart9508819,sequencing assay,ERX10811516,HCAHeart9508819_S1_L001_I1_001.fastq.gz,HCAHeart9508819_S1_L001_I1_001.fastq.gz,ERR11403725,ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR114/ERR114...,index1,sample_barcode,heart left ventricle
1,HCAHeart9508819,ERS15408182,SAMEA113413051,Homo sapiens,55 to 60,adult,male,D3,heart left ventricle,normal,...,HCAHeart9508819,sequencing assay,ERX10811516,HCAHeart9508819_S1_L001_I2_001.fastq.gz,HCAHeart9508819_S1_L001_I2_001.fastq.gz,ERR11403725,ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR114/ERR114...,index2,cell_barcode,heart left ventricle
2,HCAHeart9508819,ERS15408182,SAMEA113413051,Homo sapiens,55 to 60,adult,male,D3,heart left ventricle,normal,...,HCAHeart9508819,sequencing assay,ERX10811516,HCAHeart9508819_S1_L001_R1_001.fastq.gz,HCAHeart9508819_S1_L001_R1_001.fastq.gz,ERR11403725,ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR114/ERR114...,read1,paired,heart left ventricle
3,HCAHeart9508819,ERS15408182,SAMEA113413051,Homo sapiens,55 to 60,adult,male,D3,heart left ventricle,normal,...,HCAHeart9508819,sequencing assay,ERX10811516,HCAHeart9508819_S1_L001_R2_001.fastq.gz,HCAHeart9508819_S1_L001_R2_001.fastq.gz,ERR11403725,ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR114/ERR114...,read2,paired,heart left ventricle
4,HCAHeart9508820,ERS15408183,SAMEA113413052,Homo sapiens,60 to 65,adult,male,D7,right cardiac atrium,normal,...,HCAHeart9508820,sequencing assay,ERX10811517,HCAHeart9508820_S1_L001_I1_001.fastq.gz,HCAHeart9508820_S1_L001_I1_001.fastq.gz,ERR11403726,ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR114/ERR114...,index1,sample_barcode,right cardiac atrium


In [11]:
meta_data = meta_data[['Source Name', 'Comment[FASTQ_URI]', 'Comment[SUBMITTED_FILE_NAME]']]
meta_data.to_csv('../.data/meta_data/downloads_table.csv', header=None)
meta_data.head()

Unnamed: 0,Source Name,Comment[FASTQ_URI],Comment[SUBMITTED_FILE_NAME]
0,HCAHeart9508819,ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR114/ERR114...,HCAHeart9508819_S1_L001_I1_001.fastq.gz
1,HCAHeart9508819,ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR114/ERR114...,HCAHeart9508819_S1_L001_I2_001.fastq.gz
2,HCAHeart9508819,ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR114/ERR114...,HCAHeart9508819_S1_L001_R1_001.fastq.gz
3,HCAHeart9508819,ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR114/ERR114...,HCAHeart9508819_S1_L001_R2_001.fastq.gz
4,HCAHeart9508820,ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR114/ERR114...,HCAHeart9508820_S1_L001_I1_001.fastq.gz


[**Documentation**](https://github.com/pachterlab/scATAK/tree/main)

### Usage
```
$SCATAK_HOME/scATAK -h

######################################################################################

scATAK [options]
-module=quant [please choose 'quant' for single-cell quantification, 'track' for group bigwig track generation, 'hic' for HiC related analysis]
Please specify the following options:
-id --sample_id=sample_sheet.csv [a sample information sheet for fastq files, must be csv format]
-genome --genome_fasta=Mus_musculus.GRCm38.dna_rm.primary_assembly.fa [ENSEMBL genome fasta file for the organism of your interest]
-gene --gene_gtf=Mus_musculus.GRCm38.101.chr.gtf [ENSEMBL gene gtf file for the organism of your interest]
-bc --blen=16 [length of cell barcode, default:16]
-bf --flen=40 [length of biological feature, should not be longer than R2 read length, default:40]
-bg --bc_group=bc_group.txt [a two-column text file with Barcode and Group. First line should be 'Barcode' and 'Group'] (for -module=track or -module=hic
-bam --bam_file=peak_calling/sampleX.bam [scATAK quant mapped bam file for sampleX] (for -module=track only)
-hic --hic_bedpe=sample_significant.bedpe [hic interaction bedpe file] (for -module=hic only)
-bin --hic_binsize=10000 [hic interaction bin size, default 10kb] (for -module=hic only)
-mtxdir --region_mtxdir=atac_regions/atac_sampleX [atac region matrix directory for sampleX] (for -module=hic only)
-t --thread=8 [threads to use, default:8]
-h --help [Help information]
```
### Input File Format
#### sample_sheet.csv
sample ID 	Sample Group 	R1 Biological Read FASTQ 	R2 Cell Barcode Read FASTQ 	R3 Biological Read FASTQ 	Barcode Whitelist
```
male1,male,B8_S1_L001_R1_001.fastq.gz,B8_S1_L001_R2_001.fastq.gz,B8_S1_L001_R3_001.fastq.gz,/home/fgao/scATAK/lib/737K-cratac-v1.txt
male2,male,B9_S1_L003_R1_001.fastq.gz,B9_S1_L003_R2_001.fastq.gz,B9_S1_L003_R3_001.fastq.gz,/home/fgao/scATAK/lib/737K-cratac-v1.txt
male3,male,B10_S1_L004_R1_001.fastq.gz,B10_S1_L004_R2_001.fastq.gz,B10_S1_L004_R3_001.fastq.gz,/home/fgao/scATAK/lib/737K-cratac-v1.txt
female1,female,4_F_S1_L004_R1_001.fastq.gz,4_F_S1_L004_R2_001.fastq.gz,4_F_S1_L004_R3_001.fastq.gz,/home/fgao/scATAK/lib/737K-cratac-v1.txt
female2,female,5_F_S1_L001_R1_001.fastq.gz,5_F_S1_L001_R2_001.fastq.gz,5_F_S1_L001_R3_001.fastq.gz,/home/fgao/scATAK/lib/737K-cratac-v1.txt
female3,female,6_F_S1_L002_R1_001.fastq.gz,6_F_S1_L002_R2_001.fastq.gz,6_F_S1_L002_R3_001.fastq.gz,/home/fgao/scATAK/lib/737K-cratac-v1.txt
```
#### bc_group.txt
Cell Barcodes 	Cell Type Group
```
Barcodes Groups
AAACGAAAGCCTCGCA Oligodendrocytes
AAACGAAAGGAAGAAC Oligodendrocytes
AAACGAACAGCAACGA OPC
AAACGAACATTACTCT Microglia
AAACGAATCACTCGGG Oligodendrocytes
AAACGAATCCTTACGC Oligodendrocytes
AAACGAATCGATCTTT Astrocytes
```
### Quick Start
```
#Quant module
$SCATAK_HOME/scATAK -module=quant -id=sample_sheet.csv -genome=Homo_sapiens.GRCh38.dna_rm.primary_assembly.fa -gene=Homo_sapiens.GRCh38.101.chr.gtf -bc=16 -bf=40 -t=4

#Track module
$SCATAK_HOME/scATAK -module=track -bg=bc_group.txt -bam=peak_calling/sampleX.bam -genome=Homo_sapiens.GRCh38.dna_rm.primary_assembly.fa

#Hic module
$SCATAK_HOME/scATAK -module=hic -bg=bc_group.txt -hic=sample_significant.bedpe -bin=10000 -mtxdir=atac_regions/atac_sampleX -t=16
```

In [1]:
%%bash

metadata='../.data/meta_data/downloads_table.csv'
output_path='../.data/mapping/'
index_output="../.data/kb_index/"

if [ ! -f "$metadata" ]; then
    echo "File not found: $metadata"
    exit 1
fi

while IFS=, read -r index sample url file_name; do
    mkdir -p "$output_path/$sample"
    
    axel -n 10 --output="$output_path/$sample/$file_name" "$url"
    
    downloaded_files+=("$output_path/$sample/$file_name")
    
    if [ ${#downloaded_files[@]} -eq 4 ]; then
        echo "Processing files for sample: $sample"
        
        echo "Deleting downloaded files for sample: $sample"
        #rm "${downloaded_files[@]}"
        downloaded_files=()
    fi
done < "$metadata"