# **Genomic Data Science - Command Line Reference**


This notebook serves as a structured reference for essential **SAMtools** and **BEDTools** commands used in genomic data analysis. It includes categorized command lists, explanations, and additional useful commands for working with BAM, SAM, BED, and FASTA files.



## **1. SAMtools Commands**
SAMtools is a suite of programs for interacting with high-throughput sequencing data in SAM, BAM, and CRAM formats.

### **1.1 General Information**
```bash
samtools command
```
- Displays help information about a specific `samtools` command.

<br />

### **1.2 Alignment Statistics**
```bash
samtools flagstat example.bam
```
- Counts and summarizes alignment statistics for a BAM file, including the number of mapped and unmapped reads.

<br />

### **1.3 Sorting and Indexing**
```bash
nohup samtools sort example.bam example.sorted &
```
- Sorts a BAM file by genomic coordinates.
- `nohup` ensures the command continues running after the terminal is closed.
- The `&` runs the command in the background.

```bash
samtools index example.sorted.bam
```
- Creates an index file (`.bai`) for a sorted BAM file, allowing for fast retrieval of data.

<br />

### **1.4 Viewing Alignments**
```bash
samtools view example.bam | more
```
- Displays the alignments in a BAM file page by page.

```bash
samtools view -h example.bam | more
```
- Displays the alignments including the header (`-h` flag).

```bash
samtools view -bS example.sam > output.bam
```
- Converts a SAM file into a BAM file

```bash
samtools view -h example.bam > example.sam
```
- Converts a BAM file into a SAM file (human-readable format).

```bash
samtools view -H example.bam
```
- Extracts and displays only the header information from a BAM file.

```bash
samtools view example.bam | cut -f6 | grep "D" | wc -l
```
- Extracts and counts the alignments that contain a deletion (D) in the CIGAR column.

```bash
samtools view example.bam | cut -f6 | grep -c "[I,D]"
```
- Extracts and counts the alignments that contain an insertion and/or deletion in the CIGAR column.

```bash
samtools view athal_wu_0_A.bam | cut -f6 | grep "N" | wc -l
```
- Extracts and counts the alignments that contain a skipped region <intron> (N) in the CIGAR column - gives the spliced alignments.

```bash
samtools view athal_wu_0_A.bam | cut –f7 | grep –c ‘*’
```
- Extracts and counts the alignments that show the read's mate unmapped.

```bash
samtools view athal_wu_0_A.bam | cut –f7 | grep –c ‘=’
```
- Extracts and counts the alignments that show the read's mate mapped to the same chromosome.

```bash
samtools view athal_wu_0_A.bam | cut –f3 | grep –v ’*’ | wc -l
```
- Exclude lines that have unmapped reads.

```bash
samtools view example.bam | cut –f3 | grep "Chr3" | wc -l
```
- Count the number of entries reported for Chr3
 

<br />

### **1.5 Merging BAM Files**
```bash
nohup samtools merge NA12814.bam NA12814_1.bam NA12814_2.bam &
```
- Merges multiple BAM files into a single BAM file.
- `nohup` ensures the process continues after logout, and `&` runs it in the background.

### **1.6 Additional Commands**
```bash
zcat NA12814_1.fastq.gz | wc -l
```
- Decompresses a gzipped FASTQ file and counts the total number of lines.
- Since FASTQ files have 4 lines per sequence, dividing the output by 4 gives the number of reads.

```bash
samtools view -h athal_wu_0_A.bam Chr3:11777000-11794000 > extracted_alignments.sam
```
- Extract only the alignments in the range provided and save into a sam file (include header)

```bash
samtools view -c -f 8 extracted_alignments.sam
```
- Count the number of alignments that show the read's mate unmapped

```bash
samtools view -c -F 8 extracted_alignments.sam
```
- Count the number of alignments that show the read's mate mapped

```bash
samtools view -c -f 4 extracted_alignments.sam
```
- Count alignments where the read itself is unmapped
  
<br />

---

## **2. BEDtools Commands**
BEDTools is a powerful suite for genomic interval operations, allowing for manipulation of BED, GFF, and BAM files.

### **2.1 General Commands**
```bash
bedtools >& bedtools.log
vi bedtools.log
```
- Redirects all BEDTools output (including errors) to a log file and opens it in `vi` for review.

<br />

### **2.2 Finding Overlapping Regions**
```bash
bedtools intersect -wo -a REfSeq.gtf -b Alus.bed | more
```
- Finds overlapping regions between two genomic annotation files (`RefSeq.gtf` and `Alus.bed`).
- The `-wo` flag outputs the number of overlapping bases.

```bash
bedtools intersect -wo -a REfSeq.gtf -b Alus.bed | cut -f9 | cut -d ' ' -f2 | more
```
- Extracts the 9th column from the intersection output and then isolates the second word in that column.

```bash
bedtools intersect -wo -a REfSeq.gtf -b Alus.bed | cut -f9 | cut -d ' ' -f2 | sort -u | wc -l
```
- Finds the unique gene names from overlapping intervals and counts them.

```bash
bedtools intersect -split -wo -a REfSeq.bed -b Alus.bed | more
```
- Find overlaps, considering split alignments
  
```bash
bedtools intersect -split -wao -a REfSeq.bed -b Alus.bed | more
```
- Find intersections considering split alignments, reporting all overlaps

<br />

### **2.3 Handling BAM and BED Conversions**
```bash
bedtools bamtobed -cigar -i NA12814.bam
```
- Converts a BAM file to a BED file, retaining alignment information (`-cigar` flag).

```bash
bedtools bamtobed -split -i NA12814.bam
```
- Converts BAM to BED while considering split alignments (e.g., exon junctions).

```bash
bedtools bedtobam -i REfSeq.gtf -g hg38c.hdrs > refseq.bam
```
- Converts a BED file to a BAM file using a genome header file (`hg38c.hdrs`).

```bash
bedtools bedtobam -bed12 -i REfSeq.bed -g hg38c.hdrs > refseq.bam
```
- Converts BED12 format to BAM format.

<br />

### **2.4 Extracting FASTA Sequences**
```bash
bedtools getfasta -fi /data1/igm3/genome/hg38/hg38c.fa -bed REfSeq.gtf -fo RefSeq.gtf.fasta
```
- Extracts FASTA sequences from a reference genome (`hg38c.fa`) based on genomic regions in a GTF file.

```bash
bedtools getfasta -split -fi /data1/igm3/genome/hg38/hg38c.fa -bed REfSeq.gtf -fo RefSeq.bed.fasta
```
- Extracts FASTA sequences but considers split features (exon-intron structures).

<br />

---

## **3. Additional Useful Commands**
Here are some extra commands that may be useful for genomic analysis:

```bash
bedtools intersect -a alignments.bed -b athal_wu_0_A_annot.gtf -wo | awk '$NF >= 10'
```
- Computes how many of the overlaps are 10 bases or longer
  
### **3.1 Compute Read Depth**
```bash
samtools depth example.bam | awk '{sum+=$3} END {print sum/NR}'
```
- Computes the average sequencing depth for a BAM file.

<br />

### **3.2 Extract Mapped Reads Count**
```bash
samtools stats example.bam | grep "reads mapped:"
```
- Extracts the number of mapped reads from `samtools stats` output.

<br />

### **3.3 Compute Coverage**
```bash
bedtools genomecov -ibam example.bam -bg > coverage.bedgraph
```
- Generates a genome-wide coverage file from a BAM file.

<br />

### **3.4 Merge Overlapping Intervals**
```bash
bedtools merge -i sorted.bed
```
- Merges overlapping BED intervals into larger contiguous regions.

<br />

---