## Crispy CRISPRs
David Tian, Audra Devoto, Zoey Werbin and Justine Albers

#### Project Overview: 

We studied bacterial CRISPR-Cas systems in the Sippewissett Salt Marsh pink berry consortium. We matched the CRISPR sequences from full unassembled metagenomic reads of Bacteroidetes, Alphaproteobacteria, purple sulfur bacteria, and sulfate reducing bacteria found in the pink berries with sequences of potential target phage. We created a database of _  to search for SNC (spacer-containing non-CRISPR) reads. We then BLAST-searched the SNCs to investigate if they matched phage sequences. We also constructed a phylogenetic tree for the pink berry bacteria. 

#### Extracting CRISPR spacers:

We primarily used a program called CRASS in order to identify CRISPR spacers in the pink berry metagenome. Download the following programs: 
 
Apache (https://httpd.apache.org/download.cgi) - a collaborative software development effort aimed at creating a robust, commercial-grade, featureful, and freely-available source code implementation of an HTTP (Web) server

Zlib (http://zlib.net/) - software library used for data compression

Bedtools (https://github.com/arq5x/bedtools2/releases) - a swiss-army knife of tools for a wide-range of genomics analysis tasks. The most widely-used tools enable genome arithmetic: that is, set theory on the genome. For example, bedtools allows one to intersect, merge, count, complement, and shuffle genomic intervals from multiple files in widely-used genomic file formats such as BAM, BED, GFF/GTF, VCF.

CRASS (http://ctskennerton.github.io/crass/) - CRASS is a tool for finding and assembling reads from genomic and metagenomic datasets that contain Clustered Regularly Interspersed Short Palindromic Repeats (CRISPR). CRASS searches through the dataset and identifies reads which contain repeated K-mers that are of a specific length and are separated by a spacer sequence. These possible direct repeats are then curated internally to remove bad matches and then reads containing direct repeats are then outputed for further analysis.

    CRASS input: Moleculo metagenome assembled into contigs

    CRASS output: a fasta file with spacer sequences and identifiers

**Convert file from BAM to FastQ format:**

In [None]:
bamToFastq -i srb1-otuB_mp-pacbio_correctedReads.bam -fq ../../srb1-otuB_mp-pacbio_correctedReads.fastq
crass srb1-otuB_mp-pacbio_correctedReads.fastq
makeblastdb -in mydb.fsa -parse_seqids -dbtype nucl
sed -e 's/\(^>.*$\)/#\1#/' asmMeta.all.copy.fasta | tr -d "\r" | tr -d "\n" | sed -e 's/$/#/' | tr "#" "\n" | sed -e '/^$/d'

Next, we need to remove all the reads from the metagenome that contain CRISPR reads. Use the following command to **combine all CRISPR groups:**

In [None]:
cat Group_* > all_crispr_reads.fa

**Install BioPython**, a Python package for bioinformatics:

In [None]:
sudo pip install biopython

**Linearize the FASTA sequences:**

In [None]:
awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' < [YOUR FASTA] > [OUTPUT FILE NAME]

**Change format back to FASTA:**

In [None]:
tr "\t" "\n" < [OUTPUT FROM LAST STEP] > [FINAL OUTPUT NAME]

**Create the following script, called fasta_remove.py:**

In [None]:
fasta_remove.py

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import sys
from Bio import SeqIO

fasta_file = sys.argv[1]  # Input fasta file
remove_file = sys.argv[2] # Input wanted file, one gene name per line
result_file = sys.argv[3] # Output fasta file

remove = set()
with open(remove_file) as f:
    for line in f:
        line = line.strip()
        if line != "":
            remove.add(line)

fasta_sequences = SeqIO.parse(open(fasta_file),'fasta')

with open(result_file, "w") as f:
    for seq in fasta_sequences:
        nuc = seq.seq.tostring()
        if nuc not in remove and len(nuc) > 0:
            SeqIO.write([seq], f, "fasta")

chmod +x fasta_remove.py

**Call the script we just created:**

In [None]:
./fasta_remove.py asmMeta.fasta crass/all_crispr_reads.fa no_crispr_asmMeta.fasta

**Create new BLAST database with this file:**

In [None]:
makeblastdb -in no_crispr_asmMeta.fasta -parse_seqids -dbtype nucl -out asmMetaNoCRISPR -outfmt 
blastn -db asmMetaNoCRISPR/asmMetaNoCRISPR -query crass/all_spacers.fa -out OUT.csv

**Find all the matching IDs in the output. Separate them out from the full metagenome with:**

In [None]:
seqtk subseq in.fq name.lst > out.fq

#### Identifying spacer-containing, non-CRISPR reads (SNCs): 

We filtered reads that already belonged to a binned genome out of the whole metagenome in order to create a database specific to the pink berry community that excluded sequences from the bacteria we extracted CRISPRs from. When we searched our extracted spacers against this database, we found over 50% of the spacers hit metagenomic sequences. Modeling the work of Andersson & Banfield (2008), we called these metagenomic sequences SNC (spacer-containing, non-CRISPR) reads. 

#### Literature Cited: 

1) Andersson, A. F., & Banfield, J. F. (2008). Virus population dynamics and acquired virus resistance in natural microbial communities. Science, 320(5879), 1047-1050.