# Example: paired-end IGH sequences
## Preparation
Prepare a file with paths to the sample fastq files. The name is expected to be structured like this:

SAMPLENAME_L001_R1_001.fastq.gz

SAMPLENAME_L001_R2_001.fastq.gz

In [None]:
echo "TESTDATA/test_L001_R1_001.fastq.gz" > SAMPLES
echo "TESTDATA/test_L001_R2_001.fastq.gz" >> SAMPLES
wait

## Configuration

This example is implemented in 'execute-all.sh'

Define the sequence run (this is only used for creating a directory on a webdav server, if that is used to store results)

The variables mids, organism, cell and celltype are mandatory:
* mids: file with regular expression for the MIDs in this run (see below)
* organism: human or mouse
* cell: TRA, TRB, IGH, IGK or IGL
* celltype: celltype_organism in uppercase, e.g. IGH_HUMAN

In [None]:
cat MIDS-miseq.txt

In [None]:
run=test
mids=MIDS-miseq.txt
organism=human
cell=IGH
celltype=IGH_HUMAN
wait

Define reference database

In [None]:
refs="${cell}V_${organism}.fasta ${cell}J_${organism}.fasta"
v="${cell}V_${organism}"
j="${cell}J_${organism}"

## Data analysis

Get list with fastq files

In [None]:
samples=`cat SAMPLES`  # get all arguments
r1_samples=`grep R1_001 SAMPLES`

In [None]:
echo $samples

Run [FastQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/). The example takes a few seconds (~12s)

In [None]:
./run-fastqc.sh ${samples}
wait

[FastQC result test R1](./TESTDATA/test_L001_R1_001_fastqc.html)
[FastQC result test R2](./TESTDATA/test_L001_R2_001_fastqc.html)

Assemble both ends of the sequence pairs with [PEAR](http://sco.h-its.org/exelixis/web/software/pear/doc.html) This step can take a while (wait till you see "FINISHED")

In [None]:
./batch-pear.sh ${r1_samples}
wait

Continue with the assembled fastq files. If you have single-end reads you normally start from here.

In [None]:
samples=`ls *.assembled.fastq.gz`

Split sequences based on their Molecular IDentifier (MID). This is an extra control for contamination. Input: mid-file, output directory and list of samples

In [None]:
python2 FastqSplitOnMid.py ${mids} split ${samples}
wait

Continue with the assembled, split per mid, fastq files

In [None]:
samples=`ls split/*.fastq.gz`

FastQC report on the split sample files

In [None]:
./run-fastqc.sh ${samples}
wait

Extract the CDR3 sequence

In [None]:
python2 TranslateAndExtractCdr3.py ${celltype} ${samples}
wait
echo "FINISHED"

Align sequences ([BWA](http://bio-bwa.sourceforge.net/) and [Picard tools](http://broadinstitute.github.io/picard/)) against [IMGT](http://imgt.org/) and call raw SNPS with [Samtools](http://samtools.sourceforge.net/) and [VarScan](http://dkoboldt.github.io/varscan/) (any mutation is accepted, also sequence errors)

In [None]:
for ref in $refs; do
    ./batch-align.sh ${ref} ${samples} >> align.log 2>> align.err
done
wait
echo "FINISHED"

## Combine all information and generate reports

In [None]:
mkdir final

for sample in ${samples}; do
    mydir=`dirname ${sample}`
    prefix=`basename ${sample} .fastq.gz`

    # Combine MID, CDR3, V, J and sequence information
    midFile=`echo ${mydir}/${prefix}|perl -ne 's/(.+)-.+$/$1-report.txt/;print;'`
    cdr3File=${sample}-${celltype}-CDR3.csv
    vFile=${prefix}-${v}-easy-import.txt
    jFile=${prefix}-${j}-easy-import.txt
    seqFile=${sample}-${celltype}.csv
    outFile="final/${prefix}-${celltype}-all_info.csv"
    cloneFile="final/${prefix}-${celltype}-clones.csv"
    cloneSubsFile="final/${prefix}-${celltype}-clones-subs.csv"
    cloneMainsFile="final/${prefix}-${celltype}-clones-mains.csv"
    totalFile="final/${prefix}-${celltype}-productive.txt"
    python combine-immuno-data.py ${midFile} ${cdr3File} ${vFile} ${jFile} ${seqFile} ${outFile} ${cloneFile} ${cloneSubsFile} ${cloneMainsFile} ${totalFile}
    wait

done

echo "FINISHED"

Guess which files contain the correct MID (file with most entries) and store that information in a separate directory.

In [None]:
ip_address=`hostname -I`
ips=($ip_address)
ip=${ips[0]}

wc -l final/*all_info.csv > wc-${ip}.txt
wait
python2 select-correct-mids.py wc-${ip}.txt > mv-samples-with-correct-mid.sh
wait
mkdir final/correct-mid
wait
cd final
bash ../mv-samples-with-correct-mid.sh
wait
mv correct-mid/*-productive.txt .
cd ..
echo "FINISHED"

Correct V gene assignments

In [None]:
python2 re-assign-v-genes.py final/correct-mid/*-all_info.csv
wait
mv *.rr.* final/correct-mid
echo "FINISHED"

In [None]:
ls -l final/correct-mid