# Read mapping

Before performing other downstream analyses (variant calling, expression analysis, etc) you need to map your reads to the reference.
<br><br>
### 1. Create index
First, you will need to create an index of the reference genome (tip: use the `bowtie2-build` command).


In [3]:
cd 
cd data/refs

In [None]:
bowtie2-build genome.fna genome_index

In [4]:
ls

genes.gff           genome_index.2.bt2  genome_index.rev.1.bt2
genome.fna          genome_index.3.bt2  genome_index.rev.2.bt2
genome_index.1.bt2  genome_index.4.bt2  prots.faa


<br><br>
### 2. Map samples to ref genome
Next, map each of your samples to the reference genome using Bowtie2 (http://bowtie-bio.sourceforge.net/bowtie2/). (tip: check the `bowtie2 --help` for a parameter which allows you to use fasta instead of fastq files as input; also make sure to redirect the stderr output of bowtie2 to a file using the `2>` redirection, so that you can collect bowtie2 mapping stats).


In [7]:
# create file to write out the mapping stats of bowtie alignment
cd
touch ./Genomics/mappingStats.txt

In [8]:
echo "\n\nHigh temp 1" >> ./Genomics/mappingStats.txt
bowtie2 -x ./data/refs/genome_index -f ./data/reads/hightemp_01.fasta -S ./Genomics/hightemp_01.sam 2>>./Genomics/mappingStats.txt

echo "\n\nHigh temp 2" >> ./Genomics/mappingStats.txt
bowtie2 -x ./data/refs/genome_index -f ./data/reads/hightemp_02.fasta -S ./Genomics/hightemp_02.sam 2>>./Genomics/mappingStats.txt

echo "\n\nNormal 1" >> ./Genomics/mappingStats.txt
bowtie2 -x ./data/refs/genome_index -f ./data/reads/normal_01.fasta -S ./Genomics/normal_01.sam 2>>./Genomics/mappingStats.txt

echo "\n\nNormal 2" >> ./Genomics/mappingStats.txt
bowtie2 -x ./data/refs/genome_index -f ./data/reads/normal_02.fasta -S ./Genomics/normal_02.sam 2>>./Genomics/mappingStats.txt

In [9]:
cat ./Genomics/mappingStats.txt

\n\nHigh temp 1
	LANGUAGE = (unset),
	LC_ALL = (unset),
	LC_CTYPE = "UTF-8",
	LANG = "es_ES.UTF-8"
    are supported and installed on your system.
291814 reads; of these:
  291814 (100.00%) were unpaired; of these:
    17 (0.01%) aligned 0 times
    289275 (99.13%) aligned exactly 1 time
    2522 (0.86%) aligned >1 times
99.99% overall alignment rate
\n\nHigh temp 2
	LANGUAGE = (unset),
	LC_ALL = (unset),
	LC_CTYPE = "UTF-8",
	LANG = "es_ES.UTF-8"
    are supported and installed on your system.
289637 reads; of these:
  289637 (100.00%) were unpaired; of these:
    7 (0.00%) aligned 0 times
    287101 (99.12%) aligned exactly 1 time
    2529 (0.87%) aligned >1 times
100.00% overall alignment rate
\n\nNormal 1
	LANGUAGE = (unset),
	LC_ALL = (unset),
	LC_CTYPE = "UTF-8",
	LANG = "es_ES.UTF-8"
    are supported and installed on your system.
290331 reads; of these:
  290331 (100.00%) were unpaired; of these:
    8 (0.00%) aligned 0 times
    287957 (99.18%) aligned exactly 1 time
    2366 

<br><br>
__How many records are in your mapping (.sam/.bam) files? How many different reads are in your mapping (.sam/.bam) files? How these numbers compare with the number of reads in your original samples and with the alignment statistics (stats from bowtie2)?__<br>
The number of records in my mappings is the total number of reads displayed in the table before [291814, 289637, 290331 & 291324] and it is the same as the total number of reads that I had in the fasta files. As for the number of different reads, it is the same, all reads are uniq.


In [14]:
cd
cd ./Genomics
samtools view -c hightemp_01.sam
samtools view -c hightemp_02.sam
samtools view -c normal_01.sam
samtools view -c normal_02.sam

291814
289637
290331
291324


In [19]:
samtools view hightemp_01.sam | cut -f1 | sort | uniq | wc -l
samtools view hightemp_02.sam | cut -f1 | sort | uniq | wc -l
samtools view normal_01.sam | cut -f1 | sort | uniq | wc -l
samtools view normal_02.sam | cut -f1 | sort | uniq | wc -l

291814
289637
290331
291324


<br><br>
__How many reads map to a single location and how many to more than one (multiple mapping reads)?__<br>
`XS: is a sam header field that only appears if the SAM record is for an aligned read and more than one alignment was found for the read`<br>We can also see this by looking at the stats.

In [1]:
cd 
cd ./Genomics

In [6]:
samtools view hightemp_01.sam | grep 'XS' | wc -l
samtools view hightemp_02.sam | grep 'XS' | wc -l
samtools view normal_01.sam | grep 'XS' | wc -l
samtools view normal_02.sam | grep 'XS' | wc -l

2522
2529
2366
2307


__How do you think that multiple mapping reads could affect downstream analyses (variant calling and RNAseq)?__<br>If a sequence that contains a special feature like a SNP, and it is mapped more than one time, that one feature in the downstream analysis could appear to be more abundant than it really is.


<br><br>
__Could you use these mappings to perform an analysis of Copy Number Variation (https://en.wikipedia.org/wiki/Copy-number_variation)?__

I think it would not be possible because our data is cDNA, which means that comes from transcripts. The Copy Number Variation (CNV) is the phenomenon where one same gene is repeated in the genome, but it is not possible to determine if the RNA transcripts we have come from one single gene that is highly expressed or from multiple copies of the same gene. Maybe there is a way to track the region of the genome from where the transcripts come, but as far as I know, using cDNA it is not posible to do CNV.