# Part 1  
---  
## Demultiplexing:   
---  
### 1) File Identification  

| File Name                    | ID      |  
|:----------------------------:|:-------:|  
| 1294_S1_L008_R1_001.fastq.gz | read1  |  
| 1294_S1_L008_R2_001.fastq.gz | index1 |  
| 1294_S1_L008_R3_001.fastq.gz | index2 |  
| 1294_S1_L008_R4_001.fastq.gz | read2  |  


### 2a) Python Code for Mean Q-score Analysis  
---  
Written by: Max Hills  
**See Mean Quality-Score Graphs and Tables in Local Folder.**   

In [54]:
% matplotlib inline
import matplotlib.pyplot as plt
import math
import numpy as np
def convert_phred(letter):
    """The function takes a single string character and converts it into the Illumina quality score."""
    phred_score = ord(letter)-33
    return phred_score


In [76]:
def avg_qs(fastq):
    """INPUT: fastq - A standard format fastq file of type str, readlen - The int length of the reads (in nucleotides)
    OUTPUT: A table of average q-scores at each position, A line-graph of average q-scores at each position"""
    with open(fastq,'r') as f:
        # Initialize a line-count.
        LC = 0
        # Initialize a read-count.
        RC = 0
        # Iterate through each line in the file.
        for line in f:        
            # Increment the line count.
            LC += 1        
            # If the line number is a multiple of 4, do the following...
            if LC%4 == 0:
                line = line.strip()
                readlen = len(line)
                if RC == 0:
                    all_qscores = np.zeros(readlen,dtype=int)
                # Increment the read-count.
                RC += 1
                # Initialize charIndex
                charIndex = 0
                for char in line:
                    # Get the decimal phred score at this position.
                    score = convert_phred(char)
                    # Store the score for the current position at its index within all_qscores. 
                    all_qscores[charIndex] += score
                    # Increment charIndex.
                    charIndex += 1
    mean_scores = all_qscores/RC
    table = fastq.replace('fastq','txt').replace('.gz','')
    graph = fastq.replace('fastq','png').replace('.gz','')
    with open(table, 'w') as t:
        # Print a header for the columns.
        print("# Base Pos\tMean Quality Score",file=t)
        print("===========\t==================",file=t)
        for i in range(readlen):
            template = "{}\t\t{:0.1f}"
            out = str(template.format(i,mean_scores[i]))
            print(out,file=t)
    # Initialize empty lists to hold x,y axis labels.
    x = []
    y = []
    ylow = math.floor(min(mean_scores))-2
    yhigh = math.ceil(max(mean_scores))+2
    # Create x labels stepwise by 2's.
    for n in range(0,readlen+1,2):
        x.append(n)
    # Create y labels stepwise by 1's.
    for n in range(ylow,yhigh):
        y.append(n)
    # Create the figure, with size and color specifications
    plt.figure(num=None, figsize=(16, 8), dpi=80, facecolor='w', edgecolor='k')
    # Plot the mean_scores, with a dark blue line.
    plt.plot(mean_scores, "darkblue")
    # Create the x,y tick marks
    plt.xticks(x)
    plt.yticks(y)
    # Label the x,y axes.
    plt.xlabel("Position of Base in Read")
    plt.ylabel("Mean Quality Score")
    # Set the range for the x,y axes.
    plt.axis([-1,(readlen+1), ylow, yhigh])
    # Title the plot
    plt.title("Average Quality Scores @ Base Position X")
    # Show the graph grid, then show the figure.
    plt.grid(True)
    plt.savefig(graph,format='png')

**Read 1**  
![Read1](1294_S1_L008_R1_001.png)  
**Index 1**  
![Index1](1294_S1_L008_R2_001.png)  
**Read 2**  
![Read2](1294_S1_L008_R4_001.png)  
**Index 2**  
![Index2](1294_S1_L008_R3_001.png)  

In [77]:
%%bash 
ls

1294_S1_L008_R1_001.png
1294_S1_L008_R1_001.txt
1294_S1_L008_R2_001.png
1294_S1_L008_R2_001.txt
1294_S1_L008_R3_001.png
1294_S1_L008_R3_001.txt
1294_S1_L008_R4_001.png
1294_S1_L008_R4_001.txt
Demultiplexing_p1.pdf
demulti.py
demulti.sh
Hills_Demultiplexing.ipynb
index_1.fastq
index1LC.txt
index1Ncount.txt
index_1.png
index_1.txt
index_2.fastq
index2LC.txt
index2Ncount.txt
index_2.png
index_2.txt
lane1_NoIndex_L001_R1_003.fastq
lane1_NoIndex_L001_R1_003.png
lane1_NoIndex_L001_R1_003.txt
PhredQscoreErrors.png
read_1.fastq
read_1.png
read_1.txt
read_2.fastq
read_2.png
read_2.txt
README.md


### 2b) Phred Quality-Score Cut-off:  
---  
> A good quality-score cutoff often depends on the particular applicatation. A quality-score of 20 or higher is often considered usable data, because this means that only 1% of bases are likely to be erroneous. In other cases, we may desire greater certainty, in which case a quality-score cutoff of 30, meaning only 0.1% of bases are likely to be erroneous, can be a useful target. As you can see from the graph below, quality scores below 20 increase in error rates precipitously. For this reason, sequences with a quality-score below 20 are considered unreliable.   

![QscoreErrorRate](PhredQscoreErrors.png)



### 2c) Number of Indexes Containing an "N" undetermined base call:  
---  
**The following shell commands were used to retrieve this information:**  
>zcat /projects/bgmp/shared/2017_sequencing/1294_S1_L008_R2_001.fastq.gz | sed -n '2~4p' | grep -cE 'N' > index1Ncount.txt  
>zcat /projects/bgmp/shared/2017_sequencing/1294_S1_L008_R3_001.fastq.gz | sed -n '2~4p' | grep -cE 'N' > index2Ncount.txt  

_Index 1:  
**3,976,613** indexes contained an "N"_  
_Index 2:  
**3,328,051** indexes contained an "N"_




# Part 2  
---  
> Write a strategy for writing an algorithm for demultiplexing files and reporting index-hopping. That is, given four files (2 w/ biological reads, 2 w/ index reads) and known indexes, sort reads by index, outputting one forward file and one reverse file per index, plus a pair of file for unknown indexes. Additionally, your algorithm should report the number of properly matched indexes (per index) and the level of index hopping observed.  

1. Define the Problem:  

2. Determine/Describe What Output Would Be Informative:  

3. Write Examples (Unit Tests!):  
    i. Include four properly formatted input fastq files.  
    ii. Include the appropriate number of properly formatted output fastq files.  

4. Develop Your Algorithm Using Pseudocode:  

5. Determine High-Level Functions:
    i. Description  
    ii. Function Headers  
    iii. Test Examples for Individual Functions  
    iv. Return Statement  
    