# Problem
When doing flow-cell analysis of sequences, indices are added to identify sequences from specific samples. As multiple different samples are assessed simultaneously, indices are necessary to parse data from set to set. Unfortunatly, when flow cells are populated for analysis, occasionally "index hopping" occurs, when a DNA strand gets paired with an alternate index, misidentifying the strand.

# Output
When demultiplexing, certain data is necessary for analysis. These data types are listed below:

For each of the paired end reads:

    Sequence
    Quality score for sequence
    index for sequence read
    index quality scores
List of reads where:

    Indices don't match for the paired ends
    Quality scores are insufficient
    Index quality scores are insufficient

# Unit Tests & Expected Results
    Found on this repository

# Pseudocode:

    Open index file:
        strip/split line by tabs
        create dictionary of barcode to barcode name
    Create array of arrays (call "pairarray"):
        Identify sequence vs. index lines using OS and size
        For each array in pairarray:
            Sequence file, associated index file
    Generate array of "trash" (for unknown indices) file names
    Start incremental value called "hopping"
    Open sets of index files and sequence files together at the same time.
        For each set of four lines:
            Save header string for sequence
            Identify indexes
            If indices don't match, have low quality or an "N", write out header, sequence, + and quality to associated trash files, then pass to next header
                If they hadn't matched but have a good quality score, mark header in trash file
            If indices didn't match, add 1 to hopping
            Append index string to header
            Append index string quality score to header
            Check quality scores for sequences. If either have too-low averages, add a mark to the header for both
            Write (new header, sequence, +, quality score) to file named using barcode library and which sequence is used
        print: "Properly matched: ", total lines in index files, " Index hopping level: ", hopping / total lines in index files, "%", sep = ""

## High level functions:
    getindex(file):
        '''Generates dictionary of index names and sequences'''
        opens files, splits and strips lines
        Generates dictionary key and value from each line
        returns: dictionary of indices
    example:
        file x = [A1    ACTG
                  B1    CTGA]
        returns:  {A1: ACTG
                   B1: CTGA}
    

    obtain_files(folder):
        '''Creates list of 2 element arrays, pairing off each index and sequence file'''
        Opens folder, tests size of files.
        First large file paired with first small file
        Second large file paired with second small file
        returns: array of lists, organized by read
    example:
        folder: [Read1, Index1, Index2, Read2]
        returns: [[Read1, Index1], [Read2, Index2]]

    match_index(index1, index2):
        '''Reads both idices to see if they match'''
        Tests to see if it's an index line
        Compares indices
        returns: boolean
    example:
        index 1: ACTG
        Index 2: ACTG
        returns: True

    write_out(index, boolean):
        '''Given an index and proof of match, writes out to files'''
        index_file.write(header)
        index_file.write(sequence)
        index_file.write(spacer)
        index_file.write(quality scores)
        returns: null

In [1]:
def convert_phred(letter):
    """Converts a single character into a phred score"""
    letter = ord(letter)-33
    return letter