# 20190821 operator-barcode library mapping

(c) 2019 Manuel Razo. This work is licensed under a [Creative Commons Attribution License CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/). All code contained herein is licensed under an [MIT license](https://opensource.org/licenses/MIT).

In [78]:
import os
import glob
import itertools
import re
import numpy as np
import pandas as pd
from joblib import Parallel, delayed
import skbio

## Objective

The objective of this notebook is to process the sequences generated on `20190818` to map the `O1`, `O2`, and `O3` libraries with their corresponding random 20mer barcodes.

For this dataset we have two technical replicas given by two different barcodes. In principle these libraries should be exactly the same since they were generated from the same library and only on the last PCRs the different index were added. Let's first find these files.

In [18]:
# Define data directory
datadir = '../../../data/processed_sequencing/' +\
          '20190821_operator_library_mapping/'

# List all fastq.gz files
fastq_files = glob.glob(f'{datadir}*.fastq.gz')

fastq_files

['../../../data/processed_sequencing/20190821_operator_library_mapping/index1_merged.fastq.gz',
 '../../../data/processed_sequencing/20190821_operator_library_mapping/index2_merged.fastq.gz']

Ok. We now have a list of the files, now we can begin manipulating them. To get a feeling for the data let's partially read one of them into memory.

In [45]:
# Use skbio to have a generator to iterate over fastq
seqs = skbio.io.read(datadir + fastq_files[0],
                     format='fastq',
                     verify='false',
                     variant='illumina1.8')

# Define number of samples
n_samples = 10000

# Initialize dataframe to save sequences
names = ['id', 'index', 'sequence']
df_seq = pd.DataFrame(columns=names)

# Iterate over sequences
for seq in itertools.islice(seqs, n_samples):
    # Add DNA sample to corresponding list
    df_seq = df_seq.append({'id': seq.metadata['id'],
                            'index' : 'index1',
                            'sequence' : str(skbio.DNA(sequence=seq, 
                                                       validate=False))},
                           ignore_index=True)

df_seq['seq_len'] =  df_seq.sequence.apply(len)
df_seq.head()

Unnamed: 0,id,index,sequence,seq_len
0,M05340:171:000000000-D6YGV:1:1101:16708:1702,index1,TCTTGACCATTTAGGTTTGGGCATGTGAGACCGGATGCTAACTAAA...,113
1,M05340:171:000000000-D6YGV:1:1101:16708:1702,index1,TTTACACTTTATTCTTCCTGCTCGTCTCCTTTTTTGAAATGTGATC...,113
2,M05340:171:000000000-D6YGV:1:1101:13854:1711,index1,TTATCATGGGTTGTTAAGAGGCATGTGAGACCGGATGCTAACTAAA...,113
3,M05340:171:000000000-D6YGV:1:1101:13854:1711,index1,TTTACACTTTCTGCTTCCTGCTCGTATCCTTTTTTTAATTGTGCTC...,113
4,M05340:171:000000000-D6YGV:1:1101:15932:1711,index1,TACCTGAAGGCTATTTCCCTGCATGTGAGACCGGATGCTAACTAAA...,113


Now let's define the 3 sequences for the operators. We will reverse complement them since the reads start from the barcode and then go through the reverse complement of the operators.

In [46]:
O1 = skbio.DNA('aattgtgagcggataacaatt'.upper()).reverse_complement()
O2 = skbio.DNA('aaatgtgagcgagtaacaacc'.upper()).reverse_complement()
O3 = skbio.DNA('ggcagtgagcgcaacgcaatt'.upper()).reverse_complement()
operators = {'O1': str(O1),
             'O2': str(O2),
             'O3': str(O3)}
operators

{'O1': 'AATTGTTATCCGCTCACAATT',
 'O2': 'GGTTGTTACTCGCTCACATTT',
 'O3': 'AATTGCGTTGCGCTCACTGCC'}

Now we define a function to find which operator is found in each of the sequences and apply it to all sequences.

In [47]:
# Define function to find operator
def op_match(seq):
    '''
    Function to match the operator sequences
    '''
    # Loop through operators
    for key, item in operators.items():
        # Find operator and return boolean if found
        op_pos = re.search(item, seq)
        # If found return the operator and break loop
        if bool(op_pos):
            return [key] + [*op_pos.span()]
            break

    # If none match, return none
    if not bool(op_pos):
        return ['None', 0, 0]
    
op_map = list()
# Loop through rows
for seq in df_seq.sequence:
     op_map.append(op_match(seq))

df_seq = pd.concat([df_seq,
                    pd.DataFrame.from_records(op_map,
                                              columns=['operator',
                                                       'op_begin',
                                                       'op_end'])],
                   axis=1)
df_seq.head(10)

Unnamed: 0,id,index,sequence,seq_len,operator,op_begin,op_end
0,M05340:171:000000000-D6YGV:1:1101:16708:1702,index1,TCTTGACCATTTAGGTTTGGGCATGTGAGACCGGATGCTAACTAAA...,113,O2,56,77
1,M05340:171:000000000-D6YGV:1:1101:16708:1702,index1,TTTACACTTTATTCTTCCTGCTCGTCTCCTTTTTTGAAATGTGATC...,113,,0,0
2,M05340:171:000000000-D6YGV:1:1101:13854:1711,index1,TTATCATGGGTTGTTAAGAGGCATGTGAGACCGGATGCTAACTAAA...,113,O2,56,77
3,M05340:171:000000000-D6YGV:1:1101:13854:1711,index1,TTTACACTTTCTGCTTCCTGCTCGTATCCTTTTTTTAATTGTGCTC...,113,,0,0
4,M05340:171:000000000-D6YGV:1:1101:15932:1711,index1,TACCTGAAGGCTATTTCCCTGCATGTGAGACCGGATGCTAACTAAA...,113,O1,56,77
5,M05340:171:000000000-D6YGV:1:1101:15932:1711,index1,TTTCCACTTTCTTCTTCCTTCTCGTCTCCTTTTTTTATTTTTGCTC...,113,,0,0
6,M05340:171:000000000-D6YGV:1:1101:16589:1721,index1,TATGCTGTACTGGGTTATTCGCATGTGAGACCGGTTGCTAACTAAC...,113,,0,0
7,M05340:171:000000000-D6YGV:1:1101:16589:1721,index1,TTTCCCCTTTCTTCTTCCTTCTCTTCTCCTTTTTTTCTTTTTTCTC...,113,,0,0
8,M05340:171:000000000-D6YGV:1:1101:14894:1735,index1,TTCTTCTGTGTATTAGACTTGCATGTGAGACCGGATGCTAACTAAA...,113,O1,56,77
9,M05340:171:000000000-D6YGV:1:1101:14894:1735,index1,TTTCCACTTTATTCTTCCTTCTCGTCTCCTTTTTTTAATTGTGCTC...,113,,0,0


Let's look at a summary table of each operator count.

In [48]:
df_seq.operator.value_counts()

O3      3045
O2      2787
O1      2659
None    1509
Name: operator, dtype: int64

There are a lot of `None` entries which means that the algorithm didn't find any of the operators. We'll come back to those later.

First thing, let's count how many unique sequences he have for each operator. We will be very stringent with our filtering of the barcodes. We know that the operator should be located exactly at position 56 of the sequences. So we will only keep the ones that satisfy this condition.

In [59]:
# Remove sequences that didn't map to an operator.
df = df_seq[(df_seq.operator != 'None') &
            (df_seq.seq_len == 113) &
            (df_seq.op_begin == 56)]
df.groupby('operator')['sequence'].nunique()

operator
O1    1500
O2    1275
O3    1044
Name: sequence, dtype: int64

So about half of the sequences are unique. Let's now build a table with the sequences that passed the filter with the number of counts for each of them. We will also extract the first 20 bp of the sequence that we know must correspond to the barcode.

In [76]:
# Group by operator
df_group = df.groupby('operator')

# Initialize dataframe to save outcome
names = ['operator', 'sequence', 'barcode', 'counts']
df_counts = pd.DataFrame(columns=names)

# Loop thorough operators
for group, data in df_group:
    # Count unique barcodes and turn it into a DataFrame
    df_op = data['sequence'].value_counts()\
                            .rename_axis('sequence')\
                            .reset_index(name='counts')
    # Add a column that contains operator
    df_op['operator'] = [group] * len(df_op)
    # Extract barcodes
    df_op['barcode'] = df_op['sequence'].apply(lambda x: x[0:20])
    # Append to dataframe
    df_counts = df_counts.append(df_op,
                                 ignore_index=True,
                                 sort=False)

df_counts.head()

Unnamed: 0,operator,sequence,barcode,counts
0,O1,GCAGTAGTATCTTGTGCTGTGCATGTGAGACCGGATGCTAACTAAA...,GCAGTAGTATCTTGTGCTGT,8
1,O1,CGGAGTTCGTTGCTGGGAGCGCATGTGAGACCGGATGCTAACTAAA...,CGGAGTTCGTTGCTGGGAGC,7
2,O1,TAGCCTCTTCTTTACGATTGGCATGTGAGACCGGATGCTAACTAAA...,TAGCCTCTTCTTTACGATTG,6
3,O1,CCTTTCTTGGCACTCACCATGCATGTGAGACCGGATGCTAACTAAA...,CCTTTCTTGGCACTCACCAT,6
4,O1,CCACCGTTGACGTCGGCTCTGCATGTGAGACCGGATGCTAACTAAA...,CCACCGTTGACGTCGGCTCT,6


Excellent! Now we have a working pipeline to process the sequences. The analysis for the full dataset is performed with the script `library_mapping.py`.