## Problem Statement:

1)      Build a sample sheet generator to load the different sample sheets based upon the type of instrument requested. The sample sheet generator specification:

a.      Input: instrument type, string;

b.      Input: number of samples, int;

c.       Input: sample sheet name to be saved as output, string

d.      Output: A sample sheet that contains 3 to 16 samples with unique indexes assigned to each sample. All other sections should be included except that the Application field should indicate the instrument type.

 

2)      Build a sample sheet comparison module to identify the differences between the sample sheets for the three different instruments.

a.      Assuming all three sample sheets have 16 samples named identically, except the Application value in the [Header] section is blank. Based on the internal sample sheets you have, how would you identify which instrument a particular sample sheet is intended for.

b.      How would you identify the unique values that are instrument specific based on the three sample sheets.

 

3)      Write a python script to help you identify whether a HiSeq 4000 sample sheet has been erroneously used on a NovaSeq instrument? For example, the Application field says NovaSeq, when in fact everything else is from a HiSeq 4000 template.

 

Note: 

·         Object oriented design is preferred but not required.

·         Use of open source code is acceptable and should be indicated where applicable.         

·         Use of pseudo code is acceptable inside a function, i.e., no detailed implementation is required except for input, output parameters.

·         Python is preferred, but other languages such as C#, R, bash, are acceptable.

·         Indicate where the instructions may have been unclear and briefly explain your design decision without that additional information.

## Solution:

We will first begin with the sample sheet generator. This can be handled quite simply with a basic function.

In [31]:
import pandas as pd
import zipfile
from urllib import urlopen
from StringIO import StringIO
import os.path

def sample_sheet_generator(inst,n,outfile):
    """
    generate sample sheet according to specified instrument type
    outputs a .csv file and returns dataframe object of that file
    
    
    :type inst: str
    :param inst: instrument type
    :type n: int
    :param n: number of samples
    :type outfile: str
    :param outfile: name of output file
    
    :rtype: Pandas Dataframe
    
    """
    assert isinstance(inst,str),'Instrument type needs to be a string!'
    assert isinstance(n,int),'Number of samples must be an integer!'
    assert isinstance(outfile,str),'Output file name must be a string!'
    assert inst in ['HiSeq 2500', 'HiSeq 4000','NovaSeq 6000'],'Invalid instrument type.'
    assert 3 <= n <= 16
    
    
    #download appropriate file
    if inst == 'HiSeq 2500':
        urllink = 'https://support.illumina.com/content/dam/illumina-support/documents/downloads/productfiles/trusight/trusight-oncology-umi-reagents-hiseq-2500-sample-sheet-template-1000000036306-00.zip'
    elif inst == 'HiSeq 4000':
        urllink = 'https://support.illumina.com/content/dam/illumina-support/documents/downloads/productfiles/trusight/trusight-oncology-umi-reagents-hiseq-4000-sample-sheet-template-1000000036305-00.zip'
    elif inst == 'NovaSeq 6000':
        urllink = 'https://support.illumina.com/content/dam/illumina-support/documents/downloads/productfiles/trusight/trusight-oncology-umi-reagents-novaseq-6000-sample-sheet-template-1000000036332-00.zip'
    else:
        return 'False'
    
    #unzip and convert to dataframe
    url = urlopen(urllink)
    zf = zipfile.ZipFile(StringIO(url.read()))
    ss = zf.namelist()
    ss = ss[0]
    df = pd.read_csv(zf.open(ss))
    df.set_index('[Header]',inplace=True)
    
    #add appropriate applicaton title
    df.loc["Application"][0] = inst
    df.reset_index(inplace=True)
    
    #take samples
    df = df.iloc[0:24+n,:]
    
    #output tab delimited .csv file
    df.to_csv(outfile,sep='\t',index=False)
    
    #return Pandas dataframe
    return df

In [32]:
sample_sheet_generator('HiSeq 4000',4,'hello.csv')

Unnamed: 0,[Header],Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8
0,IEMFileVersion,4,,,,,,,
1,Investigator Name,User Name,,,,,,,
2,Experiment Name,Experiment,,,,,,,
3,Date,6/21/2016,,,,,,,
4,Workflow,GenerateFASTQ,,,,,,,
5,Application,HiSeq 4000,,,,,,,
6,Assay,,,,,,,,
7,Description,,,,,,,,
8,Chemistry,Default,,,,,,,
9,,,,,,,,,


#### Sample Sheet Comparison Module

Looking at the differences between the sheets, it seems clear that there are two main indicators of the instrument types besides the Application field.

1. HiSeq 4000 and NovaSeq 6000 have 151 reads, while HiSeq 2500 has 125.
2. HiSeq 2500 and NovaSeq 6000 have identical index2 columns.

Using this, we can use simple boolean logic and if statements to identify the correct sample sheet.

The code below will display the differences between sheets on the second sheet compared.

In [33]:
def sample_sheet_compare(file1,file2):
    
    n = 16
    
    df1 = sample_sheet_generator('HiSeq 4000',n,'4000.csv')
    df2 = sample_sheet_generator('HiSeq 2500',n,'2500.csv')
    df3 = sample_sheet_generator('NovaSeq 6000',n,'6000.csv')

    dfdict = {'HiSeq 4000':df1,'HiSeq 2500':df2,'NovaSeq 6000':df3}
    newdict = {key:dfdict[key] for key in [file1, file2]}
    df=pd.concat(newdict)
    df.drop_duplicates(keep='first',inplace=True)
    df.drop_duplicates()
    return df 

In [34]:
sample_sheet_compare('HiSeq 4000','NovaSeq 6000')

Unnamed: 0,Unnamed: 1.1,[Header],Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8
HiSeq 4000,0,IEMFileVersion,4,,,,,,,
HiSeq 4000,1,Investigator Name,User Name,,,,,,,
HiSeq 4000,2,Experiment Name,Experiment,,,,,,,
HiSeq 4000,3,Date,6/21/2016,,,,,,,
HiSeq 4000,4,Workflow,GenerateFASTQ,,,,,,,
HiSeq 4000,5,Application,HiSeq 4000,,,,,,,
HiSeq 4000,6,Assay,,,,,,,,
HiSeq 4000,7,Description,,,,,,,,
HiSeq 4000,8,Chemistry,Default,,,,,,,
HiSeq 4000,9,,,,,,,,,


### Sample Sheet Error

We can now make a function to quickly decipher if an input sheet has been labelled incorrectly. We are assuming in this case that we can trust the files in the website.

In [35]:
def sample_sheet_error(infile):
    """
    Given a .csv sample sheet, find out if it is labelled incorrectly
    :type infile: str
    :param infile: name of input sample sheet to be analyzed
    
    rtype: bool
    true if labelled correctly, false if labelled incorrectly
    """
    
    assert isinstance(infile,str)
    assert os.path.exists(infile)
    outfile = 'trash.csv'
    n = 16
    
    #read file, find current instrument configuration
    df = pd.read_csv(infile,sep='\t')
    df1 = df.set_index('[Header]')
    inst = df1.loc['Application'][0] 
    truedf = sample_sheet_generator(inst,n,outfile)
    
    #compare reads
    if df.iloc[11,0] == truedf.iloc[11,0]:
        
        #compare index2 column
        dfindex2 = df.iloc[24:,8]
        truedfindex2 = truedf.iloc[24:,8]
        if set(dfindex2).issubset(truedfindex2):
            return True
        else:
            return False                   
    else:
        return False

    

In [36]:
sample_sheet_error('hello.csv') #should return True
#test on modified file

True