> Author: @ ET 4/14/22 :goat:
>> edited: @ ET 4/20/22 
------

## AMD ID Quality Check of FASTQ Filenames ##

### Background: ### 

A standardized sample naming schema is used to capture all associated meta-data prior to sequencing. Briefly, the `AMD ID` consist of 20 characters that capture information like collection year, geographical and treatment information, sample type and molecular markers included for each sample. 

Capturing this information at the pre-processing stage reduces the need to have multiple documents with this information, removing potential for mislabeling or tracking errors, and provides the bioinformatics team  with sufficient information to perform subsequent analysis and standardize analysis pipelines, including submission of data to NCBI. 

#### AMD ID Description #### 

* The AMD ID Key:  `<year> <country> <state/site> <day of failure> <treatment> <sample_id> <genus spp> <sample type> <mol marker bit code> <# sample processed>`. 
* Any missing meta data is replaced with an `x` _lower case_ strings for each character position. 

Example:
- `Individual` sequenced sample ID: `17GNDo00F0001PfF1290` = `<2017> <Guinea> <Dorota> <Day0> <AS+AQ> <0001> <P.falciparum> <FilterBloodSpot> <k13-crt-mdr-dhfr-dhps-cytB-cpmp-pfs47> <0>` 
                      
- `Pooled` sequenced sample ID: `17GNDo00x001p10F1290` = `<2017> <Guinea> <Dorota> <Day0> <missing info> <001> <Pooled Samples> <Samples in Pool> <FilterBloodSpot> <k13-crt-mdr-dhfr-dhps-cytB-cpmp-pfs47> <0>`
       
NOTE: If information is not availble (na), **x** is used for each character position. For example, in the pooled samples Treatment has (2) character spaces, represented as a two digit number. This is replaced with (2) **xx** since its a pool of samples that have possibly different treatment information. Moreover, for pooled samples, **Sample ID** is replaced with **three digit number + the letter p** (for pooled), and **Genus** is replaced with **total number of SamplesInPool** as a (2) digit number. 

Please see presentation at [AMD_ID presentation](https://github.com/CDCgov/MaRS/blob/master/Geneious_workflow/01_sample_ID_QC/files/AMD_ID_create_key.pptx) for more descriptive information. The `AMD ID` can be generated using the [AMD create template](https://github.com/CDCgov/MaRS/blob/master/Geneious_workflow/01_sample_ID_QC/files/AMD_ID_create_template.xlsx). 
     
### Code information ###

This code checks whether the `AMD ID` is the correct length and contains all the required elements described above. 

### Required packages ###
- [pandas](https://pandas.pydata.org/) 
- [numpy](https://numpy.org) 
- [os](https://docs.python.org/3/library/os.html)
- [re](https://docs.python.org/3/library/re.html) 

### Inputs ###
- .fastq filenames in the `AMD ID` convention 

### Outputs ### 
- List of incorrect `AMD ID` .fastq file 

### NOTE ###
This code will work on UNIX based systems. Please change the respective parts below in code for Windows based OS. 

### Before you start, restart the kernel to clear dir() ###
> Kernel > Restart Kerneal and Clear all outputs 

In [2]:
## Import dependencies ## 
import numpy as np 
import pandas as pd
import os
import re

In [3]:
## Get current dir $PATHs ## 

current_dir = os.getcwd()
print("You are currently in:", current_dir)


You are currently in: /Users/eldintalundzic/MaRS/Geneious_workflow/01_sample_ID_QC


In [4]:
## Get user $HOME directory ## 

def slash_split(string):
    '''Takes string, at 3rd "/" keeps all characters to the left; if no "/" returns full string'''
    if string.count("/") == 0:
        return string 
    return "/".join(string.split("/", 3)[:3]) 

user_path = slash_split(current_dir)

print("User home directory:", user_path)

User home directory: /Users/eldintalundzic


### Choose example .fastq files ### 
- Un-comment the block below and run it to use example data downloaded as part of the MaRS repo. 
- Otherwise, skip code block below

In [5]:
## Change dir to example .fastq data ## 

os.chdir(user_path + "/MaRS/Geneious_workflow/02_geneious_analysis/example_data")

print("Changing directory to:", os.getcwd())

Changing directory to: /Users/eldintalundzic/MaRS/Geneious_workflow/02_geneious_analysis/example_data


### OR choose your own .fastq files ### 
- Un-comment the block below and run it
- Provide your full $PATH to your `AMDID.fastq` files _(ex. /Users/"username"/Desktop/my_data)_

In [6]:
## Change dir to your own .fastq ## 

# fastq_filepath = input("Provide full path to fastq filenames:")                     # save user provided input string $PATH as var fastq_filepath 

# os.chdir(fastq_filepath)                                                            # change dir to user provided $PATH 
# print("Changing directory to:", os.getcwd()) 

### Check directory contents for fastq files ### 

In [7]:
## QC directory check ## 

file_list = os.listdir()                                                            # save contents of directory to file_list 

## Remove any None (null records) if any ##
clean_file_list = [] 
for item in file_list:
    if item != None: clean_file_list.append(item)
    
# List comprehension to do same
# Check for true values and filter None values automatically 
#clean_file_list = [ item for item in clean_file_list if item ] 

test_string = "fastq"                                                               # define string to search for in file_list; here "fastq" filesnames 

fastq_files = []                                                                    # create an empty list to save fastq filenames 
non_fastq_files = []                                                                # create an empty list to save all non-fastq filenames 
    
## Save fastq filenames into fastq_files var ## 
fastq_files = [ item for item in clean_file_list if test_string in item ] 
# filter() + lambda way to do it ^: 
#fastq_files = list(filter(lambda x: test_string in x, file_list)) 

non_fastq_files = [ item for item in clean_file_list if test_string not in item ] 


## Raise error if no fastq files found in directory ##
def check_list(ls):                                                                 # create a func called check_list 
    '''Takes a list, if empty returns error msg, else returns list''' 
    try:                                                                            # see if list has anything at first index 
        ls[0]                                                                       # if there is anything return the entire list 
        return ls
    except IndexError:                                                              # else for IndexError produce error messege  
        er_msg = '''"Looks like your dir has no fastq files in it. Make sure
        the $PATH is correct and you have fastq files in it.'''
        print(er_msg) 

total_fastq_files = len(check_list(fastq_files))
print("Found", total_fastq_files, "*fastq files!")

Found 4 *fastq files!


### Capture only AMD IDs ### 

In [8]:
## Filter only filenames with underscores ## 
undscore_fastq_list = list(filter(lambda x: '_' in x, fastq_files))

## Get AMD ID ## 
amd_ids = [] 

## Takes list, at first substr "_" splits string ##
for items in undscore_fastq_list:
    amd_ids.append("_".join(items.split("_", 1)[:1]))                               # at 1st encounter split string to left [:1] 
        
print(len(amd_ids), "AMD IDs found out of", total_fastq_files, "*fastq files")


## Save correct length AMD IDs ## 

amdid_oklen =[] 
wrong_len = []

for item in amd_ids:
    if len(item) == 20: amdid_oklen.append(item)
    else: wrong_len.append(item) + print("These were incorrect length:", wrong_len)

print("Found correct length AMD IDs:", len(amdid_oklen), "out of", len(amdid_oklen)) 



4 AMD IDs found out of 4 *fastq files
Found correct length AMD IDs: 4 out of 4


In [9]:
## Seperate ind from pooled samples ## 

substring = "p"                                                 # p = pooled samples 

## Create empty lists for pooled and individual samples ## 
pool_fq = [] 
ind_fq = [] 

## Loop through AMD ID fastq filenames; seperate = ind v pool 
for items in amdid_oklen:
    if substring in items: pool_fq.append(items)                # save pooled IDs in pool_fq list 
    else: ind_fq.append(items)                                  # save ind IDs in ind_fq list 

In [13]:
## Create df of AMD IDs ## 

df_ind = pd.DataFrame(ind_fq, columns =['AMD_ID_IND'])

df_pool = pd.DataFrame(pool_fq, columns =['AMD_ID_POOL'])


### QC check of AMD ID contents ### 
Please provide the expected year, country, states/province, treatment and bit_marker information. 

In [15]:
## Modify inputs below as string ## 
year = '20'
country = 'GN'
state = 'Ma'
#state_2 = 'Ak'
treatment = 'A' 
bit_markers = '0671'
pool = 'p' 

## If need to add additional filters, modify above and in code below ## 

## Run QC test for each category for pooled samples & add column if T/F ##
df_pool[year] = df_pool['AMD_ID_POOL'].str.contains(year, regex=False, case=True)
df_pool[country] = df_pool['AMD_ID_POOL'].str.contains(country, regex=False, case=True)
df_pool[state] = df_pool['AMD_ID_POOL'].str.contains(state, regex=False, case=True)
df_pool[treatment] = df_pool['AMD_ID_POOL'].str.contains(treatment, regex=False, case=True)
#df_pool[state_2] = df_pool['AMD_ID_POOL'].str.contains(treatment, regex=False, case=True)
df_pool[bit_markers] = df_pool['AMD_ID_POOL'].str.contains(bit_markers, regex=False, case=True)
df_pool[pool] = df_pool['AMD_ID_POOL'].str.contains(pool, regex=False, case=True)

## Run QC test for each category for individual samples & add column if T/F ## 

df_ind[year] = df_ind['AMD_ID_IND'].str.contains(year, regex=False, case=True)
df_ind[country] = df_ind['AMD_ID_IND'].str.contains(country, regex=False, case=True)
df_ind[state] = df_ind['AMD_ID_IND'].str.contains(state, regex=False, case=True)
#df_ind[state_2] = df_ind['AMD_ID_IND'].str.contains(state_2, regex=False, case=True)
df_ind[treatment] = df_ind['AMD_ID_IND'].str.contains(treatment, regex=False, case=True)
df_ind[bit_markers] = df_ind['AMD_ID_IND'].str.contains(bit_markers, regex=False, case=True)

## TODO: make above more pythonic: 
## - Figure out a way to use a list of check variables iteratively and add columns 


### Below code will identify erroeneous AMDIDs ### 
A .csv file will be created on the desktop showing the AMD_IDs and elements that didn't match. 

In [31]:
## Change dir to desktop to save erroneous IDs ## 

os.chdir(user_path + "/Desktop/")

## List any filenames that had any incorrect elements ## 

df_ind_check = df_ind[(df_ind == False).any(axis=1)]

df_pool_check = df_pool[(df_pool == False).any(axis=1)]

if len(df_ind_check) != 0:
    print('Errors found in pooled samples. Check df_ind_check df')
    print(df_ind_check)
    df_ind_check.to_csv('Erroneous_Individual_AMDIDS.csv') 
else:
    print('No errors found in individual sequenced samples.') 

print(' ')
    
if len(df_pool_check) != 0:
    print('Errors found in pooled samples. Check df_pool_check df')
    print(df_pool_check)
    df_pool_check.to_csv('Erroneous_Pooled_AMDIDS.csv') 
else:
    print('No errors found in individual sequenced samples.') 

No errors found in individual sequenced samples.
 
Errors found in pooled samples. Check df_pool_check df
            AMD_ID_POOL    20     GN     Ma     A  0671     p
0  20MDAk00x001p05F0671  True  False  False  True  True  True
1  20MDAk00x001p05F0671  True  False  False  True  True  True


### Below is example code to fix AMD_IDS ### 

In [29]:
## Fix erroneous AMD IDs ##






Unnamed: 0,AMD_ID_POOL,20,GN,Ma,A,0671,p
0,20MDAk00x001p05F0671,True,False,False,True,True,True
1,20MDAk00x001p05F0671,True,False,False,True,True,True


In [450]:
## Clear variables, modules, etc ## 

#%reset -f          # reset workspace; see %reset -h

#dir()              # check dir is clean 

Don't know how to reset  #, please run `%reset?` for details
Don't know how to reset  reset, please run `%reset?` for details
Don't know how to reset  workspace;, please run `%reset?` for details
Don't know how to reset  see, please run `%reset?` for details
Don't know how to reset  %reset, please run `%reset?` for details
Don't know how to reset  -h, please run `%reset?` for details
