> Author: @ ET 4/13/22 :goat:  
>> edited by: 
------

>#### Activity Name ####
 - [ ] Add code to grab the IDs from the .fastq files in example data 
>#### Completed Activity ✓ ####


------

## AMD ID Quality Check of FASTQ Filenames ##

### Background: ### 

A standardized sample naming schema is used to capture all associated meta-data prior to sequencing. Briefly, the `AMD ID` consist of 20 characters that capture information like collection year, geographical and treatment information, sample type and molecular markers included for each sample. 

Capturing this information at the pre-processing stage reduces the need to have multiple documents with this information, removing potential for mislabeling or tracking errors, and provides the bioiformatics team  with sufficient information to perform subsequent analysis and standardize analysis pipelines, including submission of data to NCBI. 

#### AMD ID Description #### 

* The AMD ID Key:  `<year> <country> <state/site> <day of failure> <treatment> <sample_id> <genus spp> <sample type> <mol marker bit code> <# sample processed>`. 
* Any missing meta data is replaced with an `x` _lower case_ strings for each character position. 

Example:
- `Individual` sequenced sample ID: `17GNDo00F0001PfF1290` = `<2017> <Guinea> <Dorota> <Day0> <AS+AQ> <0001> <P.falciparum> <FilterBloodSpot> <k13-crt-mdr-dhfr-dhps-cytB-cpmp-pfs47> <0>` 

                       
- `Pooled` sequenced sample ID: `17GNDo00x001p10F1290` = `<2017> <Guinea> <Dorota> <Day0> <missing info> <001> <Pooled Samples> <Samples in Pool> <FilterBloodSpot> <k13-crt-mdr-dhfr-dhps-cytB-cpmp-pfs47> <0>`
       
NOTE: If information is not availble (na), **x** is used for each character position. For example, in the pooled samples Treatment has (2) character spaces, represented as a two digit integer code. This is replaced with (2) **xx** since its a pool of samples that have possibly different Treatment information. Moreover, for pooled samples, **Sample ID** is replaced with **three digit number + the letter p** (for pooled), and **Genus** is replaced with **total number of SamplesInPool** as a (2) digit number. 

Please see presentation at `MaRS/Geneious_workflow/01_sample_ID_QC/files/AMD_ID_create_key.pptx` for more descriptive information. 
     
### Code information ###

This code checks whether the `AMD ID` is the correct length and contains all the required elements described above. 

### Required packages ###
- [Pandas](https://pandas.pydata.org/) 
- [os](https://docs.python.org/3/library/os.html)
- [re](https://docs.python.org/3/library/re.html) 

### Inputs ###
- .fastq filenames in the `AMD ID` convention 

### Outputs ### 
- List of incorrect `AMD ID` .fastq file 

In [10]:
## Clear variables, modules, etc ## 

# %reset -f          # reset workspace; see %reset -h

# dir()              # check dir is clean 

In [7]:
## Import dependencies ## 

import pandas as pd
import os
import re


In [25]:
## Get current dir $PATHs ## 

current_dir = os.getcwd()
print(current_dir)

## Get user $PATH ## 

def slash_split(string):
    '''Takes string, at 3rd "/" keeps all characters to the left; if no "/" returns full string'''
    if string.count("/") == 0:
        return string 
    return "/".join(string.split("/", 3)[:3]) 

user_path = slash_split(current_dir)

## Change dir to example .fastq data ## 

#os.chdir(user_path + "/MaRS/Geneious_workflow/02_geneious_analysis/example_data")

/Users/eldintalundzic/MaRS/Geneious_workflow/02_geneious_analysis/example_data


In [28]:
## Change dir to your own .fastq ## 

fastq_filepath = input("Provide full path to fastq filenames:") 



Provide full path to fastq filenames: /Users/eldintalundzic/MaRS/Geneious_workflow/02_geneious_analysis/example_data


In [27]:
os.getcwd()

'/Users/eldintalundzic/MaRS/Geneious_workflow/02_geneious_analysis/example_data'

In [29]:
fastq_filepath

'/Users/eldintalundzic/MaRS/Geneious_workflow/02_geneious_analysis/example_data'