> Author: @ DP 
>> edited by: @DP 4/29/22 version 0.3

------

>#### To Do ####

>#### Completed Activity ####

- [x] Add code to grab the IDs from the .fastq files in example data 
- [x] Modify a regex of sample ID check for sample mismatch
- [x] Reviewd the code using csv and fatq files.


------


## AMD ID Quality Check ##

### Background: ### 

A standardized sample naming schema is used to capture all associated meta-data prior to sequencing. Briefly, the `AMD ID` consist of 20 characters that capture information like collection year, geographical and treatment information, sample type and molecular markers included for each sample. 

Capturing this information at the pre-processing stage reduces the need to have multiple documents with this information, removing potential for mislabeling or tracking errors, and provides the bioiformatics team  with sufficient information to perform subsequent analysis and standardize analysis pipelines, including submission of data to NCBI. 

#### AMD ID Description #### 

* The AMD ID Key:  `<year> <country> <state/site> <day of failure> <treatment> <sample_id> <genus spp> <sample type> <mol marker bit code> <# sample processed>`. 
* Any missing meta data is replaced with an `x` _lower case_ strings for each character position. 

Example:
- `Individual` sequenced sample ID: `17GNDo00F0001PfF1290` = `<2017> <Guinea> <Dorota> <Day0> <AS+AQ> <0001> <P.falciparum> <FilterBloodSpot> <k13-crt-mdr-dhfr-dhps-cytB-cpmp-pfs47> <0>` 

                       
- `Pooled` sequenced sample ID: `17GNDo00x001p10F1290` = `<2017> <Guinea> <Dorota> <Day0> <missing info> <001> <Pooled Samples> <Samples in Pool> <FilterBloodSpot> <k13-crt-mdr-dhfr-dhps-cytB-cpmp-pfs47> <0>`
       
NOTE: If information is not availble (na), **x** is used for each character position. For example, in the pooled samples Treatment has (2) character spaces, represented as a two digit integer code. This is replaced with (2) **xx** since its a pool of samples that have possibly different Treatment information. Moreover, for pooled samples, **Sample ID** is replaced with **three digit number + the letter p** (for pooled), and **Genus** is replaced with **total number of SamplesInPool** as a (2) digit number. 

Please see presentation at `MaRS/Geneious_workflow/01_sample_ID_QC/files/AMD_ID_create_key.pptx` for more descriptive information. The AMD ID can be generated using the `MaRS/Geneious_workflow/01_sample_ID_QC/files/AMD_ID_create_template.xlsx`. 

     
### Code information ###

This code checks whether the `AMD ID` is the correct length and contains all the required elements described above. 

### Required packages ###
- [Pandas](https://pandas.pydata.org/) 
- [os](https://docs.python.org/3/library/os.html)
- [re](https://docs.python.org/3/library/re.html) 
- [tabulate](https://pypi.org/project/tabulate/)

### Inputs ###
- .csv file that includes the AMD_IDs or fastqfile names in AMD_ID format

### Outputs ### 
- List of incorrect AMD_IDs 


### Below code takes a user input for files and directory. It checks if input is fastq files or csv file with sample ID. 

**Note:** Use .csv file as input when it requires to QC the files before the sequancing otherwise use the fastq files from a sample directory for QC.

In [1]:
## Import dependencies ## 

import pandas as pd
import os
import re
import glob
import tabulate


print("specify the file type: \n")

while True:
    try:
        file_type = input("csv or raw_fastq")
        
        if file_type == "csv":                                                                       # If the file is csv (only ids)
            Sample_filepath  = input("Please enter a valid file path to a csv: ")                    # Ask user input for the path of .csv file with sample IDs

            while not os.path.isfile(Sample_filepath):
                    print("Error: That is not a valid file, try again...")                            # Error if file is not found
                    Sample_filepath = input("Please enter a valid file path to a csv: ")              # If error, enter again the file path
                    
            Sample_file = pd.read_csv(Sample_filepath)                                                # Read a csv file using pandas csv function
            
            break
            
                                                                                                      # If using a fastq files from a specific directory
        if file_type == "raw_fastq":
            
            inputdirectory = input("Enter the full path of the folder containing your files: ")       # input directory name  
            
            
            while not os.path.isdir(inputdirectory):
                    print("Error: That is not a valid Folder, try again...")                                             # Error if folder is not found
                    inputdirectory = input("Please type in the full path of the folder containing your files:   ")       # If error, enter again the file path          
            
            inputfile_extensions = input("Please type in the file extension of your files: ")                            # input extension of files (gz or fastq.gz)
            files =os.path.join(inputdirectory, "*."+inputfile_extensions)                                               # join the path of dir and extension of file
           
            my_file = [f for f in glob.glob(files)]                                                                      # use glob functio to list the files
            
            clean_filenames = [doc_name.split("/")[-1].split("_")[0] for doc_name in my_file]                             
            Sample_file = pd.DataFrame(clean_filenames, columns=["AMD_ID"])                                              # add column name to data frame called AMD_ID
            Sample_file = Sample_file.drop_duplicates()                                                                  # drop duplicates from list
            print(Sample_file)
            break        
        
            
        else:
            print("Invalid file type.")      
   

    except ValueError:
        
        print("provide a correct file type...")
        continue



specify the file type: 



csv or raw_fastq csv
Please enter a valid file path to a csv:  /Users/dhruviben/Desktop/PARMA-SOP/test_script.csv


In [2]:
## Creat a empty list for AMD_IDs 

Sample_no_match = []        # All the Ids with no match will be saved in list
Sample_with_match = []      # all the ID which has length  20 will be saved in list
 
## First part is to check if Sample ID has length 20 or not

Sample_name = Sample_file.rename(columns={'Sample':"Sample_ID", 'AMD_ID': "Sample_ID",'AMD ID (Pooled)': "Sample_ID", 'Document Name': "Sample_ID"})      # rename column name to Sample_ID as differant files migth have diffenrt column name.
  
SampleID_df = Sample_name[['Sample_ID']]                       # creat a dataframe using the column Sample_ID 

#remove US conrtols to avoid any errors in sample ID

SampleID_df = SampleID_df[SampleID_df['Sample_ID'].str.contains("USxxxx") == False]


for rows in SampleID_df.index:                                 # run a for loop on each rows
    
    sample_name =SampleID_df['Sample_ID'][rows].split('/n')    # split rows by newline
    for each_ID in sample_name:
        if len(each_ID) == 20 :                                # if length is 20, save the samples in Sample_with_match list
            Sample_with_match.append(each_ID) 
        else: 
            Sample_no_match.append(each_ID)                    # if length is not 20 then save the results in Sample_no_match list. 
            #print(each_ID,"has length", int(len(each_ID)))     # print the sample ID with its length if less than 20



In [3]:
## 2nd part is to check all ID with length 20, if it matches with AMD ID information regular expression as shown in discription at begining.

for each_file  in Sample_with_match:                                     # Run a for loop for each file in Sample_with_match list
    
    AMD_ID =('([0-9]{2})([A-Zx]{2})([A-Za-z]{2})([0-9x]{2})([A-Yx]{1})([0-9]{3})(([0-9]{1})|([p]{1}))(([0-9]{2})|([Pf]{2}))([A-Zx]{1})([0-9x]{3})([0-9]{1})')
             
                                                                         # split AMD ID by its information using regular expression
   
    AMD_group = re.match(AMD_ID,each_file)                               # match each ID with pattern
    
    if AMD_group is None :                                               # if match does not found
            
        Sample_no_match.append(each_file)                                # append the ID to list
        #print(each_file, "is not maching with ID")
        
    else:
       
        pass                                                             # if ID match with regex, pass
#print(Sample_with_match)



In [4]:
## lastly, print All the IDs without match so that user can review them and make a corrction before further processing.
print(len(Sample_no_match), "out of", len(SampleID_df),"samples did not match with AMD_ID")         # print the total number of samples that did not match 


if len(Sample_no_match) == 0:
        print("you are good to proceed with analysis: All the samples pass through QC test")
else :
    print("\nHere is the list of samples that did not match")
    
ID_No_match = "\n".join ([str(ID) for ID in Sample_no_match if len(Sample_no_match) != 0 ])         #  print the list of IDs that did not match 
print(ID_No_match)


11 out of 23 samples did not match with AMD_ID

Here is the list of samples that did not match
20MDAn00X002p5F0671
20MDAn00X00205F0671
19ANBe00A0009PfFxxx
19ANBe00A0010PfFxxx
NTC-DFR
17GNDo00F0001PfF129
17GNDo00F0001PfF12911
17GNDo00F0001PfF1
17GNDo0F0001PfF1291
191NBe00A0026PfFxxx0
19ANBe00A0031P1Fxxx0


In [5]:
# This part of code runs through the samples_no_match list and creats a table with key. Then user can identyfy where the key does not match visually from the table.

data_regex_QC = []                               


# Loop through the Sample id with no match list, split the ID by key using regex and creat dictionary .

for id in Sample_no_match:
    if len(id) >= 15:
        match = re.match(r"(?P<year>\w{2})(?P<country>\w{2})(?P<Site>\w{2})(?P<Treatment_Day>\w{2})(?P<Treatment>\w{1})(?P<ID>\w{4})(?P<Genus_Pooled>\w{,2})(?P<Type>\w{,1})(?P<GenemarkerCode>\w{,3})(?P<Repeat>\w{0,})", id)
        dic = match.groupdict()
        Dict_QC_re ={"name": id,"length_of_sample_ID" : len(id)}              # append the two keys to dict for Sample name and its length
        Dict_QC_re.update(dic)                                                # update a dict with new key value pair i.e name and length
        data_regex_QC.append(Dict_QC_re)     
    elif len(id) < 15:
        Dict_QC_re ={"name": id,"length_of_sample_ID" : len(id)}              # append the two keys to dict for Sample name and its length
        data_regex_QC.append(Dict_QC_re)     
                                   
        
if len(data_regex_QC) != 0:                                            # If length of list is not 0; 
    header = data_regex_QC[0].keys()                                   # header = keys of dict
    rows = [x.values() for x in data_regex_QC]                         # rows will be value of dict
    print (tabulate.tabulate(rows, header, tablefmt="grid"))           # use tabulate module to creat a table   

else:
    print('\n',"All the samples are matching with AMD_ID","\n", "No errors found in samples")        # If all the IDs matched with AMD id no table will be created. 
     

+-----------------------+-----------------------+--------+-----------+--------+-----------------+-------------+------+----------------+--------+------------------+----------+
| name                  |   length_of_sample_ID |   year | country   | Site   | Treatment_Day   | Treatment   | ID   | Genus_Pooled   | Type   | GenemarkerCode   | Repeat   |
| 20MDAn00X002p5F0671   |                    19 |     20 | MD        | An     | 00              | X           | 002p | 5F             | 0      | 671              |          |
+-----------------------+-----------------------+--------+-----------+--------+-----------------+-------------+------+----------------+--------+------------------+----------+
| 20MDAn00X00205F0671   |                    19 |     20 | MD        | An     | 00              | X           | 0020 | 5F             | 0      | 671              |          |
+-----------------------+-----------------------+--------+-----------+--------+-----------------+-------------+------+-------