## Quality check of sample ID 

## Background: 

CDC Malaria Next generation Sequancing lab generates sample ID which is called AMD ID, is 20 character long and have specific meta data attched to it which makes it easier to standerdize the sample name and tracking. It also makes it convenient to submit the samples to NCBI as it requires to submit associated meta data (attributes)submitted with each fastq record.
  
   **AMD ID** is standerdized sample identification number which includes ID number and associated metadata related to the sample. 
 
   - AMD ID and bit code key is found under MS Teams > SOPs - Lab > Files > Sample Naming > Sample_naming_key.pptx <pre>
   
   
   - Key: **Year Country State/Site DayofTreatment Treatment SampleID Genus SampleType GeneMarker-8bitcode SampleSequencingCount**

   
        - Example:
    
            - Individual sequenced sample ID: 17GNDo00F0001PfF1290 = 2017 Guinea Dorota Day0 AS+AQ 0001 P.falciparum FilterBloodSpot k13-crt-mdr-dhfr-dhps-cytB-cpmp-pfs47 0 
                          
            - Pooled sequenced sample ID: 17GNDo00X001p10F1290 = 2017 Guinea Dorota Day0 **X** 001**Pooled SamplesInPool** FilterBloodSpot k13-crt-mdr-dhfr-dhps-cytB-cpmp-pfs47 0  
           
              - NOTE: If information is not availble (na) **x** is used. For pooled samples, Treatment is x since its a pool of multiple samples with that info.
          
              - NOTE: For pooled samples, **Sample ID** is replaced with **three digit sample ID + p(Pooled)** and **Genus** is replaced **total number of SamplesInPool** to indicated this as a pooled sequenced sample and sample count in each pool. 
              
## Code information 

This code runs on a .csv file with Sample IDs and checks if the Sample Id matches with AMD ID or not.

### Required packages
- [Pandas](https://pandas.pydata.org/) 
- [tabulate](https://pypi.org/project/tabulate/)

### Inputs
- AMD IDs in .csv format or fastqfile names

### Outputs:
- Outputs a list of sample IDs which does not match with standerdized AMD IDs.


### Below code takes a user input for files and directory. It checks if input is fastq files or csv file with sample ID. 

Use .csv file as input when it requires to QC the files before the sequancing otherwise use the fastq files from a folder for QC.

In [3]:
## Import dependencies ## 

import pandas as pd
import os
import re
import glob
import tabulate


print("specify the file type: \n")

while True:
    try:
        file_type = input("csv or raw_fastq")
        
        if file_type == "csv":                                                                       # If the file is csv (only ids)
            Sample_filepath  = input("Please enter a valid file path to a csv: ")                    # Ask user input for the path of .csv file with sample IDs

            while not os.path.isfile(Sample_filepath):
                    print("Error: That is not a valid file, try again...")                            # Error if file is not found
                    Sample_filepath = input("Please enter a valid file path to a csv: ")              # If error, enter again the file path
                    
            Sample_file = pd.read_csv(Sample_filepath)                                                # Read a csv file using pandas csv function
            
            break
            
                                                                                                          # If using a fastq files from a specific directory
        if file_type == "raw_fastq":
            
            inputdirectory = input("Enter the full path of the folder containing your files: ")   # input directory name  
            
            
            while not os.path.isdir(inputdirectory):
                    print("Error: That is not a valid Folder, try again...")                                             # Error if folder is not found
                    inputdirectory = input("Please type in the full path of the folder containing your files:   ")       # If error, enter again the file path          
            
            inputfile_extensions = input("Please type in the file extension of your files: ")              # input extension of files (gz or fastq.gz)
            files =os.path.join(inputdirectory, "*."+inputfile_extensions)                                               # join the path of dir and extension of file
            print(files)
            my_file = [f for f in glob.glob(files)]                                                                      # use glob functio to list the files
            
            clean_filenames = [doc_name.split("/")[-1].split("_")[0] for doc_name in my_file]                             
            Sample_file = pd.DataFrame(clean_filenames, columns=["AMD_ID"])                                               # add column name to data frame called AMD_ID
            Sample_file = Sample_file.drop_duplicates()                                                                   # drop duplicates from list
            print(Sample_file)
            break        
        
            
        else:
            print("Invalid file type.")      
   

    except ValueError:
        
        print("provide a correct file type...")
        continue




specify the file type: 



csv or raw_fastq raw_fastq
Enter the full path of the folder containing your files:  /Users/dhruvipatel/Desktop/ET_training/ET_raw
Please type in the file extension of your files:  gz


/Users/dhruvipatel/Desktop/ET_training/ET_raw/*.gz
                  AMD_ID
0   22ETAm00x018p10B3201
1   22ETAm00Z0096PfB3201
2   22ETAm00D0083PfB3201
3   22ETAm00D0067PfB3201
4   22ETAm21D0067PfB3201
5   22ETAm00x014p10B3201
6   22ETAm00x007p10B3201
7   21USxxxxxx3D7PfD3201
8   22ETAm00Z0047PfB3201
9   22ETAm28Z0063PfB3201
10  22ETAm00D0078PfB3201
11  22ETAm00x015p10B3201
12  22ETAm00x001p02B3201
13  22ETAm42D0078PfB3201
14  20USxxxxxxHB3PfD3201
15  22ETAm00x020p04B3201
16  20USxxxxxxDD2PfD3201
17  22ETAm42Z0045PfB3201
18  22ETAm00x019p06B3201
19  22ETAm00Z0080PfB3201
20  21USxxxxxx7G8PfD3201
21  22ETAm00x016p10B3201
22  22ETAm00x017p10B3201
23  22ETAm00x013p10B3201
24  22ETAm00Z0048PfB3201
25  22ETAm42D0091PfB3201
26  22ETAm00Z0073PfB3201
27  22ETAm00x005p10B3201
28  22ETAm00x010p10B3201
29  22ETAm00D0044PfB3201
30  22ETAm00x002p07B3201
31  22ETAm00x004p10B3201
32  22ETAm00Z0063PfB3201
33  22ETAm00x003p10B3201
34  22ETAm28D0080PfB3201
35  22ETAm00x008p10B3201
36  22ETAm35Z0073PfB3201

In [4]:
## Creat a empty list for AMD_IDs 

Sample_no_match = []        # All the Ids with no match will be saved in list
Sample_with_match = []      # all the ID which has length  20 will be saved in list
 
## First part is to check if Sample ID has length 20 or not

Sample_name = Sample_file.rename(columns={'Sample':"Sample_ID", 'AMD_ID': "Sample_ID",'AMD ID (Pooled)': "Sample_ID", 'Document Name': "Sample_ID"})      # rename column name to Sample_ID as differant files migth have diffenrt column name.
  
SampleID_df = Sample_name[['Sample_ID']]                       # creat a dataframe using the column Sample_ID 

#remove US conrtols to avoid any errors in sample ID

SampleID_df = SampleID_df[SampleID_df['Sample_ID'].str.contains("USxxxx") == False]


for rows in SampleID_df.index:                                 # run a for loop on each rows
    
    sample_name =SampleID_df['Sample_ID'][rows].split('/n')    # split rows by newline
    for each_ID in sample_name:
        if len(each_ID) == 20 :                                # if length is 20, save the samples in Sample_with_match list
            Sample_with_match.append(each_ID) 
        else: 
            Sample_no_match.append(each_ID)                    # if length is not 20 then save the results in Sample_no_match list. 
            print(each_ID,"has length", int(len(each_ID)))     # print the sample ID with its length if less than 20



In [5]:
## 2nd part is to check all ID with length 20, if it matches with AMD ID information regular expression as shown in discription at begining.

for each_file  in Sample_with_match:                                     # Run a for loop for each file in Sample_with_match list
    
    AMD_ID =('([0-9]{2})([A-Zx]{2})([A-Za-z]{2})([0-9x]{2})([A-Zx]{1})([0-9]{3})(([0-9]{1})|([p]{1}))(([0-9]{2})|([Pf]{2}))([A-Zx]{1})([0-9x]{3})([0-9]{1})')
             
                                                                         # split AMD ID by its information using regular expression
   
    AMD_group = re.match(AMD_ID,each_file)                               # match each ID with pattern
    
    if AMD_group is None :                                               # if match does not found
            
        Sample_no_match.append(each_file)                                # append the ID to list
        print(each_file, "is not maching with ID")
        
    else:
       
        pass                                                             # if ID match with regex, pass
#print(Sample_with_match)



In [6]:
## lastly, print All the IDs without match so that user can review them and make a corrction before further processing.
print(len(Sample_no_match), "out of", len(SampleID_df),"samples did not match with AMD_ID")         # print the total number of samples that did not match 


if len(Sample_no_match) == 0:
        print("you are good to proceed with analysis: All the samples pass through QC test")
else :
    print("\nHere is the list of samples that did not match")
    
ID_No_match = "\n".join ([str(ID) for ID in Sample_no_match if len(Sample_no_match) != 0 ])         #  print the list of IDs that did not match 
print(ID_No_match)


0 out of 41 samples did not match with AMD_ID
you are good to proceed with analysis: All the samples pass through QC test



In [7]:
# This part of code runs through the samples_no_match list and creats a table with key. Then user can identyfy where the key does not match visually from the table.

data_regex_QC = []                               

#Sample_no_match = ["17GNDo00F0001PfF1291", "17GNDo00F0001PfF129","17GNDo00F0001PfF12911",'17GNDo00F0001PfF1','17GNDo0F0001PfF1291', "NF54","NTC-DFR", "NTC-DHFR" ]

# Loop through the Sample id with no match list, split the ID by key using regex and creat dictionary .

for id in Sample_no_match:
    if len(id) >= 15:
        match = re.match(r"(?P<year>\w{2})(?P<country>\w{2})(?P<Site>\w{2})(?P<Treatment_Day>\w{2})(?P<Treatment>\w{1})(?P<ID>\w{4})(?P<Genus_Pooled>\w{,2})(?P<Type>\w{,1})(?P<GenemarkerCode>\w{,3})(?P<Repeat>\w{0,})", id)
        dic = match.groupdict()
        Dict_QC_re ={"name": id,"length_of_sample_ID" : len(id)}              # append the two keys to dict for Sample name and its length
        Dict_QC_re.update(dic)                                                # update a dict with new key value pair i.e name and length
        data_regex_QC.append(Dict_QC_re)     
    elif len(id) < 15:
        Dict_QC_re ={"name": id,"length_of_sample_ID" : len(id)}              # append the two keys to dict for Sample name and its length
        data_regex_QC.append(Dict_QC_re)     
                                   
        
if len(data_regex_QC) != 0:                                            # If length of list is not 0; 
    header = data_regex_QC[0].keys()                                   # header = keys of dict
    rows = [x.values() for x in data_regex_QC]                         # rows will be value of dict
    print (tabulate.tabulate(rows, header, tablefmt="grid"))           # use tabulate module to creat a table   

else:
    print('\n',"All the samples are matching with AMD_ID","\n", "No errors found in samples")        # If all the IDs matched with AMD id no table will be created. 
     


 All the samples are matching with AMD_ID 
 No errors found in samples
