## Quality check of sample ID 

## Background: 

   CDC Malaria Next generation Sequancing lab generates sample ID which is called AMD ID, is 20 character long and have specific meta data attched to it which makes it easier to standerdize the sample name and tracking. It also makes it convenient to submit the samples to NCBI as it requires to submit associated meta data (attributes)submitted with each fastq record.
  
   **AMD ID** is standerdized sample identification number which includes ID number and associated metadata related to the sample. 
 
   - AMD ID and bit code key is found under MS Teams > SOPs - Lab > Files > Sample Naming > Sample_naming_key.pptx <pre>
   
   
   - Key: **Year Country State/Site DayofTreatment Treatment SampleID Genus SampleType GeneMarker-8bitcode SampleSequencingCount**

   
        - Example:
    
            - Individual sequenced sample ID: 17GNDo00F0001PfF1290 = 2017 Guinea Dorota Day0 AS+AQ 0001 P.falciparum FilterBloodSpot k13-crt-mdr-dhfr-dhps-cytB-cpmp-pfs47 0 
                          
            - Pooled sequenced sample ID: 17GNDo00X001p10F1290 = 2017 Guinea Dorota Day0 **X** 001**Pooled SamplesInPool** FilterBloodSpot k13-crt-mdr-dhfr-dhps-cytB-cpmp-pfs47 0  
           
              - NOTE: If information is not availble (na) **x** is used. For pooled samples, Treatment is x since its a pool of multiple samples with that info.
          
              - NOTE: For pooled samples, **Sample ID** is replaced with **three digit sample ID + p(Pooled)** and **Genus** is replaced **total number of SamplesInPool** to indicated this as a pooled sequenced sample and sample count in each pool. 
     
## Code information   
  This code runs on a .csv file with Sample IDs and checks if the Sample Id matches with AMD ID or not.

### Required packages
- [Pandas](https://pandas.pydata.org/) 

### Inputs
- AMD IDs in .csv format

### Outputs:
- Outputs a list of sample IDs which does not match with standerdized AMD IDs.


In [43]:
## Import dependencies ## 

import pandas as pd
import os
import re

Sample_filepath  = input("Please enter a valid file path to a csv: ")         # Ask user to addt the path of .csv file with sample IDs

while not os.path.isfile(Sample_filepath):
    print("Error: That is not a valid file, try again...")                    # Error if file is not found
    Sample_filepath = input("Please enter a valid file path to a csv: ")      # If error, enter again the file path

Sample_file = pd.read_csv(Sample_filepath)                                    # Read a csv file using pandas csv function


Please enter a valid file path to a csv:  /Users/dhruviben/Desktop/PARMA-SOP/test_script.csv


In [44]:
## Creat a empty list for AMD_IDs 

Sample_no_match = []        # All the Ids with no match will be saved in list
Sample_with_match = []      # all the ID which has length  20 will be saved in list
 
## 1st part is to check if Sample ID has length 20 or not

Sample_name = Sample_file.rename(columns={'Sample':"Sample_ID", 'AMD_ID': "Sample_ID",'AMD ID (Pooled)': "Sample_ID", 'Document Name': "Sample_ID"})      # rename column name to Sample_ID as differant files migth have diffenrt column name.
  
SampleID_df = Sample_name[['Sample_ID']]                       # creat a dataframe using the column Sample_ID 

for rows in SampleID_df.index:                                 # run a for loop on each rows
    
    sample_name =SampleID_df['Sample_ID'][rows].split('/n')    # split rows by newline
    for each_ID in sample_name:
        if len(each_ID) == 20 :                                # if length is 20, save the samples in Sample_with_match list
            Sample_with_match.append(each_ID) 
        else: 
            Sample_no_match.append(each_ID)                    # if length is not 20 then save the results in Sample_no_match list. 
            print(each_ID,"has length", int(len(each_ID)))     # print the sample ID with its length if less than 20



20MDAn00X002p5F0671 has length 19
20MDAn00X00205F0671 has length 19
19ANBe00A0009PfFxxx has length 19
19ANBe00A0010PfFxxx has length 19
19 has length 2
12ab has length 4


In [45]:
## 2nd part is to check all ID with length 20, if it matches with AMD ID information regular expression as shown in discription at begining.

for each_file  in Sample_with_match:                                     # Run a for loop for each file in Sample_with_match list
    
    AMD_ID =('([0-9x]{2})([A-Zx]{2})([A-Za-z]{2})([0-9x]{2})([A-Yx]{1})([0-9]{3})(([0-9]{1})|([p]{1}))(([0-9]{2})|([Pf]{2}))([A-Zx]{1})([0-9x]{3})([0-9]{1})')
             
                                                                         # split AMD ID by its information using regular expression
   
    AMD_group = re.match(AMD_ID,each_file)                               # match each ID with pattern
    
    if AMD_group is None :                                               # if match does not found
            
        Sample_no_match.append(each_file)                                # append the ID to list
        print(each_file, "is not maching with ID")
        
    else:
       
        pass                                                             # if ID match with regex, pass
#print(Sample_with_match)



191NBe00A0026PfFxxx0 is not maching with ID
19ANBe00A0031P1Fxxx0 is not maching with ID


In [46]:
## lastly, print All the IDs without match so that user can review them and make a corrction before further processing.
print(len(Sample_no_match), "out of", len(SampleID_df),"samples did not match with AMD_ID")         # print the total number of samples that did not match 

print("\nHere is the list of samples that did not match")

ID_No_match = "\n".join ([str(ID) for ID in Sample_no_match if len(Sample_no_match) != 0 ])         #  print the list of IDs that did not match 
print(ID_No_match)


8 out of 20 samples did not match with AMD_ID

Here is the list of samples that did not match
20MDAn00X002p5F0671
20MDAn00X00205F0671
19ANBe00A0009PfFxxx
19ANBe00A0010PfFxxx
19
12ab
191NBe00A0026PfFxxx0
19ANBe00A0031P1Fxxx0
