### Download the Test data from MaRS bioproject (https://www.ncbi.nlm.nih.gov/bioproject/PRJNA428490)

User can download five samples from MaRS bioproject (SRR18918405, SRR18918415, SRR18918416, SRR18918242, SRR18918253) using SRA Toolkit, rename the samples to AMD IDs using the code provided below and run the workflow.  

#### Download SRA toolkit from https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software
    
#### If using Conda to download SRA toolkit, Run

`    
conda install -c bioconda sra-tools
`  
    
#### After user download and Install the tool, Run the commands below to download the fastq files.
 
* prefetch will download and save SRA file related to SRR accession in the current directory 
  

`
prefetch SRR18918405 SRR18918415 SRR18918416 SRR18918242 SRR18918253
`

* convert to FASTQ: fastq-dump will convert sra to fastq, use --split-files option for paired data

`
fastq-dump SRR18918405 SRR18918415 SRR18918416 SRR18918242 SRR18918253 --split-files
`

`mkdir SRA_Testdata
`

`
mv *.fastq SRA_Testdata
`   
    
Now All the fastq Paired end reads will be saved in folder SRA_Testdata


   

## Rename sample ID with AMD ID

## Background: 

   CDC Malaria Next generation Sequancing lab generates sample ID which is called AMD ID, is 20 character long and have specific meta data attched to it which makes it easier to standerdize the sample name and tracking. It also makes it convenient to submit the samples to NCBI as it requires to submit associated meta data (attributes)submitted with each fastq record.
  
   **AMD ID** is standerdized sample identification number which includes ID number and associated metadata related to the sample. 
 
   - AMD ID and bit code key is found under MS Teams > SOPs - Lab > Files > Sample Naming > Sample_naming_key.pptx <pre>
   
   
   - Key: **Year Country State/Site DayofTreatment Treatment SampleID Genus SampleType GeneMarker-8bitcode SampleSequencingCount**

   
        - Example:
    
            - Individual sequenced sample ID: 17GNDo00F0001PfF1290 = 2017 Guinea Dorota Day0 AS+AQ 0001 P.falciparum FilterBloodSpot k13-crt-mdr-dhfr-dhps-cytB-cpmp-pfs47 0 
                          
            - Pooled sequenced sample ID: 17GNDo00X001p10F1290 = 2017 Guinea Dorota Day0 **X** 001**Pooled SamplesInPool** FilterBloodSpot k13-crt-mdr-dhfr-dhps-cytB-cpmp-pfs47 0  
           
              - NOTE: If information is not availble (na) **x** is used. For pooled samples, Treatment is x since its a pool of multiple samples with that info.
          
              - NOTE: For pooled samples, **Sample ID** is replaced with **three digit sample ID + p(Pooled)** and **Genus** is replaced **total number of SamplesInPool** to indicated this as a pooled sequenced sample and sample count in each pool. 
     
## Code information   
  **Samples_ids.csv** files contains SRR ids and AMD ids. This code runs on a fastq files downloaded from MaRS bioproject SRA and  renmane it with AMD ids provided in sample_ids file. It uses OS module from python, renmae each fastq file with newname. 

### Required packages
- [Pandas](https://pandas.pydata.org/) 
-  os 

### Inputs
- A folder containig Fastq_files from SRA, .csv file with SRR name and AMD ID.

### Outputs:
- rename a files with standerdized AMD IDs.



In [None]:
import pandas as pd
import os



My_files =  "/Users/dhruvipatel/Desktop/Archive/Training_senegal/Test/SRA_Testdata"   # folder path contains fastq files


with open('Samples_ids.csv') as csv_file:    #open a csv file
    next(csv_file)                           # remove a header 
    for line in csv_file:                    # read a file line by line
        line = line.split(",")
        field_1 = line[0]                    # field_1 contains column1 (SRR_ID) data of .csv file
        field_2 = line[1].strip()            # field_2 contains column2 (AMD_ID) data of .csv file
        
        for file in os.listdir(My_files):                               
            SRR_ID = file.split("_")[0]                           # for ids in old files, split it at underscore(_). only get first part of ID before underscore(_).
            AMD_ID = "_".join(file.split("_")[1:])              # for ids in old files, split it at underscore(_). remove first part of ID before _ and join again using underscore
            print(SRR_ID, AMD_ID)
            if SRR_ID in field_1 :
                
                oldname = file
                newname =  field_2 + "_"+ AMD_ID
                old_path = os.path.join(My_files, oldname)                                           # OLD path of samples
                print("OLD ", old_path)  
                new_path = os.path.join(My_files, newname)                                           # Path for new IDs
                print("NEW ", new_path) 

            
                os.rename(old_path, new_path) 
            
            
            
            