No description, website, or topics provided.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.


Bag3D (Blast_Chlre4-1 - - - delete redundant Dt transcripts from Dt_all_seq.fa database
       - Dt_non-redundant_seq_Chlre4 - Blast_Chlre4-2_Chlre4 - transcriptID, geneID
                                     - Blast_Chlre4-2_169 - - protein_names-1 - - delete redundant Dt transcripts from Dt_non-redundant_seq_Chlre4
       - Dt_non-redundant_seq_Chlre4_169 - Blast_Chlre4-2_169-2 - - protein_names-2
       mergeDt_annotation, namely transcriptID, geneID, protein_name to have a complete annotation file) is a series of python (and bash scripts)
programs designed to filter the multiple FASTA format sequence file from de novo assembly
with the redundant contigs and execute functional assignment for the unknown contigs.
The final output files: Dt_non-redundant_seq_Chlre4.fa (non-redundant Dt contigs) and Dt_non-redundant_all_annotation.txt (annotation information).


Bag2D consists of a total of 9 programs as listed below:


     Bash shell script to use BlastX to search a protein database using a translated 
     nucleotide query


     Python program to extract the transcriptID from the blastX result


     Python program to extract the top hit of the Dt contig name for each protein name of
     the reference species according to the E Value (the lower E Value means a better hit)
     and the non top-hit Dt names will be generated in deleted_dtnames.txt


     Python program to filter the multiple fasta formatted sequence file with the 
     redundant contigs 


     Python program to count sequence length of a multiple fasta formatted single header
     line sequence file


     Python program to extract the geneID from the blastX result


     Python program to extract the protein names from the blastX result 


     Python program to merge the annotation information from transcriptID, geneID, protein names


     Python program to extract the pacID from the blastX result 


     Python program to merge the annotation information from the comparison reference species

(Using different formats of reference protein files like Chlre4_best_proteins.fasta, Creinhardtii_169_peptide.fa, Creinhardtii_281_v5.5.protein.fa
 to generate transcriptID/geneID, protein names, pacID accordingly)



     Dunaliella tertiolecta fasta sequences from de novo assembly (assembler: Velvet and Oases)

Creinhardtii_169_peptide.fa, or Chlre4_best_proteins.fasta, Creinhardtii_281_v5.5.protein.fa

     protein database of a reference species Chlamydomonas reinhardtii (the file was performed 
     the below formatdb command line prior to use:

        formatdb -i Creinhardtii_169_peptide.fa -p T

     which was also applied on all the protein database files)


     annotation information of the reference protein database used beforehand


The above programs and data files are used in the context of a workflow.

Total RNA from Dunaliella tertiolecta was extracted from a pure culture
and cDNA produced using standard molecular biology techniques. 
The samples were subjected to NGS DNA sequencing using Illumina MiSeq system
inhouse, and the raw datasets submitted to Partek Flow Genomics for de novo 
sequence assembly resulting in 56926 contigs ranging from 100 nt to 17153 nt.
The Velvet-Oases program was used in this assembly with default parameters 
without merging. It generated the dataset found in Dt_all_seq.fa

As an initial step to remove redundant sequences, any reverse complements
(that match 100% full length) of contigs were removed.

To prevent any bias, we subjected the raw dataset to the following workflow
firstly to remove redundant sequences based on sequence similarity to
proteomic datasets from JGI for the following species, C. reinhardtii
(being the species with the most comprehensive annotation available), 
Csu, Olu, Vcar and Chorella.

Secondly, sequences of Dt, which had annotations thus obtained, 
were matched with GO and KEGG terms to identify genes related to specific pathways.

Thirdly, the level of RNA transcripts of Dt wildtype and D9 high
lipid producer was subsequently analysed in a separate workflow.

(Note: sequence of the vector, a bleomycin-resistance marker gene, used in the mutagenesis work
leading to the isolation of D9 D. tertiolecta  was used to blast against the Dt_all_seq.fa 
dataset but no hits. A low match of a 5' 30nt was found in the contig 
Locus_5902_Transcript_1/1_Confidence_1.000_Length_1516 which matches
Cre62.g792700.t1.1 with evalue 1e-92 to a gene annotated as
"Transducin family protein / WD-40 repeat family protein".)


This dataset was analysed using the Bag2D workflow software developed in-house
in the manner outlined below:
Step 1: Program 1: 6 Chlre4_best_proteins.fasta Dt_all_seq.fa (6 means the e value was set at 10^-6)
Step 2: blast_out_Chlre4-1.txt --> Program 2: blast_out_Chlre4-1.txt
Step 3: transcriptID.txt --> Program 3: python transcriptID.txt
Step 4: deleted_dtnames.txt --> Program 4: python  Dt_all_seq.fa deleted_dtnames.txt
 Dt_non-redundant_seq_Chlre4.fa --> Step 5: Program 5: python Dt_non-redundant_seq_Chlre4.fa --> geneLength.txt
  |                             |                                        
  |                             v                                        
  |   Step 6a: Program 1: 6 Creinhardtii_169_peptide.fa Dt_non-redundant_seq.fa 
  |                             |
  |                             v
  |        blast_out_Chlre-2_169.txt --> Program 7: python blast_out_Chlre4-2_169.txt Creinhardtii_169_annotation_info.txt 
  |                             |
  |                             v
  |         protein_name-1.txt --> Program 3: python protein_name1.txt
  |                             |
  |                             v   
  |        deleted_dtnames2.txt --> Program 4: python  Dt_non-redundant_seq_Chlre4.fa deleted_dtnames2.txt
  |                             |
  |                             v
  |                Dt_non-redundant_seq_Chlre4_169.fa
  |                             |
  |                             v
  |            Program 1: 6 Creinhardtii_169_peptide.fa Dt_non-redundant_seq_169.fa
  |                             |
  |                             v
  |                 blast_out_Chlre4-2_169-2.txt --> Step 6a: Program 7: python blast_output-2_169-2.txt Creinhardtii_169_annotation_info.txt
  |                             |
  |                             v
  |                      protein_name-2.txt --------------------------------------------------------------------------------------------|
  |      Step 6b: Program 1: 6 Chlre4_best_proteins.fasta Dt_non-redundant_seq_Chlre4.fa                         |
  |                         |                                                                                                           |
  |                         v                                                                                                           |
  |           blast_out_Chlre4-2_Chlre4.txt --> Program 6: python blast_out_Chlre4-2_Chlre4.txt --> geneID.txt  -|
  |                         |                                                                                                           |-> Program 8:
  |                         v                                                                                                           |                   |
  |               Program 2: python blast_out_Chlre4-2_Chlre4.txt  -->  transcriptID.txt ---------------------------|                   v
  |                                                                                                                                                   Dt_annotation file
                                                         feed into Partek Genomics Suite software (GO, KEGG)

For comparison study with more reference species,
Step 7a: Program 1: 6 Creinhardtii_281_v5.5.protein.fa Dt_non-redundant_seq.fa
(note: this Dt_non-redundant_seq.fa was generated from Step 1-4 using Creinhardtii_281_v5.5.protein.fa as the protein reference)
     Cre_blast_output_v5.txt --> Step 8a: Program 9: Python Cre_blast_output_v5.txt --> Cre_pacID.txt --|
Step 7b: Program 1: 6 CsubellipsoideaC169_227_v2.0.protein.fa Dt_non-redundant_seq.fa         |
                   |                                                                                                 |
                   v                                                                                                 |
     Csu_blast_output_v5.txt --> Step 8b: Program 9: Python Csu_blast_output_v5.txt --> Csu_pacID.txt --|
Step 7c: Program 1: 6 Olucimarinus_231_v2.0.protein.fa Dt_non-redundant_seq.fa	             |
                   |                                                                                                 |
                   v                                                                                                 |
     Olu_blast_output_v5.txt --> Step 8c: Program 9: Python Olu_blast_output_v5.txt --> Olu_pacID.txt --|
Step 7d: Program 1: 6 Vcarteri_199_v2.0.protein.fa Dt_non-redundant_seq.fa                    |
                   |                                                                                                 |
                   v                                                                                                 |
     Vcar_blast_output_v5.txt--> Step 8d: Program 9: Python Vcar_blast_output_v5.txt--> Vcar_pacID.txt--|
Step 7e: Program 1: 6 Chlorella_NC64A.best_proteins.fasta Dt_non-redundant_seq.fa             |
                   |                                                                                                 |
                   v                                                                                                 |
Chlorella_blast_output_v5.txt-->Step8e: Program 9: Chlorella_blast_output_v5.txt-->Chlorella_pacID.txt -|
Step 9: Program 10: Python Dt_all_seq.fa Cre_pacID.txt Csu_pacID.txt Olu_pacID.txt Vcar_pacID.txt Chorella_pacID.txt