<div style="text-align: center;">
  <img src="./project_logo.jpg" alt="Alt text" width="400" height="400">
</div>

# DNA-PROT

---


## 1.0 Introduction

The DNA-PROT Package is a Python package designed for various analyses on DNA and protein sequences. The package includes functions to translate and transcribe ADN into a protein sequence and analyse it by calculating the hydrophobicity score, molecular weight, secondary structure configuration likelihoods (beta-sheet, alpha-helix, and beta-turn), the retention coefficient in High-Performance Liquid Chromatography (HPLC), and the polarity score.

The Protein Analysis Package is a versatile tool designed for professionals, educators, and students in the fields of  biochemistry, molecular biology and genetics. It is an useful resource for analyzing and understanding protein sequences, predicting structural configurations, and facilitating educational activities. Whether for research, drug development, or teaching, this package offers essential tools for anyone working with proteins and DNA.

---

## 2.0 Functionality and features

This Python package is designed to facilitate the analysis of DNA sequences, with a specific focus on identifying Shine-Dalgarno sequences, processing and translating DNA to RNA and protein sequences, and calculating various properties of the resulting proteins. 

The package is able to extract the DNA information from a text file in the usual genome format by detecting a shine-dalgarno sequence and then translate the ADN into the sequence of the proteins it contains in *all possible reading frames*. In addition to this our package is also able to then analyse the proteins and gives 5 major characteristics of the protein: Molecular weight , configuration likelihood ,hydrophobicity, polarity score, and the retention factor in an HPLC in TFA. With this info extracted, the package is then able to create a dataset with these mentionned properties for each of the all possible amino acid sequences read from the original DNA file.

The project_functions notebook shows all the functions in the project and explains how they work, however, below is a description of the main functions and their functionalities: 

### 2.1 Sequence Manipulation and Analysis

**Functions:**
- `find_shine_dalgarno(sequence, shine_dalgarno="AGGAGG")`
- `cut_sequence(sequence, shine_dalgarno="AGGAGG")`
- `translate_to_uppercase(sequence)`
- `filter_dna_sequence(sequence)`
- `read_dna_sequence(filename)`
- `separate_sections(filename)`

**Description:**
These functions handle various tasks related to the manipulation and analysis of DNA sequences. They include locating the Shine-Dalgarno sequence, cutting sequences based on this motif, translating sequences to uppercase, filtering sequences to retain only valid nucleotide characters, and reading sequences from files.

### 2.2 Transcription and Translation

**Functions:**
- `transcribe_dna_to_rna(dna_sequence)`
- `find_start_codons_rna(rna_sequence)`
- `translate_rna_to_protein(rna_sequence)`
- `complementary_sequences(dna_sequence)`
- `flip_rna_sequence(rna_sequence)`
- `translate_rna_to_proteins_all_frames(rna_sequence)`
- `read_genetic_code(filename)`

**Description:**
These functions manage the processes of transcribing DNA to RNA and translating RNA into protein sequences. They include finding start codons in RNA sequences, translating RNA to protein using a genetic code, generating complementary sequences, and translating in all possible reading frames.

### 2.3 Protein Properties Calculation

**Functions:**
- `calculate_hydrophobicity(protein)`
- `calculate_molecular_weight(protein)`
- `calculate_retention_coefficient(protein)`
- `calculate_polarity_score(protein)`
- `calculate_configuration_likelihoods(protein)`

**Description:**
These functions calculate various properties of protein sequences, such as hydrophobicity, molecular weight, retention coefficient, polarity score, and the likelihood of different structural configurations (alpha-helix, beta-sheet, beta-turn).
These properties are calculated on different scores determined by researchers and found in scientific papers. As the scores are based on the amino acid sequence and they do not take into account a specific protein's configuration, the results are only approximative and will be more reliable for smaller sequences of amino acids and may begin to deviate in larger peptidic structures.They however remain useful to get an idea of the order of magnitude of these properties and compare values between proteins.

### 2.4 Data Processing and Output

**Functions:**
- `ReadShineDalgarnoFromTxt(filename: str)`
- `translate_one_letter_to_three_letter_list(one_letter_sequences)`
- `get_unique_folder_path(base_folder)`
- `DNA_ToProtExcl_Analysis(sections, section_number=None, output_folder=None)`

**Description:**
These functions are responsible for processing data and generating output files. They handle reading and processing DNA sequences from text files, converting one-letter protein sequences to three-letter codes, ensuring unique folder paths for output, and performing comprehensive DNA to protein analysis while saving the results in Excel files.

---

## 3.0 Limitations

The scoring system for determining protein properties, such as hydrophobicity, molecular weight, retention coefficient, configuration likelihoods, and polarity, has notable limitations. It relies on simplified and generalized values for amino acids, often ignoring the complex context-dependent nature of protein behavior and 3D structural influences. This approach provides a static snapshot that does not reflect the dynamic nature of proteins or account for post-translational modifications. Environmental factors, such as pH and temperature, which significantly impact protein properties, are typically not considered. Additionally, the empirical scales used for scores can vary due to differences in experimental conditions. Over-reliance on these numerical scores may lead to incomplete interpretations, highlighting the need to complement these predictions with experimental data and structural analysis for a comprehensive understanding of protein behavior.


---

## 4.0 Challenges

Throughout our project, we encountered several challenges that required careful navigation. One significant hurdle was determining the most suitable output format for the extensive data generated. Given the diverse range of protein properties analyzed, including hydrophobicity, molecular weight, and configuration likelihoods for every single dna reading frame which for long strands of DNA their number can can be significant. Therefore, selecting an appropriate format was crucial. After deliberation, we settled on Excel files due to its versatility and widespread compatibility. However, this decision was not without its drawbacks, as Excel's structure is highly adaptable but may not always provide the most intuitive presentation for complex datasets. Moreover, while Excel offers flexibility, it also poses the risk of accidental alterations or misinterpretations, necessitating caution during data handling and analysis. 


---

## 5.0 Conclusion