<div style="hwidth: 100%; background-color: #ddd; overflow:hidden; ">
    <div style="display: flex; justify-content: center; align-items: center; border-bottom: 10px solid #80c4e7; padding: 3px;">
        <h2 style="position: relative; top: 3px; left: 8px;">S2 Project: DNA Classification - (part1: ETL)</h2>
        <img style="position: absolute; height: 68px; top: -2px;; right: 18px" src="./Content/Notebook-images/dna1.png"/>
    </div>
    <div style="padding: 3px 8px;">
        <h4>Objectives:</h4>
        The primary objective of this project is to develop predictive models for DNA sequence gene classification.
        <h4>Dataset:</h4>
        The dataset files contain genetic sequence data in FASTA format. The dataset consists of two files:
        <ul>
            <li>Arabidopsis_thaliana_BHLH_gene_Family.fasta</li>
            <li>Arabidopsis_thaliana_CYP_gene_Family.fasta</li>
        </ul>
        <h4>Steps:</h4>
        <ol>
            <li>Read the genetic sequence data from the files.</li>
            <li>Vectorize the data to prepare it for modeling.</li>
            <li>Save the data in usable csv format</li>
            <li>Define analysis approach we are going to take in this study</li>
        </ol>
    </div>    
</div>

### 1 - Importing utils
The following code cells will import necessary libraries.

In [14]:
import numpy as np
import pandas as pd
from sklearn.utils import shuffle, resample
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import tensorflow as tf
from tensorflow.keras import models, layers, Input, Sequential
import matplotlib.pyplot as plt

### 2 - Importing Dataset
The following function will read each **.fasta file** and return a pandas dataframe in this format [**id** - **sequence** - **length** - **class**]

In [15]:
def read_fasta_file(file_path, family):
    sequences = []
    with open(file_path, 'r') as file:
        current_id = None
        current_sequence = ''
        for line in file:
            if line.startswith('>'):
                if current_id:
                    sequences.append({'id': current_id, 'sequence':current_sequence, 'length':len(current_sequence), 'class': family})
                current_id = line.strip().split('|')[0][1:].strip()
                current_sequence = ''
            else:
                current_sequence += line.strip()
        if current_id:
            sequences.append({'id': current_id, 'sequence':current_sequence, 'length':len(current_sequence), 'class': family})
    
    df = pd.DataFrame(sequences)
    return df

In [16]:
# Data file path
#gene_family_1 = "./Content/Data/Arabidopsis_thaliana_BHLH_gene_Family.fasta"
#gene_family_2 = "./Content/Data/Arabidopsis_thaliana_CYP_gene_Family.fasta"
gene_family_1 = "./Content/Raw-Data/Ach_cds.fas"
gene_family_2 = "./Content/Raw-Data/Csi_cds.fas"

# Convert to dataframe:
dataset1 = read_fasta_file(gene_family_1, "kiwi")
dataset2 = read_fasta_file(gene_family_2, "Orange")

# Concatenate the two dataframes
dataset = pd.concat([dataset1, dataset2], ignore_index=True)

# Let's get a quick look at our dataset
dataset.head()

Unnamed: 0,id,sequence,length,class
0,Achn199931 Actinidia chinensis,ATGGGAAGAGGAAAGATCGAGGTGAAGAGGATAGAGAACAACACAA...,714,kiwi
1,Achn100281 Actinidia chinensis,ATGACCGGCGACAGAGGGTTTTCTCCGATCGGCGGGGACCTACCGC...,2295,kiwi
2,Achn251771 Actinidia chinensis,ATGATCAACGGCTATCACAACCACAATCAGCATAATTTTACAGAGA...,1146,kiwi
3,Achn065501 Actinidia chinensis,ATGGAGGTCGTTTGTCTCAACAGTGAGCCAGTGTTTGACGACGGTG...,1992,kiwi
4,Achn103311 Actinidia chinensis,ATGGTAAAACATATTTCAAGCTCATCATCAGAAGGGGATGAGAGGT...,399,kiwi


### 3 - Exploratory analysis

* Track null-value field

In [17]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4551 entries, 0 to 4550
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        4551 non-null   object
 1   sequence  4551 non-null   object
 2   length    4551 non-null   int64 
 3   class     4551 non-null   object
dtypes: int64(1), object(3)
memory usage: 142.3+ KB


**Note**: As we can see, our database contains 380 entries. In each column we have uniform data types and non-null data

* Track non coherent data

In [18]:
pattern = r'^[ATCG]+$'
assert dataset['sequence'].str.match(pattern).all(), "Error: Invalid characters found in sequence column"

AssertionError: Error: Invalid characters found in sequence column

**Note**: All sequences in the 'sequence' column contain only 'A', 'T', 'C', and 'G'. That sound good.

* Handling imbalanced data

In [25]:
class_counts = dataset['class'].value_counts()
total_samples = len(dataset)
class_counts_df = pd.DataFrame(class_counts)
class_counts_df.columns = ['Count']
class_counts_df['Percentage'] = (class_counts_df['Count'] / total_samples * 100).round(2)

print("Class Distribution:")
print(class_counts_df)
print("\nTotal Samples:", total_samples)

Class Distribution:
        Count  Percentage
class                    
kiwi     2296       50.45
Orange   2255       49.55

Total Samples: 4551


In [26]:
# Calculate the imbalance ratio
imbalance_ratio = class_counts_df['Count'].max() / class_counts_df['Count'].min()
imbalance_threshold = 1.5
print("Imbalance Ratio:", imbalance_ratio)

Imbalance Ratio: 1.018181818181818


**Note**: We can see that our dataset is not significantly imbalanced since the imbalance ratio of 1.35 is not greater than the threshold value we set. Therefore, we do not need to create a balanced dataset. However, if necessary, techniques such as oversampling or undersampling can be used.

In [27]:
def balance_dataset(df):
    df_majority = df[df['class'] == 'kiwi']
    df_minority = df[df['class'] == 'Orange']
    
    #df_minority_upsampled = resample(df_minority, replace=True, n_samples=len(df_majority), random_state=42)    
    #df_balanced = pd.concat([df_majority, df_minority_upsampled], ignore_index=True)
    
    df_majority_undersampled = resample(df_majority, replace=False, n_samples=len(df_minority), random_state=42)    
    df_balanced = pd.concat([df_majority_undersampled, df_minority], ignore_index=True)
    return df_balanced

if imbalance_ratio >= imbalance_threshold:
    dataset = balance_dataset(dataset)
    print(dataset['class'].value_counts())

* Let encode label

In [28]:
dataset['class'] = LabelEncoder().fit_transform(dataset['class'])

output_path   = "./Output/kiwi_orange_cds.csv"
dataset.to_csv(output_path, index=False)

dataset.head()

Unnamed: 0,id,sequence,length,class
0,Achn199931 Actinidia chinensis,ATGGGAAGAGGAAAGATCGAGGTGAAGAGGATAGAGAACAACACAA...,714,1
1,Achn100281 Actinidia chinensis,ATGACCGGCGACAGAGGGTTTTCTCCGATCGGCGGGGACCTACCGC...,2295,1
2,Achn251771 Actinidia chinensis,ATGATCAACGGCTATCACAACCACAATCAGCATAATTTTACAGAGA...,1146,1
3,Achn065501 Actinidia chinensis,ATGGAGGTCGTTTGTCTCAACAGTGAGCCAGTGTTTGACGACGGTG...,1992,1
4,Achn103311 Actinidia chinensis,ATGGTAAAACATATTTCAAGCTCATCATCAGAAGGGGATGAGAGGT...,399,1


**Note**: <span style="color:red;">This marks the end of the Extract, Transform, and Load (ETL) process</span> we performed on our FASTA file to obtain a CSV file, which is commonly used in most AI projects. Now, we will move to the next notebook where we will use the CSV output file and try different machine learning techniques easly.

### 4 - Analysis appraoch
To classify DNA sequences, we can explore several approaches, each with its own strengths and weaknesses. Here are some detailed methods we will examine in the next notebook:

<div style="background-color: #f5f5f5; padding: 1px .5em;">
    
<!-- ************************************************ -->
<h4 style="background-color: #80c4e6; padding: .5em;">
    Approach 1: k-mer Representation with Frequency Analysis on ML classsic model
</h4>

1. **Description**:
   - Break the DNA sequence into k-mers (subsequences of length k).
   - Perform frequency analysis to create a feature vector based on the occurrence of each k-mer.
   - Use this feature vector as input to the model.

2. **Pros**:
   - Captures local context within each k-mer.
   - Simplifies the input representation by reducing it to frequency counts.

3. **Cons**:
   - Loses positional information beyond the k-mer length.
   - Treating it as a [bag-of-words](https://en.wikipedia.org/wiki/Bag-of-words_model) model ignores the order of k-mers.

4. **Todo**:
   - Try model for various length of k-mer
   - Try feature selection and/or  dimenssion reduction as k grow<br><br>
   - [<span style="padding: .5em; background-color: #dddddd;">Click to Open Notebook 2</span>](./02-approach1_kmer_classic_ml.ipynb)

<!-- ************************************************ -->
<h4 style="background-color: #80c4e6; padding: .5em;">
    Approach 2: k-mer Representation with Frequency Analysis on Neural Network Architecture
</h4>


1. <strong>Description</strong>:
   - In this approach, we represent DNA sequences using k-mer frequencies. Each sequence is encoded as a vector where each element represents the frequency of a specific k-mer in the sequence. This vector representation is then used as input to a neural network architecture for classification.

2. <strong>Pros</strong>:
   - Utilizes frequency analysis: By representing sequences based on the frequency of k-mers, the model can capture important patterns and motifs in the DNA sequences.
   - Flexible architecture: Neural networks provide a flexible framework for learning complex relationships between features, allowing the model to adapt to different types of data.

3. <strong>Cons</strong>:
   - Curse of dimensionality: Depending on the value of k and the size of the alphabet (e.g., DNA bases A, C, G, T), the feature space can become very large, leading to increased computational complexity and potential overfitting.
   - Loss of sequence information: By focusing solely on k-mer frequencies, the model may overlook important sequential dependencies and structural information present in the DNA sequences.

4. **Todo**:<br><br>
   - [<span style="padding: .5em; background-color: #dddddd;">Click to Open Notebook 3</span>](./03-approach2_kmer_neural_network.ipynb)
     
<!-- ************************************************ -->
<h4 style="background-color: #80c4e6; padding: .5em;">
    Approach 3: Single Nucleotide as a Token for Neural Network Architecture
</h4>
1. **Description**:    - Treat each base (A, T, C, G) as a single token.    - Encode each base numerically (e.g., A=0, T=1, C=2, G=3).    - Train a model on the sequence of encoded bases.2. **Pros**:    - Simple and straightforward.    - Preserves the positional information of each base.3. **Cons*:    - Limited contextual information.    - May not capture long-range dependencies well.4. **Todo**:
   - [<span style="padding: .5em; background-color: #dddddd;">Click to Open Notebook 3</span>](./04-approach3_single_nucleotide_encoding.ipynb)


<h4 style="background-color: #80c4e6; padding: .5em;">
    Approach 4: k-mer Representation with One-Hot Encoding and Pretrained embedding for Neural Network
</h4>

1. <strong>Description</strong>:
   - In this approach, DNA sequences are treadted as Natural Language Problem

2. <strong>Pros</strong>:
   - Incorporates pre-trained embeddings: By leveraging pre-trained embeddings, the model can benefit from knowledge learned from a large dataset.
   - We wanna see if this pretrained model is effective: https://github.com/pnpnpn/dna2vec
   - Using tranformer model we wanna also check this https://github.com/jerryji1993/DNABERT

3. <strong>Cons</strong>:
   - Dimensionality of one-hot encoding: One-hot encoding results in high-dimensional input vectors
   - Limited transferability of embeddings
     
4. **Todo**:<br><br>
   - [<span style="padding: .5em; background-color: #dddddd;">Click to Open Notebook 4</span>](./05-approach4_kmer_onehot_and_dna2vec.ipynb)

<!--
<h4 style="background-color: #80c4e6; padding: .5em;">
    Approach 1: k-mer Representation with one-hot encoding for Neural Network architecture
</h4>
    
1. **Description**:
   - Treat each base (A, T, C, G) as a single token.
   - Encode each base numerically (e.g., A=0, T=1, C=2, G=3).
   - Train a model on the sequence of encoded bases.

2. **Pros**:
   - Simple and straightforward.
   - Preserves the positional information of each base.

3. **Cons**:
   - Limited contextual information.
   - May not capture long-range dependencies well.

4. **Todo**:
   - [<span style="padding: .5em; background-color: #dddddd;">Click to Open Notebook 1</span>](./02-approach1_single_nucleotide_position.ipynb)
-->
      
<!--4. **Todo**:
   - This approach can be effective for simpler classification tasks or when the sequences are short.
   - We will try models like Recurrent Neural Networks (RNNs) or Convolutional Neural Networks (CNNs) that can capture sequence information.-->

<!--
<h4 style="background-color: #80c4e6; padding: .5em;">
    Approach 3: k-mer Representation with Position Analysis
</h4>

1. **Description**:
   - Break the DNA sequence into k-mers.
   - Use embeddings to represent each k-mer, capturing more complex relationships and context.
   - Train a model using these embeddings.

2. **Pros**:
   - Captures richer contextual information through embeddings.
   - Can capture long-range dependencies if using advanced embedding techniques (e.g., BERT).

3. **Cons**:
   - More complex and computationally intensive.
   - Requires large amounts of data to train effective embeddings.

4. **Todo**:
   - [<span style="padding: .5em; background-color: #dddddd;">Click to Open Notebook 3 \[FROM SCRATCH MODEL\] </span>](./04-approach3_kmer_position(from_scratch).ipynb)
     
   - [<span style="padding: .5em; background-color: #dddddd;">Click to Open Notebook 4 [TRANSFERT LEARNING ] </span>](./05-approach3_kmer_position(transfert_learning).ipynb)
      -->


<!--4. **Recommendation**:
   - This approach is powerful for capturing complex patterns in DNA sequences.
   - Suitable for deep learning models like Transformers, which can handle long-range dependencies and positional encoding.
   - Consider pretraining embeddings on a large corpus of DNA sequences and fine-tuning for your specific task.-->

<!--
<h4 style="background-color: #80c4e6; padding: .5em;">
    Approach 4: Hybrid Models
</h4>

- **Hybrid Approaches**: Combining different approaches might yield better results. For example, you can use k-mers with embeddings and incorporate positional encoding to preserve sequence order.
- **Model Selection**: The choice of model significantly impacts the performance. Here are some suggestions:
  - **RNNs/LSTMs**: Good for capturing sequential dependencies.
  - **CNNs**: Effective for capturing local patterns and motifs in sequences.
  - **Transformers**: Excellent for capturing long-range dependencies and complex patterns with attention mechanisms.
- **Hyperparameter Tuning**: Experiment with different values of k (e.g., k=3, 4, 5, 6) and tune hyperparameters to find the optimal setup.
- **Evaluation**: Use cross-validation and metrics like accuracy, precision, recall, and F1-score to evaluate the performance of different approaches.
</div>
-->