<div style="hwidth: 100%; background-color: #ddd; overflow:hidden; ">
    <div style="display: flex; justify-content: center; align-items: center; border-bottom: 10px solid #80c4e7; padding: 3px;">
        <h2 style="position: relative; top: 3px; left: 8px;">S2 Project: DNA Classification</h2>
        <img style="position: absolute; height: 68px; top: -2px;; right: 18px" src="./Content/Notebook-images/dna1.png"/>
    </div>
    <div style="padding: 3px 8px;">
        <h4>Objectives:</h4>
        The primary objective of this project is to develop predictive models for DNA sequence gene classification.
        <h4>Dataset:</h4>
        The dataset files contain genetic sequence data in FASTA format. The dataset consists of two files:
        <ul>
            <li>Arabidopsis_thaliana_BHLH_gene_Family.fasta</li>
            <li>Arabidopsis_thaliana_CYP_gene_Family.fasta</li>
        </ul>
        <h4>Steps:</h4>
        <ol>
            <li>Read the genetic sequence data from the files.</li>
            <li>Vectorize the data to prepare it for modeling.</li>
            <li>Implement classification models such as k-nearest neighbors (kNN), support vector machine (SVM), and random forest (RF).</li>
            <li>Evaluate the performance of the models using appropriate metrics.</li>
            <li>Iterate on model tuning and feature selection to improve classification accuracy.</li>
            <!-- Add more steps as needed -->
        </ol>
    </div>    
</div>

### 1 - Importing utils
The following code cells will import necessary libraries.

In [1]:
import numpy as np
import pandas as pd
from sklearn.utils import shuffle, resample
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import tensorflow as tf
from tensorflow.keras import models, layers, Input, Sequential
import matplotlib.pyplot as plt
from sklearn import model_selection
import warnings
from sklearn.exceptions import ConvergenceWarning
from sklearn.feature_extraction import DictVectorizer
warnings.filterwarnings("ignore", category=ConvergenceWarning)

### 2 - Importing Dataset
The following function will read our preprocessed **.csv file** and return a pandas dataframe

In [2]:
dataset = pd.read_csv("./Output/Arabidopsis_thaliana_GHLH_and_CYP_gene.csv")

In [3]:
dataset.head()

Unnamed: 0,id,sequence,length,class
0,AT1G51140.1,AAGTTTCTCTCACGTTCTCTTTTTTAATTTTAATTTCTCGCCGGAA...,2297,0
1,AT1G73830.1,ACTTTCTATTTTCACCAATTTTCAAAAAAAAAATAAAAATTGAAAC...,1473,0
2,AT1G09530.1,AGTTACAGACGATTTGGTCCCCTCTCTTCTCTCTCTGCGTCCGTCT...,2958,0
3,AT1G49770.1,ATGACTAATGCTCAAGAGTTGGGGCAAGAGGGTTTTATGTGGGGCA...,2205,0
4,AT1G68810.1,AAACTTTTGTCTCTTTTTAACTCTCTTAACTTTCGTTTCTTCTCCT...,1998,0


### 3 - Preprocessing

In [4]:
def kMer(sequence, k=3, step=1):
    kmers_count = {}
    s = 0
    for i in range(0, len(sequence) - k + 1, step):
        kmer = sequence[i:i + k]
        s += 1
        if kmer in kmers_count:
            kmers_count[kmer] += 1
        else:
            kmers_count[kmer] = 1
    for key, value in kmers_count.items():
        kmers_count[key] = value / s

    return kmers_count

In [5]:
sequences   = dataset['sequence']
kmers_count = []

for i in range(len(sequences)):
    kmers_count.append(kMer(sequences[i], k=6, step=1))

In [6]:
v = DictVectorizer(sparse=False)
feature_values = v.fit_transform(kmers_count)
feature_names = v.get_feature_names_out()
X = pd.DataFrame(feature_values, columns=feature_names)
X.head()

Unnamed: 0,AAAAAA,AAAAAC,AAAAAG,AAAAAT,AAAACA,AAAACC,AAAACG,AAAACT,AAAAGA,AAAAGC,...,TTTTCG,TTTTCT,TTTTGA,TTTTGC,TTTTGG,TTTTGT,TTTTTA,TTTTTC,TTTTTG,TTTTTT
0,0.002182,0.0,0.000436,0.000436,0.000436,0.000436,0.0,0.0,0.000436,0.000873,...,0.000436,0.001745,0.000873,0.000873,0.002182,0.001309,0.001309,0.000873,0.001745,0.001745
1,0.00545,0.0,0.000681,0.002725,0.000681,0.0,0.0,0.000681,0.000681,0.0,...,0.0,0.004087,0.0,0.001362,0.000681,0.001362,0.002044,0.001362,0.002044,0.004768
2,0.000339,0.000339,0.000339,0.001016,0.000677,0.0,0.000339,0.000339,0.001355,0.000339,...,0.000339,0.00237,0.001693,0.000339,0.000339,0.00237,0.001355,0.001016,0.002032,0.002032
3,0.005,0.001364,0.001818,0.000455,0.000455,0.000455,0.0,0.000455,0.001818,0.000455,...,0.000455,0.002727,0.0,0.000909,0.000455,0.000909,0.002727,0.001818,0.000455,0.004091
4,0.001004,0.001505,0.001505,0.000502,0.002007,0.0,0.000502,0.000502,0.0,0.001004,...,0.0,0.002509,0.000502,0.0,0.001505,0.002007,0.003512,0.001004,0.000502,0.006021


In [7]:
y = dataset['class']
y.head()

0    0
1    0
2    0
3    0
4    0
Name: class, dtype: int64

In [8]:
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, stratify=y)

print("Shapes of train/test splits:")
print("X_train:", X_train.shape)
print("X_test:", X_test.shape)
print("y_train:", y_train.shape)
print("y_test:", y_test.shape)

Shapes of train/test splits:
X_train: (304, 4096)
X_test: (76, 4096)
y_train: (304,)
y_test: (76,)


### 4 - Training and Testing the Classification Algorithms

In [9]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score
from sklearn.model_selection import KFold, cross_val_score, GridSearchCV

In [38]:
names = ["Nearest Neighbors", "Gaussian Process",
         "Decision Tree", "Random Forest", "Neural Net", "AdaBoost",
         "Naive Bayes", "SVM Linear", "SVM RBF", "SVM Sigmoid"]

classifiers = [
    KNeighborsClassifier(n_neighbors = 3),
    GaussianProcessClassifier(1.0 * RBF(1.0)),
    DecisionTreeClassifier(max_depth=5),
    RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
    MLPClassifier(alpha=1),
    AdaBoostClassifier(),
    GaussianNB(),
    SVC(kernel = 'linear'), 
    SVC(kernel = 'rbf'),
    SVC(kernel = 'sigmoid')
]
models = zip(names, classifiers)

* Let evaluate each model

In [39]:
results = []
names = []

for name, model in models:
    # Cross-validation
    kfold = KFold(n_splits=10, random_state=42, shuffle=True)
    cv_results = model_selection.cross_val_score(model, X_train, y_train, cv=kfold, scoring='accuracy')
    
    # Store results
    results.append(cv_results)
    names.append(name)
    msg = "[Train] - '{}' - acc: {} ±({})".format(name, cv_results.mean(), cv_results.std())
    print("{}\n{}\n{}".format('-'*80, msg, '-'*80))
    
    # Fit the model and make predictions
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)

    # Print test results
    print("[Test]  - '{}' - acc : {}".format(name, accuracy_score(y_test, predictions)))
    print(classification_report(y_test, predictions), end="\n")

--------------------------------------------------------------------------------
[Train] - 'Nearest Neighbors' - acc: 0.7294623655913978 ±(0.08198485798227853)
--------------------------------------------------------------------------------
[Test]  - 'Nearest Neighbors' - acc : 0.75
              precision    recall  f1-score   support

           0       0.68      0.78      0.72        32
           1       0.82      0.73      0.77        44

    accuracy                           0.75        76
   macro avg       0.75      0.75      0.75        76
weighted avg       0.76      0.75      0.75        76

--------------------------------------------------------------------------------
[Train] - 'Gaussian Process' - acc: 0.8174193548387099 ±(0.15994478177706528)
--------------------------------------------------------------------------------
[Test]  - 'Gaussian Process' - acc : 0.9473684210526315
              precision    recall  f1-score   support

           0       0.94      0.94     

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


--------------------------------------------------------------------------------
[Train] - 'AdaBoost' - acc: 0.842258064516129 ±(0.06157460950386322)
--------------------------------------------------------------------------------
[Test]  - 'AdaBoost' - acc : 0.75
              precision    recall  f1-score   support

           0       0.70      0.72      0.71        32
           1       0.79      0.77      0.78        44

    accuracy                           0.75        76
   macro avg       0.74      0.75      0.74        76
weighted avg       0.75      0.75      0.75        76

--------------------------------------------------------------------------------
[Train] - 'Naive Bayes' - acc: 0.8258064516129032 ±(0.09185968075323626)
--------------------------------------------------------------------------------
[Test]  - 'Naive Bayes' - acc : 0.8026315789473685
              precision    recall  f1-score   support

           0       0.81      0.69      0.75        32
           1 

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


--------------------------------------------------------------------------------
[Train] - 'SVM RBF' - acc: 0.9048387096774194 ±(0.0659820028227145)
--------------------------------------------------------------------------------
[Test]  - 'SVM RBF' - acc : 0.8947368421052632
              precision    recall  f1-score   support

           0       0.88      0.88      0.88        32
           1       0.91      0.91      0.91        44

    accuracy                           0.89        76
   macro avg       0.89      0.89      0.89        76
weighted avg       0.89      0.89      0.89        76

--------------------------------------------------------------------------------
[Train] - 'SVM Sigmoid' - acc: 0.8848387096774195 ±(0.042110058465091235)
--------------------------------------------------------------------------------
[Test]  - 'SVM Sigmoid' - acc : 0.8947368421052632
              precision    recall  f1-score   support

           0       0.83      0.94      0.88        32


* Let tune our differents model

In [40]:
# Define the hyperparameter grids
param_grids = {
    "Nearest Neighbors": {'n_neighbors': [3, 5, 7]},
    "Gaussian Process": {'kernel': [1.0 * RBF(1.0), 1.0 * RBF(0.5), 1.0 * RBF(2.0)]},
    "Decision Tree": {'max_depth': [3, 5, 7]},
    "Random Forest": {'max_depth': [3, 5, 7], 'n_estimators': [10, 50, 100], 'max_features': [1, 2, 3]},
    "Neural Net": {'alpha': [0.0001, 0.001, 0.01]},
    "AdaBoost": {'n_estimators': [50, 100, 200]},
    "Naive Bayes": {},  # No hyperparameters to tune for GaussianNB
    "SVM Linear": {'C': [0.1, 1, 10]},
    "SVM RBF": {'C': [0.1, 1, 10], 'gamma': [0.001, 0.01, 0.1]},
    "SVM Sigmoid": {'C': [0.1, 1, 10], 'gamma': [0.001, 0.01, 0.1]}
}

# Models
names = ["Nearest Neighbors", "Gaussian Process", "Decision Tree", "Random Forest", "Neural Net", "AdaBoost", "Naive Bayes", "SVM Linear", "SVM RBF", "SVM Sigmoid"]
classifiers = [
    KNeighborsClassifier(),
    GaussianProcessClassifier(),
    DecisionTreeClassifier(),
    RandomForestClassifier(),
    MLPClassifier(),
    AdaBoostClassifier(),
    GaussianNB(),
    SVC(kernel='linear'), 
    SVC(kernel='rbf'),
    SVC(kernel='sigmoid')
]
models = zip(names, classifiers)

In [None]:
results = []
names = []
best_parameters = []

for name, model in models:
    print(f"Processing {name}...")
    
    # Get the hyperparameter grid for the current model
    param_grid = param_grids[name]
    
    # Cross-validation
    kfold = KFold(n_splits=10, random_state=42, shuffle=True)
    
    # Perform grid search
    grid_search = GridSearchCV(model, param_grid, cv=kfold, scoring='accuracy', n_jobs=-1)
    grid_search.fit(X_train, y_train)
    
    # Get the best model
    best_model = grid_search.best_estimator_
    best_params = grid_search.best_params_
    best_parameters.append((name, best_params))
    print('Best params found: ', best_params)
    
    # Cross-validation results
    cv_results = cross_val_score(best_model, X_train, y_train, cv=kfold, scoring='accuracy')
    results.append(cv_results)
    names.append(name)
    msg = "[Train] - '{}' - best params: {} - acc: {} ±({})".format(name, grid_search.best_params_, cv_results.mean(), cv_results.std())
    print("{}\n{}\n{}".format('-'*80, msg, '-'*80))
    
    # Fit the best model and make predictions
    best_model.fit(X_train, y_train)
    predictions = best_model.predict(X_test)
    
    # Print test results
    print("[Test]  - '{}' - acc : {}".format(name, accuracy_score(y_test, predictions)))
    print(classification_report(y_test, predictions), end="\n")