<div style="hwidth: 100%; background-color: #ddd; overflow:hidden; ">
    <div style="display: flex; justify-content: center; align-items: center; border-bottom: 10px solid #80c4e7; padding: 3px;">
        <h2 style="position: relative; top: 3px; left: 8px;">S2 Project: DNA Classification</h2>
        <img style="position: absolute; height: 68px; top: -2px;; right: 18px" src="./Content/Notebook-images/dna1.png"/>
    </div>
    <div style="padding: 3px 8px;">
        <h4>Objectives:</h4>
        The primary objective of this project is to develop predictive models for DNA sequence gene classification.
        <h4>Dataset:</h4>
        The dataset files contain genetic sequence data in FASTA format. The dataset consists of two files:
        <ul>
            <li>Arabidopsis_thaliana_BHLH_gene_Family.fasta</li>
            <li>Arabidopsis_thaliana_CYP_gene_Family.fasta</li>
        </ul>
        <h4>Steps:</h4>
        <ol>
            <li>Read the genetic sequence data from the files.</li>
            <li>Vectorize the data to prepare it for modeling.</li>
            <li>Implement classification models such as k-nearest neighbors (kNN), support vector machine (SVM), and random forest (RF).</li>
            <li>Evaluate the performance of the models using appropriate metrics.</li>
            <li>Iterate on model tuning and feature selection to improve classification accuracy.</li>
            <!-- Add more steps as needed -->
        </ol>
    </div>    
</div>

### 1 - Importing utils
The following code cells will import necessary libraries.

In [66]:
import numpy as np
import pandas as pd
from sklearn.utils import shuffle, resample
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import tensorflow as tf
from tensorflow.keras import models, layers, Input, Sequential
import matplotlib.pyplot as plt
from sklearn import model_selection
import warnings
from sklearn.exceptions import ConvergenceWarning
warnings.filterwarnings("ignore", category=ConvergenceWarning)

### 2 - Importing Dataset
The following function will read our preprocessed **.csv file** and return a pandas dataframe

In [26]:
dataset = pd.read_csv("./Output/Arabidopsis_thaliana_GHLH_and_CYP_gene.csv")

In [27]:
dataset.head()

Unnamed: 0,id,sequence,length,class
0,AT1G51140.1,AAGTTTCTCTCACGTTCTCTTTTTTAATTTTAATTTCTCGCCGGAA...,2297,0
1,AT1G73830.1,ACTTTCTATTTTCACCAATTTTCAAAAAAAAAATAAAAATTGAAAC...,1473,0
2,AT1G09530.1,AGTTACAGACGATTTGGTCCCCTCTCTTCTCTCTCTGCGTCCGTCT...,2958,0
3,AT1G49770.1,ATGACTAATGCTCAAGAGTTGGGGCAAGAGGGTTTTATGTGGGGCA...,2205,0
4,AT1G68810.1,AAACTTTTGTCTCTTTTTAACTCTCTTAACTTTCGTTTCTTCTCCT...,1998,0


### 3 - Preprocessing

In [28]:
dataset.describe()

Unnamed: 0,length,class
count,380.0,380.0
mean,2080.078947,0.573684
std,727.650228,0.495193
min,468.0,0.0
25%,1684.25,0.0
50%,1947.0,1.0
75%,2367.5,1.0
max,4873.0,1.0


**Note**: As we can see, our DNA sequences are not all fixed size. So we need to make sur these sequence have same size.

In [39]:
pad_up_to = dataset['sequence'].apply(lambda x: len(x)).min()
print(pad_up_to)
pad_seq   = "-"
sequences = list(dataset['sequence'])
classes   = list(dataset['class'])
df = {}

# loop through sequences and split into individual nucleotides
for i, seq in enumerate(sequences):
    nucleotides = list(seq[:pad_up_to].ljust(pad_up_to, pad_seq))
    nucleotides.append(classes[i])
    df[i] = nucleotides

df = pd.DataFrame(df).T
df.rename(columns = {pad_up_to: 'Class'}, inplace = True) 
df.head()

468


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,459,460,461,462,463,464,465,466,467,Class
0,A,A,G,T,T,T,C,T,C,T,...,C,G,A,C,G,G,C,G,A,0
1,A,C,T,T,T,C,T,A,T,T,...,C,A,T,A,T,A,T,T,A,0
2,A,G,T,T,A,C,A,G,A,C,...,T,T,T,C,T,T,T,A,T,0
3,A,T,G,A,C,T,A,A,T,G,...,T,A,T,A,A,A,A,T,T,0
4,A,A,A,C,T,T,T,T,G,T,...,C,T,A,C,G,G,A,A,G,0


In [40]:
df.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,459,460,461,462,463,464,465,466,467,Class
count,380,380,380,380,380,380,380,380,380,380,...,380,380,380,380,380,380,380,380,380,380
unique,4,4,4,4,4,4,4,4,4,4,...,4,4,4,4,4,4,4,4,4,2
top,A,T,G,A,A,A,A,A,A,A,...,A,T,T,A,A,A,A,T,T,1
freq,249,201,131,121,161,145,150,127,130,137,...,114,113,120,117,110,107,110,126,124,218


In [42]:
series = []
for name in df.columns:
    series.append(df[name].value_counts())
    
info = pd.DataFrame(series)
details = info.transpose()
details.head()

Unnamed: 0,count,count.1,count.2,count.3,count.4,count.5,count.6,count.7,count.8,count.9,...,count.10,count.11,count.12,count.13,count.14,count.15,count.16,count.17,count.18,count.19
A,249.0,90.0,100.0,121.0,161.0,145.0,150.0,127.0,130.0,137.0,...,114.0,107.0,117.0,117.0,110.0,107.0,110.0,119.0,108.0,
G,50.0,25.0,131.0,97.0,43.0,55.0,62.0,51.0,48.0,50.0,...,78.0,88.0,88.0,76.0,88.0,92.0,85.0,68.0,75.0,
C,49.0,64.0,72.0,65.0,80.0,64.0,62.0,96.0,90.0,76.0,...,76.0,72.0,55.0,74.0,80.0,84.0,76.0,67.0,73.0,
T,32.0,201.0,77.0,97.0,96.0,116.0,106.0,106.0,112.0,117.0,...,112.0,113.0,120.0,113.0,102.0,97.0,109.0,126.0,124.0,
1,,,,,,,,,,,...,,,,,,,,,,218.0


In [43]:
numerical_df = pd.get_dummies(df)

In [44]:
numerical_df

Unnamed: 0,0_A,0_C,0_G,0_T,1_A,1_C,1_G,1_T,2_A,2_C,...,466_A,466_C,466_G,466_T,467_A,467_C,467_G,467_T,Class_0,Class_1
0,True,False,False,False,True,False,False,False,False,False,...,False,False,True,False,True,False,False,False,True,False
1,True,False,False,False,False,True,False,False,False,False,...,False,False,False,True,True,False,False,False,True,False
2,True,False,False,False,False,False,True,False,False,False,...,True,False,False,False,False,False,False,True,True,False
3,True,False,False,False,False,False,False,True,False,False,...,False,False,False,True,False,False,False,True,True,False
4,True,False,False,False,True,False,False,False,True,False,...,True,False,False,False,False,False,True,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
375,False,False,True,False,False,False,True,False,True,False,...,True,False,False,False,False,True,False,False,False,True
376,False,False,True,False,False,True,False,False,False,True,...,False,True,False,False,False,False,False,True,False,True
377,True,False,False,False,False,False,False,True,False,False,...,False,True,False,False,True,False,False,False,False,True
378,False,True,False,False,False,True,False,False,False,True,...,False,False,False,True,False,True,False,False,False,True


In [50]:
df = numerical_df.drop(columns=['Class_0'])
df.rename(columns = {'Class_1': 'Class'}, inplace = True)
df.head()

Unnamed: 0,0_A,0_C,0_G,0_T,1_A,1_C,1_G,1_T,2_A,2_C,...,465_T,466_A,466_C,466_G,466_T,467_A,467_C,467_G,467_T,Class
0,True,False,False,False,True,False,False,False,False,False,...,False,False,False,True,False,True,False,False,False,False
1,True,False,False,False,False,True,False,False,False,False,...,True,False,False,False,True,True,False,False,False,False
2,True,False,False,False,False,False,True,False,False,False,...,True,True,False,False,False,False,False,False,True,False
3,True,False,False,False,False,False,False,True,False,False,...,False,False,False,False,True,False,False,False,True,False
4,True,False,False,False,True,False,False,False,True,False,...,False,True,False,False,False,False,False,True,False,False


* Let split the data

In [52]:
# Split data
X = np.array(df.drop(['Class'], axis=1))
y = np.array(df['Class'])
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, stratify=y)

print("Shapes of train/test splits:")
print("X_train:", X_train.shape)
print("X_test:", X_test.shape)
print("y_train:", y_train.shape)
print("y_test:", y_test.shape)

Shapes of train/test splits:
X_train: (304, 1872)
X_test: (76, 1872)
y_train: (304,)
y_test: (76,)


### 4 - Training and Testing the Classification Algorithms

In [68]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score

In [69]:
names = ["Nearest Neighbors", "Gaussian Process",
         "Decision Tree", "Random Forest", "Neural Net", "AdaBoost",
         "Naive Bayes", "SVM Linear", "SVM RBF", "SVM Sigmoid"]

classifiers = [
    KNeighborsClassifier(n_neighbors = 3),
    GaussianProcessClassifier(1.0 * RBF(1.0)),
    DecisionTreeClassifier(max_depth=5),
    RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
    MLPClassifier(alpha=1),
    AdaBoostClassifier(),
    GaussianNB(),
    SVC(kernel = 'linear'), 
    SVC(kernel = 'rbf'),
    SVC(kernel = 'sigmoid')
]
models = zip(names, classifiers)

* Let evaluate each model

In [71]:
results = []
names = []

for name, model in models:
    kfold = model_selection.KFold(n_splits=10, random_state=42, shuffle=True)
    cv_results = model_selection.cross_val_score(model, X_train, y_train, cv=kfold, scoring='accuracy')
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
    print(msg)
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    print('Test-- ',name,': ',accuracy_score(y_test, predictions))
    print()
    print(classification_report(y_test, predictions))
    print('-'*100)
    print()

Decision Tree: 0.520108 (0.095681)
Test--  Decision Tree :  0.5657894736842105

              precision    recall  f1-score   support

       False       0.49      0.53      0.51        32
        True       0.63      0.59      0.61        44

    accuracy                           0.57        76
   macro avg       0.56      0.56      0.56        76
weighted avg       0.57      0.57      0.57        76

----------------------------------------------------------------------------------------------------

Random Forest: 0.586022 (0.068871)
Test--  Random Forest :  0.5657894736842105

              precision    recall  f1-score   support

       False       0.40      0.06      0.11        32
        True       0.58      0.93      0.71        44

    accuracy                           0.57        76
   macro avg       0.49      0.50      0.41        76
weighted avg       0.50      0.57      0.46        76

------------------------------------------------------------------------------------