<div style="hwidth: 100%; background-color: #ddd; overflow:hidden; ">
    <div style="display: flex; justify-content: center; align-items: center; border-bottom: 10px solid #80c4e7; padding: 3px;">
        <h2 style="position: relative; top: 3px; left: 8px;">S2 Project: DNA Classification - (<span style="color: red;">Model validation</span>)</h2>
        <img style="position: absolute; height: 68px; top: -2px;; right: 18px" src="./Content/Notebook-images/dna1.png"/>
    </div>
    <div style="padding: 3px 8px;">
        
1. **Objectives**:
   - In this section we are going to make classification on our previous best model for this unseen <span style="color: red;">bHLH</span> new test set to asses the overall perfomance on new data
    </div>    
</div>

### 1 - Importing utils
The following code cells will import necessary libraries.

In [17]:
import random
import string
import numpy as np
import pandas as pd
from sklearn.utils import shuffle, resample
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import tensorflow as tf
from tensorflow.keras import models, layers, Input, Sequential
import matplotlib.pyplot as plt
from sklearn.feature_extraction import DictVectorizer
import seaborn as sns
from IPython.display import display, HTML
from sklearn.metrics import (
    confusion_matrix, 
    classification_report, 
    accuracy_score, 
    f1_score, 
    recall_score, 
    precision_score
)
import joblib
import itertools

In [18]:
def read_fasta_file(file_path, family):
    """
    Utils: Convert fasta file to dataframe
    """
    sequences = []
    with open(file_path, 'r') as file:
        current_id = None
        current_sequence = ''
        for line in file:
            if line.startswith('>'):
                if current_id:
                    sequences.append({'id': current_id, 'sequence':current_sequence, 'length':len(current_sequence), 'class': family})
                current_id = line.strip().split('|')[0][1:].strip()
                current_sequence = ''
            else:
                current_sequence += line.strip()
        if current_id:
            sequences.append({'id': current_id, 'sequence':current_sequence, 'length':len(current_sequence), 'class': family})
    
    df = pd.DataFrame(sequences)
    return df

In [19]:
def kmer_count(sequence, k=3, step=1):
    """
    Utils: to count kmer occurence in DNA sequence and compute frequence
    """
    kmers = [''.join(p) for p in itertools.product('ACGT', repeat=k)]
    kmers_count = {kmer: 0 for kmer in kmers}
    s = 0
    for i in range(0, len(sequence) - k + 1, step):
        kmer = sequence[i:i + k]
        s += 1
        kmers_count[kmer] += 1
    for key, value in kmers_count.items():
        kmers_count[key] = value / s

    return kmers_count

In [20]:
def build_kmer_representation(df, k=3):
    """
    Utils: For given k-mer generate dataset and return vectorised version
    """
    sequences   = df['sequence']
    kmers_count = []
    for i in range(len(sequences)):
        kmers_count.append(kmer_count(sequences[i], k=k, step=1))
        
    v = DictVectorizer(sparse=False)
    feature_values = v.fit_transform(kmers_count)
    feature_names = v.get_feature_names_out()
    X = pd.DataFrame(feature_values, columns=feature_names)
    y = dataset['class']
    return X, y, feature_names

In [21]:
def test_report(X_test, y_test, model=None, args=["MODEL NAME", 0]):
    """
    Utils: For given model, and test data we run prediction and report peformance metrics
    """
    y_pred = model.predict(X_test)
    cm = confusion_matrix(y_test, y_pred)
    accuracy = accuracy_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred, average='weighted', zero_division=1)
    recall = recall_score(y_test, y_pred, average='weighted', zero_division=1)
    precision = precision_score(y_test, y_pred, average='weighted', zero_division=1)
    tn, fp, fn, tp = cm.ravel()
    report = classification_report(y_test, y_pred, target_names=['Class 0 - (BHLH)', 'Class 1 - (CYP )'], zero_division=1)
    cf_id = ''.join(random.choices(string.ascii_uppercase+string.digits, k=8))
    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Reds', xticklabels=['Class 0 - (BHLH)', 'Class 1 - (CYP)'], yticklabels=['Class 0 - (BHLH)', 'Class 1 - (CYP)'])
    plt.xlabel('Predicted')
    plt.ylabel('Actual')
    plt.savefig(f"Output/CFMatrix/confusion_matrix_{cf_id}.png")
    plt.close()
    
    report_html = f"""
    <div style="border: 2px solid #ddd;">
        <div style="padding: 0.6em; background-color: #ffdddd; font-weight: bold;">MODEL: {args[0]}</div>
        <div style="display: flex;">
            <div style="padding: 10px; width: 240px;">
                <h2>Initial perfomance</h2>
                <ul>
                    <li>Cross_validation Accuracy: {args[1]}</li>
                </ul>
            </div>
            <div style="flex: 1; padding: 10px;">
                <h2>Classification Report</h2>
                <pre>{report}</pre>
                <h3>Metrics</h3>
                <div style="display: flex;">
                    <ul>
                        <li>True Positives (TP): {tp}</li>
                        <li>True Negatives (TN): {tn}</li>
                    </ul>
                    <ul style="margin-left: 2em;">
                        <li>False Positives (FP): {fp}</li>
                        <li>False Negatives (FN): {fn}</li>
                    </ul>
                </div>
            </div>
            <div style="flex: 1; padding: 10px;">
                <h2 style="margin-left: 2em;">Confusion Matrix</h2>
                <img src="Output/CFMatrix/confusion_matrix_{cf_id}.png" width="400">
            </div>
        </div>
    </div>
    """

    # Display report and confusion matrix side by side
    display(HTML(report_html))

### 2 - Load Model / Data


* **Load our fasta file**
    - Here we load the fasta file as pandas DataFrame
    - Instead of class=="BHLH" we will use 0 <span style="color: blue;">which correspond to BHLH when we were training our model</span>

In [22]:
# Data file path
bHLH_Validation_data = "./Content/Validation-set/LsbHLH.fasta"

# Convert to dataframe and label class as "class_0":
dataset = read_fasta_file(bHLH_Validation_data, 0)

# Let's get a quick look at our dataset
dataset.head()

Unnamed: 0,id,sequence,length,class
0,LsbHLHD2,CTTTAGACTAAATAGTCATGAGTCAGAAAATTCGGATGCCTTGAAA...,1099,0
1,LsbHLHD3,TCAATTTTATAACTGTGAGAGATCCAATGAAGCACACGTGCACGGA...,1066,0
2,LsbHLHD4,ACTCAATGAATGATAAGCTGAAAAATAACATGTTACCCATTTATGA...,2851,0
3,LsbHLHD5,TACACTTCAAGAGACAACCTCTTCAAACTAACCAAAAACACATAGA...,2722,0
4,LsbHLHD6,CTAAGGTTCTTTTCTTCCTGCATCTATTAGATAGATACCTCAAATA...,2335,0


* **Prepare k-mer representation for k={3, 4, 5, 6}**

In [23]:
X_test_3, y_test_3, features_list_3 = build_kmer_representation(dataset, k=3)
X_test_4, y_test_4, features_list_4 = build_kmer_representation(dataset, k=4)
X_test_5, y_test_5, features_list_5 = build_kmer_representation(dataset, k=5)
X_test_6, y_test_6, features_list_6 = build_kmer_representation(dataset, k=6)

* **Load best models & Plot performance**

<h4 style="background-color: #80c4e6; border-top: 4px solid #dddddd; display: flex; color: white;">
    <ul><li>Without feature selection</li></ul>
</h4>

In [24]:
# BEST MODEL WITHOUT FEATURE SELECTION
basic_k3 = joblib.load(f"./Output/ModelPickle/k3_[Gaussian Process]_basic.joblib")
basic_k4 = joblib.load(f"./Output/ModelPickle/k4_[Gaussian Process]_basic.joblib")
basic_k5 = joblib.load(f"./Output/ModelPickle/k5_[Gaussian Process]_basic.joblib")
basic_k6 = joblib.load(f"./Output/ModelPickle/k6_[Gaussian Process]_basic.joblib")

In [25]:
test_report(X_test_3, y_test_3, model=basic_k3, args=["Gaussian Process, k=3", "0.87"])

In [26]:
test_report(X_test_4, y_test_4, model=basic_k4, args=["Gaussian Process, k=4", "0.90"])

In [11]:
test_report(X_test_5, y_test_5, model=basic_k5, args=["Gaussian Process, k=5", "0.90"])

In [12]:
test_report(X_test_6, y_test_6, model=basic_k6, args=["Gaussian Process, k=6", "0.90"])

<h4 style="background-color: #80c4e6; border-top: 4px solid #dddddd; display: flex; color: white;">
    <ul><li>With feature selection: PCA</li></ul>
</h4>