<div style="hwidth: 100%; background-color: #ddd; overflow:hidden; ">
    <div style="display: flex; justify-content: center; align-items: center; border-bottom: 10px solid #80c4e7; padding: 3px;">
        <h2 style="position: relative; top: 3px; left: 8px;">S2 Project: DNA Classification - (part2: Approach 2)</h2>
        <img style="position: absolute; height: 68px; top: -2px;; right: 18px" src="./Content/Notebook-images/dna1.png"/>
    </div>
    <div style="padding: 3px 8px;">
        
1. **Description**:
   - **Idea**: k-mer Representation with Frequency Analysis
   - Break the DNA sequence into k-mers (subsequences of length k).
   - Perform frequency analysis to create a feature vector based on the occurrence of each k-mer.
   - Use this feature vector as input to the model.

3. **Pros**:
   - Captures local context within each k-mer.
   - Simplifies the input representation by reducing it to frequency counts.

4. **Cons**:
   - Loses positional information beyond the k-mer length.
   - Treating it as a [bag-of-words](https://en.wikipedia.org/wiki/Bag-of-words_model) model ignores the order of k-mers.

    </div>    
</div>

### 1 - Importing utils
The following code cells will import necessary libraries.

In [1]:
import numpy as np
import pandas as pd

# Model
from sklearn.neighbors import KNeighborsClassifier
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.gaussian_process.kernels import RBF
from xgboost import XGBClassifier

# Metric and utils
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score
from sklearn.model_selection import KFold, cross_val_score, GridSearchCV
from sklearn.feature_extraction import DictVectorizer

# Warning
import warnings
warnings.filterwarnings("ignore")

### 2 - Importing Dataset
The following function will read our preprocessed **.csv file** and return a pandas dataframe

In [2]:
dataset = pd.read_csv("./Output/Arabidopsis_thaliana_GHLH_and_CYP_gene.csv")
dataset.head()

Unnamed: 0,id,sequence,length,class
0,AT1G51140.1,AAGTTTCTCTCACGTTCTCTTTTTTAATTTTAATTTCTCGCCGGAA...,2297,0
1,AT1G73830.1,ACTTTCTATTTTCACCAATTTTCAAAAAAAAAATAAAAATTGAAAC...,1473,0
2,AT1G09530.1,AGTTACAGACGATTTGGTCCCCTCTCTTCTCTCTCTGCGTCCGTCT...,2958,0
3,AT1G49770.1,ATGACTAATGCTCAAGAGTTGGGGCAAGAGGGTTTTATGTGGGGCA...,2205,0
4,AT1G68810.1,AAACTTTTGTCTCTTTTTAACTCTCTTAACTTTCGTTTCTTCTCCT...,1998,0


### 3 - Preprocessing
Instead of taking each base as an individual feature, we transform DNA sequences using the k-mer representation, a widely adopted method in DNA sequence analysis. The k-mer approach captures richer contextual information for each nucleotide by concatenating it with its subsequent bases to form k-mers. For example, the DNA sequence ‘ATGCCA’ can be tranformed into four 3-mers: "ATG, TGC, GCC, CCA", or into three 4-mers: "ATGC, TGCC, GCCA". In our experiments, we will try these k-mer length: **3, 4, 5, and 6**.

In [3]:
# Utils: to count kmer occurence in DNA sequence on compute  frequence

def kmer_count(sequence, k=3, step=1):
    kmers_count = {}
    s = 0
    for i in range(0, len(sequence) - k + 1, step):
        kmer = sequence[i:i + k]
        s += 1
        if kmer in kmers_count:
            kmers_count[kmer] += 1
        else:
            kmers_count[kmer] = 1
    for key, value in kmers_count.items():
        kmers_count[key] = value / s

    return kmers_count

### 4 - Training and Testing

<h4 style="background-color: #80c4e6; display: flex;">
    <ul><li>k=3</li></ul>
</h4>

In [4]:
k = 3
sequences   = dataset['sequence']
kmers_count = []
for i in range(len(sequences)):
    kmers_count.append(kmer_count(sequences[i], k=k, step=1))

In [5]:
v = DictVectorizer(sparse=False)
feature_values = v.fit_transform(kmers_count)
feature_names = v.get_feature_names_out()
X_3 = pd.DataFrame(feature_values, columns=feature_names)
X_3.head()

Unnamed: 0,AAA,AAC,AAG,AAT,ACA,ACC,ACG,ACT,AGA,AGC,...,TCG,TCT,TGA,TGC,TGG,TGT,TTA,TTC,TTG,TTT
0,0.028758,0.018736,0.028322,0.024401,0.013508,0.008715,0.005664,0.018301,0.02963,0.012636,...,0.010022,0.025272,0.026144,0.013072,0.014815,0.016993,0.018301,0.024837,0.026144,0.043573
1,0.042148,0.015636,0.020394,0.040789,0.014956,0.004759,0.005438,0.015636,0.031271,0.007478,...,0.010197,0.03535,0.019714,0.008158,0.010877,0.010197,0.031271,0.03603,0.012916,0.062542
2,0.030108,0.019959,0.025034,0.021313,0.013532,0.008119,0.007442,0.016915,0.024357,0.011502,...,0.010487,0.032476,0.019959,0.015223,0.017591,0.022327,0.016238,0.02977,0.026725,0.046008
3,0.037222,0.016795,0.019065,0.029051,0.018157,0.009532,0.005447,0.011802,0.022696,0.006809,...,0.005901,0.024966,0.019065,0.01271,0.011348,0.020881,0.031321,0.027236,0.019519,0.059464
4,0.034068,0.019539,0.033567,0.018537,0.022044,0.011523,0.009519,0.012024,0.036072,0.014028,...,0.009018,0.028056,0.017535,0.006012,0.013026,0.019539,0.018537,0.019539,0.020541,0.049098


In [6]:
y = dataset['class']
y.head()

0    0
1    0
2    0
3    0
4    0
Name: class, dtype: int64

In [7]:
# Split data
X_3_train, X_3_test, y_train, y_test = train_test_split(X_3, y, train_size=0.8, stratify=y)

print("Shapes of train/test splits:")
print("X_train:", X_3_train.shape)
print("X_test:", X_3_test.shape)
print("y_train:", y_train.shape)
print("y_test:", y_test.shape)

Shapes of train/test splits:
X_train: (304, 64)
X_test: (76, 64)
y_train: (304,)
y_test: (76,)


* Let tune our differents model

In [8]:
# Define the hyperparameter grids
param_grids = {
    "Nearest Neighbors": {'n_neighbors': [3, 5, 7]},
    "Gaussian Process": {'kernel': [1.0 * RBF(1.0), 1.0 * RBF(0.5), 1.0 * RBF(2.0)]},
    "Random Forest": {'max_depth': [3, 5, 7], 'n_estimators': [10, 50, 100], 'max_features': [1, 2, 3]},
    "Neural Net": {'alpha': [0.0001, 0.001, 0.01]},
    "AdaBoost": {'n_estimators': [50, 100, 200]},
    "Naive Bayes": {},
    "SVM Linear": {'C': [0.1, 1, 10]},
    "SVM RBF": {'C': [0.1, 1, 10], 'gamma': [0.001, 0.01, 0.1]},
    "MultinomialNB": {'alpha': [0.01, 0.001, 0.1, 1, 10]},
    "XGBClassifier": {'n_estimators':[2, 3, 5, 10, 20, 100, 200], 'max_depth':[2, 3, 5, 7]}
}

# Models
names = ["Nearest Neighbors", "XGBClassifier", "Gaussian Process", "Random Forest", "Neural Net", "AdaBoost", "Naive Bayes", "SVM Linear", "SVM RBF", "MultinomialNB"]
classifiers = [
    KNeighborsClassifier(),
    XGBClassifier(objective='binary:logistic'),
    GaussianProcessClassifier(),
    RandomForestClassifier(),
    MLPClassifier(max_iter=10000, early_stopping=False),
    AdaBoostClassifier(),
    GaussianNB(),
    SVC(kernel='linear'),
    SVC(kernel='rbf'),
    MultinomialNB(),
]
models = zip(names, classifiers)

In [9]:
results = []
names   = []
best_parameters = []

for name, model in models:
    print(f"Processing {name}...")
    param_grid = param_grids[name]
    kfold = KFold(n_splits=10, random_state=42, shuffle=True)
    
    # Perform grid search & Get the best model
    grid_search = GridSearchCV(model, param_grid, cv=kfold, scoring='accuracy', n_jobs=-1)
    grid_search.fit(X_3_train, y_train)
    best_model = grid_search.best_estimator_
    best_params = grid_search.best_params_
    best_parameters.append((name, best_params))
    print('Best params found: ', best_params)
    
    # Cross-validation results
    cv_results = cross_val_score(best_model, X_3_train, y_train, cv=kfold, scoring='accuracy')
    results.append(cv_results)
    names.append(name)
    msg = "[Train] - '{}' - acc: {} ±({})".format(name, cv_results.mean(), cv_results.std())
    print("{}\n{}\n{}".format('-'*80, msg, '-'*80))
    
    # Fit the best model
    best_model.fit(X_3_train, y_train)
    
    # Make predictions and print test results
    predictions = best_model.predict(X_3_test)
    print("[Test]  - '{}' - acc : {}".format(name, accuracy_score(y_test, predictions)))
    print(classification_report(y_test, predictions), end="\n")

Processing Nearest Neighbors...
Best params found:  {'n_neighbors': 3}
--------------------------------------------------------------------------------
[Train] - 'Nearest Neighbors' - acc: 0.7996774193548387 ±(0.04596582401834304)
--------------------------------------------------------------------------------
[Test]  - 'Nearest Neighbors' - acc : 0.8289473684210527
              precision    recall  f1-score   support

           0       0.91      0.66      0.76        32
           1       0.79      0.95      0.87        44

    accuracy                           0.83        76
   macro avg       0.85      0.81      0.81        76
weighted avg       0.84      0.83      0.82        76

Processing XGBClassifier...
Best params found:  {'max_depth': 7, 'n_estimators': 100}
--------------------------------------------------------------------------------
[Train] - 'XGBClassifier' - acc: 0.8490322580645161 ±(0.06368315313742923)
--------------------------------------------------------------

<h4 style="background-color: #80c4e6; display: flex;">
    <ul><li>k=4</li></ul>
</h4>

In [10]:
k = 4
sequences   = dataset['sequence']
kmers_count = []
for i in range(len(sequences)):
    kmers_count.append(kmer_count(sequences[i], k=k, step=1))

In [11]:
v = DictVectorizer(sparse=False)
feature_values = v.fit_transform(kmers_count)
feature_names = v.get_feature_names_out()
X_4 = pd.DataFrame(feature_values, columns=feature_names)
X_4.head()

Unnamed: 0,AAAA,AAAC,AAAG,AAAT,AACA,AACC,AACG,AACT,AAGA,AAGC,...,TTCG,TTCT,TTGA,TTGC,TTGG,TTGT,TTTA,TTTC,TTTG,TTTT
0,0.008282,0.008718,0.006975,0.004795,0.006103,0.003487,0.001744,0.007411,0.011334,0.005667,...,0.003923,0.00959,0.007411,0.004795,0.006103,0.007847,0.006975,0.008718,0.010026,0.017873
1,0.019728,0.004762,0.002721,0.014966,0.004082,0.002041,0.002041,0.007483,0.010884,0.002041,...,0.003401,0.017007,0.005442,0.002721,0.002041,0.002721,0.010884,0.019048,0.006803,0.02585
2,0.007783,0.007783,0.010152,0.004399,0.007107,0.003723,0.002707,0.00643,0.010829,0.005076,...,0.004061,0.013875,0.008122,0.004399,0.005076,0.009137,0.007107,0.010491,0.011168,0.017259
3,0.015441,0.00545,0.006812,0.009537,0.006358,0.003633,0.001362,0.00545,0.009083,0.003179,...,0.002271,0.014078,0.004087,0.004541,0.004087,0.006812,0.01317,0.012262,0.009083,0.024977
4,0.011529,0.006015,0.010025,0.006516,0.009023,0.002506,0.003509,0.004511,0.01604,0.007519,...,0.002005,0.011028,0.004511,0.002005,0.006015,0.00802,0.007018,0.008521,0.011028,0.022556


In [12]:
# Split data
X_4_train, X_4_test, y_train, y_test = train_test_split(X_4, y, train_size=0.8, stratify=y)

print("Shapes of train/test splits:")
print("X_train:", X_4_train.shape)
print("X_test:", X_4_test.shape)
print("y_train:", y_train.shape)
print("y_test:", y_test.shape)

Shapes of train/test splits:
X_train: (304, 256)
X_test: (76, 256)
y_train: (304,)
y_test: (76,)


* Let tune our differents model

In [13]:
# Define the hyperparameter grids
param_grids = {
    "Nearest Neighbors": {'n_neighbors': [3, 5, 7]},
    "Gaussian Process": {'kernel': [1.0 * RBF(1.0), 1.0 * RBF(0.5), 1.0 * RBF(2.0)]},
    "Random Forest": {'max_depth': [3, 5, 7], 'n_estimators': [10, 50, 100], 'max_features': [1, 2, 3]},
    "Neural Net": {'alpha': [0.0001, 0.001, 0.01]},
    "AdaBoost": {'n_estimators': [50, 100, 200]},
    "Naive Bayes": {},
    "SVM Linear": {'C': [0.1, 1, 10]},
    "SVM RBF": {'C': [0.1, 1, 10], 'gamma': [0.001, 0.01, 0.1]},
    "MultinomialNB": {'alpha': [0.01, 0.001, 0.1, 1, 10]},
    "XGBClassifier": {'n_estimators':[2, 3, 5, 10, 20, 100, 200], 'max_depth':[2, 3, 5, 7]}
}

# Models
names = ["Nearest Neighbors", "XGBClassifier", "Gaussian Process", "Random Forest", "Neural Net", "AdaBoost", "Naive Bayes", "SVM Linear", "SVM RBF", "MultinomialNB"]
classifiers = [
    KNeighborsClassifier(),
    XGBClassifier(objective='binary:logistic'),
    GaussianProcessClassifier(),
    RandomForestClassifier(),
    MLPClassifier(max_iter=10000, early_stopping=False),
    AdaBoostClassifier(),
    GaussianNB(),
    SVC(kernel='linear'),
    SVC(kernel='rbf'),
    MultinomialNB()
]
models = zip(names, classifiers)

In [15]:
results = []
names   = []
best_parameters = []

for name, model in models:
    print(f"Processing {name}...")
    param_grid = param_grids[name]
    kfold = KFold(n_splits=10, random_state=42, shuffle=True)
    
    # Perform grid search & Get the best model
    grid_search = GridSearchCV(model, param_grid, cv=kfold, scoring='accuracy', n_jobs=-1)
    grid_search.fit(X_4_train, y_train)
    best_model = grid_search.best_estimator_
    best_params = grid_search.best_params_
    best_parameters.append((name, best_params))
    print('Best params found: ', best_params)
    
    # Cross-validation results
    cv_results = cross_val_score(best_model, X_4_train, y_train, cv=kfold, scoring='accuracy')
    results.append(cv_results)
    names.append(name)
    msg = "[Train] - '{}' - acc: {} ±({})".format(name, cv_results.mean(), cv_results.std())
    print("{}\n{}\n{}".format('-'*80, msg, '-'*80))
    
    # Fit the best model
    best_model.fit(X_4_train, y_train)
    
    # Make predictions and print test results
    predictions = best_model.predict(X_4_test)
    print("[Test]  - '{}' - acc : {}".format(name, accuracy_score(y_test, predictions)))
    print(classification_report(y_test, predictions), end="\n")

Processing XGBClassifier...
Best params found:  {'max_depth': 3, 'n_estimators': 100}
--------------------------------------------------------------------------------
[Train] - 'XGBClassifier' - acc: 0.8882795698924731 ±(0.03658901372735434)
--------------------------------------------------------------------------------
[Test]  - 'XGBClassifier' - acc : 0.881578947368421
              precision    recall  f1-score   support

           0       0.90      0.81      0.85        32
           1       0.87      0.93      0.90        44

    accuracy                           0.88        76
   macro avg       0.88      0.87      0.88        76
weighted avg       0.88      0.88      0.88        76

Processing Gaussian Process...
Best params found:  {'kernel': 1**2 * RBF(length_scale=1)}
--------------------------------------------------------------------------------
[Train] - 'Gaussian Process' - acc: 0.9146236559139785 ±(0.04180901801937289)
-------------------------------------------------

<h4 style="background-color: #80c4e6; display: flex;">
    <ul><li>k=5</li></ul>
</h4>

In [16]:
k = 5
sequences   = dataset['sequence']
kmers_count = []
for i in range(len(sequences)):
    kmers_count.append(kmer_count(sequences[i], k=k, step=1))

In [17]:
v = DictVectorizer(sparse=False)
feature_values = v.fit_transform(kmers_count)
feature_names = v.get_feature_names_out()
X_5 = pd.DataFrame(feature_values, columns=feature_names)
X_5.head()

Unnamed: 0,AAAAA,AAAAC,AAAAG,AAAAT,AAACA,AAACC,AAACG,AAACT,AAAGA,AAAGC,...,TTTCG,TTTCT,TTTGA,TTTGC,TTTGG,TTTGT,TTTTA,TTTTC,TTTTG,TTTTT
0,0.003053,0.000872,0.003053,0.001308,0.002617,0.002181,0.000436,0.003489,0.002181,0.001744,...,0.001308,0.003925,0.001744,0.002181,0.003925,0.002181,0.004361,0.002617,0.005233,0.005669
1,0.00885,0.001361,0.000681,0.00885,0.001361,0.000681,0.000681,0.002042,0.002042,0.0,...,0.000681,0.00953,0.001361,0.002042,0.001361,0.002042,0.004765,0.007488,0.003404,0.010211
2,0.002031,0.001354,0.00237,0.002031,0.00237,0.001693,0.000339,0.003385,0.004401,0.001693,...,0.000677,0.006432,0.003724,0.000677,0.00237,0.004401,0.00237,0.003724,0.004739,0.006432
3,0.008632,0.001363,0.003635,0.001817,0.001817,0.001817,0.000454,0.001363,0.004089,0.000909,...,0.001363,0.006361,0.001817,0.003635,0.001363,0.002272,0.007269,0.006361,0.002272,0.009087
4,0.004514,0.003009,0.002006,0.002006,0.003009,0.000502,0.000502,0.002006,0.003009,0.004012,...,0.001003,0.004514,0.001505,0.001003,0.004012,0.004514,0.004514,0.003009,0.004012,0.011033


In [19]:
# Split data
X_5_train, X_5_test, y_train, y_test = train_test_split(X_5, y, train_size=0.8, stratify=y)

print("Shapes of train/test splits:")
print("X_train:", X_5_train.shape)
print("X_test:", X_5_test.shape)
print("y_train:", y_train.shape)
print("y_test:", y_test.shape)

Shapes of train/test splits:
X_train: (304, 1024)
X_test: (76, 1024)
y_train: (304,)
y_test: (76,)


* Let tune our differents model

In [20]:
# Define the hyperparameter grids
param_grids = {
    "Nearest Neighbors": {'n_neighbors': [3, 5, 7]},
    "Gaussian Process": {'kernel': [1.0 * RBF(1.0), 1.0 * RBF(0.5), 1.0 * RBF(2.0)]},
    "Random Forest": {'max_depth': [3, 5, 7], 'n_estimators': [10, 50, 100], 'max_features': [1, 2, 3]},
    "Neural Net": {'alpha': [0.0001, 0.001, 0.01]},
    "AdaBoost": {'n_estimators': [50, 100, 200]},
    "Naive Bayes": {},
    "SVM Linear": {'C': [0.1, 1, 10]},
    "SVM RBF": {'C': [0.1, 1, 10], 'gamma': [0.001, 0.01, 0.1]},
    "MultinomialNB": {'alpha': [0.01, 0.001, 0.1, 1, 10]},
    "XGBClassifier": {'n_estimators':[2, 3, 5, 10, 20, 100, 200], 'max_depth':[2, 3, 5, 7]} 
}

# Models
names = ["Nearest Neighbors", "XGBClassifier", "Gaussian Process", "Random Forest", "Neural Net", "AdaBoost", "Naive Bayes", "SVM Linear", "SVM RBF"]
classifiers = [
    KNeighborsClassifier(),
    XGBClassifier(objective='binary:logistic'),
    GaussianProcessClassifier(),
    RandomForestClassifier(),
    MLPClassifier(max_iter=10000, early_stopping=False),
    AdaBoostClassifier(),
    GaussianNB(),
    SVC(kernel='linear'),
    SVC(kernel='rbf')
]
models = zip(names, classifiers)

In [21]:
results = []
names   = []
best_parameters = []

for name, model in models:
    print(f"Processing {name}...")
    param_grid = param_grids[name]
    kfold = KFold(n_splits=10, random_state=42, shuffle=True)
    
    # Perform grid search & Get the best model
    grid_search = GridSearchCV(model, param_grid, cv=kfold, scoring='accuracy', n_jobs=-1)
    grid_search.fit(X_5_train, y_train)
    best_model = grid_search.best_estimator_
    best_params = grid_search.best_params_
    best_parameters.append((name, best_params))
    print('Best params found: ', best_params)
    
    # Cross-validation results
    cv_results = cross_val_score(best_model, X_5_train, y_train, cv=kfold, scoring='accuracy')
    results.append(cv_results)
    names.append(name)
    msg = "[Train] - '{}' - acc: {} ±({})".format(name, cv_results.mean(), cv_results.std())
    print("{}\n{}\n{}".format('-'*80, msg, '-'*80))
    
    # Fit the best model
    best_model.fit(X_5_train, y_train)
    
    # Make predictions and print test results
    predictions = best_model.predict(X_5_test)
    print("[Test]  - '{}' - acc : {}".format(name, accuracy_score(y_test, predictions)))
    print(classification_report(y_test, predictions), end="\n")

Processing Nearest Neighbors...
Best params found:  {'n_neighbors': 5}
--------------------------------------------------------------------------------
[Train] - 'Nearest Neighbors' - acc: 0.7863440860215053 ±(0.08437270656518296)
--------------------------------------------------------------------------------
[Test]  - 'Nearest Neighbors' - acc : 0.8026315789473685
              precision    recall  f1-score   support

           0       0.70      0.94      0.80        32
           1       0.94      0.70      0.81        44

    accuracy                           0.80        76
   macro avg       0.82      0.82      0.80        76
weighted avg       0.84      0.80      0.80        76

Processing XGBClassifier...
Best params found:  {'max_depth': 2, 'n_estimators': 100}
--------------------------------------------------------------------------------
[Train] - 'XGBClassifier' - acc: 0.881505376344086 ±(0.053432962788091426)
--------------------------------------------------------------

<h4 style="background-color: #80c4e6; display: flex;">
    <ul><li>k=6</li></ul>
</h4>

In [22]:
k = 6
sequences   = dataset['sequence']
kmers_count = []
for i in range(len(sequences)):
    kmers_count.append(kmer_count(sequences[i], k=k, step=1))

In [23]:
v = DictVectorizer(sparse=False)
feature_values = v.fit_transform(kmers_count)
feature_names = v.get_feature_names_out()
X_6 = pd.DataFrame(feature_values, columns=feature_names)
X_6.head()

Unnamed: 0,AAAAAA,AAAAAC,AAAAAG,AAAAAT,AAAACA,AAAACC,AAAACG,AAAACT,AAAAGA,AAAAGC,...,TTTTCG,TTTTCT,TTTTGA,TTTTGC,TTTTGG,TTTTGT,TTTTTA,TTTTTC,TTTTTG,TTTTTT
0,0.002182,0.0,0.000436,0.000436,0.000436,0.000436,0.0,0.0,0.000436,0.000873,...,0.000436,0.001745,0.000873,0.000873,0.002182,0.001309,0.001309,0.000873,0.001745,0.001745
1,0.00545,0.0,0.000681,0.002725,0.000681,0.0,0.0,0.000681,0.000681,0.0,...,0.0,0.004087,0.0,0.001362,0.000681,0.001362,0.002044,0.001362,0.002044,0.004768
2,0.000339,0.000339,0.000339,0.001016,0.000677,0.0,0.000339,0.000339,0.001355,0.000339,...,0.000339,0.00237,0.001693,0.000339,0.000339,0.00237,0.001355,0.001016,0.002032,0.002032
3,0.005,0.001364,0.001818,0.000455,0.000455,0.000455,0.0,0.000455,0.001818,0.000455,...,0.000455,0.002727,0.0,0.000909,0.000455,0.000909,0.002727,0.001818,0.000455,0.004091
4,0.001004,0.001505,0.001505,0.000502,0.002007,0.0,0.000502,0.000502,0.0,0.001004,...,0.0,0.002509,0.000502,0.0,0.001505,0.002007,0.003512,0.001004,0.000502,0.006021


In [24]:
# Split data
X_6_train, X_6_test, y_train, y_test = train_test_split(X_6, y, train_size=0.8, stratify=y)

print("Shapes of train/test splits:")
print("X_train:", X_6_train.shape)
print("X_test:", X_6_test.shape)
print("y_train:", y_train.shape)
print("y_test:", y_test.shape)

Shapes of train/test splits:
X_train: (304, 4096)
X_test: (76, 4096)
y_train: (304,)
y_test: (76,)


* Let tune our differents model

In [25]:
# Define the hyperparameter grids
param_grids = {
    "Nearest Neighbors": {'n_neighbors': [3, 5, 7]},
    "Gaussian Process": {'kernel': [1.0 * RBF(1.0), 1.0 * RBF(0.5), 1.0 * RBF(2.0)]},
    "Random Forest": {'max_depth': [3, 5, 7], 'n_estimators': [10, 50, 100], 'max_features': [1, 2, 3]},
    "Neural Net": {'alpha': [0.0001, 0.001, 0.01]},
    "AdaBoost": {'n_estimators': [50, 100, 200]},
    "Naive Bayes": {},
    "SVM Linear": {'C': [0.1, 1, 10]},
    "SVM RBF": {'C': [0.1, 1, 10], 'gamma': [0.001, 0.01, 0.1]},
    "XGBClassifier": {'n_estimators':[2, 3, 5, 10, 20, 100, 200], 'max_depth':[2, 3, 5, 7]}  
}

# Models
names = ["Nearest Neighbors", "XGBClassifier", "Gaussian Process", "Random Forest", "Neural Net", "AdaBoost", "Naive Bayes", "SVM Linear", "SVM RBF"]
classifiers = [
    KNeighborsClassifier(),
    XGBClassifier(objective='binary:logistic'),
    GaussianProcessClassifier(),
    RandomForestClassifier(),
    MLPClassifier(max_iter=10000, early_stopping=False),
    AdaBoostClassifier(),
    GaussianNB(),
    SVC(kernel='linear'),
    SVC(kernel='rbf')
]
models = zip(names, classifiers)

In [26]:
results = []
names   = []
best_parameters = []

for name, model in models:
    print(f"Processing {name}...")
    param_grid = param_grids[name]
    kfold = KFold(n_splits=10, random_state=42, shuffle=True)
    
    # Perform grid search & Get the best model
    grid_search = GridSearchCV(model, param_grid, cv=kfold, scoring='accuracy', n_jobs=-1)
    grid_search.fit(X_6_train, y_train)
    best_model = grid_search.best_estimator_
    best_params = grid_search.best_params_
    best_parameters.append((name, best_params))
    print('Best params found: ', best_params)
    
    # Cross-validation results
    cv_results = cross_val_score(best_model, X_6_train, y_train, cv=kfold, scoring='accuracy')
    results.append(cv_results)
    names.append(name)
    msg = "[Train] - '{}' - acc: {} ±({})".format(name, cv_results.mean(), cv_results.std())
    print("{}\n{}\n{}".format('-'*80, msg, '-'*80))
    
    # Fit the best model
    best_model.fit(X_6_train, y_train)
    
    # Make predictions and print test results
    predictions = best_model.predict(X_6_test)
    print("[Test]  - '{}' - acc : {}".format(name, accuracy_score(y_test, predictions)))
    print(classification_report(y_test, predictions), end="\n")

Processing Nearest Neighbors...
Best params found:  {'n_neighbors': 5}
--------------------------------------------------------------------------------
[Train] - 'Nearest Neighbors' - acc: 0.7662365591397851 ±(0.054324385500287224)
--------------------------------------------------------------------------------
[Test]  - 'Nearest Neighbors' - acc : 0.7894736842105263
              precision    recall  f1-score   support

           0       0.81      0.66      0.72        32
           1       0.78      0.89      0.83        44

    accuracy                           0.79        76
   macro avg       0.79      0.77      0.78        76
weighted avg       0.79      0.79      0.79        76

Processing XGBClassifier...
Best params found:  {'max_depth': 2, 'n_estimators': 200}
--------------------------------------------------------------------------------
[Train] - 'XGBClassifier' - acc: 0.8981720430107527 ±(0.0516376508769629)
--------------------------------------------------------------

<h4 style="background-color: #80c4e6; border-top: 4px solid #dddddd; display: flex; color: white;">
    <ul><li>Testing various levels of granularity: k=3 & k=6</li></ul>
</h4>

* I Want to test this feature selection methode to take only relevant feature for this k-mer frequence approach
* https://link-springer-com.eressources.um6p.ma/chapter/10.1007/978-3-319-24462-4_9

In [59]:
# Let combine X_3 and X_6 & Split data

X_comb_36 = pd.concat([X_3, X_6], axis=1)
X_36_train, X_36_test, y_train, y_test = train_test_split(X_comb_36, y, train_size=0.8, stratify=y)
X_36_train.head()

Unnamed: 0,AAA,AAC,AAG,AAT,ACA,ACC,ACG,ACT,AGA,AGC,...,TTTTCG,TTTTCT,TTTTGA,TTTTGC,TTTTGG,TTTTGT,TTTTTA,TTTTTC,TTTTTG,TTTTTT
292,0.023613,0.0183,0.022432,0.01889,0.011806,0.007084,0.014168,0.011806,0.028926,0.012397,...,0.000591,0.001183,0.0,0.0,0.000591,0.0,0.000591,0.000591,0.0,0.0
122,0.035915,0.016835,0.029742,0.027497,0.013468,0.007295,0.008979,0.015713,0.035915,0.010662,...,0.0,0.000562,0.001124,0.001686,0.0,0.000562,0.001124,0.0,0.001124,0.001124
277,0.033218,0.022837,0.022837,0.015917,0.021453,0.017993,0.005536,0.017301,0.032526,0.013149,...,0.0,0.00208,0.0,0.0,0.0,0.0,0.000693,0.0,0.0,0.0
191,0.026626,0.015277,0.022698,0.024443,0.016587,0.013531,0.019206,0.010039,0.021388,0.011785,...,0.0,0.002185,0.0,0.000437,0.0,0.000437,0.000437,0.001311,0.000874,0.002185
246,0.030612,0.022809,0.030612,0.02401,0.022809,0.009604,0.008403,0.015006,0.02581,0.012605,...,0.0,0.001203,0.001804,0.0,0.0,0.001203,0.0,0.000601,0.000601,0.0


In [55]:
# Ensure the feature names are consistent between train and test sets
X_36_test = X_36_test.reindex(columns=X_36_train.columns, fill_value=0)

In [56]:
# Define the hyperparameter grids
param_grids = {
    "Gaussian Process": {'kernel': [1.0 * RBF(1.0), 1.0 * RBF(0.5), 1.0 * RBF(2.0)]},
}

# Models
names = ["Gaussian Process"]
classifiers = [
    GaussianProcessClassifier(),
]
models = zip(names, classifiers)

In [57]:
results = []
names   = []
best_parameters = []

for name, model in models:
    print(f"Processing {name}...")
    param_grid = param_grids[name]
    kfold = KFold(n_splits=10, random_state=42, shuffle=True)
    
    # Perform grid search & Get the best model
    grid_search = GridSearchCV(model, param_grid, cv=kfold, scoring='accuracy', n_jobs=-1)
    grid_search.fit(X_36_train, y_train)
    best_model = grid_search.best_estimator_
    best_params = grid_search.best_params_
    best_parameters.append((name, best_params))
    print('Best params found: ', best_params)
    
    # Cross-validation results
    cv_results = cross_val_score(best_model, X_36_train, y_train, cv=kfold, scoring='accuracy')
    results.append(cv_results)
    names.append(name)
    msg = "[Train] - '{}' - acc: {} ±({})".format(name, cv_results.mean(), cv_results.std())
    print("{}\n{}\n{}".format('-'*80, msg, '-'*80))
    
    # Fit the best model
    best_model.fit(X_6_train, y_train)
    
    # Make predictions and print test results
    predictions = best_model.predict(X_36_test)
    print("[Test]  - '{}' - acc : {}".format(name, accuracy_score(y_test, predictions)))
    print(classification_report(y_test, predictions), end="\n")

Processing Gaussian Process...
Best params found:  {'kernel': 1**2 * RBF(length_scale=1)}
--------------------------------------------------------------------------------
[Train] - 'Gaussian Process' - acc: 0.8884946236559139 ±(0.05022564415965383)
--------------------------------------------------------------------------------


ValueError: The feature names should match those that were passed during fit.
Feature names unseen at fit time:
- AAA
- AAC
- AAG
- AAT
- ACA
- ...
