<a href="https://colab.research.google.com/github/Megha-178/DataScienceEcosystem./blob/main/Machine_Learning_DNA_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#DNA Classification
K nearest nwighbors, Markov models, Support vector machine

For the dataset UCI Repository is used, which has 106 DNA Sequences and 57 sequential nucleotides

focused on developing and assessing machine learning models for a classification task using a provided dataset. Beginning with data preprocessing steps like handling missing values and encoding categorical variables, the project progressed to model selection, encompassing various algorithms such as K-Nearest Neighbors, Decision Trees, Random Forests, Neural Networks, AdaBoost, Naive Bayes, and Support Vector Machines with different kernels. Following model training on the training data, evaluation via cross-validation techniques ensued, with performance metrics like accuracy, precision, recall, and F1-score scrutinized on the test set. Through meticulous observation, it was noted that certain models, notably SVM with a linear kernel and Gaussian Naive Bayes, consistently outperformed others across multiple metrics. The insights gleaned from these findings underscored the pivotal role of algorithm selection and parameter tuning in optimizing model performance. Ultimately, the project offered actionable recommendations, emphasizing the importance of weighing factors like model complexity, interpretability, and performance trade-offs when making informed decisions in model selection for classification tasks.

In [None]:
import sys
import numpy as np
import sklearn
import pandas as pd


In [None]:
url= 'https://archive.ics.uci.edu/ml/machine-learning-databases/molecular-biology/promoter-gene-sequences/promoters.data'
names= ['class','id','sequence']
data=pd.read_csv(url,names=names)

In [None]:
print(data.iloc[0]) #iloc is used to access a group of rows /columns by integer values

class                                                       +
id                                                        S10
sequence    \t\ttactagcaatacgcttgcgttcggtggttaagtatgtataat...
Name: 0, dtype: object


generate a list of DNA sequences, loop through the sequences and split them into individual nucleotides

In [None]:
#generating the list of DNA Sequences
sequences=list(data.loc[:, 'sequence'])
dataset={}

In [None]:
classes=data.loc[:,'class']
print(classes[:5])

0    +
1    +
2    +
3    +
4    +
Name: class, dtype: object


t=thymine, a=adenine, g=guanine, c=cytosine and so on

In [None]:
#looping through sequences and split the individual nucleotides
for i, seq in enumerate(sequences):
  nucleotides=list(seq)  #define nucleotides

  #split nucleotides and remove the tab characters
  nucleotides=[x for x in nucleotides if x != '\t' ]

  #append class
  nucleotides.append(classes[i])

  #add to dataset
  dataset[i]=nucleotides

print(dataset[0])


['t', 'a', 'c', 't', 'a', 'g', 'c', 'a', 'a', 't', 'a', 'c', 'g', 'c', 't', 't', 'g', 'c', 'g', 't', 't', 'c', 'g', 'g', 't', 'g', 'g', 't', 't', 'a', 'a', 'g', 't', 'a', 't', 'g', 't', 'a', 't', 'a', 'a', 't', 'g', 'c', 'g', 'c', 'g', 'g', 'g', 'c', 't', 't', 'g', 't', 'c', 'g', 't', '+']


In [None]:
dframe=pd.DataFrame(dataset)  #[58 rows x 106 columns]
print(dframe)

# New Section

In [None]:
#lets switch the rows and columns using the tranpose function ang bringing the first 5 instances
df=dframe.transpose()
print(df.iloc[:5])

  0  1  2  3  4  5  6  7  8  9   ... 48 49 50 51 52 53 54 55 56 57
0  t  a  c  t  a  g  c  a  a  t  ...  g  c  t  t  g  t  c  g  t  +
1  t  g  c  t  a  t  c  c  t  g  ...  c  a  t  c  g  c  c  a  a  +
2  g  t  a  c  t  a  g  a  g  a  ...  c  a  c  c  c  g  g  c  g  +
3  a  a  t  t  g  t  g  a  t  g  ...  a  a  c  a  a  a  c  t  c  +
4  t  c  g  a  t  a  a  t  t  a  ...  c  c  g  t  g  g  t  a  g  +

[5 rows x 58 columns]


In [None]:
#Rename the last column as class
df.rename(columns={57:'class'},inplace=True)
print(df.iloc[:5])

   0  1  2  3  4  5  6  7  8  9  ... 48 49 50 51 52 53 54 55 56 class
0  t  a  c  t  a  g  c  a  a  t  ...  g  c  t  t  g  t  c  g  t     +
1  t  g  c  t  a  t  c  c  t  g  ...  c  a  t  c  g  c  c  a  a     +
2  g  t  a  c  t  a  g  a  g  a  ...  c  a  c  c  c  g  g  c  g     +
3  a  a  t  t  g  t  g  a  t  g  ...  a  a  c  a  a  a  c  t  c     +
4  t  c  g  a  t  a  a  t  t  a  ...  c  c  g  t  g  g  t  a  g     +

[5 rows x 58 columns]


In [None]:
df.describe() #53 promoters and 53 non promoters, convert them to numeric values

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,48,49,50,51,52,53,54,55,56,class
count,106,106,106,106,106,106,106,106,106,106,...,106,106,106,106,106,106,106,106,106,106
unique,4,4,4,4,4,4,4,4,4,4,...,4,4,4,4,4,4,4,4,4,2
top,t,a,a,c,a,a,a,a,a,a,...,c,c,c,t,t,c,c,c,t,+
freq,38,34,30,30,36,42,38,34,33,36,...,36,42,31,33,35,32,29,29,34,53


In [None]:
series=[]
for name in df.columns:
  series.append(df[name].value_counts())
info=pd.DataFrame(series)
details=info.transpose()
print(details)

   count  count  count  count  count  count  count  count  count  count  ...  \
t   38.0   26.0   27.0   26.0   22.0   24.0   30.0   32.0   32.0   28.0  ...   
c   27.0   22.0   21.0   30.0   19.0   18.0   21.0   20.0   22.0   22.0  ...   
a   26.0   34.0   30.0   22.0   36.0   42.0   38.0   34.0   33.0   36.0  ...   
g   15.0   24.0   28.0   28.0   29.0   22.0   17.0   20.0   19.0   20.0  ...   
+    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN  ...   
-    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN  ...   

   count  count  count  count  count  count  count  count  count  count  
t   21.0   22.0   23.0   33.0   35.0   30.0   23.0   29.0   34.0    NaN  
c   36.0   42.0   31.0   32.0   21.0   32.0   29.0   29.0   17.0    NaN  
a   23.0   24.0   28.0   27.0   25.0   22.0   26.0   24.0   27.0    NaN  
g   26.0   18.0   24.0   14.0   25.0   22.0   28.0   24.0   28.0    NaN  
+    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN   53

In [None]:
#new dataframe
numerical_df=pd.get_dummies(df)
numerical_df.loc[:5]


Unnamed: 0,0_a,0_c,0_g,0_t,1_a,1_c,1_g,1_t,2_a,2_c,...,55_a,55_c,55_g,55_t,56_a,56_c,56_g,56_t,class_+,class_-
0,False,False,False,True,True,False,False,False,False,True,...,False,False,True,False,False,False,False,True,True,False
1,False,False,False,True,False,False,True,False,False,True,...,True,False,False,False,True,False,False,False,True,False
2,False,False,True,False,False,False,False,True,True,False,...,False,True,False,False,False,False,True,False,True,False
3,True,False,False,False,True,False,False,False,False,False,...,False,False,False,True,False,True,False,False,True,False
4,False,False,False,True,False,True,False,False,False,False,...,True,False,False,False,False,False,True,False,True,False
5,True,False,False,False,False,False,True,False,False,False,...,False,False,True,False,False,False,False,True,True,False


In [None]:
df=numerical_df.drop(columns=['class_-'])
df.rename(columns={'class_+':'class'}, inplace=True)

In [None]:
df=df.astype(int)
print(df.iloc[:5])

   0_a  0_c  0_g  0_t  1_a  1_c  1_g  1_t  2_a  2_c  ...  54_t  55_a  55_c  \
0    0    0    0    1    1    0    0    0    0    1  ...     0     0     0   
1    0    0    0    1    0    0    1    0    0    1  ...     0     1     0   
2    0    0    1    0    0    0    0    1    1    0  ...     0     0     1   
3    1    0    0    0    1    0    0    0    0    0  ...     0     0     0   
4    0    0    0    1    0    1    0    0    0    0  ...     1     1     0   

   55_g  55_t  56_a  56_c  56_g  56_t  class  
0     1     0     0     0     0     1      1  
1     0     0     1     0     0     0      1  
2     0     0     0     0     1     0      1  
3     0     1     0     1     0     0      1  
4     0     0     0     0     1     0      1  

[5 rows x 229 columns]


**Splitting the dataset into training and test model**

In [None]:
!pip install scikit-learn







In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.gaussian_process import GaussianProcessClassifier  # Corrected capitalization
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier  # Corrected capitalization
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB  # Corrected capitalization
from sklearn.svm import SVC  # Corrected capitalization
from sklearn.metrics import classification_report, accuracy_score
from sklearn.model_selection import KFold, cross_val_score

In [None]:
from sklearn import model_selection
#create X and Y datasets for training
X = df.drop(['class'], axis=1).values

# Extract the 'class' column to create the target vector Y
Y = df['class'].values
#define seed for productivity
seed=1
#split data into training and test
X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y, test_size=0.25, random_state=seed)

from sklearn import model_selection , This imports the model_selection module from scikit-learn, which provides functions for splitting data into training and testing sets.

X = np.array(df.drop(['class'], 1))
Y = np.array(df.drop(['class'], axis=1))
This code creates the feature matrix X and the target vector Y for training the model. X contains all columns from the DataFrame df except for the column named 'class', while Y contains all columns of df except for the 'class' column.
Defining seed for reproducibility:
python

seed = 1
This line defines a seed value that will be used by random number generators. Setting a seed ensures that the random splitting of data into training and testing sets will be reproducible, meaning you'll get the same split every time you run the code with the same seed value.
Splitting data into training and test sets:


SCoring Method, model tp train

In [None]:
kfold = KFold(n_splits=10, shuffle=True, random_state=seed)

In [None]:


# Define classifiers and their names
classifiers = [
    KNeighborsClassifier(n_neighbors=3),  #model will consider 3 nearest neighbprs
    GaussianProcessClassifier(1.0 * RBF(1.0)),
    DecisionTreeClassifier(max_depth=5), #tree will have a maximum depth of 5 levels
    RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
    # depth=5 levels, number of decision trees =10, best split feat =only one feature at split
    MLPClassifier(alpha=1),
    #prevent overfitting by penalizing large weights
    AdaBoostClassifier(), #AdaBoostClassifier uses decision trees as its default base estimator.
    GaussianNB(),
    SVC(kernel='linear'),
    SVC(kernel='rbf'),
    SVC(kernel='sigmoid')
]

names = ['Nearest Neighbors', 'Gaussian Process', 'Decision Tree', 'Random Forest',
         'Neural Net', 'AdaBoost', 'Gaussian Naive Bayes', 'SVM Linear', 'SVM RBF', 'SVM Sigmoid']

# Define scoring method evaluating the performance of the classifiers. Here, we're
#using accuracy as the scoring method, which measures the proportion of correctly classified instances.

scoring = 'accuracy'
models=zip(names, classifiers)
results = []
names=[]
for name, model in models:
    kfold = model_selection.KFold(n_splits=10, shuffle=True, random_state=seed)
    cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
    print(msg)

    model.fit(X_train, Y_train)

    # Make predictions on the test data
    predictions = model.predict(X_test)

    # Print classifier name
    print(name)

    # Print classification report
    print(classification_report(Y_test, predictions))

Nearest Neighbors: 0.810714 (0.099808)
Nearest Neighbors
              precision    recall  f1-score   support

           0       1.00      0.65      0.79        17
           1       0.62      1.00      0.77        10

    accuracy                           0.78        27
   macro avg       0.81      0.82      0.78        27
weighted avg       0.86      0.78      0.78        27





Gaussian Process: 0.855357 (0.160605)
Gaussian Process
              precision    recall  f1-score   support

           0       1.00      0.82      0.90        17
           1       0.77      1.00      0.87        10

    accuracy                           0.89        27
   macro avg       0.88      0.91      0.89        27
weighted avg       0.91      0.89      0.89        27

Decision Tree: 0.719643 (0.127788)
Decision Tree
              precision    recall  f1-score   support

           0       0.92      0.65      0.76        17
           1       0.60      0.90      0.72        10

    accuracy                           0.74        27
   macro avg       0.76      0.77      0.74        27
weighted avg       0.80      0.74      0.74        27

Random Forest: 0.707143 (0.141782)
Random Forest
              precision    recall  f1-score   support

           0       0.88      0.82      0.85        17
           1       0.73      0.80      0.76        10

    accuracy                 



Neural Net: 0.900000 (0.093541)




Neural Net
              precision    recall  f1-score   support

           0       1.00      0.82      0.90        17
           1       0.77      1.00      0.87        10

    accuracy                           0.89        27
   macro avg       0.88      0.91      0.89        27
weighted avg       0.91      0.89      0.89        27

AdaBoost: 0.875000 (0.147902)
AdaBoost
              precision    recall  f1-score   support

           0       1.00      0.76      0.87        17
           1       0.71      1.00      0.83        10

    accuracy                           0.85        27
   macro avg       0.86      0.88      0.85        27
weighted avg       0.89      0.85      0.85        27

Gaussian Naive Bayes: 0.837500 (0.112500)
Gaussian Naive Bayes
              precision    recall  f1-score   support

           0       1.00      0.88      0.94        17
           1       0.83      1.00      0.91        10

    accuracy                           0.93        27
   macro avg   

KFold is a method for splitting a dataset into k consecutive folds, and cross_val_score is a function for evaluating a score by cross-validation.

kernel='linear': This specifies the type of kernel used in the SVC. Here, it's set to a linear kernel, which means the decision boundary will be linear.
kernel='rbf': This specifies the type of kernel used in the SVC. Here, it's set to a radial basis function (RBF) kernel, which is commonly used when the data is not linearly separable.
kernel='sigmoid': This specifies the type of kernel used in the SVC. Here, it's set to a sigmoid kernel.

cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
This line calculates the cross-validated scores for the current classifier using the cross_val_score function. It takes the model, training data (X_train and Y_train), cross-validation object (kfold), and scoring method (scoring) as inputs.

msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
print(msg)
This code prints the mean and standard deviation of the cross-validated scores for the current classifier. %s is a placeholder for the classifier name, %f is a placeholder for floating-point numbers (mean and standard deviation), and (name, cv_results.mean(), cv_results.std()) are the values to be substituted into the placeholders.

Neural Net:
Precision: The Neural Net classifier achieved perfect precision for class 0 (True) and a precision of 0.77 for class 1 (False). This means that when the classifier predicted an instance to be in class 0, it was correct 100% of the time, and when it predicted an instance to be in class 1, it was correct 77% of the time.
Recall: The classifier achieved a recall of 0.82 for class 0 and perfect recall (1.00) for class 1. This indicates that the classifier correctly identified 82% of the instances belonging to class 0 and all instances belonging to class 1.
F1-score: The F1-score is the harmonic mean of precision and recall. For class 0, it's 0.90, and for class 1, it's 0.87.
Support: This indicates the number of instances in each class in the test set.

AdaBoost:
Similar to the Neural Net classifier, AdaBoost achieved high precision, recall, and F1-scores for both classes. However, it achieved a slightly lower accuracy of 0.85 compared to the Neural Net classifier.
Despite having a lower accuracy, AdaBoost maintains a balanced performance for both classes, with precision, recall, and F1-scores above 0.7 for both.

Gaussian Naive Bayes:
Gaussian Naive Bayes achieved the highest accuracy among the classifiers evaluated, with an accuracy of 0.93.
It shows excellent performance in terms of precision, recall, and F1-score for both classes, with F1-scores of 0.94 and 0.91 for classes 0 and 1, respectively.

SVM Linear:
SVM with a linear kernel achieved the highest accuracy among all classifiers, with an accuracy of 0.96.
It shows excellent performance in terms of precision, recall, and F1-score for both classes, with F1-scores of 0.97 and 0.95 for classes 0 and 1, respectively.

SVM RBF and SVM Sigmoid:
SVM with RBF and sigmoid kernels achieved accuracies of 0.93, which is slightly lower than SVM with a linear kernel but still high.
They maintain a balanced performance for both classes, with precision, recall, and F1-scores above 0.8 for both.

In summary, each classifier demonstrates varying degrees of performance, but all perform relatively well. Neural Net, AdaBoost, and SVM with linear and Gaussian Naive Bayes classifiers stand out with high accuracies and balanced performance for both classes. These insights can help in selecting the most appropriate classifier for the task based on the desired balance between precision, recall, and overall accuracy.