## DNA Sequencing With Machine Learning

In this notebook, I will apply a classification model that can predict a gene's function based on the DNA sequence of the coding sequence alone.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
import pandas as pd

# Assuming your CSV file is named 'data.csv'
# Adjust the file path accordingly
file_path = '/kaggle/input/covid-deeppredictor/TrainingData/Trainingdata.csv'

# Reading CSV into DataFrame
df = pd.read_csv(file_path, usecols=[1, 3])
df.columns = ['class', 'sequence']
# Displaying the DataFrame
print(df)

      class                                           sequence
0         1  GATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAAAATCTGTGTA...
1         1  CATTCAGTACGGTCGTAGCGGTATAACACTGGGAGTACTCGTGCCA...
2         1  CACGCGCGGGCAAGTCAATGTGCACTCTTTCCGAACAACTTGATTA...
3         1  ATATTAGGTTTTTACCTACCCAGGAAAAGCCAACCAACCTCGATCT...
4         1  ATATTAGGTTTTTACCTACCCAGGAAAAGCCAACCAACCTCGATCT...
...     ...                                                ...
1495      6  AGTATGGAAAGAATAAAAGAACTACGGACCCTGATGTCGCAGTCTC...
1496      6  AAAGCAGGCAAACCATTTGAATGGATGTCAATCCGACTCTACTTTT...
1497      6  AGTATGGAAAGAATAAAAGAACTACGGACCCTGATGTCGCAGTCTC...
1498      6  ATTTGAATGGATGTCAATCCGACTCTACTTTTCCTAAAGGTTCCAG...
1499      6  AGTATGGAAAGAATAAAAGAACTACGGACCCTGATGTCGCAGTCTC...

[1500 rows x 2 columns]


### We have some data for human DNA sequence coding regions and a class label.  We also have data for Chimpanzee and a more divergent species, the dog.

In [3]:
file_pathlol = '/kaggle/input/covid-deeppredictor/TestData/TestData/Testdata-1.csv'

# Reading CSV into DataFrame
test = pd.read_csv(file_pathlol, usecols=[1, 2])
test.columns = ['class', 'sequence']
# Displaying the DataFrame



### Here are the definitions for each of the 7 classes and how many there are in the human training data.  They are gene sequence function groups.

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1500 entries, 0 to 1499
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   class     1500 non-null   int64 
 1   sequence  1500 non-null   object
dtypes: int64(1), object(1)
memory usage: 23.6+ KB


### Let's define a function to collect all possible overlapping k-mers of a specified length from any sequence string. We will basically apply the k-mers to the complete sequences.

In [5]:
# function to convert sequence strings into k-mer words, default size = 6 (hexamer words)
def getKmers(sequence, size=6):
    return [sequence[x:x+size].lower() for x in range(len(sequence) - size + 1)]

In [6]:
human_data = pd.concat([df, test], ignore_index=True)

In [7]:
human_data

Unnamed: 0,class,sequence
0,1,GATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAAAATCTGTGTA...
1,1,CATTCAGTACGGTCGTAGCGGTATAACACTGGGAGTACTCGTGCCA...
2,1,CACGCGCGGGCAAGTCAATGTGCACTCTTTCCGAACAACTTGATTA...
3,1,ATATTAGGTTTTTACCTACCCAGGAAAAGCCAACCAACCTCGATCT...
4,1,ATATTAGGTTTTTACCTACCCAGGAAAAGCCAACCAACCTCGATCT...
...,...,...
4638,6,TCAAATATATTCAATATGGAGAGAATAAAAGAGCTGAGAGATCTAA...
4639,6,TCAAATATATTCAATATGGAGAGAATAAAAGAGCTGAGAGATCTAA...
4640,6,CAAACCATTTGAATGGATGTCAATCCGACTCTACTTTTCTTAAAAA...
4641,6,TCAAATATATTCAATATGGAGAGAATAAAAGAGCTGAGAGATCTAA...


In [8]:
# df = pd.concat([df, tes], )

## Now we can convert our training data sequences into short overlapping  k-mers of legth 6.  Lets do that for each species of data we have using our getKmers function.

In [9]:
human_data['words'] = human_data.apply(lambda x: getKmers(x['sequence']), axis=1)
human_data = human_data.drop('sequence', axis=1)


In [10]:
# tes['words'] = tes.apply(lambda x: getKmers(x['sequence']), axis=1)
# tes = tes.drop('sequence', axis=1)

### Now, our coding sequence data is changed to lowercase, split up into all possible k-mer words of length 6 and ready for the next step.  Let's take a look.

In [None]:
# # Assuming 'sequence' is the name of the column containing DNA sequences
# # Assuming df is your DataFrame with the 'sequence' column

# import pandas as pd
# import re

# # Assuming df is your DataFrame and 'sequence' is the column containing DNA sequences

# # Define a function to check if a sequence contains only DNA nucleotides
# def is_valid_sequence(sequence):
#     pattern = re.compile(r'[^ATGC]', re.IGNORECASE)  # Regular expression to match non-DNA nucleotides
#     return not bool(pattern.search(sequence))

# # Filter out rows with invalid sequences
# df = df[df['sequence'].apply(is_valid_sequence)]

# # Now df contains only rows with valid DNA sequences


### Since we are going to use scikit-learn natural language processing tools to do the k-mer counting, we need to now convert the lists of k-mers for each gene into string sentences of words that the count vectorizer can use.  We can also make a y variable to hold the class labels.  Let's do that now.

In [11]:
human_texts = list(human_data['words'])
for item in range(len(human_texts)):
    human_texts[item] = ' '.join(human_texts[item])
y_data = human_data.iloc[:, 0].values

In [None]:
# lol = list(tes['words'])
# for item in range(len(lol)):
#   lol[item] = ' '.join(lol[item])
# y_tes = tes.iloc[:, 0].values

In [13]:
np.unique(y_data)

array([1, 2, 3, 4, 5, 6])

## We will perform the same steps for chimpanzee and dog

In [13]:
# X=human_texts
X=human_data['words']

In [None]:
print(X[0])

In [None]:
# from tensorflow.keras.models import Sequential
# from tensorflow.keras.layers import Embedding, Conv1D, MaxPooling1D, Flatten, Dense
# from tensorflow.keras.optimizers import Adam

# all_words = [word for text in X for word in text.split()]

# # Determine the vocabulary size
# vocab_size = len(set(all_words))
# from tensorflow.keras.utils import to_categorical

# # Assuming y_data contains the categorical labels
# # Perform one-hot encoding
# y_data_encoded = to_categorical(y_data, num_classes=num_classes)
# # Determine the maximum sequence length
# max_length = max(len(text.split()) for text in X)
# num_classes = 82
# # Model Architecture
# model = Sequential([
#     Embedding(vocab_size, 100, input_length=max_length),
#     Conv1D(128, 5, activation='relu'),
#     MaxPooling1D(),
#     Flatten(),
#     Dense(64, activation='relu'),
#     Dense(82, activation='softmax')
# ])

# # Model Compilation
# model.compile(loss='categorical_crossentropy', optimizer=Adam(), metrics=['accuracy'])
# X_train_padded = pad_sequences(X_train, maxlen=max_length)
# X_test_padded = pad_sequences(X_test, maxlen=max_length)
# # Model Training
# model.fit(X_train_padded, y_train, epochs=10, validation_data=(X_val, y_val))

# # Model Evaluation
# loss, accuracy = model.evaluate(X_test_padded, y_test)
# print(f'Test Accuracy: {accuracy}')

# # Model Deployment (optional)
# # Use the trained model for predictions on new data


In [14]:
from gensim.models import FastText
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import numpy as np

# Assuming X contains sequences split into k-mers and y_data contains labels

# Encode labels
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y_data)

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, test_size=0.4, random_state=42)

# Train k-mer2vec model using FastText
kmer2vec_model = FastText(sentences=X_train, vector_size=100, window=5, min_count=1, workers=4)

# Function to convert sequences to vectors using k-mer2vec model
def sequence_to_vector(sequence, model):
    vectors = [model.wv[word] for word in sequence if word in model.wv]
    return np.mean(vectors, axis=0) if vectors else np.zeros(model.vector_size)

# Convert sequences to vectors
X_train_vectors = np.array([sequence_to_vector(sequence, kmer2vec_model) for sequence in X_train])
X_test_vectors = np.array([sequence_to_vector(sequence, kmer2vec_model) for sequence in X_test])

# Now you can use X_train_vectors and X_test_vectors as input to your machine learning model


In [None]:
# Creating the Bag of Words model using CountVectorizer()
# This is equivalent to k-mer counting
# The n-gram size of 4 was previously determined by testing
# from sklearn.feature_extraction.text import CountVectorizer
# cv = CountVectorizer(ngram_range=(4,4))
# X = cv.fit_transform(human_texts)
# X_test= cv.fit_transform(lol)

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(ngram_range=(1, 1))


X_train_= tfidf.fit_transform(human_texts)

# X_test= tfidf.fit_transform(lol)

# X_chimp = cv.transform(chimp_texts)
# X_dog = cv.transform(dog_texts)

In [None]:
# import numpy as np
# from sklearn.model_selection import train_test_split
# from sklearn.preprocessing import LabelEncoder
# from tensorflow.keras.models import Sequential
# from tensorflow.keras.layers import Embedding, Conv1D, GlobalMaxPooling1D, Dense, Dropout

# # Convert sparse matrix X to a dense numpy array
# X_dense = X.toarray()
# label_encoder = LabelEncoder()

# # Fit and transform the 'Species' column
# y_data = label_encoder.fit_transform(y_data)
# # Split the data into train, test, and validation sets
# X_train, X_test, y_train, y_test = train_test_split(X_dense, y_data, test_size=0.2, random_state=42)
# X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.1, random_state=42)  # 0.25 * 0.8 = 0.2

In [None]:
# X_train

In [None]:


# # Define CNN model
# model = Sequential()

# # Embedding layer (optional, depends on the size and nature of your text data)
# #model.add(Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=max_seq_length))

# # Convolutional layers
# model.add(Conv1D(filters=64, kernel_size=5, activation='relu', input_shape=(X_train.shape[1], 1)))
# model.add(GlobalMaxPooling1D())
# num_classes = 82

# # Dense layers
# model.add(Dense(32, activation='relu'))
# model.add(Dropout(0.1))
# model.add(Dense(num_classes, activation='softmax'))  # num_classes is the number of output classes

# # Compile the model
# model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# # Reshape X_train, X_val, and X_test to match the input shape of the model
# X_train_reshaped = np.expand_dims(X_train, axis=-1)
# X_val_reshaped = np.expand_dims(X_val, axis=-1)
# X_test_reshaped = np.expand_dims(X_test, axis=-1)

# # Train the model
# model.fit(X_train_reshaped, y_train, epochs=30, batch_size=32, validation_data=(X_val_reshaped, y_val))

# # Evaluate the model on the test set
# test_loss, test_accuracy = model.evaluate(X_test_reshaped, y_test)
# print("Test Accuracy:", test_accuracy)


### If we have a look at class balance we can see we have relatively balanced dataset.

In [None]:
human_data['class'].value_counts().sort_index().plot.bar()

In [None]:
# Splitting the human dataset into the training set and test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_train_,
                                                    y_data,
                                                    test_size = 0.20,
                                                    random_state=42)
# y_train=y_data
# y_test=y_tes
# X_test_ = X_test[:, :173952]

In [None]:
X_test.shape

In [30]:
import numpy as np
from sklearn.model_selection import train_test_split

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv1D, MaxPooling1D, Flatten, Dense, Embedding, GlobalMaxPooling1D, Bidirectional

# Assuming you have y_train as an array of labels and X_train as your BoW matrix

# Convert sparse BoW matrix to a dense array
X_train_dense = X_train_vectors

# Splitting the Data
X_train_split, X_val_split, y_train_split, y_val_split = train_test_split(X_train_dense, y_train, test_size=0.2, random_state=42)

# Define the CNN Model
# model = Sequential([
#     Embedding(input_dim=X_train_dense.shape[1], output_dim=128),
#     Conv1D(filters=64, kernel_size=5, activation='relu'),
#     # MaxPooling1D(pool_size=5),
#     # Conv1D(filters=128, kernel_size=5, activation='relu'),
#     GlobalMaxPooling1D(),
#     Dense(64, activation='relu'),
#     Dense(len(np.unique(y_train)), activation='softmax')
# ])
model = Sequential([
    Embedding(input_dim=X_train_dense.shape[1], output_dim=128),
    Conv1D(filters=64, kernel_size=5, activation='relu'),
    MaxPooling1D(pool_size=5),
    Bidirectional(LSTM(units=64, return_sequences=True)),
    Bidirectional(LSTM(units=32)),
    Dense(64, activation='relu'),
    Dense(6, activation='softmax')
])

# Compile the model
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Training the Model
model.fit(X_train_split, y_train_split, epochs=10, batch_size=32, validation_data=(X_val_split, y_val_split))

# Model Evaluation
loss, accuracy = model.evaluate(X_val_split, y_val_split)
print(f'Validation Accuracy: {accuracy}')


Epoch 1/10
[1m93/93[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 24ms/step - accuracy: 0.5255 - loss: 1.4390 - val_accuracy: 0.7201 - val_loss: 0.9894
Epoch 2/10
[1m93/93[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 14ms/step - accuracy: 0.7416 - loss: 0.8490 - val_accuracy: 0.7201 - val_loss: 0.7738
Epoch 3/10
[1m93/93[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 14ms/step - accuracy: 0.7723 - loss: 0.7238 - val_accuracy: 0.7954 - val_loss: 0.6744
Epoch 4/10
[1m93/93[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 14ms/step - accuracy: 0.8051 - loss: 0.6543 - val_accuracy: 0.7954 - val_loss: 0.6592
Epoch 5/10
[1m93/93[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 14ms/step - accuracy: 0.7967 - loss: 0.6621 - val_accuracy: 0.7968 - val_loss: 0.6515
Epoch 6/10
[1m93/93[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 14ms/step - accuracy: 0.7995 - loss: 0.6574 - val_accuracy: 0.7968 - val_loss: 0.6495
Epoch 7/10
[1m93/93[0m [32m━━━

In [22]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv1D, MaxPooling1D, Flatten, Dense, Embedding, GlobalMaxPooling1D, LSTM

# Assuming you have y_train as an array of labels and X_train as your BoW matrix
# Assuming you've imported X_train and y_train

# Convert sparse BoW matrix to a dense array
X_train_dense = X_train_vectors

# Splitting the Data
X_train_split, X_val_split, y_train_split, y_val_split = train_test_split(X_train_dense, y_train, test_size=0.2, random_state=42)

# Define the CNN Model
max_words = X_train_dense.shape[1]  # Assuming each document in X_train has the same number of words
num_classes = 6

model = Sequential([
    Embedding(input_dim=max_words, output_dim=128, input_length=max_words),
    # Conv1D(filters=64, kernel_size=5, activation='relu'),
    # MaxPooling1D(pool_size=10),
    # Conv1D(filters=64, kernel_size=5, activation='relu'),
    # GlobalMaxPooling1D(),
    # Dense(64, activation='relu'),
    # Dense(num_classes, activation='softmax')
    LSTM(units=64, return_sequences=True),
    LSTM(units=32),
    Dense(64, activation='relu'),
    Dense(num_classes, activation='softmax')
])

# Compile the model
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Training the Model
model.fit(X_train_split, y_train_split, epochs=10, batch_size=32, validation_data=(X_val_split, y_val_split))

# Model Evaluation
loss, accuracy = model.evaluate(X_val_split, y_val_split)
print(f'Validation Accuracy: {accuracy}')


ValueError: Unrecognized keyword arguments passed to Embedding: {'input_length': 100}

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Conv1D, MaxPooling1D, GlobalMaxPooling1D, LSTM, Dense

# Example CNN model
cnn_model = Sequential([
    Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=max_length),
    Conv1D(filters=64, kernel_size=3, activation='relu'),
    MaxPooling1D(pool_size=2),
    GlobalMaxPooling1D(),
    Dense(64, activation='relu'),
    Dense(num_classes, activation='softmax')
])

# Example RNN model
rnn_model = Sequential([
    Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=max_length),
    LSTM(units=64, return_sequences=True),
    LSTM(units=32),
    Dense(64, activation='relu'),
    Dense(num_classes, activation='softmax')
])

# Compile models
cnn_model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
rnn_model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Train models
cnn_model.fit(train_sequences, train_labels, epochs=10, validation_data=(val_sequences, val_labels))
rnn_model.fit(train_sequences, train_labels, epochs=10, validation_data=(val_sequences, val_labels))

# Evaluate models
cnn_loss, cnn_accuracy = cnn_model.evaluate(test_sequences, test_labels)
rnn_loss, rnn_accuracy = rnn_model.evaluate(test_sequences, test_labels)


In [None]:
# import numpy as np
# from sklearn.model_selection import train_test_split
# from sklearn.preprocessing import LabelEncoder
# from sklearn.feature_extraction.text import TfidfVectorizer
# from tensorflow.keras.models import Sequential
# from tensorflow.keras.layers import Conv1D, MaxPooling1D, Flatten, Dense
# from sklearn.metrics import accuracy_score

# # Assuming X_train is the TF-IDF transformed data and y_train is the encoded labels

# # Splitting the Data
# X_train_split, X_val_split, y_train_split, y_val_split = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

# # Define the CNN Model
# model = Sequential([
#     Conv1D(filters=64, kernel_size=3, activation='relu', input_shape=(X_train_split.shape[1], 1)),
#     MaxPooling1D(pool_size=2),
#     Flatten(),
#     Dense(128, activation='relu'),
#     Dense(len(np.unique(y_train)), activation='softmax')
# ])

# # Reshape input data to match CNN input shape
# X_train_split_reshaped = X_train_split.reshape(X_train_split.shape[0], X_train_split.shape[1], 1)
# X_val_split_reshaped = X_val_split.reshape(X_val_split.shape[0], X_val_split.shape[1], 1)

# # Compile the model
# model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# # Training the Model
# model.fit(X_train_split_reshaped, y_train_split, epochs=10, batch_size=32, validation_data=(X_val_split_reshaped, y_val_split))

# # Evaluate the Model
# y_pred = model.predict_classes(X_val_split_reshaped)
# accuracy = accuracy_score(y_val_split, y_pred)
# print(f'Validation Accuracy: {accuracy}')


### A multinomial naive Bayes classifier will be created.  I previously did some parameter tuning and found the ngram size of 4 (reflected in the Countvectorizer() instance) and a model alpha of 0.1 did the best.

In [15]:
### Multinomial Naive Bayes Classifier ###
# The alpha parameter was determined by grid search previously
# from sklearn.naive_bayes import MultinomialNB
# classifier = MultinomialNB(alpha=0.01)
# classifier.fit(X_train_vectors, y_train)
# from sklearn.svm import SVC

# classifier = SVC(kernel='linear')
# classifier.fit(X_train_vectors, y_train)
# from sklearn.neural_network import MLPClassifier

# # Initialize MLP classifier
# classifier = MLPClassifier(hidden_layer_sizes=(100,), max_iter=1000)
# classifier.fit(X_train_vectors, y_train)

from sklearn.svm import SVC

# Initialize and train the Polynomial Support Vector Classifier (SVC)
classifier = SVC(kernel='poly', degree=2)  # Degree can be adjusted
classifier.fit(X_train_vectors, y_train)

# from sklearn.svm import SVC

# # Initialize and train the RBF Support Vector Classifier (SVC)
# classifier = SVC(kernel='rbf')
# classifier.fit(X_train, y_train)


# import xgboost as xgb

# # Initialize and train the Gradient Boosting classifier (XGBoost)
# classifier = xgb.XGBClassifier()
# classifier.fit(X_train, y_train)

# from sklearn.ensemble import RandomForestClassifier

# # Initialize and train the Random Forest classifier
# classifier = RandomForestClassifier(n_estimators=200, random_state=42)
# classifier.fit(X_train, y_train)

# from sklearn.ensemble import ExtraTreesClassifier

# # Initialize and train the Extra Trees classifier
# classifier = ExtraTreesClassifier(n_estimators=100, random_state=42)
# classifier.fit(X_train, y_train)

# from sklearn.ensemble import BaggingClassifier
# from sklearn.tree import DecisionTreeClassifier

# # Initialize base decision tree
# base_classifier = DecisionTreeClassifier()

# # Initialize and train Random Forest with Bagging
# classifier = BaggingClassifier(base_classifier, n_estimators=100, random_state=42)
# classifier.fit(X_train, y_train)
# from sklearn.ensemble import RandomForestClassifier

# # Initialize and train Random Forest classifier with feature subsampling (Random Subspace Method)
# classifier = RandomForestClassifier(n_estimators=100, max_features='sqrt', random_state=42)
# classifier.fit(X_train, y_train)


In [16]:
y_pred = classifier.predict(X_test_vectors)

### Okay, so let's look at some model performce metrics like the confusion matrix, accuracy, precision, recall and f1 score.  We are getting really good results on our unseen data, so it looks like our model did not overfit to the training data.  In a real project I would go back and sample many more train test splits since we have a relatively small data set.

In [None]:
X_test_vectors

In [17]:
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
print("Confusion matrix\n")
print(pd.crosstab(pd.Series(y_test, name='Actual'), pd.Series(y_pred, name='Predicted')))
def get_metrics(y_test, y_predicted):
    accuracy = accuracy_score(y_test, y_predicted)
    precision = precision_score(y_test, y_predicted, average='weighted')
    recall = recall_score(y_test, y_predicted, average='weighted')
    f1 = f1_score(y_test, y_predicted, average='weighted')
    return accuracy, precision, recall, f1
accuracy, precision, recall, f1 = get_metrics(y_test, y_pred)
print("accuracy = %.3f \nprecision = %.3f \nrecall = %.3f \nf1 = %.3f" % (accuracy, precision, recall, f1))

Confusion matrix

Predicted    0    1    2    3    4    5
Actual                                 
0          142    0    2    0    0    0
1            0  123    0    0    0    0
2            0    0  927    0    0    0
3            0    0    0  121    0    0
4            0    0    0    0  134    0
5            0    0    0    0    0  409
accuracy = 0.999 
precision = 0.999 
recall = 0.999 
f1 = 0.999


In [None]:
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, roc_auc_score
from sklearn.preprocessing import LabelBinarizer
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# Assuming you have y_test and y_pred defined earlier
# y_test = ...
# y_pred = ...

print("Confusion matrix\n")
print(pd.crosstab(pd.Series(y_test, name='Actual'), pd.Series(y_pred, name='Predicted')))

def get_metrics(y_test, y_predicted):
    accuracy = accuracy_score(y_test, y_predicted)
    precision = precision_score(y_test, y_predicted, average='weighted')
    recall = recall_score(y_test, y_predicted, average='weighted')
    f1 = f1_score(y_test, y_predicted, average='weighted')
    return accuracy, precision, recall, f1

accuracy, precision, recall, f1 = get_metrics(y_test, y_pred)
print("accuracy = %.3f \nprecision = %.3f \nrecall = %.3f \nf1 = %.3f" % (accuracy, precision, recall, f1))

# Compute ROC curve and ROC area for each class
lb = LabelBinarizer()
lb.fit(y_test)
y_test_bin = lb.transform(y_test)
y_pred_bin = lb.transform(y_pred)

n_classes = len(lb.classes_)

fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):
    fpr[i], tpr[i], _ = roc_curve(y_test_bin[:, i], y_pred_bin[:, i])
    roc_auc[i] = auc(fpr[i], tpr[i])

# Compute micro-average ROC curve and ROC area
fpr["micro"], tpr["micro"], _ = roc_curve(y_test_bin.ravel(), y_pred_bin.ravel())
roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])

# Plot ROC curve
plt.figure()
plt.plot(fpr["micro"], tpr["micro"], color='darkorange', lw=2, label='micro-average ROC curve (area = {0:0.2f})'.format(roc_auc["micro"]))
for i in range(n_classes):
    plt.plot(fpr[i], tpr[i], lw=1, alpha=0.3)  # Plot each class ROC curve with lower opacity

plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show()

