<a href="https://colab.research.google.com/github/Intertangler/ML4biotech/blob/main/cb206v_exercise6_deepneuralnetworks_python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## import the data
This artificial data represents the gene expression levels (normalized already) of two separate genes. The class labels associated with each data point indicate the presence or absence 1 or 0 of a particular downstream phenotype influenced by the genes. Our goal here is to detect a nonlinear relationship between the two gene expression levels that strongly correlates with the downstream phenotype.

In [None]:
import pandas as pd
url = "https://raw.githubusercontent.com/Intertangler/ML4biotech/main/gene_expression_XOR.csv"
df = pd.read_csv(url)

X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values


In [None]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve, roc_auc_score
import matplotlib.pyplot as plt


np.random.seed(428)

# Shuffle data
shuffle_idx = np.random.permutation(len(X))
X = X[shuffle_idx]
y = y[shuffle_idx]

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Train a logistic regression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
logreg_prob = logreg.predict_proba(X_test)[:, 1]
# ROC data for logistic regression
logreg_fpr, logreg_tpr, _ = roc_curve(y_test, logreg_prob)
logreg_auc = roc_auc_score(y_test, logreg_prob)

# Plot data points
plt.scatter(X[y == 0][:, 0], X[y == 0][:, 1], label="Class 0")
plt.scatter(X[y == 1][:, 0], X[y == 1][:, 1], label="Class 1")
plt.xlabel("Gene 1")
plt.ylabel("Gene 2")
plt.show()

## exercise
Complete the missing lines.
Use the Keras library to construct a deep neural network - define the model type, its architecture by deciding the number of layers and nodes, the activation function. Compile the model with appropriate loss function, and learning rate scheduling mechanism.

The code after this section will then fit the model to the training input and output data. And then a test data set that has been set aside will be used to score the performance with an ROC curve. Compare the performance of your model to a logistic regression. If things have been set up right, the neural network should outperform the logistic regression.



In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from sklearn.metrics import roc_curve, roc_auc_score
import numpy as np
import matplotlib.pyplot as plt
from tensorflow.keras.callbacks import LambdaCallback


#ðŸŒŸðŸŒŸðŸŒŸðŸŒŸ YOUR CODE HERE ðŸŒŸðŸŒŸðŸŒŸðŸŒŸ#  # Initialize the model
#ðŸŒŸðŸŒŸðŸŒŸðŸŒŸ YOUR CODE HERE ðŸŒŸðŸŒŸðŸŒŸðŸŒŸ# # Add one or more hidden layers with a certain number of nodes
#ðŸŒŸðŸŒŸðŸŒŸðŸŒŸ YOUR CODE HERE ðŸŒŸðŸŒŸðŸŒŸðŸŒŸ# # Add the output layer with a sigmoid activation
#ðŸŒŸðŸŒŸðŸŒŸðŸŒŸ YOUR CODE HERE ðŸŒŸðŸŒŸðŸŒŸðŸŒŸ# # Compile the model with binary cross-entropy loss and adaptive moment estimation




# Function to print the loss every 100 epochs
print_callback = LambdaCallback(
    on_epoch_end=lambda epoch, logs: print(f"Epoch {epoch}, Loss: {logs['loss']}")
    if epoch % 100 == 0 else None
)

# Training part
model.fit(
    X_train,
    y_train,
    epochs=1000,
    verbose=0,  # No output
    callbacks=[print_callback]  # Print using callback
)

# predict on test set
y_pred_test = model.predict(X_test)
mlp_prob_keras = y_pred_test.flatten()
#  ROC data
keras_mlp_fpr, keras_mlp_tpr, _ = roc_curve(y_test, mlp_prob_keras)
keras_mlp_auc = roc_auc_score(y_test, mlp_prob_keras)
# plotting part!
plt.figure()
plt.plot(logreg_fpr, logreg_tpr, label=f'logistic Regression (AUC = {logreg_auc:.2f})')
plt.plot(keras_mlp_fpr, keras_mlp_tpr, label=f'Keras MLP (AUC = {keras_mlp_auc:.2f})')
plt.plot([0, 1], [0, 1], linestyle='--', label='random performance(AUC = 0.5)')
plt.xlabel('false positives ')
plt.ylabel('true positives ')
plt.legend()
plt.show()
