<a href="https://colab.research.google.com/github/Gyuheon-Song/Bioinformatics/blob/main/Bioinformatics_Machine_Learning_Practice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# How to use this tutorial


This tutorial utilizes a Colab notebook , which is an interactive computational enviroment that combines live code, visualizations, and explanatory text. To run this notebook, you may first need to **sign in with your Google account** and make a copy by choosing **File > Save a Copy in Drive** from the menu bar (may take a few moments to save).

The most powerful feature of google colab is the ability to use cloud GPU for free. At first turn on the GPU from **Runtime > Change Runtime Type > Hardware Acceleration**. Then **click on the Connect button located at the top right of the page** to assign server resources.

If you are connected to a runtime, you need to **upload the sample data** to the server. Click on the **'Files'** tab on the left side of the page and press the **'upload'** button at the top to upload the data. Please note that if the connection is disconnected, all the data will be deleted, so please be careful.

The notebook is organized into a series of cells. You can modify the Python command and execute each cell as you would a Jupyter notebook. To execute each of the cells, **click on the black run button located at the top left of the code block.**

# 0. Background

In this tutorial, we will be creating classifiers using Support Vector Machine (SVM), Multi-Layer Perceptron (MLP), Decision Tree, and Random Forest algorithms in Python 3.

We will train these classifiers using gene expression profiles of LUSC and LUAD cancer samples, aiming to assess their ability to effectively discriminate between different cancer subtypes.

Additionally, we will test whether these classifiers can successfully differentiate matched normal samples from LUSC and LUAD patients.

In [32]:
from google.colab import drive
drive.mount('/content/drive')

KeyboardInterrupt: 

# 1. Import packages

In [57]:
                                    # CELL 1
###########################################################################################################################
### TITLE: Bioinformatics Machine Learning Practice Python3 Code                                                        ###
### DESCRIPTION: Make Support Vector Machine, Decision Tree, Random Forest and Artificial Neural Network classifiers    ###
### to classify lung adenocarcinoma (LUAD) and lung squamouscell carcinoma (LUSC) according to gene expression profiles.###
### ### DATE: Version1 - Feb 22, 2018 / 1.01 May 10, 2019  / 1.02 June, 2020                                            ###
###########################################################################################################################

###########################################################################################################################
###                888b    888              888      888888b.     d8b              888                  888             ###
###               8888b   888              888      888  "88b    Y8P              888                  888              ###
###              88888b  888              888      888  .88P                     888                  888               ###
###             888Y88b 888    .d88b.    888888   8888888K.    888    .d88b.    888        8888b.    88888b.            ###
###            888 Y88b888   d8P  Y8b   888      888  "Y88b   888   d88""88b   888           "88b   888 "88b            ###
###           888  Y88888   88888888   888      888    888   888   888  888   888       .d888888   888  888             ###
###          888   Y8888   Y8b.       Y88b.    888   d88P   888   Y88..88P   888       888  888   888 d88P              ###
###         888    Y888    "Y8888     "Y888   8888888P"    888    "Y88P"    88888888  "Y888888   88888P"                ###
###########################################################################################################################

import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split

from sklearn import tree
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import accuracy_score

#Description of each package
#pandas, numpy: deal with matrix data
#train_test_split: split data into test set and training set
#MLPClassifier: make multi layer ANN
#DecisionTreeClassifier: make Decision Tree"
#RandonForestClassifier: make Random Forest
#accuracy_score: measure accuracy of models

In [None]:
######### This part is deprecated #########
# CELL2
# Add graphviz-excutable to system environment path. (only for Windows)
# YOU DON'T NEED TO UNDERSTAND THIS PART.
# from sys import platform
# import os

# changed = False
# if platform == "win32":
#     paths = os.environ["PATH"].split(os.pathsep)
#     for path in paths:
#         if path.endswith("Anaconda3") or path.endswith("anaconda3"):
#             path_to_anaconda = path
#             graphviz_path = path_to_anaconda + os.altsep + os.altsep.join(["Library", "bin", "graphviz"])
#             changed = True
#             break

#     if not paths[-1].endswith(os.altsep.join(["Library", "bin", "graphviz"])) and changed:
#         os.environ["PATH"] += (os.pathsep + graphviz_path)

# 2. Load file and show

In [58]:
# CELL3
LUAD_cancer_matrix = pd.read_csv('/content/LUAD_cancer_onlyCCDSpublic.tsv', index_col = 0, sep="\t")
# Show only first 10 lines of the matrix (LUAD_cancer_matrix)
LUAD_cancer_matrix.head(10)

Unnamed: 0_level_0,TCGA-05-4244-01A-01R-1107-07,TCGA-05-4249-01A-01R-1107-07,TCGA-05-4250-01A-01R-1107-07,TCGA-05-4382-01A-01R-1206-07,TCGA-05-4384-01A-01R-1755-07,TCGA-05-4389-01A-01R-1206-07,TCGA-05-4390-01A-02R-1755-07,TCGA-05-4395-01A-01R-1206-07,TCGA-05-4396-01A-21R-1858-07,TCGA-05-4397-01A-01R-1206-07,...,TCGA-NJ-A4YG-01A-22R-A262-07,TCGA-NJ-A4YI-01A-11R-A262-07,TCGA-NJ-A4YP-01A-11R-A262-07,TCGA-NJ-A4YQ-01A-11R-A262-07,TCGA-NJ-A55A-01A-11R-A262-07,TCGA-NJ-A55O-01A-11R-A262-07,TCGA-NJ-A55R-01A-11R-A262-07,TCGA-NJ-A7XG-01A-12R-A39D-07,TCGA-O1-A52J-01A-11R-A262-07,TCGA-S2-AA1A-01A-12R-A39D-07
Hybridization REF,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
CXorf67,0.0,0.0,0.0,7.6577,0.0,0.0,0.0,0.0,0.0,1.3252,...,0.5774,1.3427,1.0163,0.0,2.6226,0.0,1.9223,0.0,3.3493,2.6208
GTPBP6,307.1821,388.9237,391.2791,486.0267,620.0367,602.0134,568.4177,561.5761,549.6588,780.7841,...,537.5289,669.3521,825.4573,605.1475,543.7538,767.2112,1081.5162,834.8025,1620.5742,567.3991
EFCAB12,33.1617,26.7224,20.2271,4.0682,9.774,4.0268,5.748,5.6372,42.4564,5.0801,...,6.351,20.141,14.2276,67.7966,729.0847,15.1692,21.8658,20.1488,7.177,36.0357
A1BG,26.0302,120.1349,50.8597,145.9037,127.3671,67.1409,164.7134,21.9745,17.4375,126.5687,...,190.3868,345.6395,72.533,165.6874,99.4055,42.007,273.9088,245.9714,209.4211,114.5543
A1CF,0.0,0.322,0.0,0.0,0.0,36.2416,0.0,0.0,15.9212,0.0,...,1.1547,0.6714,0.0,0.6277,0.0,0.0,61.9932,0.0,0.0,0.6552
RBFOX1,1.7454,1.6098,0.0,0.0,0.0,0.0,97.7167,0.0,0.0,0.0,...,10.97,0.0,0.0,0.0,1.7484,0.0,8.6502,15.1116,0.9569,0.0
GGACT,135.5022,89.0629,151.1332,112.0685,87.5748,111.8624,64.9465,180.7029,133.5785,228.6295,...,121.6801,74.2263,167.7363,81.6259,48.5619,105.9218,47.7996,100.0664,29.5598,122.9867
A2ML1,0.3491,1.6098,0.0,4.7861,0.0,36.5772,0.6387,390.8484,0.7582,19.2159,...,0.0,0.6714,14.2276,0.0,0.0,8.7515,0.2403,2.7476,0.0,0.0
A2M,9844.7858,25712.6617,16943.6359,23326.2727,48314.5571,7798.896,14147.7758,4611.4227,24726.2775,5078.8294,...,11778.2217,19484.3505,11934.1565,14199.5857,55734.6883,22401.3477,28060.0132,4287.9863,26192.7943,53159.237
A4GALT,130.9015,578.5562,356.4619,554.4677,152.7184,399.3289,657.8317,622.7803,210.7657,843.291,...,112.0092,60.423,473.5772,152.5424,422.2397,950.4084,233.075,378.2484,56.9378,260.1125


# 3. Load files and divide into test and training set

In [59]:
# CELL4
LUSC_cancer_matrix = pd.read_csv('/content/LUSC_cancer_onlyCCDSpublic.tsv', index_col = 0, sep="\t")

merged_cancer_matrix = pd.concat([LUAD_cancer_matrix,LUSC_cancer_matrix], axis = 1)

X = merged_cancer_matrix.T
gene_names = list(X.columns.values)
y = np.array(LUAD_cancer_matrix.shape[1]*[0] + LUSC_cancer_matrix.shape[1]*[1])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

# 4. Load files and process fake set

In [60]:
# CELL5
LUAD_normal_matrix = pd.read_csv('/content/LUAD_normal_onlyCCDSpublic.tsv', index_col = 0, sep="\t")
LUSC_normal_matrix = pd.read_csv('/content/LUSC_normal_onlyCCDSpublic.tsv', index_col = 0, sep="\t")

merged_normal_matrix = pd.concat([LUAD_normal_matrix,LUSC_normal_matrix], axis = 1)

X_fake = merged_normal_matrix.T
y_fake = np.array(LUAD_normal_matrix.shape[1]*[0] + LUSC_normal_matrix.shape[1]*[1])

# 5. SVM

In [37]:
# CELL6 - Make models with train-set
svm = SVC(kernel='linear', C =1.0, random_state=0)
svm.fit(X_train, y_train)

In [38]:
# CELL7 - Apply models to test-set and calculate the accuracy of each model
y_pred_train = svm.predict(X_train)
y_pred_test = svm.predict(X_test)
print('Train-set accuracy of SVM: %.2f' % accuracy_score(y_train, y_pred_train))
print('Test-set accuracy of SVM: %.2f' % accuracy_score(y_test, y_pred_test))

Train-set accuracy of SVM: 1.00
Test-set accuracy of SVM: 0.93


In [39]:
# CELL8 - Apply models to fake-set and calculate the accuracy of each model
y_pred_fake = svm.predict(X_fake)
print('Fake-set accuracy of SVM: %.2f' % accuracy_score(y_fake, y_pred_fake))

Fake-set accuracy of SVM: 0.46


# 6. MLP

In [40]:
# CELL9
mlp = MLPClassifier(activation='relu', max_iter=1000, learning_rate='constant',
                    random_state=0, learning_rate_init=0.01, hidden_layer_sizes=(10, 15, 30,10))
mlp.fit(X_train, y_train)

In [41]:
# CELL10
y_pred_train = mlp.predict(X_train)
y_pred_test = mlp.predict(X_test)
print('Train-set accuracy of ANN: %.2f' % accuracy_score(y_train, y_pred_train))
print('Test-set accuracy of ANN: %.2f' % accuracy_score(y_test, y_pred_test))

Train-set accuracy of ANN: 0.97
Test-set accuracy of ANN: 0.95


In [44]:
# CELL11
y_pred_fake = mlp.predict(X_fake)
print('Fake-set accuracy of ANN: %.2f' % accuracy_score(y_fake, y_pred_fake))

Fake-set accuracy of ANN: 0.53


# 7. Decision Tree

In [43]:
# CELL12
my_tree = DecisionTreeClassifier(criterion='entropy', max_depth=4, random_state=0)
my_tree.fit(X_train, y_train)

In [45]:
# CELL13
y_pred_train = my_tree.predict(X_train)
y_pred = my_tree.predict(X_test)
print('Train-set accuracy of Decision tree: %.2f' % accuracy_score(y_train, y_pred_train))
print('Test-set accuracy of Decision tree: %.2f' % accuracy_score(y_test, y_pred))

Train-set accuracy of Decision tree: 0.99
Test-set accuracy of Decision tree: 0.94


In [46]:
# CELL14
y_pred_fake = my_tree.predict(X_fake)
print('Fake-set accuracy of Decision tree: %.2f' % accuracy_score(y_fake, y_pred_fake))

Fake-set accuracy of Decision tree: 0.57


In [47]:
# CELL15
import graphviz
dot_data = tree.export_graphviz(my_tree, out_file=None, feature_names = gene_names)
graph = graphviz.Source(dot_data)
graph.render("tree")

'tree.pdf'

# 8. Random Forest

In [73]:
# CELL16
forest = RandomForestClassifier(criterion='entropy', n_estimators=20, max_depth=15,
                                max_features=None, random_state=1, n_jobs=-1)
forest.fit(X_train, y_train)

In [74]:
# CELL17
y_pred_train = forest.predict(X_train)
y_pred = forest.predict(X_test)
print('Train-set accuracy of Random Forest: %.2f' % accuracy_score(y_train, y_pred_train))
print('Test-set accuracy of Random Forest: %.2f' % accuracy_score(y_test, y_pred))

Train-set accuracy of Random Forest: 1.00
Test-set accuracy of Random Forest: 0.95


In [75]:
# CELL18
y_pred_fake = forest.predict(X_fake)
print('Fake-set accuracy of Random Forest: %.2f' % accuracy_score(y_fake, y_pred_fake))

Fake-set accuracy of Random Forest: 0.54
