# MLP Classifier untuk Klasifikasi Class Fungi

<p style="text-align: justify;">Menurut <a title="Thomas Cavalier-Smith" href="https://en.wikipedia.org/wiki/Thomas_Cavalier-Smith">Cavalier-Smith (2015),</a> makhluk hidup dibagi dalam 7 kingdom, yaitu: Bacteria, Bacteria, Archaea, Protozoa, Chromista, Plantae, Fungi, dan Animalia.</p>
<p style="text-align: justify;">&nbsp;</p>
<p style="text-align: justify;"><img src="images/220px-Biological_classification_L_Pengo_vflip.svg.png" alt="" /></p>
<p style="text-align: justify;">Fungi memiliki phylum: Ascomycota, Basidiomycota, Chytridiomycota, Glomeromycota, dan Zygomycota.</p>
<p style="text-align: justify;">Dalam phylum Ascomycota, memiliki beberapa class antara lain: Saccharomycetes dan Leotiomycetes.</p>
<p style="text-align: justify;"><em>sumber: <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4418965/">https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4418965/</a></em></p>
<p style="text-align: justify;">Pada materi ini kita akan mencoba melakukan training untuk MLP dalam melakukan klasifikasi untuk 2 class pada kingdom Fungi (phylum&nbsp;Ascomycota), yaitu: Saccharomycetes dan Leotiomycetes.</p>
<p style="text-align: justify;">Selain itu akan dilakukan juga parameter tuning pada MLP dengan tujuan untuk mendapatkan model terbaik dari MLP Classifier, sehingga model tersebut dapat kita simpan dan dapat digunakan untuk melakukan identifikasi suatu urutan DNA apakah masuk dalam Saccharomycetes atau Leotiomycetes.</p>

In [2]:
from Bio import SeqIO
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.pipeline import FeatureUnion
from sklearn.pipeline import Pipeline
from sklearn.externals import joblib
import numpy as np
import pickle

# Data Acquisition and Labelling

<p>Pada proses ini, kita melakukan pembacaan data yang diambil dari public dataset boldsystems.org, dna sequence dari 2 kelas Fungi disimpan dalam format fasta.</p>
<p>Labelling:</p>
<ol>
<li>Saccharomycetes --&gt; 0</li>
<li>Leotiomycetes --&gt; 1</li>
</ol>
<p>&nbsp;</p>

In [3]:
labels = []
sequences = []

path = {0:'data/Saccharomycetes.fas', 1:'data/Leotiomycetes.fas'}
for key, value in path.items():
    print(key, value)
    for seq_record in SeqIO.parse(value, "fasta"):
        sequences.append(str(seq_record.seq))
        labels.append(key)
        
#for seq_record in SeqIO.parse('data/.fasta', "fasta"):
#    print(seq_record.id)
#    sequences.append(str(seq_record.seq))
print('total sequences: ',len(sequences))

0 data/Saccharomycetes.fas
1 data/Leotiomycetes.fas
total sequences:  11842


In [17]:
print(sequences[10])

TACCTTATCTATGGTATGATTAGTGCTATGGTTGCTACAGCTATGTCTGTAATTATTAGATTAGAATTATCTGGACCAGGTGATCAGTTCTTACACGAAAATCGGCAAGTATATAATGTTCTTGTTACTGGTCATGCTATAGCAATGATTTTCTTATTT---GTTATGCCTGTATTAATAGGTGCATTTGGAAATTTCTTTTTACCTATTATGATAGGTGCTGTAGATATGGCATTTGCTAGATTAAATAATATTAGTTTTTGATGTTTACCTCCTGCATTAGTATGTATTATTGGTTCAATTTTAATTGAATCAGGAGCAGGTACAGGA------------TGAACTGTATATCCTCCACTATCATCAATTAGTGCACATTCAGGTCCATCAGTTGATTTAGCTATATTTGCATTACACCTTACAAGTATTTCATCATTATTAGGTGCTATTAACTTTATTGTTACTACTTTAAACATGCGTAGTATAGGAGTACACATGATAGATATGCCATTATTTGTTTGAGCTATATTCTTCACAGCTATATTATTATTATTATCATTACCTGTATTAACAGCTGGTGTTACATTATTATTAATGGACCGTAACTTTAACACAGGATTCTATGAAGTAGCTGCTGGTGGTGATCCAATCCTTTACGAACATTTA


In [18]:
print(labels)

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

# K-Mer Encoding

<p>untuk ekstraksi encoding kita gunakan K-Mer encoding dengan panjang karakter setiap word adalah 4</p>

In [4]:
def getKmers(sequence, size):
    return [sequence[x:x+size].lower() for x in range(len(sequence) - size + 1)]

In [5]:
documents = []
for seq in sequences:
    words = ' '.join(getKmers(seq, 4))
    documents.append(words)
print(len(documents))

11842


In [21]:
print(documents[10])

tacc acct cctt ctta ttat tatc atct tcta ctat tatg atgg tggt ggta gtat tatg atga tgat gatt atta ttag tagt agtg gtgc tgct gcta ctat tatg atgg tggt ggtt gttg ttgc tgct gcta ctac taca acag cagc agct gcta ctat tatg atgt tgtc gtct tctg ctgt tgta gtaa taat aatt atta ttat tatt atta ttag taga agat gatt atta ttag taga agaa gaat aatt atta ttat tatc atct tctg ctgg tgga ggac gacc acca ccag cagg aggt ggtg gtga tgat gatc atca tcag cagt agtt gttc ttct tctt ctta ttac taca acac cacg acga cgaa gaaa aaaa aaat aatc atcg tcgg cggc ggca gcaa caag aagt agta gtat tata atat tata ataa taat aatg atgt tgtt gttc ttct tctt cttg ttgt tgtt gtta ttac tact actg ctgg tggt ggtc gtca tcat catg atgc tgct gcta ctat tata atag tagc agca gcaa caat aatg atga tgat gatt attt tttt tttc ttct tctt ctta ttat tatt attt ttt- tt-- t--- ---g --gt -gtt gtta ttat tatg atgc tgcc gcct cctg ctgt tgta gtat tatt atta ttaa taat aata atag tagg aggt ggtg gtgc tgca gcat catt attt tttg ttgg tgga ggaa gaaa aaat aatt attt tttc ttct tctt cttt tttt tttt 

# Feature Extraction

In [6]:
vectorizer = CountVectorizer()
tf = vectorizer.fit_transform(documents)
print(tf.shape)

(11842, 1227)


In [26]:
print(tf)

  (0, 252)	1
  (0, 285)	1
  (0, 977)	1
  (0, 193)	1
  (0, 543)	1
  (0, 537)	2
  (0, 1024)	1
  (0, 443)	1
  (0, 131)	1
  (0, 366)	1
  (0, 990)	1
  (0, 348)	1
  (0, 985)	2
  (0, 439)	1
  (0, 539)	1
  (0, 103)	2
  (0, 677)	1
  (0, 686)	1
  (0, 712)	2
  (0, 551)	1
  (0, 104)	1
  (0, 1110)	1
  (0, 194)	3
  (0, 320)	1
  (0, 323)	3
  :	:
  (11841, 12)	1
  (11841, 491)	4
  (11841, 95)	4
  (11841, 939)	1
  (11841, 1095)	2
  (11841, 215)	3
  (11841, 278)	5
  (11841, 289)	2
  (11841, 58)	1
  (11841, 249)	2
  (11841, 970)	2
  (11841, 433)	1
  (11841, 83)	1
  (11841, 17)	4
  (11841, 243)	4
  (11841, 969)	2
  (11841, 192)	7
  (11841, 273)	4
  (11841, 975)	5
  (11841, 1100)	2
  (11841, 1128)	1
  (11841, 43)	2
  (11841, 495)	5
  (11841, 1016)	6
  (11841, 201)	1


In [27]:
print(labels)

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

# Split Data Training-Testing

In [7]:
X_train, X_test, Y_train, Y_test = train_test_split(tf, labels, test_size=.3)

print(X_train.shape)
print(X_test.shape)

(8289, 1227)
(3553, 1227)


In [14]:
print(X_train)

  (0, 201)	2
  (0, 1016)	4
  (0, 495)	3
  (0, 43)	1
  (0, 221)	1
  (0, 1132)	1
  (0, 975)	1
  (0, 273)	2
  (0, 192)	1
  (0, 243)	2
  (0, 17)	1
  (0, 278)	1
  (0, 95)	1
  (0, 491)	2
  (0, 12)	1
  (0, 49)	1
  (0, 216)	1
  (0, 248)	1
  (0, 86)	1
  (0, 463)	1
  (0, 1129)	2
  (0, 1111)	1
  (0, 1037)	1
  (0, 630)	2
  (0, 928)	1
  :	:
  (8288, 695)	1
  (8288, 1033)	2
  (8288, 336)	3
  (8288, 981)	1
  (8288, 590)	1
  (8288, 23)	1
  (8288, 435)	2
  (8288, 607)	4
  (8288, 343)	2
  (8288, 600)	3
  (8288, 549)	3
  (8288, 304)	1
  (8288, 561)	2
  (8288, 556)	3
  (8288, 688)	1
  (8288, 105)	5
  (8288, 979)	2
  (8288, 302)	1
  (8288, 368)	3
  (8288, 303)	2
  (8288, 367)	1
  (8288, 557)	2
  (8288, 608)	5
  (8288, 601)	1
  (8288, 548)	2


# Training dengan MLPClassifier

In [8]:
classifier = MLPClassifier(hidden_layer_sizes=(10, 10, 10), max_iter=100)
classifier = classifier.fit(X_train, Y_train)

In [9]:
predicts = classifier.predict(X_test)
print(predicts)

[1 1 0 ... 0 1 0]


# Evaluasi

<h2>Accuracy Score</h2>
<p><strong>Classification accuracy:</strong> prosentase classifier memprediksi dengan benar</p>

In [10]:
accuracy_score(Y_test, predicts)

0.9994370954123276

<p>Contoh perbandingan label hasil prediksi dengan label sebenarnya</p>

In [11]:
print('Prediksi: ', predicts[0:20].tolist())
print('Actual  : ', Y_test[0:20])

Prediksi:  [1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0]
Actual  :  [1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0]


<h2>Confussion Matrix</h2>
<p>Suatu tabel yang mendeskripsikan performa dari suatu model klasifikasi</p>
<p style="margin: 0px; padding: 0px; text-align: justify; color: #000000; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 14px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: #ffffff; text-decoration-style: initial; text-decoration-color: initial;"><strong style="margin: 0px; padding: 0px; font-weight: bold;">Basic terminology</strong></p>
<ul>
<li><strong>True Positive (TP):</strong> memperkirakan dengan benar bahwa data tersebut masuk kelas "1"</li>
<li><strong>True Negatives (TN):</strong> memperkirakan dengan benar bahwa data tersebut masuk kelas "0"</li>
<li><strong>False Positive (FP):</strong> salah memperkirakan bahwa data tersebut masuk kelas "1"</li>
<li><strong>False Negatives (FN):</strong> salah memperkirakan bahwa data tersebut masuk kelas "0"</li>
</ul>
<p>Contoh:</p>
<p><img src="images/confusion_matrix2.png" alt="" /></p>
<p>Dengan hasil dari Confussion Matrix, kita dapat menghitung beberapa nilai evaluasi sebagai berikut:</p>
<ul>
<li style="padding: 0px; margin: 0px;"><strong style="padding: 0px; margin: 0px; font-weight: bold;">Accuracy:</strong> prosentase classifier benar memprediksi
<ul style="padding: 0px; margin: 0px 0px 0px 3em;">
<li style="padding: 0px; margin: 0px;">(TP+TN)/total = (100+50)/165 = 0.91</li>
</ul>
</li>
<li style="padding: 0px; margin: 0px;"><strong style="padding: 0px; margin: 0px; font-weight: bold;">Misclassification Rate:</strong> prosentase classifier salah memprediksi<br />
<ul style="padding: 0px; margin: 0px 0px 0px 3em;">
<li style="padding: 0px; margin: 0px;">(FP+FN)/total = (10+5)/165 = 0.09</li>
<li style="padding: 0px; margin: 0px;">equivalent to 1 minus Accuracy</li>
<li style="padding: 0px; margin: 0px;">also known as "Error Rate"</li>
</ul>
</li>
<li style="padding: 0px; margin: 0px;"><strong style="padding: 0px; margin: 0px; font-weight: bold;">True Positive Rate:</strong> jika aktualnya masuk kelas "1", seberapa sering classifier memprediksi "1"?
<ul style="padding: 0px; margin: 0px 0px 0px 3em;">
<li style="padding: 0px; margin: 0px;">TP/actual yes = 100/105 = 0.95</li>
<li style="padding: 0px; margin: 0px;">also known as "Sensitivity" or "Recall"</li>
</ul>
</li>
<li style="padding: 0px; margin: 0px;"><strong style="padding: 0px; margin: 0px; font-weight: bold;">False Positive Rate:</strong>&nbsp;jika aktualnya masuk kelas "0", seberapa sering classifier memprediksi "1" <br />
<ul style="padding: 0px; margin: 0px 0px 0px 3em;">
<li style="padding: 0px; margin: 0px;">FP/actual no = 10/60 = 0.17</li>
</ul>
</li>
<li style="padding: 0px; margin: 0px;"><strong style="padding: 0px; margin: 0px; font-weight: bold;">True Negative Rate:</strong>&nbsp;jika aktualnya masuk kelas "0", seberapa sering classifier memprediksi "0" <br />
<ul style="padding: 0px; margin: 0px 0px 0px 3em;">
<li style="padding: 0px; margin: 0px;">TN/actual no = 50/60 = 0.83</li>
<li style="padding: 0px; margin: 0px;">equivalent to 1 minus False Positive Rate</li>
<li style="padding: 0px; margin: 0px;">also known as "Specificity"</li>
</ul>
</li>
<li style="padding: 0px; margin: 0px;"><strong style="padding: 0px; margin: 0px; font-weight: bold;">Precision:</strong> ketika classifier memprediksi sebagai kelas "1", seberapa sering prediksinya benar?
<ul style="padding: 0px; margin: 0px 0px 0px 3em;">
<li style="padding: 0px; margin: 0px;">TP/predicted yes = 100/110 = 0.91</li>
</ul>
</li>
</ul>
<i>source: <a href="https://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/">https://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/</a></i><br>
https://www.youtube.com/watch?v=85dtiMz9tSo
</p>
<p><img src=images/1_7EYylA6XlXSGBCF77j_rOA.png /><br>
<i>source: <a href="https://towardsdatascience.com/understanding-confusion-matrix-a9ad42dcfd62">https://towardsdatascience.com/understanding-confusion-matrix-a9ad42dcfd62</a></i>
</p>

In [12]:
tn, fp, fn, tp = confusion_matrix(Y_test, predicts).ravel()
print(tn, fp, fn, tp)

2183 1 1 1368


In [13]:
total         = tn+fp+fn+tp
accuracy      = (tp+tn) / total
misclass_rate = (fp+fn) / total

print('Total    = ', total)
print('Accuracy = ', accuracy)
print('Misclassification Rate = ', misclass_rate)

Total    =  3553
Accuracy =  0.9994370954123276
Misclassification Rate =  0.0005629045876723895
