<a href="https://colab.research.google.com/github/MJ-best/-DACON-AI-/blob/main/%5BDACON%5DGenome_Info_clf_AI_20221230.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# [DACON] 유전체 정보 품종 분류 AI 경진대회
- https://dacon.io/competitions/official/236035/overview/description
- 개체와 SNP 정보를 이용하여 품종 분류 AI 모델 개발
- 유전체 염기서열에서 획득한 유전체 변이 정보인 Single Nucleotide Polymorphism 정보는 특정 개체 및 특정 품종에 따라 다른 변이 양상을 나타낼 수 있기 때문에 동일개체를 확인하거나, 동일 품종을 구분하는데 활용이 가능합니다. 따라서 이번 경진대회에서는 개체 정보와 SNP 정보를 이용하여 A, B, C 품종을 분류하는 최고의 품종구분 정확도를 획득하는 것이 목표입니다.

- 농축수산 현장에서는 유전체 변이정보를 이용해서 품종을 구분하는 연구를 통해 품종의 다양성 혹은 품종 부정유통을 방지하기 위해 많이 활용하게 됩니다.

- SNP란 DNA 내 A, T, G, C와 같은 염기서열의 차이로서 개체 간 염기서열의 차이 정도를 파악하여 분자적인 수준에서의 명확한 품종 조성 확인을 가능케 한다

- 많은 SNP 정보를 통해 분류하는 것보다, 보다 **더 적은 SNP 정보로 높은 분류 성능**을 내는 것이 중요합니다.

- 따라서 이번 경진대회에서는 개체 정보와 사전에 구성된 **15개의 SNP 정보를 바탕으로 품종 분류 모델을 개발**해야 합니다.



In [1]:
import datetime
from pytz import timezone

print("last update : ",datetime.datetime.now(timezone('Asia/Seoul')))

last update :  2022-12-30 23:22:30.195030+09:00


In [2]:
import io
import os
import numpy as np
import pandas as pd
import math
import matplotlib.pyplot as plt

from sklearn import preprocessing

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix, f1_score
from sklearn.model_selection import GridSearchCV, cross_val_score, cross_validate

import tensorflow as tf

## 데이터준비

In [3]:
from google.colab import drive
drive.mount('/content/drive/')

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


In [4]:
path = '/content/drive/MyDrive/Colab Notebooks/[DACON]유전체정보_품좀_분류_AI_경진대회/data'
os.chdir(path)
df = pd.read_csv("train.csv", encoding='utf-8-sig', error_bad_lines=False)



  exec(code_obj, self.user_global_ns, self.user_ns)


In [5]:
df

Unnamed: 0,id,father,mother,gender,trait,SNP_01,SNP_02,SNP_03,SNP_04,SNP_05,...,SNP_07,SNP_08,SNP_09,SNP_10,SNP_11,SNP_12,SNP_13,SNP_14,SNP_15,class
0,TRAIN_000,0,0,0,2,G G,A G,A A,G A,C A,...,A A,G G,A A,G G,A G,A A,A A,A A,A A,B
1,TRAIN_001,0,0,0,2,A G,A G,C A,A A,A A,...,A A,G A,A A,A G,A A,G A,G G,A A,A A,C
2,TRAIN_002,0,0,0,2,G G,G G,A A,G A,C C,...,A A,G A,G A,A G,A A,A A,A A,A A,A A,B
3,TRAIN_003,0,0,0,1,A A,G G,A A,G A,A A,...,G G,A A,G G,A G,G G,G G,G G,A A,G G,A
4,TRAIN_004,0,0,0,2,G G,G G,C C,A A,C C,...,A A,A A,A A,G G,A A,A A,A G,A A,G A,C
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
257,TRAIN_257,0,0,0,2,A G,A G,A A,G A,C C,...,A A,G A,A A,G G,A G,G A,A A,A A,A A,B
258,TRAIN_258,0,0,0,2,G G,A A,C A,A A,A A,...,G A,G A,A A,A G,A G,A A,A G,A A,G A,C
259,TRAIN_259,0,0,0,1,A G,G G,A A,G A,A A,...,G G,G A,G A,A A,G G,G G,G G,C A,G G,A
260,TRAIN_260,0,0,0,1,A A,G G,A A,G A,A A,...,G G,A A,G A,A G,A G,G A,G G,C A,G G,A


In [6]:
data = df.drop(['id', 'father', 'mother', 'gender', 'trait','class'], axis = 1)

In [7]:
target = df['class']

In [8]:
info = pd.read_csv("snp_info.csv", encoding='utf-8-sig', error_bad_lines=False)



  exec(code_obj, self.user_global_ns, self.user_ns)


In [9]:
info

Unnamed: 0,SNP_id,name,chrom,cm,pos
0,SNP_01,BTA-19852-no-rs,2,67.0546,42986890
1,SNP_02,ARS-USMARC-Parent-DQ647190-rs29013632,6,31.1567,13897068
2,SNP_03,ARS-BFGL-NGS-117009,6,68.2892,44649549
3,SNP_04,ARS-BFGL-NGS-60567,6,77.8749,53826064
4,SNP_05,BovineHD0600017032,6,80.5015,61779512
5,SNP_06,BovineHD0600017424,6,80.5954,63048481
6,SNP_07,Hapmap49442-BTA-111073,6,80.78,64037334
7,SNP_08,BovineHD0600018638,6,82.6856,67510588
8,SNP_09,ARS-BFGL-NGS-37727,6,86.874,73092782
9,SNP_10,BTB-01558306,7,62.0692,40827112


- name : SNP 명
- chrom : 염색체 정보
- cm : Genetic distance
- pos : 각 마커의 유전체상 위치 정보

- 내가 알기로 Genetic distance가 가까울 수록 비슷한 종이고 멀수록 서로 많이 다른 종자가 된다. (computing genetic distance, we can estimate how long ago the two populations were separated)

- SNP는 DNA의 어느 한 부분을 뜻하는 말이고, 여기에 올 수 있는 DNA code가 일반적으로 2종류여서 예를 들어 SNP-1은 A, G가 올수 있기 때문에 SNP-1자리에 올 수 있는 유전자형은 AA, AG, GG 세가지가 된다.

-http://www.incodom.kr/SNP#h_2c088d3d06d5a44395884bd694e4f8ac

## 데이터 전처리
- class(y_train, y_test)는 다중분류므로 라벨 인코딩과 원핫 인코딩을 적용해준다
- SNP는 만들어지는 경우의 수가 3가지 뿐이므로 각 열을 따로 인코딩 해준다
- 0,1,2로 나눠지는 데이터를 MinMax로 0~1 사이로 만들어준다

In [10]:
class CFG:
    SEED = 42

In [11]:
def seed_everything(seed):
    np.random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
seed_everything(CFG.SEED) # Seed 고정

In [12]:
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size = 0.2, random_state=CFG.SEED, stratify=target)

- y데이터(라벨링)에는 원핫 인코딩을 적용해준다

In [13]:
#먼저 라벨 인코더
target_label = preprocessing.LabelEncoder()
y_train = target_label.fit_transform(y_train)

In [14]:
#원핫인코더
from keras.utils import np_utils
y_train = np_utils.to_categorical(y_train)

- 각 SNP별로 따로따로 인코딩

In [15]:
snp_1 =  preprocessing.LabelEncoder()
snp_2 =  preprocessing.LabelEncoder()
snp_3 =  preprocessing.LabelEncoder()
snp_4 =  preprocessing.LabelEncoder()
snp_5 =  preprocessing.LabelEncoder()
snp_6 =  preprocessing.LabelEncoder()
snp_7 =  preprocessing.LabelEncoder()
snp_8 =  preprocessing.LabelEncoder()
snp_9 =  preprocessing.LabelEncoder()
snp_10 =  preprocessing.LabelEncoder()
snp_11 =  preprocessing.LabelEncoder()
snp_12 =  preprocessing.LabelEncoder()
snp_13 =  preprocessing.LabelEncoder()
snp_14 =  preprocessing.LabelEncoder()
snp_15 =  preprocessing.LabelEncoder()

In [16]:
snp_1.fit(data['SNP_01'])
snp_2.fit(data['SNP_02'])
snp_3.fit(data['SNP_03'])
snp_4.fit(data['SNP_04'])
snp_5.fit(data['SNP_05'])
snp_6.fit(data['SNP_06'])
snp_7.fit(data['SNP_07'])
snp_8.fit(data['SNP_08'])
snp_9.fit(data['SNP_09'])
snp_10.fit(data['SNP_10'])
snp_11.fit(data['SNP_11'])
snp_12.fit(data['SNP_12'])
snp_13.fit(data['SNP_13'])
snp_14.fit(data['SNP_14'])
snp_15.fit(data['SNP_15'])

LabelEncoder()

In [17]:
X_train['SNP_01'] = snp_1.transform(X_train['SNP_01'])
X_train['SNP_02'] = snp_2.transform(X_train['SNP_02'])
X_train['SNP_03'] = snp_3.transform(X_train['SNP_03'])
X_train['SNP_04'] = snp_4.transform(X_train['SNP_04'])
X_train['SNP_05'] = snp_5.transform(X_train['SNP_05'])
X_train['SNP_06'] = snp_6.transform(X_train['SNP_06'])
X_train['SNP_07'] = snp_7.transform(X_train['SNP_07'])
X_train['SNP_08'] = snp_8.transform(X_train['SNP_08'])
X_train['SNP_09'] = snp_9.transform(X_train['SNP_09'])
X_train['SNP_10'] = snp_10.transform(X_train['SNP_10'])
X_train['SNP_11'] = snp_11.transform(X_train['SNP_11'])
X_train['SNP_12'] = snp_12.transform(X_train['SNP_12'])
X_train['SNP_13'] = snp_13.transform(X_train['SNP_13'])
X_train['SNP_14'] = snp_14.transform(X_train['SNP_14'])
X_train['SNP_15'] = snp_15.transform(X_train['SNP_15'])

In [18]:
X_train.head()

Unnamed: 0,SNP_01,SNP_02,SNP_03,SNP_04,SNP_05,SNP_06,SNP_07,SNP_08,SNP_09,SNP_10,SNP_11,SNP_12,SNP_13,SNP_14,SNP_15
123,1,1,0,2,0,2,2,0,2,0,2,1,2,1,1
189,2,0,1,0,0,1,0,1,0,1,0,0,1,0,0
49,0,2,0,1,0,2,2,0,1,0,2,1,2,1,2
198,2,1,1,0,2,0,0,2,1,2,2,0,1,0,1
29,1,2,0,1,0,2,2,0,1,0,2,2,2,0,2


In [19]:
X_test['SNP_01'] = snp_1.transform(X_test['SNP_01'])
X_test['SNP_02'] = snp_2.transform(X_test['SNP_02'])
X_test['SNP_03'] = snp_3.transform(X_test['SNP_03'])
X_test['SNP_04'] = snp_4.transform(X_test['SNP_04'])
X_test['SNP_05'] = snp_5.transform(X_test['SNP_05'])
X_test['SNP_06'] = snp_6.transform(X_test['SNP_06'])
X_test['SNP_07'] = snp_7.transform(X_test['SNP_07'])
X_test['SNP_08'] = snp_8.transform(X_test['SNP_08'])
X_test['SNP_09'] = snp_9.transform(X_test['SNP_09'])
X_test['SNP_10'] = snp_10.transform(X_test['SNP_10'])
X_test['SNP_11'] = snp_11.transform(X_test['SNP_11'])
X_test['SNP_12'] = snp_12.transform(X_test['SNP_12'])
X_test['SNP_13'] = snp_13.transform(X_test['SNP_13'])
X_test['SNP_14'] = snp_14.transform(X_test['SNP_14'])
X_test['SNP_15'] = snp_15.transform(X_test['SNP_15'])

In [20]:
X_test.head(5)

Unnamed: 0,SNP_01,SNP_02,SNP_03,SNP_04,SNP_05,SNP_06,SNP_07,SNP_08,SNP_09,SNP_10,SNP_11,SNP_12,SNP_13,SNP_14,SNP_15
15,0,2,0,2,0,2,2,0,2,1,1,1,2,2,2
139,2,2,1,1,2,0,0,2,0,2,0,0,0,0,1
204,0,2,0,1,0,2,2,0,1,1,2,2,2,0,2
114,2,2,1,0,1,1,0,2,0,2,1,0,1,0,0
16,0,2,0,2,0,2,2,0,1,2,1,2,2,1,2


In [21]:
scaler = preprocessing.MinMaxScaler()
scaler.fit(X_train)
X_train_scaled_ = scaler.transform(X_train)
X_test_scaled_ = scaler.transform(X_test)

In [22]:
X_train_scaled = pd.DataFrame(X_train_scaled_, columns=X_train.columns, index=X_train.index)
X_test_scaled = pd.DataFrame(X_test_scaled_, columns=X_test.columns, index=X_test.index)

In [23]:
X_train_scaled.head(5)

Unnamed: 0,SNP_01,SNP_02,SNP_03,SNP_04,SNP_05,SNP_06,SNP_07,SNP_08,SNP_09,SNP_10,SNP_11,SNP_12,SNP_13,SNP_14,SNP_15
123,0.5,0.5,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.5,1.0,0.5,0.5
189,1.0,0.0,0.5,0.0,0.0,0.5,0.0,0.5,0.0,0.5,0.0,0.0,0.5,0.0,0.0
49,0.0,1.0,0.0,0.5,0.0,1.0,1.0,0.0,0.5,0.0,1.0,0.5,1.0,0.5,1.0
198,1.0,0.5,0.5,0.0,1.0,0.0,0.0,1.0,0.5,1.0,1.0,0.0,0.5,0.0,0.5
29,0.5,1.0,0.0,0.5,0.0,1.0,1.0,0.0,0.5,0.0,1.0,1.0,1.0,0.0,1.0


## 평가지표
- F1 macro를 사용해 평가한다

In [24]:
#머신러닝 평가지표
def get_clf_eval(y_test, pred):
    confusion = confusion_matrix(y_test, pred)
    accuracy = accuracy_score(y_test, pred)*100
    precision = precision_score(y_test, pred, average='macro')*100
    recall = recall_score(y_test, pred, average='macro')*100
    f1_macro = f1_score(y_test, pred,  average='macro')
    
    print('오차행렬')
    print(confusion)
    print('정확도 :',"%.1f"%accuracy+'%')
    print('정밀도 :',"%.1f"%precision+'%')
    print('재현율 :', "%.1f"%recall+'%')
    print('Macro F1 score :', f1_macro)

In [25]:
#딥러닝 평가지표(케라스)
from keras import backend as K
def recall(y_target, y_pred):
    # clip(t, clip_value_min, clip_value_max) : clip_value_min~clip_value_max 이외 가장자리를 깎아 낸다
    # round : 반올림한다
    y_target_yn = K.round(K.clip(y_target, 0, 1)) # 실제값을 0(Negative) 또는 1(Positive)로 설정한다
    y_pred_yn = K.round(K.clip(y_pred, 0, 1)) # 예측값을 0(Negative) 또는 1(Positive)로 설정한다

    # True Positive는 실제 값과 예측 값이 모두 1(Positive)인 경우이다
    count_true_positive = K.sum(y_target_yn * y_pred_yn) 

    # (True Positive + False Negative) = 실제 값이 1(Positive) 전체
    count_true_positive_false_negative = K.sum(y_target_yn)

    # Recall =  (True Positive) / (True Positive + False Negative)
    # K.epsilon()는 'divide by zero error' 예방차원에서 작은 수를 더한다
    recall = count_true_positive / (count_true_positive_false_negative + K.epsilon())

    # return a single tensor value
    return recall


def precision(y_target, y_pred):
    # clip(t, clip_value_min, clip_value_max) : clip_value_min~clip_value_max 이외 가장자리를 깎아 낸다
    # round : 반올림한다
    y_pred_yn = K.round(K.clip(y_pred, 0, 1)) # 예측값을 0(Negative) 또는 1(Positive)로 설정한다
    y_target_yn = K.round(K.clip(y_target, 0, 1)) # 실제값을 0(Negative) 또는 1(Positive)로 설정한다

    # True Positive는 실제 값과 예측 값이 모두 1(Positive)인 경우이다
    count_true_positive = K.sum(y_target_yn * y_pred_yn) 

    # (True Positive + False Positive) = 예측 값이 1(Positive) 전체
    count_true_positive_false_positive = K.sum(y_pred_yn)

    # Precision = (True Positive) / (True Positive + False Positive)
    # K.epsilon()는 'divide by zero error' 예방차원에서 작은 수를 더한다
    precision = count_true_positive / (count_true_positive_false_positive + K.epsilon())

    # return a single tensor value
    return precision


def f1score(y_target, y_pred):
    _recall = recall(y_target, y_pred)
    _precision = precision(y_target, y_pred)
    # K.epsilon()는 'divide by zero error' 예방차원에서 작은 수를 더한다
    _f1score = ( 2 * _recall * _precision) / (_recall + _precision+ K.epsilon())
    
    # return a single tensor value
    return _f1score

## 머신러닝 알고리즘

### 랜덤포레스트 모델을 사용해서 학습

In [26]:
clf = RandomForestClassifier(random_state=CFG.SEED)
clf.fit(X_train_scaled, y_train)

RandomForestClassifier(random_state=42)

In [27]:
pred = clf.predict(X_test_scaled)

In [28]:
index = np.argmax(pred, axis=1)
preds = pd.DataFrame()
preds = target_label.inverse_transform(index)
get_clf_eval(y_test, preds)

오차행렬
[[14  0  0]
 [ 0 23  0]
 [ 2  2 12]]
정확도 : 92.5%
정밀도 : 93.2%
재현율 : 91.7%
Macro F1 score : 0.9162698412698412


### 분류에 도움이 되는 주요한 데이터에 관해서만 학습
- 2022년 12월 30일
- 시도해봤는데 머신러닝에서도 딥러닝에서도 좋은 성능을 얻지 못했다
- Xtra-tree : Macro F1 score 0.17316017316017315
- Deeplearning : Macro F1 score 0.07165532879818594

In [29]:
#변수별 중요도를 이용해 특성행렬 선택후 학습
from sklearn.feature_selection import SelectFromModel

selector = SelectFromModel(clf, threshold = 'median')

In [46]:
X_train_feature_important = selector.fit_transform(X_train_scaled, y_train)
X_test_feature_important = selector.fit_transform(X_test_scaled, y_test)

In [47]:
# 엑스트라 트리로 주요변수에 대해서만 학습
from sklearn.ensemble import ExtraTreesClassifier
clf_important = ExtraTreesClassifier(n_estimators=100, random_state=CFG.SEED)
clf_important.fit(X_train_feature_important, y_train)

ExtraTreesClassifier(random_state=42)

In [48]:
pred1 = clf_important.predict(X_test_feature_important)

In [49]:
#원핫인코딩을 먼저 디코딩 한 다음 라벨인코더로 되돌리기
index1 = np.argmax(pred1, axis=1)

In [50]:
#라벨디코더를 이용해 되돌리기
preds1 = pd.DataFrame()
preds1 = target_label.inverse_transform(index1)

In [51]:
get_clf_eval(y_test, preds1)

오차행렬
[[ 8  1  5]
 [21  2  0]
 [13  2  1]]
정확도 : 20.8%
정밀도 : 25.2%
재현율 : 24.0%
Macro F1 score : 0.17316017316017315


In [61]:
model = tf.keras.Sequential([
                             tf.keras.layers.Dense(units=8, activation='relu', input_dim=X_train_feature_important.shape[1]),
                             tf.keras.layers.Dense(units=100, activation='relu'),
                             tf.keras.layers.Dense(units=100, activation='relu'),
                             tf.keras.layers.Dense(units=3, activation='softmax')
])

model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.07),
              loss='categorical_crossentropy', metrics=['accuracy', precision, recall, f1score])

model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_4 (Dense)             (None, 8)                 72        
                                                                 
 dense_5 (Dense)             (None, 100)               900       
                                                                 
 dense_6 (Dense)             (None, 100)               10100     
                                                                 
 dense_7 (Dense)             (None, 3)                 303       
                                                                 
Total params: 11,375
Trainable params: 11,375
Non-trainable params: 0
_________________________________________________________________


In [None]:
history = model.fit(X_train_feature_important, y_train, epochs=1000, batch_size=50, validation_split=0.25, callbacks=[tf.keras.callbacks.EarlyStopping(patience=10, monitor='val_loss')])

In [63]:
y_hat = model.predict(X_test_feature_important)



In [64]:
index = np.argmax(y_hat, axis=1)

In [65]:
preds = pd.DataFrame()
preds = target_label.inverse_transform(index)

In [66]:
get_clf_eval(y_test, preds)

오차행렬
[[ 2  2 10]
 [21  0  2]
 [12  2  2]]
정확도 : 7.5%
정밀도 : 6.7%
재현율 : 8.9%
Macro F1 score : 0.07165532879818594


### 최적모델 찾기

In [52]:
pip install -q lazypredict

In [None]:
from lazypredict.Supervised import LazyClassifier
clf = LazyClassifier(verbose=1, predictions=True)

y_train = target_label.fit_transform(y_train)
y_train = np_utils.to_categorical(y_train)
y_test = target_label.fit_transform(y_test)
y_test = np_utils.to_categorical(y_test)

models, predictions = clf.fit(X_train, X_test, y_train, y_test)

models

## 딥러닝 다중분류모델을 이용해서 풀어보자

In [67]:
model = tf.keras.Sequential([
                             tf.keras.layers.Dense(units=125, activation='relu', input_dim=X_train_scaled.shape[1]),
                             tf.keras.layers.Dense(units=100, activation='relu'),
                             tf.keras.layers.Dense(units=100, activation='relu'),
                             tf.keras.layers.Dense(units=3, activation='softmax')
])

model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.07),
              loss='categorical_crossentropy', metrics=['accuracy', precision, recall, f1score])

model.summary()

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_8 (Dense)             (None, 125)               2000      
                                                                 
 dense_9 (Dense)             (None, 100)               12600     
                                                                 
 dense_10 (Dense)            (None, 100)               10100     
                                                                 
 dense_11 (Dense)            (None, 3)                 303       
                                                                 
Total params: 25,003
Trainable params: 25,003
Non-trainable params: 0
_________________________________________________________________


In [None]:
history = model.fit(X_train_scaled, y_train, epochs=1000, batch_size=50, validation_split=0.25, callbacks=[tf.keras.callbacks.EarlyStopping(patience=10, monitor='val_loss')])

In [69]:
y_hat = model.predict(X_test_scaled)



In [70]:
index = np.argmax(y_hat, axis=1)

In [71]:
preds = pd.DataFrame()
preds = target_label.inverse_transform(index)

In [72]:
get_clf_eval(y_test, preds)

오차행렬
[[14  0  0]
 [ 0 22  1]
 [ 0  1 15]]
정확도 : 96.2%
정밀도 : 96.5%
재현율 : 96.5%
Macro F1 score : 0.9646739130434782


## 드롭아웃을 적용시켜 아주 깊은 딥러닝 모델을 만들어보자
- 드롭아웃층을 적용했다
- 깊은 모델일 때 relu보다 좀더 미분값 소실이 적다는 swish함수를 사용했다

In [None]:
model1 = tf.keras.Sequential([
                             tf.keras.layers.Dense(units=1050, activation='swish', input_dim=X_train_scaled.shape[1]),
                             tf.keras.layers.Dense(units=100, activation='swish'),
                             tf.keras.layers.Dropout(0.2), 
                             tf.keras.layers.Dense(units=100, activation='swish'),
                             tf.keras.layers.Dropout(0.2), 
                             tf.keras.layers.Dense(units=100, activation='swish'),
                             tf.keras.layers.Dropout(0.2), 
                             tf.keras.layers.Dense(units=100, activation='swish'),
                             tf.keras.layers.Dropout(0.2), 
                             tf.keras.layers.Dense(units=100, activation='swish'),
                             tf.keras.layers.Dropout(0.2), 
                             tf.keras.layers.Dense(units=100, activation='swish'),
                             tf.keras.layers.Dense(units=3, activation='softmax')
])

model1.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.07),
              loss='categorical_crossentropy', metrics=['accuracy', precision, recall, f1score])

model1.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 1050)              16800     
                                                                 
 dense_1 (Dense)             (None, 100)               105100    
                                                                 
 dropout (Dropout)           (None, 100)               0         
                                                                 
 dense_2 (Dense)             (None, 100)               10100     
                                                                 
 dropout_1 (Dropout)         (None, 100)               0         
                                                                 
 dense_3 (Dense)             (None, 100)               10100     
                                                                 
 dropout_2 (Dropout)         (None, 100)               0

In [None]:
history = model1.fit(X_train_scaled, y_train, epochs=1000, batch_size=32, validation_split=0.25, callbacks=[tf.keras.callbacks.EarlyStopping(patience=10, monitor='val_loss')])

Epoch 1/1000
Epoch 2/1000
Epoch 3/1000
Epoch 4/1000
Epoch 5/1000
Epoch 6/1000
Epoch 7/1000
Epoch 8/1000
Epoch 9/1000
Epoch 10/1000
Epoch 11/1000
Epoch 12/1000
Epoch 13/1000
Epoch 14/1000
Epoch 15/1000


In [None]:
y_hat1 = model1.predict(X_test_scaled)



In [None]:
index1 = np.argmax(y_hat1, axis=1)

In [None]:
preds1 = pd.DataFrame()
preds1 = target_label.inverse_transform(index1)

In [None]:
get_clf_eval(y_test, preds1)

오차행렬
[[ 0 14  0]
 [ 0 23  0]
 [ 0 16  0]]
정확도 : 43.4%
정밀도 : 14.5%
재현율 : 33.3%
Macro F1 score : 0.20175438596491227


  _warn_prf(average, modifier, msg_start, len(result))


- 어째 결과가 더 안좋게 나오네...


## 데이터 시각화

In [None]:
pip install -q autoviz

[K     |████████████████████████████████| 64 kB 1.9 MB/s 
[K     |████████████████████████████████| 240 kB 9.5 MB/s 
[K     |████████████████████████████████| 9.4 MB 40.7 MB/s 
[K     |████████████████████████████████| 16.5 MB 49.6 MB/s 
[K     |████████████████████████████████| 12.9 MB 51.6 MB/s 
[K     |████████████████████████████████| 3.2 MB 43.6 MB/s 
[K     |████████████████████████████████| 1.7 MB 41.8 MB/s 
[K     |████████████████████████████████| 295 kB 48.7 MB/s 
[K     |████████████████████████████████| 55 kB 3.4 MB/s 
[K     |████████████████████████████████| 965 kB 52.4 MB/s 
[K     |████████████████████████████████| 18.5 MB 479 kB/s 
[K     |████████████████████████████████| 1.6 MB 46.0 MB/s 
[K     |████████████████████████████████| 121 kB 57.1 MB/s 
[K     |████████████████████████████████| 83 kB 1.9 MB/s 
[?25h  Building wheel for emoji (setup.py) ... [?25l[?25hdone


In [None]:
pip install matplotlib==3.1.3

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting matplotlib==3.1.3
  Downloading matplotlib-3.1.3-cp38-cp38-manylinux1_x86_64.whl (13.1 MB)
[K     |████████████████████████████████| 13.1 MB 4.8 MB/s 
Installing collected packages: matplotlib
  Attempting uninstall: matplotlib
    Found existing installation: matplotlib 3.6.2
    Uninstalling matplotlib-3.6.2:
      Successfully uninstalled matplotlib-3.6.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
autoviz 0.1.58 requires matplotlib>=3.3.3, but you have matplotlib 3.1.3 which is incompatible.[0m
Successfully installed matplotlib-3.1.3


In [None]:
new_data = X_train.append(X_test)

In [None]:
new_target = np.concatenate((y_train, y_test), axis=0)
new_target

array([0, 2, 0, 1, 0, 0, 1, 0, 1, 1, 1, 2, 1, 1, 0, 2, 2, 1, 1, 0, 1, 0,
       1, 0, 2, 0, 2, 1, 1, 1, 1, 1, 1, 1, 2, 0, 1, 2, 2, 0, 1, 2, 1, 1,
       1, 1, 0, 1, 0, 1, 1, 1, 1, 2, 1, 1, 1, 0, 1, 1, 1, 1, 2, 1, 1, 1,
       1, 1, 2, 2, 1, 1, 0, 0, 0, 2, 1, 0, 0, 0, 1, 0, 1, 0, 2, 2, 0, 0,
       0, 0, 2, 0, 2, 2, 2, 2, 2, 2, 1, 0, 1, 0, 1, 2, 1, 2, 1, 1, 2, 2,
       1, 0, 0, 1, 2, 1, 0, 1, 1, 2, 2, 2, 1, 1, 0, 2, 2, 2, 0, 2, 1, 2,
       0, 1, 0, 0, 0, 0, 0, 2, 1, 1, 0, 1, 0, 2, 2, 0, 0, 0, 1, 0, 0, 1,
       1, 2, 2, 1, 1, 1, 2, 2, 2, 0, 1, 0, 1, 2, 2, 2, 2, 1, 1, 1, 1, 1,
       2, 2, 2, 1, 1, 2, 2, 2, 1, 1, 2, 2, 1, 0, 2, 2, 1, 2, 2, 1, 0, 2,
       0, 2, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 2, 0, 2, 1, 2, 0,
       1, 1, 1, 2, 1, 0, 1, 2, 1, 0, 2, 2, 0, 0, 1, 1, 0, 2, 1, 2, 1, 1,
       2, 0, 1, 2, 1, 1, 2, 1, 2, 1, 1, 0, 2, 2, 1, 2, 0, 1, 0, 1])

In [None]:
new_data['new_target'] = new_target

In [None]:
new_data

Unnamed: 0,SNP_01,SNP_02,SNP_03,SNP_04,SNP_05,SNP_06,SNP_07,SNP_08,SNP_09,SNP_10,SNP_11,SNP_12,SNP_13,SNP_14,SNP_15,new_target
123,1,1,0,2,0,2,2,0,2,0,2,1,2,1,1,0
189,2,0,1,0,0,1,0,1,0,1,0,0,1,0,0,2
49,0,2,0,1,0,2,2,0,1,0,2,1,2,1,2,0
198,2,1,1,0,2,0,0,2,1,2,2,0,1,0,1,1
29,1,2,0,1,0,2,2,0,1,0,2,2,2,0,2,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
183,1,0,0,0,1,1,0,0,0,1,1,1,2,0,1,2
248,1,2,0,2,0,2,2,0,2,1,2,2,2,1,1,0
192,2,1,0,0,2,1,0,0,0,2,2,0,1,0,0,1
202,0,2,0,2,0,2,2,0,1,0,1,2,2,1,2,0


In [None]:
#autoviz를 이용한 EDA
from autoviz.AutoViz_Class import AutoViz_Class

AV =AutoViz_Class()

AV.AutoViz(
    filename = '',
    dfte=new_data,
    depVar = 'new_target',
    verbose = 1,
    max_rows_analyzed = data.shape[0],
    max_cols_analyzed = data.shape[1],
    chart_format='svg'
)

Shape of your Data Set loaded: (262, 16)
#######################################################################################
######################## C L A S S I F Y I N G  V A R I A B L E S  ####################
#######################################################################################
Classifying variables in data set...
Data cleaning improvement suggestions. Complete them before proceeding to ML modeling.


Unnamed: 0,Nuniques,dtype,Nulls,Nullpercent,NuniquePercent,Value counts Min,Data cleaning improvement suggestions
SNP_01,3,int64,0,0.0,1.15,0,
SNP_02,3,int64,0,0.0,1.15,0,
SNP_03,3,int64,0,0.0,1.15,0,
SNP_04,3,int64,0,0.0,1.15,0,
SNP_05,3,int64,0,0.0,1.15,0,
SNP_06,3,int64,0,0.0,1.15,0,
SNP_07,3,int64,0,0.0,1.15,0,
SNP_08,3,int64,0,0.0,1.15,0,
SNP_09,3,int64,0,0.0,1.15,0,
SNP_10,3,int64,0,0.0,1.15,0,


    15 Predictors classified...
        No variables removed since no ID or low-information variables found in data set
Since Number of Rows in data 262 exceeds maximum, randomly sampling 262 rows for EDA...

################ Multi_Classification problem #####################
Number of variables = 15 exceeds limit, finding top 15 variables through XGBoost
    No categorical feature reduction done. All 0 Categorical vars selected 
    Removing correlated variables from 15 numerics using SULO method
Selecting all (15) variables since none of them are highly correlated...
    Adding 0 categorical variables to reduced numeric variables  of 15
############## F E A T U R E   S E L E C T I O N  ####################
Current number of predictors = 15 
    Finding Important Features using Boosted Trees algorithm...
        using 15 variables...
        using 12 variables...
        using 9 variables...
        using 6 variables...
        using 3 variables...
Found 13 important features
########

Unnamed: 0,Nuniques,dtype,Nulls,Nullpercent,NuniquePercent,Value counts Min,Data cleaning improvement suggestions
SNP_10,3,int64,0,0.0,1.15,0,
SNP_07,3,int64,0,0.0,1.15,0,
SNP_12,3,int64,0,0.0,1.15,0,
SNP_04,3,int64,0,0.0,1.15,0,
SNP_14,3,int64,0,0.0,1.15,0,
SNP_05,3,int64,0,0.0,1.15,0,
SNP_01,3,int64,0,0.0,1.15,0,
SNP_09,3,int64,0,0.0,1.15,0,
SNP_02,3,int64,0,0.0,1.15,0,
SNP_11,3,int64,0,0.0,1.15,0,


    13 Predictors classified...
    No variables removed since no ID or low-information variables found in data
    List of variables removed: []
Total Number of Scatter Plots = 91
No categorical or boolean vars in data set. Hence no pivot plots...
No categorical or numeric vars in data set. Hence no bar charts.
All Plots done
Time to run AutoViz = 8 seconds 

 ###################### AUTO VISUALIZATION Completed ########################


Unnamed: 0,SNP_10,SNP_07,SNP_12,SNP_04,SNP_14,SNP_05,SNP_01,SNP_09,SNP_02,SNP_11,SNP_13,SNP_08,SNP_15,new_target
59,0,0,0,0,0,1,0,0,0,0,1,1,0,0
145,2,0,0,0,0,2,1,1,0,2,0,2,1,1
233,1,0,0,0,0,1,2,0,2,0,2,0,1,0
154,2,0,0,0,0,1,1,0,1,1,1,2,1,1
31,0,1,2,2,2,0,0,0,2,2,1,0,2,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41,2,0,1,0,0,2,2,0,1,1,1,2,0,1
160,2,0,1,1,0,2,1,0,0,0,1,1,0,1
39,2,0,0,0,0,1,2,0,0,2,2,2,1,1
195,2,0,0,1,0,2,2,0,1,0,1,1,0,1
