# **Gaussian Naive Bayes**

Import Libraries
- scikit-learn(사이킷런) 머신러닝 라이브러리 dataset을 전체 불러오기 (랜덤하게 데이터 만들기)
- matplot 라이브러리에서 pyplot을 plt라는 이름으로 import

In [None]:
from sklearn import datasets
import matplotlib.pyplot as plt

Generate some (random) data

make_blobs 함수
- 100 : data 수
- 2 : feature 수
- center : class 수
- random_state : 난수 발생 시드 (data 퍼짐 정도)
- cluster_std : 클러스터의 표준 편차

plt.scatter 함수
- X는 값, Y는 class
- c : dot 컬러
- s : 마커 사이즈
- cmap : 컬러맵 (도트 컬러 자동변경)
- scatter는 공간상에 데이터 분포가 어떻게 되어있는지 보여주기 용이함

In [None]:
X, Y = datasets.make_blobs(100, 2, centers=2, random_state=2, cluster_std=1.5)

plt.scatter(X[:,0], X[:,1], c=Y, s=50, cmap='RdBu')
plt.show()

**Train a Naive Bayes Model**
- 사이킷런에 있는 서브라이브러리인 naive_bayes 중에 GaussianNB 만 불러오기.

In [None]:
from sklearn.naive_bayes import GaussianNB

- 데이터 넣어서 만들 수 있는 model 생성
- .000 : 라이브러리 안에 있는 함수 사용하기
   ex) .fit : 학습시키기 / X : data, Y : class

In [None]:
model = GaussianNB()
model.fit(X, Y)

Let's check out the distribution of the classes
- theta : 각 분포의 평균 
- sigma : 각 분포의 분산
=> feature2, class2 여서 아래 결과와 같이 2*2.
=> 가로축 class, 세로축 feature

In [None]:
model.theta_    # mean of each feature per class (n_classes, n_features)

In [None]:
model.sigma_    # variance of each feature per class (n_classes, n_features)

In [None]:
plt.scatter(X[:,0], X[:,1], c=Y, s=50, cmap='RdBu')
# 평균지점 표시
plt.scatter(model.theta_[:,0], model.theta_[:,1], marker='d', c=['r', 'b'], s=200)
plt.show()

**Performance Evaluation**
- 성능평가 metrics 라이브러리로 해보자.
- .predict 함수로 예측 가능.

In [None]:
from sklearn import metrics

In [None]:
# 예측
pred = model.predict(X)
# 예측 결과 (처음부터 10번째까지)
print(pred[:10])
# 실제 정답
print(Y[:10])

# probability 값 확인
score = model.predict_proba(X)
print(score[:10,:])

Accuracy
- 예측이 끝났으면, 실제로 얼마나 정확한지 계산 해봐야 함

In [None]:
#(정답,예측)
acc = metrics.accuracy_score(Y, pred)
print('Accuracy : ', acc)

In [None]:
X2, Y2 = datasets.make_blobs(100, 2, centers=2, random_state=2, cluster_std=2.5)

pred2 = model.predict(X2)
print(pred2[:10])
print(Y2[:10])

score2 = model.predict_proba(X2)
print(score2[:10,:])

Accuracy

In [None]:
acc2 = metrics.accuracy_score(Y2, pred2)
print('Accuracy : ', acc2)

# **Breast Cancer Wisconsin (Diagnostic) Dataset**
*   569 instances (212 Malignant(악성종양), 357 Benign(양성종양))
*   30 numerical features (computed from a digitized image of a breast mass)
*   2 classes (Malignant, Benign)


Import Libraries
- 넘파이(Numpy)는 Python에서 벡터, 행렬 등 수치 연산을 수행하는 선형대수(Linear algebra) 라이브러리

In [None]:
import numpy as np
from sklearn import datasets

Load dataset

In [None]:
wisconsin = datasets.load_breast_cancer()

In [None]:
# 어떤 것들이 들어있는지.
wisconsin.keys()

In [None]:
wisconsin.data

In [None]:
# dimension이 어떻게 되는지 (569명에 대한 30개의 x값을 갖고 있다)
wisconsin.data.shape

In [None]:
wisconsin.target_names

**Prepare Data**
학습용, 검증용 dataset 나누기

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
# 30%는 test set으로 사용하겠다.
TrainX, TestX, TrainY, TestY = train_test_split(wisconsin.data, wisconsin.target, test_size=0.3, random_state=0)

In [None]:
print(TrainX.shape)
print(TrainY.shape)
print(TestX.shape)
print(TestY.shape)

**Train a Naive Bayes Model**

In [None]:
from sklearn.naive_bayes import GaussianNB

In [None]:
model = GaussianNB()
model.fit(TrainX, TrainY)

In [None]:
model.theta_

In [None]:
pred_train = model.predict(TrainX)
print(pred_train[:20])
print(TrainY[:20])

score_train = model.predict_proba(TrainX)
score_train[:10,:]

In [None]:
pred_test = model.predict(TestX)
print(pred_test[:20])
print(TestY[:20])

score_test = model.predict_proba(TestX)
score_test[:10,:]

**Performance Evaluation**

In [None]:
from sklearn import metrics

Accuracy

In [None]:
tr_acc = metrics.accuracy_score(TrainY, pred_train)
print('Training Accuracy : ', tr_acc)

ts_acc = metrics.accuracy_score(TestY, pred_test)
print('Test Accuracy : ', ts_acc)

ROC curve & AUC
- curve 아래 영역이 얼마나 되는지로 어떤 model이 더 나은지 판단해보자.

In [None]:
# 예측된 label이 아니라 계산된 값을 줘야함. posterior probability값, _proba로 계산된 값.
tr_fpr, tr_tpr, tr_th = metrics.roc_curve(TrainY, score_train[:,1], pos_label=1)
ts_fpr, ts_tpr, ts_th = metrics.roc_curve(TestY, score_test[:,1], pos_label=1)

In [None]:
import matplotlib.pyplot as plt

plt.plot(tr_fpr, tr_tpr, color='b', label='Train')
plt.plot(ts_fpr, ts_tpr, color='r', label='Test')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.legend(loc='best')
plt.show()

In [None]:
# 계산되는 방법이 달라서 Accuracy와 equal은 아니다.
tr_auc = metrics.roc_auc_score(TrainY, score_train[:,1])
print('Training AUC : ', tr_auc)

ts_auc = metrics.roc_auc_score(TestY, score_test[:,1])
print('Test AUC : ', ts_auc)

# **Iris Plants Dataset**
*   150 instances (50 per each class)
*   4 numerical features (sepal length, sepal width, petal length, petal width)
*   3 class (setosa, versicolor, virginica)



In [None]:
import numpy as np
from sklearn import datasets
iris = datasets.load_iris()

In [None]:
iris.keys()

In [None]:
iris.target_names

**Prepare Data**

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
TrainX, TestX, TrainY, TestY = train_test_split(iris.data, iris.target, test_size=0.3, random_state=0)

In [None]:
print(TrainX.shape)
print(TrainY.shape)
print(TestX.shape)
print(TestY.shape)

**Train a Naive Bayes Model**

In [None]:
from sklearn.naive_bayes import GaussianNB

In [None]:
model = GaussianNB()
model.fit(TrainX, TrainY)

In [None]:
model.theta_

In [None]:
pred_train = model.predict(TrainX)
print(pred_train)
print(TrainY)

In [None]:
pred_test = model.predict(TestX)
print(pred_test)
print(TestY)

**Performance Evaluation**

In [None]:
from sklearn import metrics

Accuracy

In [None]:
tr_acc = metrics.accuracy_score(TrainY, pred_train)
print('Training Accuracy : ', tr_acc)

ts_acc = metrics.accuracy_score(TestY, pred_test)
print('Test Accuracy : ', ts_acc)

Confusion Matrix

In [None]:
tr_cmat = metrics.confusion_matrix(TrainY, pred_train)
print(tr_cmat)

tr_cmat = metrics.confusion_matrix(TestY, pred_test)
print(tr_cmat)