### sklearn.naive_bayes.GaussianNB
### sklearn.linear_model.BayesianRidge

#### 사건 B가 주어졌을 때 사건 A가 일어날 확률인 P(A|B), 즉 조건부확률과 베이즈 정리룰 이용한 알고리즘
#### 나이브란 예측에 사용된 특성치 X가 상호독립적이라는 가정하에 확률계산을 단순화하기 위해 나이브(단순한 가정)라고 명명
#### 베이즈는 특성치 X가 클래스 전체의 확률분포에 대비하여 특정 클래스에 속할 확률을 베이즈 정리를 기반으로 계산
- 모든 특성치가 레이블을 분류 혹은 예측하는 데 동등한 역할
- 분류 문제에서는 GaussianNB 알고리즘을 주로 사용하는데 가우시안은 가우스분포, 즉 정규분포상에서 발생확률을 계산하기 때문에 붙여진 이름(특성치 중 연속형 자료일 경우 발생확률을 정규분포상에서의 확률, 즉 우도를 구하여 계산)
- 회귀 문제는 naive_bayes 알고리즘과 잘 맞지 않으므로 linear_model의 BaysianRidge 사용

#### 주요 Hyperparameter
##### GaussianNB
- var_smoothing : 기본값 0.000000001로, 안정적인 연산을 위해 분산에 더해지는 모든 특성치의 최대 분산 비율

##### BayesianRidge
- alpha_1 : 기본값 1e-6으로, 감마분포의 alpha 파라미터 사전 설정
- lambda_1 : 기본값 1e-6으로, 감마분포의 lambda 파라미터 사전 설정

##### GaussianNB(*, priors, ...)
##### BayesianRidge(*, n_iter=300, tol=0.001, alpha_1=1e-06, alpha_2=1e-06, lambda_1=1e-06, lambda_2=1e-06, alpha_init=None, lambda_init=None, compute_score=False, fit_intercept=True, normalize=False, copy_X=True, verbose=False)

# 분석 코드 - Classification

In [1]:
# 라이브러리 및 데이터 로드
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import randint
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import BayesianRidge
from sklearn.metrics import classification_report, confusion_matrix, mean_squared_error

df = pd.read_csv('../input/big-data-certification-study/breast-cancer-wisconsin.csv', encoding='utf-8')
df.head()

Unnamed: 0,code,Clump_Thickness,Cell_Size,Cell_Shape,Marginal_Adhesion,Single_Epithelial_Cell_Size,Bare_Nuclei,Bland_Chromatin,Normal_Nucleoli,Mitoses,Class
0,1000025,5,1,1,1,2,1,3,1,1,0
1,1002945,5,4,4,5,7,10,3,2,1,0
2,1015425,3,1,1,1,2,2,3,1,1,0
3,1016277,6,8,8,1,3,4,3,7,1,0
4,1017023,4,1,1,3,2,1,3,1,1,0


In [2]:
# 데이터 분리
X=df.drop(columns=['code','Class'])
y=df[['Class']]
X_train,X_test,y_train,y_test=train_test_split(X,y,stratify=y,random_state=42)

In [3]:
# 정규화
scaler=MinMaxScaler()
scaler.fit(X_train)
mm_X_train=scaler.transform(X_train)
mm_X_test=scaler.transform(X_test)

In [4]:
# 모델 적용
model=GaussianNB()
model.fit(mm_X_train, y_train)
pred_train=model.predict(mm_X_train)
model.score(mm_X_train,y_train)

0.966796875

In [5]:
# 혼동행렬, 분류예측 보고서
cm_train=confusion_matrix(y_train,pred_train)
cfr_train=classification_report(y_train,pred_train)
print('혼동행렬 :\n',cm_train,
      '\n\n\n분류예측 보고서 :\n',cfr_train)

혼동행렬 :
 [[319  14]
 [  3 176]] 


분류예측 보고서 :
               precision    recall  f1-score   support

           0       0.99      0.96      0.97       333
           1       0.93      0.98      0.95       179

    accuracy                           0.97       512
   macro avg       0.96      0.97      0.96       512
weighted avg       0.97      0.97      0.97       512



In [6]:
# 모델 적용
pred_test=model.predict(mm_X_test)
model.score(mm_X_test, y_test)

0.9590643274853801

In [7]:
# 혼동행렬, 분류예측 보고서
cm_test=confusion_matrix(y_test,pred_test)
cfr_test=classification_report(y_test,pred_test)
print('혼동행렬 :\n',cm_test,'\n\n\n분류예측 보고서 :\n',cfr_test)

혼동행렬 :
 [[106   5]
 [  2  58]] 


분류예측 보고서 :
               precision    recall  f1-score   support

           0       0.98      0.95      0.97       111
           1       0.92      0.97      0.94        60

    accuracy                           0.96       171
   macro avg       0.95      0.96      0.96       171
weighted avg       0.96      0.96      0.96       171



In [8]:
# Hyperparameter Tuning
# Grid Search
param_g = {'var_smoothing':range(11)}
grid= GridSearchCV(GaussianNB(), param_g, cv=5, return_train_score=True)
grid.fit(mm_X_train, y_train)

GridSearchCV(cv=5, estimator=GaussianNB(),
             param_grid={'var_smoothing': range(0, 11)},
             return_train_score=True)

In [9]:
print('Best Parameter :', grid.best_params_)
print('Best Score :',round(grid.best_score_,4))
print('Test Score :',round(grid.score(mm_X_test, y_test),4))

Best Parameter : {'var_smoothing': 0}
Best Score : 0.9649
Test Score : 0.9591


In [10]:
# Randomized Search
param_r={'var_smoothing':randint(low=0, high=20)}
random=RandomizedSearchCV(GaussianNB(), param_distributions=param_r, cv=5, n_iter=100, return_train_score=True)
random.fit(mm_X_train, y_train)

RandomizedSearchCV(cv=5, estimator=GaussianNB(), n_iter=100,
                   param_distributions={'var_smoothing': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7f276e31c050>},
                   return_train_score=True)

In [11]:
print('Best Parameter :', random.best_params_) 
print('Best Score :',round(random.best_score_,4)) 
print('Test Score :',round(random.score(mm_X_test, y_test),4))

Best Parameter : {'var_smoothing': 0}
Best Score : 0.9649
Test Score : 0.9591


# 분석 코드 - Regression

In [12]:
df2=pd.read_csv('../input/big-data-certification-study/house_price.csv', encoding='utf-8') 
df2.head()

Unnamed: 0,housing_age,income,bedrooms,households,rooms,house_value
0,23,6.777,0.141112,2.442244,8.10396,500000
1,49,6.0199,0.160984,2.726688,5.752412,500000
2,35,5.1155,0.249061,1.902676,3.888078,500000
3,32,4.7109,0.231383,1.913669,4.508393,500000
4,21,4.5625,0.255583,3.092664,4.667954,500000


In [13]:
# 데이터 분리 
X=df2.drop(columns=['house_value']) 
y=df2[['house_value']] 
X_train, X_test, y_train, y_test=train_test_split(X,y,random_state=42)

In [14]:
# 정규화 
scale=MinMaxScaler() 
scale.fit(X_train) 
ms_X_train=scale.transform(X_train) 
ms_x_test=scale.transform(X_test)

In [15]:
# 모델 적용 
model_r=BayesianRidge() 
model_r.fit(ms_X_train, y_train) 
pred_x=model_r.predict(ms_X_train) 
model_r.score(ms_X_train, y_train)

0.5706920449333218

In [16]:
pred_y=model_r.predict(ms_x_test) 
model_r.score(ms_x_test,y_test)

0.5826111218474419

In [17]:
# RMSE 
rmse_train=np.sqrt(mean_squared_error(y_train,pred_x)) 
rmse_test=np.sqrt(mean_squared_error(y_test,pred_y)) 
print('Train RMSE :', round(rmse_train), '\nTest  RMSE :', round(rmse_test))

Train RMSE : 62537 
Test  RMSE : 61764


In [18]:
# Hyperparameter Tuning 
# Grid Search 
g_param = {'alpha_1':[1e-06,1e-05,1e-04,1e-03,1e-02,1e-01,1,2,3,4], 'lambda_1':[1e-06,1e-05,1e-04,1e-03,1e-02,1e-01,1,2,3,4]} 
g_search= GridSearchCV(BayesianRidge(), g_param, cv=5, return_train_score=True) 
g_search.fit(ms_X_train, y_train)

GridSearchCV(cv=5, estimator=BayesianRidge(),
             param_grid={'alpha_1': [1e-06, 1e-05, 0.0001, 0.001, 0.01, 0.1, 1,
                                     2, 3, 4],
                         'lambda_1': [1e-06, 1e-05, 0.0001, 0.001, 0.01, 0.1, 1,
                                      2, 3, 4]},
             return_train_score=True)

In [19]:
print('Best Parameter :', g_search.best_params_) 
print('Best Score :',round(g_search.best_score_,4)) 
print('Test Score :',round(g_search.score(ms_x_test, y_test),4))

Best Parameter : {'alpha_1': 4, 'lambda_1': 1e-06}
Best Score : 0.5703
Test Score : 0.5826


In [20]:
# Randomized Search 
r_param={'alpha_1':randint(low=1e-06, high=10), 'lambda_1':randint(low=1e-06,high=10)}
r_search=RandomizedSearchCV(BayesianRidge(), param_distributions=r_param, cv=5, n_iter=50, return_train_score=True) 
r_search.fit(ms_X_train, y_train)

RandomizedSearchCV(cv=5, estimator=BayesianRidge(), n_iter=50,
                   param_distributions={'alpha_1': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7f276e37c590>,
                                        'lambda_1': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7f276e37c510>},
                   return_train_score=True)

In [21]:
print('Best Parameter :', r_search.best_params_) 
print('Best Score :',round(r_search.best_score_,4)) 
print('Test Score :',round(r_search.score(ms_x_test, y_test),4))

Best Parameter : {'alpha_1': 8, 'lambda_1': 0}
Best Score : 0.5703
Test Score : 0.5826
