<a href="https://colab.research.google.com/github/Tom-Jung/Suanlab_example/blob/main/_7_%EB%82%98%EC%9D%B4%EB%B8%8C_%EB%B2%A0%EC%9D%B4%EC%A6%88_%EB%B6%84%EB%A5%98(Naive_Bayes_Classification).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 나이브 베이스 분류기(Naive Bayes Classification)

* 베이즈 정리를 적용한 확률적 분류 알고리즘
* 모든 특성들이 독립임을 가정 (naive 가정)
* 입력 특성에 따라 3개의 분류기 존재
  * 가우시안 나이브 베이즈 분류기
  * 베르누이 나이브 베이즈 분류기
  * 다항 나이브 베이즈 분류기

## 나이브 베이즈 분류기의 확률 모델

* 나이브 베이즈는 조건부 확률 모델
* *N*개의 특성을 나타내는 벡터 **x**를 입력 받아 k개의 가능한 확률적 결과를 출력

\begin{equation}
p(C_k | x_1,...,x_n)
\end{equation}

* 위의 식에 베이즈 정리를 적용하면 다음과 같음

\begin{equation}
p(C_k | \textbf{x}) = \frac{p(C_k)p(\textbf{x}|C_k)}{p(\textbf{x})}
\end{equation}

* 위의 식에서 분자만이 출력 값에 영향을 받기 때문에 분모 부분을 상수로 취급할 수 있음

\begin{equation}
\begin{split}
p(C_k | \textbf{x}) & \propto p(C_k)p(\textbf{x}|C_k) \\
& \propto p(C_k, x_1, ..., x_n)
\end{split}
\end{equation}

* 위의 식을 연쇄 법칙을 사용해 다음과 같이 쓸 수 있음

\begin{equation}
\begin{split}
p(C_k, x_1, ..., x_n) & = p(C_k)p(x_1, ..., x_n | C_k) \\
& = p(C_k)p(x_1 | C_k)p(x_2, ..., x_n | C_k, x_1) \\
& = p(C_k)p(x_1 | C_k)p(x_2 | C_k, x_1)p(x_3, ..., x_n | C_k, x_1, x_2) \\
& = p(C_k)p(x_1 | C_k)p(x_2 | C_k, x_1)...p(x_n | C_k, x_1, x_2, ..., x_{n-1})
\end{split}
\end{equation}

* 나이브 베이즈 분류기는 모든 특성이 독립이라고 가정하기 때문에 위의 식을 다음과 같이 쓸 수 있음

\begin{equation}
\begin{split}
p(C_k, x_1, ..., x_n) & \propto p(C_k)p(x_1|C_k)p(x_2|C_k)...p(x_n|C_k) \\
& \propto p(C_k) \prod_{i=1}^{n} p(x_i|C_k)
\end{split}
\end{equation}

* 위의 식을 통해 나온 값들 중 가장 큰 값을 갖는 클래스가 예측 결과

\begin{equation}
\hat{y} = \underset{k}{\arg\max} \; p(C_k) \prod_{i=1}^{n} p(x_i|C_k)
\end{equation}

In [2]:
import numpy as np
import pandas as pd
from sklearn.naive_bayes import GaussianNB, BernoulliNB, MultinomialNB
from sklearn.datasets import fetch_covtype, fetch_20newsgroups
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer, TfidfVectorizer
from sklearn import metrics

In [5]:
prior = [0.45, 0.3, 0.15, 0.1]
likelihood = [[0.3,0.3,0.4],[0.7,0.2,0.1],[0.15,0.5,0.35],[0.6,0.2,0.2]]

idx=0
for c, xs in zip(prior, likelihood):
  result=1.

  for x in xs:
    result *= x
  result *= c

  idx += 1
  print(f"{idx}번째 클래스의 가능서 : {result}")

1번째 클래스의 가능서 : 0.0162
2번째 클래스의 가능서 : 0.0042
3번째 클래스의 가능서 : 0.0039375
4번째 클래스의 가능서 : 0.0024000000000000002


## 산림 토양 데이터
* 산림 지역 토양의 특징 데이터
* 토양이 어떤 종류에 속하는지 예측
* https://archive.ics.uci.edu/ml/datasets/Covertype 에서 데이터의 자세한 설명 확인 가능

In [3]:
covtype = fetch_covtype()
print(covtype.DESCR)

.. _covtype_dataset:

Forest covertypes
-----------------

The samples in this dataset correspond to 30×30m patches of forest in the US,
collected for the task of predicting each patch's cover type,
i.e. the dominant species of tree.
There are seven covertypes, making this a multiclass classification problem.
Each sample has 54 features, described on the
`dataset's homepage <https://archive.ics.uci.edu/ml/datasets/Covertype>`__.
Some of the features are boolean indicators,
while others are discrete or continuous measurements.

**Data Set Characteristics:**

    Classes                        7
    Samples total             581012
    Dimensionality                54
    Features                     int

:func:`sklearn.datasets.fetch_covtype` will load the covertype dataset;
it returns a dictionary-like 'Bunch' object
with the feature matrix in the ``data`` member
and the target values in ``target``. If optional argument 'as_frame' is
set to 'True', it will return ``data`` and ``target`

In [4]:
pd.DataFrame(covtype.data)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,44,45,46,47,48,49,50,51,52,53
0,2596.0,51.0,3.0,258.0,0.0,510.0,221.0,232.0,148.0,6279.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2590.0,56.0,2.0,212.0,-6.0,390.0,220.0,235.0,151.0,6225.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2804.0,139.0,9.0,268.0,65.0,3180.0,234.0,238.0,135.0,6121.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,2785.0,155.0,18.0,242.0,118.0,3090.0,238.0,238.0,122.0,6211.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,2595.0,45.0,2.0,153.0,-1.0,391.0,220.0,234.0,150.0,6172.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
581007,2396.0,153.0,20.0,85.0,17.0,108.0,240.0,237.0,118.0,837.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
581008,2391.0,152.0,19.0,67.0,12.0,95.0,240.0,237.0,119.0,845.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
581009,2386.0,159.0,17.0,60.0,7.0,90.0,236.0,241.0,130.0,854.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
581010,2384.0,170.0,15.0,60.0,5.0,90.0,230.0,245.0,143.0,864.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [5]:
covtype.target

array([5, 5, 2, ..., 3, 3, 3], dtype=int32)

### 학습, 평가 데이터 분류

In [6]:
covtype_X=covtype.data
covtype_y=covtype.data

In [7]:
covtype_X_train, covtype_X_test, covtype_y_train, covtype_y_test = train_test_split(covtype_X, covtype_y, test_size=0.2)

In [8]:
print("전체 데이터 크기 : {}".format(covtype_X.shape))
print("학습 데이터 크기 : {}".format(covtype_X_train.shape))
print("평가 데이터 크기 : {}".format(covtype_X_test.shape))

전체 데이터 크기 : (581012, 54)
학습 데이터 크기 : (464809, 54)
평가 데이터 크기 : (116203, 54)


### 전처리

#### 전처리 전 데이터

In [12]:
covtype_df=pd.DataFrame(data=covtype_X)
covtype_df.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,44,45,46,47,48,49,50,51,52,53
count,581012.0,581012.0,581012.0,581012.0,581012.0,581012.0,581012.0,581012.0,581012.0,581012.0,...,581012.0,581012.0,581012.0,581012.0,581012.0,581012.0,581012.0,581012.0,581012.0,581012.0
mean,2959.365301,155.656807,14.103704,269.428217,46.418855,2350.146611,212.146049,223.318716,142.528263,1980.291226,...,0.044175,0.090392,0.077716,0.002773,0.003255,0.000205,0.000513,0.026803,0.023762,0.01506
std,279.984734,111.913721,7.488242,212.549356,58.295232,1559.25487,26.769889,19.768697,38.274529,1324.19521,...,0.205483,0.286743,0.267725,0.052584,0.056957,0.01431,0.022641,0.161508,0.152307,0.121791
min,1859.0,0.0,0.0,0.0,-173.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,2809.0,58.0,9.0,108.0,7.0,1106.0,198.0,213.0,119.0,1024.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,2996.0,127.0,13.0,218.0,30.0,1997.0,218.0,226.0,143.0,1710.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,3163.0,260.0,18.0,384.0,69.0,3328.0,231.0,237.0,168.0,2550.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,3858.0,360.0,66.0,1397.0,601.0,7117.0,254.0,254.0,254.0,7173.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [13]:
covtype_train_df =pd.DataFrame(data=covtype_X_train)
covtype_train_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,44,45,46,47,48,49,50,51,52,53
0,3317.0,202.0,24.0,391.0,160.0,2557.0,202.0,253.0,174.0,949.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2461.0,174.0,25.0,277.0,91.0,1209.0,225.0,243.0,134.0,1405.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2664.0,207.0,28.0,323.0,60.0,1989.0,189.0,252.0,182.0,1627.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,3157.0,295.0,20.0,285.0,51.0,210.0,160.0,230.0,209.0,799.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,3132.0,108.0,4.0,30.0,-2.0,1376.0,226.0,235.0,144.0,2254.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
464804,3258.0,268.0,6.0,331.0,12.0,4364.0,205.0,242.0,176.0,735.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
464805,3284.0,90.0,14.0,67.0,14.0,3309.0,240.0,218.0,105.0,1296.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
464806,3165.0,95.0,17.0,470.0,16.0,5943.0,245.0,213.0,91.0,1277.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
464807,2512.0,18.0,2.0,30.0,2.0,390.0,218.0,235.0,155.0,362.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [16]:
covtype_test_df =pd.DataFrame(data=covtype_X_test)
covtype_test_df.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,44,45,46,47,48,49,50,51,52,53
count,116203.0,116203.0,116203.0,116203.0,116203.0,116203.0,116203.0,116203.0,116203.0,116203.0,...,116203.0,116203.0,116203.0,116203.0,116203.0,116203.0,116203.0,116203.0,116203.0,116203.0
mean,2959.426486,155.552129,14.142518,269.625724,46.534831,2353.166949,212.10708,223.24181,142.454997,1977.318477,...,0.0445,0.088784,0.077356,0.002496,0.003236,0.000215,0.000525,0.026987,0.024345,0.01518
std,280.561735,112.035488,7.504124,212.665522,58.246489,1562.722328,26.84151,19.813089,38.366415,1320.896096,...,0.206203,0.284433,0.267157,0.049894,0.056792,0.014666,0.022906,0.162047,0.15412,0.12227
min,1860.0,0.0,0.0,0.0,-161.0,0.0,0.0,40.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,2810.0,58.0,9.0,108.0,7.0,1103.0,198.0,213.0,119.0,1022.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,2996.0,127.0,13.0,218.0,30.0,2002.0,218.0,226.0,143.0,1710.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,3163.0,260.0,18.0,384.0,69.0,3331.0,231.0,237.0,168.0,2550.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,3857.0,360.0,63.0,1390.0,597.0,7078.0,254.0,254.0,254.0,7141.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [24]:
scaler=StandardScaler()
covtype_X_train_scale = scaler.fit_transform(covtype_X_train)
covtype_X_test_scale = scaler.transform(covtype_X_test)

#### 전처리 후 데이터
* 평균은 0에 가깝게, 표준평차는 1에 가깝게 정규화

In [26]:
covtype_train_df = pd.DataFrame(data=covtype_X_train_scale)
covtype_train_df.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,44,45,46,47,48,49,50,51,52,53
count,464809.0,464809.0,464809.0,464809.0,464809.0,464809.0,464809.0,464809.0,464809.0,464809.0,...,464809.0,464809.0,464809.0,464809.0,464809.0,464809.0,464809.0,464809.0,464809.0,464809.0
mean,7.886685e-16,3.166224e-17,2.783472e-16,-2.8441270000000005e-17,-3.985984e-16,-8.422006e-17,-4.476548e-16,8.646206e-16,-5.896579e-15,-9.879456e-17,...,4.010152e-15,-1.277361e-14,3.126401e-15,-5.9393e-15,-9.635631e-15,-3.07278e-15,-7.25496e-16,2.873137e-15,3.008826e-15,6.091202e-15
std,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,...,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001
min,-3.932064,-1.391477,-1.883158,-1.267544,-3.762643,-1.50758,-7.930481,-11.30394,-3.726562,-1.4951,...,-0.2147726,-0.3160081,-0.2904664,-0.05338659,-0.05718445,-0.01422232,-0.02258643,-0.1658097,-0.1555229,-0.123528
25%,-0.5372707,-0.8730796,-0.6806307,-0.7593574,-0.6755554,-0.7978714,-0.52915,-0.5232405,-0.6155727,-0.7222805,...,-0.2147726,-0.3160081,-0.2904664,-0.05338659,-0.05718445,-0.01422232,-0.02258643,-0.1658097,-0.1555229,-0.123528
50%,0.1309675,-0.2563653,-0.1461739,-0.2417597,-0.2982446,-0.2274094,0.2184592,0.1347363,0.01185367,-0.2045516,...,-0.2147726,-0.3160081,-0.2904664,-0.05338659,-0.05718445,-0.01422232,-0.02258643,-0.1658097,-0.1555229,-0.123528
75%,0.7277365,0.9413119,0.5218971,0.5393422,0.3877749,0.6273212,0.7044052,0.6914858,0.6654228,0.4294023,...,-0.2147726,-0.3160081,-0.2904664,-0.05338659,-0.05718445,-0.01422232,-0.02258643,-0.1658097,-0.1555229,-0.123528
max,3.211296,1.826163,6.935379,5.305946,9.511835,3.059325,1.564156,1.551917,2.913701,3.918412,...,4.656086,3.164476,3.442739,18.7313,17.48727,70.31199,44.27437,6.03101,6.429922,8.095329


In [27]:
covtype_test_df = pd.DataFrame(data=covtype_X_test_scale)
covtype_test_df.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,44,45,46,47,48,49,50,51,52,53
count,116203.0,116203.0,116203.0,116203.0,116203.0,116203.0,116203.0,116203.0,116203.0,116203.0,...,116203.0,116203.0,116203.0,116203.0,116203.0,116203.0,116203.0,116203.0,116203.0,116203.0
mean,0.000273,-0.00117,0.006483,0.001162,0.002486,0.002423,-0.001821,-0.004866,-0.002394,-0.002804,...,0.001979,-0.006996,-0.00168,-0.006507,-0.000416,0.000908,0.000667,0.001425,0.004802,0.001237
std,1.002578,1.001361,1.002658,1.000683,0.998956,1.002783,1.003348,1.002812,1.003004,0.996889,...,1.004388,0.989966,0.997351,0.937246,0.996376,1.031416,1.014654,1.004175,1.014946,1.004922
min,-3.928491,-1.391477,-1.883158,-1.267544,-3.556838,-1.50758,-7.930481,-9.279393,-3.726562,-1.4951,...,-0.214773,-0.316008,-0.290466,-0.053387,-0.057184,-0.014222,-0.022586,-0.16581,-0.155523,-0.123528
25%,-0.533697,-0.87308,-0.680631,-0.759357,-0.675555,-0.799796,-0.52915,-0.52324,-0.615573,-0.72379,...,-0.214773,-0.316008,-0.290466,-0.053387,-0.057184,-0.014222,-0.022586,-0.16581,-0.155523,-0.123528
50%,0.130968,-0.256365,-0.146174,-0.24176,-0.281094,-0.222918,0.218459,0.134736,0.011854,-0.204552,...,-0.214773,-0.316008,-0.290466,-0.053387,-0.057184,-0.014222,-0.022586,-0.16581,-0.155523,-0.123528
75%,0.727736,0.932374,0.521897,0.539342,0.387775,0.629888,0.704405,0.691486,0.665423,0.429402,...,-0.214773,-0.316008,-0.290466,-0.053387,-0.057184,-0.014222,-0.022586,-0.16581,-0.155523,-0.123528
max,3.207722,1.826163,6.534536,5.273008,9.443233,3.034299,1.564156,1.551917,2.913701,3.894262,...,4.656086,3.164476,3.442739,18.731297,17.48727,70.311995,44.274365,6.03101,6.429922,8.095329


## 20 Newsgroup 데이터
* 뉴스 기사가 어느 그룹에 속하는지 분류
* 뉴스 기사는 텍스트 데이터이기 때문에 특별한 전처리 과정이 필요

### 학습, 평가 데이터 분류

### 벡터화
* 텍스트 데이터는 기계학습 모델에 입력 할 수 없음
* 벡터화는 텍스트 데이터를 실수 벡터로 변환해 기계학습 모델에 입력 할 수 있도록 하는 전처리 과정
* Scikit-learn에서는 Count, Tf-idf, Hashing 세가지 방법을 지원

#### CountVectorizer
* 문서에 나온 단어의 수를 세서 벡터 생성

데이터를 희소 행렬 형태로 표현

#### HashingVectorizer
* 각 단어를 해쉬 값으로 표현
* 미리 정해진 크기의 벡터로 표현

#### TfidfVectorizer
* 문서에 나온 단어 빈도(term frequency)와 역문서 빈도(inverse document frequency)를 곱해서 구함
* 각 빈도는 일반적으로 로그 스케일링 후 사용
* $tf(t, d) = log(f(t, d) + 1)$
* $idf(t, D) = \frac{|D|}{|d \in D : t \in d| + 1}$
* $tfidf(t, d, D) = tf(t, d) \times idf(t, D)$

## 가우시안 나이브 베이즈

* 입력 특성이 가우시안(정규) 분포를 갖는다고 가정

## 베르누이 나이브 베이즈

* 입력 특성이 베르누이 분포에 의해 생성된 이진 값을 갖는 다고 가정

### 학습 및 평가 (Count)

### 학습 및 평가 (Hash)

### 학습 및 평가 (Tf-idf)

### 시각화

## 다항 나이브 베이즈

* 입력 특성이 다항분포에 의해 생성된 빈도수 값을 갖는 다고 가정

### 학습 및 평가 (Count)

### 학습 및 평가 (Tf-idf)

### 시각화