### 나이브 베이즈 분류기의 특징
1) 사전 확률 및 추가 정보를 기반으로 사후 확률을 추론하는 통계적인 방법인 베이즈 추정 기반 분류<br>
2) 종속 변수 각 범주의 등장 빈도인 사전 확률 (prior) 설정이 중요<br>
3) 각 데이터의 사전 확률을 기반으로 사후 확률 (posterior)을 계산

#### sklearn - GaussianNB()

- 독립변수와 종속변수는 GaussianNB() 함수의 메서드인 fit() 함수에 할당
- 모델 객체의 predict_proba() 메소드로 에측 확률값 생산
- 이진 분류의 경우, 출력된 예측 확률값의 두번째 열이 1이 될 확률
- 스팸 분류기 등 자주 사용됨

In [2]:
import pandas as pd
from sklearn.naive_bayes import GaussianNB

In [3]:
df = pd.read_csv("C:/Users/Python/Data/iris.csv")

In [4]:
df.head(3)

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa


In [5]:
df['is_setosa'] = (df['Species'] == 'setosa') + 0
df.head(2)

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species,is_setosa
0,5.1,3.5,1.4,0.2,setosa,1
1,4.9,3.0,1.4,0.2,setosa,1


In [6]:
df["is_setosa"].value_counts()

0    100
1     50
Name: is_setosa, dtype: int64

In [9]:
df["is_setosa"].value_counts(normalize = True)

0    0.666667
1    0.333333
Name: is_setosa, dtype: float64

In [10]:
model = GaussianNB().fit(X = df.iloc[:, :4],
                         y = df['is_setosa'])
model

GaussianNB()

In [11]:
model.class_prior_ # 종속변수 사전확률

array([0.66666667, 0.33333333])

In [12]:
model.theta_

array([[6.262, 2.872, 4.906, 1.676],
       [5.006, 3.428, 1.462, 0.246]])

In [14]:
pred = model.predict_proba(df.iloc[:, :4])
pred = pred[:, 1]

In [16]:
pred[:3]

array([1., 1., 1.])

In [17]:
from sklearn.metrics import accuracy_score

In [18]:
pred_class = (pred > 0.5) + 0 # 정확도 threshold 값

In [19]:
accuracy_score(y_true = df['is_setosa'],
              y_pred = pred_class)

1.0

#### Q1.  bmi > 0 인 데이터만 사용하여 나이브베이즈 분류를 실시 
- Outcome을 종속변수로 하고 나머지 변수를 독립 변수로 할 때,종속 변수의 사전확률은?

In [20]:
df = pd.read_csv("C:/Users/Python/Data/diabetes.csv")

In [21]:
df.head(2)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0


In [32]:
df_sub = df[df['BMI'] > 0]
df_sub.head(2)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0


In [33]:
model = GaussianNB().fit(X = df_sub.iloc[:, :-2],
                         y = df_sub['Outcome'])
model

GaussianNB()

In [34]:
model.class_prior_

array([0.64861295, 0.35138705])

In [35]:
df_sub['Outcome'].value_counts(normalize = True)

0    0.648613
1    0.351387
Name: Outcome, dtype: float64

#### Q2. Outcome을 종속변수로 하고 혈당, 나이, 혈압을 독립 변수로 할 때,
종속 변수의 사전확률을 구하고 그 정확도를 계산?

In [36]:
df.columns

Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
      dtype='object')

In [38]:
model = GaussianNB().fit(X = df[['Glucose', 'BloodPressure','Age']],
                         y = df['Outcome'])
model

GaussianNB()

In [39]:
pred = model.predict_proba(df.loc[:, ['Glucose', 'BloodPressure','Age']])

In [43]:
pred

array([[0.32535177, 0.67464823],
       [0.90063283, 0.09936717],
       [0.12024892, 0.87975108],
       ...,
       [0.77355188, 0.22644812],
       [0.59287801, 0.40712199],
       [0.91997342, 0.08002658]])

In [45]:
from sklearn.metrics import accuracy_score

pred_class = (pred[:, 1] > 0.5) + 0
pred_class[:4]

array([1, 0, 1, 0])

In [47]:
accuracy_score(y_pred = pred_class, y_true = df['Outcome'])

0.7552083333333334

#### Q3. Outcome을 종속변수로 하고 임신여부, 연령대, 혈당, BMI을 독립 변수로 할 때,
-  나이브 베이즈와 로지스틱 회귀분석을 실시하고 둘 높은 정확도가 높은 모델의 정확도는?

In [50]:
df_sub = df[df['BMI'] > 0]
df_sub['Age_g'] = (df['Age'] // 10 ) * 10
df_sub['is_preg'] = (df['Pregnancies'] > 0) + 0
df_sub.head(10)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_sub['Age_g'] = (df['Age'] // 10 ) * 10
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_sub['is_preg'] = (df['Pregnancies'] > 0) + 0


Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome,Age_g,is_preg
0,6,148,72,35,0,33.6,0.627,50,1,50,1
1,1,85,66,29,0,26.6,0.351,31,0,30,1
2,8,183,64,0,0,23.3,0.672,32,1,30,1
3,1,89,66,23,94,28.1,0.167,21,0,20,1
4,0,137,40,35,168,43.1,2.288,33,1,30,0
5,5,116,74,0,0,25.6,0.201,30,0,30,1
6,3,78,50,32,88,31.0,0.248,26,1,20,1
7,10,115,0,0,0,35.3,0.134,29,0,20,1
8,2,197,70,45,543,30.5,0.158,53,1,50,1
10,4,110,92,0,0,37.6,0.191,30,0,30,1


In [56]:
from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(df_sub, train_size = 0.8, random_state = 123) # 분류 123
df_train.head(2)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome,Age_g,is_preg
247,0,165,90,33,680,52.3,0.427,23,0,20,0
659,3,80,82,31,70,34.2,1.292,27,1,20,1


In [57]:
model = GaussianNB().fit(X = df_train.loc[:, ['is_preg', 'Glucose', 'BMI', 'Age_g']],
                         y = df_train['Outcome'])
pred = model.predict_proba(df_test.loc[:, ['is_preg', 'Glucose', 'BMI', 'Age_g']])
pred[:4,]

array([[0.09436402, 0.90563598],
       [0.74783283, 0.25216717],
       [0.11042961, 0.88957039],
       [0.57991266, 0.42008734]])

In [58]:
accuracy_score(y_pred = (pred[:, 1] > 0.5) + 0,
               y_true = df_test['Outcome'])


0.8026315789473685

In [59]:
# 로지스틱 회귀

from sklearn.linear_model import LogisticRegression

model_lr = LogisticRegression()
model_lr.fit(X = df_train.loc[:, ['is_preg', 'Glucose', 'BMI', 'Age_g']],
             y = df_train['Outcome'])

LogisticRegression()

In [67]:
pred_lr = model_lr.predict_proba(df_test.loc[:, ['is_preg', 'Glucose', 'BMI', 'Age_g']])
pred_lr[:, 1]
pred_lr_class = (pred_lr[:, 1] > 0.5) + 0

In [68]:
accuracy_score(y_true = df_test["Outcome"],
              y_pred = pred_lr_class)

0.8289473684210527

##### ==> 로지스틱 회귀모델의 정확도가 83%로 나이브베이즈 모델 보다 높다.