# Exercise

## 1번 문제 (Gaussian Naive Bayes)

Gausian Naive Bayes model을 사용하여 fetch_california_housing 데이터를 분류 하시오.
  * Validation_curve 함수를 사용하여 아래 Hyperparameters의 변화에 따른 결과를 그래프로 표현하시오.
    * var_smoothing
  * 가장 높은 accuracy를 기록하는 파리미터를 도출하시오.

```python
from sklearn.datasets import fetch_california_housing
import pandas as pd

housing = fetch_california_housing()
housing_X = housing.data

# discrete한 y 데이터로 변경하기 위해 소수점 자리를 반올림하여 int로 형변환
housing_y = np.round(housing.target).astype(int) # make y discrete
print('Number of target: ',len(set(housing_y)))

pd.DataFrame(housing_X, columns=housing.feature_names).head(3)

```

## 1번 문제 답안

### Setup

In [None]:
# Common imports
import sklearn
import numpy as np

### Load data

#### California Housing dataset

* The target variable is the median house value for California districts, expressed in hundreds of thousands of dollars ($100,000).

In [None]:
from sklearn.datasets import fetch_california_housing
import pandas as pd

housing = fetch_california_housing()
housing_X = housing.data
housing_y = np.round(housing.target).astype(int) # make y discrete
print('Number of target: ',len(set(housing_y)))

pd.DataFrame(housing_X, columns=housing.feature_names).head(3)

#### Splitting

In [None]:
from sklearn.model_selection import train_test_split
housing_X_train, housing_X_test, housing_y_train, housing_y_test = train_test_split(housing_X, housing_y, random_state=42)

### GaussianNB model

**- sklearn.naive_bayes.[GaussianNB(*, priors=None, var_smoothing=1e-09)](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html) : Returns the instance itself.**

Can perform online updates to model parameters via partial_fit.

In [None]:
from sklearn.naive_bayes import GaussianNB

model = GaussianNB()
model.fit(housing_X_train, housing_y_train)

#### Evaluation

In [None]:
from sklearn import metrics

predict = model.predict(housing_X_train)
acc = metrics.accuracy_score(housing_y_train, predict)
print('Train Accuracy: {}'.format(acc))


predict = model.predict(housing_X_test)
acc = metrics.accuracy_score(housing_y_test, predict)
print('Test Accuracy: {}'.format(acc))

### Validation_curve(param_name='var_smoothing')

In [None]:
from sklearn.model_selection import validation_curve

param_range= [10**i for i in range(-11,4)]

from sklearn.naive_bayes import GaussianNB

smooth_model = GaussianNB() # change this
smooth_model.fit(housing_X_train, housing_y_train)

train_scores, test_scores = validation_curve(
                estimator=smooth_model, 
                X=housing_X_train, 
                y=housing_y_train, 
                param_name='var_smoothing', 
                param_range=param_range,
                cv=2)

train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)
test_mean = np.mean(test_scores, axis=1)
test_std = np.std(test_scores, axis=1)

#### Visualization

In [None]:
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt

plt.plot(param_range, train_mean, 
         color='blue', marker='o', 
         markersize=5, label='Training accuracy')

plt.fill_between(param_range, train_mean + train_std,
                 train_mean - train_std, alpha=0.15,
                 color='blue')

plt.plot(param_range, test_mean, 
         color='green', linestyle='--', 
         marker='s', markersize=5, 
         label='Validation accuracy')

plt.fill_between(param_range, 
                 test_mean + test_std,
                 test_mean - test_std, 
                 alpha=0.15, color='green')


plt.grid()
plt.xscale('log')
plt.legend(loc='lower right')
plt.xlabel('var_smoothing')
plt.ylabel('Accuracy')
plt.ylim([np.min(train_mean)*0.8, np.max(train_mean)*1.2])
plt.tight_layout()
plt.show()

#### Evaluation
**Default model performance**
* Train Accuracy: 0.2554909560723514
* Test Accuracy: 0.27015503875968994

In [None]:
from sklearn.naive_bayes import GaussianNB

proper_model = GaussianNB(var_smoothing=10**-6)
proper_model.fit(housing_X_train, housing_y_train)

from sklearn import metrics

predict = proper_model.predict(housing_X_train)
acc = metrics.accuracy_score(housing_y_train, predict)
print('Train Accuracy: {}'.format(acc))


predict = proper_model.predict(housing_X_test)
acc = metrics.accuracy_score(housing_y_test, predict)
print('Test Accuracy: {}'.format(acc))

#### Result
**Best performance**
* Train Accuracy: 0.3303617571059432
* Test Accuracy: 0.3412790697674419

## 2번 문제 (Bernoulli Naive Bayes)
Bernoulli Naive Bayes model을 사용하여 Forest CoverType index 10~53까지의 데이터를 분류 하시오.
  * Validation_curve 함수를 사용하여 아래 Hyperparameters의 변화에 따른 결과를 그래프로 표현하시오.
    * alpha
  * 가장 높은 accuracy를 기록하는 파리미터를 도출하시오.

```python
from sklearn.datasets import fetch_covtype
import pandas as pd

covtype = fetch_covtype()
# covtype index 10~53
bi_covtype_X = covtype.data[:,10:53] #discrete features for Bernoulli model
bi_covtype_y = covtype.target

covtype_feature_name = covtype.feature_names[10:53]
print('Number of targets: ',len(set(bi_covtype_y)))

pd.DataFrame(bi_covtype_X, columns=covtype_feature_name).head(3)
```



## 2번 문제 답안

### Setup

In [None]:
# Common imports
import sklearn
import numpy as np

### Load data

#### Forest CoverType dataset
* Characteristic data of forest covertype
* Predict which type of covertype belongs to
* https://archive.ics.uci.edu/ml/datasets/Covertype 
* $Y$: discrete, 
  * $X_{0 ∼ 9}$: continuous
  * $X_{10 ∼ 53}$: binary

In [None]:
from sklearn.datasets import fetch_covtype
import pandas as pd

covtype = fetch_covtype()
# covtype index 10~53
bi_covtype_X = covtype.data[:,10:53] #discrete features for Bernoulli model
bi_covtype_y = covtype.target

covtype_feature_name = covtype.feature_names[10:53]
print('Number of targets: ',len(set(bi_covtype_y)))

pd.DataFrame(bi_covtype_X, columns=covtype_feature_name).head(3)

#### Splitting

In [None]:
from sklearn.model_selection import train_test_split
bi_covtype_X_train, bi_covtype_X_test, bi_covtype_y_train, bi_covtype_y_test = train_test_split(bi_covtype_X, bi_covtype_y, random_state=42)

### BernoulliNB model

In [None]:
from sklearn.naive_bayes import BernoulliNB

bernoulli_model_cov = BernoulliNB()
bernoulli_model_cov.fit(bi_covtype_X_train, bi_covtype_y_train)

#### Evaluation

In [None]:
from sklearn import metrics

# use covtype binary data
predict = bernoulli_model_cov.predict(bi_covtype_X_train)
acc = metrics.accuracy_score(bi_covtype_y_train, predict)
print('Train Accuracy(covtype): {}'.format(acc))
predict = bernoulli_model_cov.predict(bi_covtype_X_test)
acc = metrics.accuracy_score(bi_covtype_y_test, predict)
print('Test Accuracy(covtype): {}'.format(acc))

### Validation_curve(alpha)

In [None]:
from sklearn.model_selection import validation_curve

param_range= [10**i for i in range(-11,4)]

from sklearn.naive_bayes import BernoulliNB

smooth_model = BernoulliNB()
smooth_model.fit(bi_covtype_X_train, bi_covtype_y_train)

train_scores, test_scores = validation_curve(
                estimator=smooth_model, 
                X=bi_covtype_X_train, 
                y=bi_covtype_y_train, 
                param_name='alpha', 
                param_range=param_range,
                cv=2)

train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)
test_mean = np.mean(test_scores, axis=1)
test_std = np.std(test_scores, axis=1)

#### Visualization

In [None]:
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt

plt.plot(param_range, train_mean, 
         color='blue', marker='o', 
         markersize=5, label='Training accuracy')

plt.fill_between(param_range, train_mean + train_std,
                 train_mean - train_std, alpha=0.15,
                 color='blue')

plt.plot(param_range, test_mean, 
         color='green', linestyle='--', 
         marker='s', markersize=5, 
         label='Validation accuracy')

plt.fill_between(param_range, 
                 test_mean + test_std,
                 test_mean - test_std, 
                 alpha=0.15, color='green')


plt.grid()
plt.xscale('log')
plt.legend(loc='lower right')
plt.xlabel('alpha')
plt.ylabel('Accuracy')
plt.ylim([np.min(train_mean)*0.8, np.max(train_mean)*1.2])
plt.tight_layout()
plt.show()

#### Evaluation
**Default model performance**
* Train Accuracy(covtype): 0.6237874604999553
* Test Accuracy(covtype): 0.6204071516595182

In [None]:
from sklearn.naive_bayes import BernoulliNB

proper_model_tfidf = BernoulliNB(alpha=10**2)
proper_model_tfidf.fit(bi_covtype_X_train, bi_covtype_y_train)

from sklearn import metrics

# use covtype binary data
predict = proper_model_tfidf.predict(bi_covtype_X_train)
acc = metrics.accuracy_score(bi_covtype_y_train, predict)
print('Train Accuracy(cov): {}'.format(acc))
predict = proper_model_tfidf.predict(bi_covtype_X_test)
acc = metrics.accuracy_score(bi_covtype_y_test, predict)
print('Test Accuracy(cov): {}'.format(acc))

#### Result
**Best performance**
* Train Accuracy(cov): 0.6338297086233445
* Test Accuracy(cov): 0.6310919567926307

## 3번 문제(Multinomial Naive Bayes) 

Multinomial Naive Bayes model을 사용하여 20 Newsgroup 데이터를 분류 하시오.
  * Hashing Vectoizer를 사용하여 text features를 vectorize 하시오.
    * Hashing Vectoizer의 파라미터 binary에 True값을 할당하여 데이터를 이진 데이터로 벡터화 하시오.

  ```python
  from sklearn.feature_extraction.text import HashingVectorizer

  # Hashing Vectoizer
  hash_vectorizer = HashingVectorizer(n_features=1000, binary=True)
  ```
  * Validation_curve 함수를 사용하여 아래 Hyperparameters의 변화에 따른 결과를 그래프로 표현하시오.
    * alpha
  * 가장 높은 accuracy를 기록하는 파리미터를 도출하시오.

## 3번 문제 답안

### Setup

In [None]:
# Common imports
import sklearn
import numpy as np

### Load data

####20 Newsgroup dataset
* Categorize which group the news article belongs to

* News articles are text data, so special preprocessing is required.

In [None]:
from sklearn.datasets import fetch_20newsgroups
import pandas as pd

newsgroup = fetch_20newsgroups()

newsgroup_X = newsgroup.data
newsgroup_y = newsgroup.target

pd.DataFrame(newsgroup.data).head(3)

#### Splitting

In [None]:
from sklearn.model_selection import train_test_split
newsgroup_X_train, newsgroup_X_test, newsgroup_y_train, newsgroup_y_test = train_test_split(newsgroup_X, newsgroup_y, random_state=42)

#### Vectorization for text data
* sklearn.feature_extraction.text.[CountVectorizer()](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)
  * 문서를 토큰 리스트로 변환
  * 각 문서에서 토큰의 출현 빈도 카운트
  * 각 문서를 Bag of words(BOW) 인코딩 벡터로 변환
* sklearn.feature_extraction.text.[HashingVectorizer()](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html#sklearn.feature_extraction.text.HashingVectorizer)
  * 해시 함수 사용으로 단어에 대한 index를 생성하여 실행시간 단축
* sklearn.feature_extraction.text.[TfidfVectorizer()](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer)
  * 단어를 갯수 그대로 카운트하지 않고 모든 문서에 공통적으로 들어있는 단어의 가중치를 축소
  * $tf-idf(d, t)= tf(d, f)̇ ⋅ idf(t)$
    * $d, t$ : document, term
    * $tf(d,t)$ : term frequency.
    * $idf(t)$ : inverse document frequency.
    * $idf(d, t)= log\frac{n}{1+df(t)}$
      * $n$ : 전체 문서의 수
      * $df$ : 단어 $t$를 가진 문서의 수


In [None]:
from sklearn.feature_extraction.text import HashingVectorizer

# Hashing Vectoizer
hash_vectorizer = HashingVectorizer(n_features=1000, binary=True)
X_train_hash = hash_vectorizer.fit_transform(newsgroup_X_train)
X_test_hash = hash_vectorizer.transform(newsgroup_X_test)

### MultinomialNB model

**- sklearn.naive_bayes.[MultinomialNB(*, alpha=1.0, fit_prior=True, class_prior=None)](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html) : Returns the instance itself.**

The multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification).

In [None]:
from sklearn.naive_bayes import MultinomialNB

model_hash = MultinomialNB()
model_hash.fit(X_train_hash, newsgroup_y_train)

#### Evaluation

In [None]:
from sklearn import metrics

# use Hash Vector
predict = model_hash.predict(X_train_hash)
acc = metrics.accuracy_score(newsgroup_y_train, predict)
print('Train Accuracy(hash): {}'.format(acc))
predict = model_hash.predict(X_test_hash)
acc = metrics.accuracy_score(newsgroup_y_test, predict)
print('Test Accuracy(hash): {}'.format(acc))

### Validation_curve(alpha)

In [None]:
from sklearn.model_selection import validation_curve
param_range= [10**i for i in range(-11,4)]

from sklearn.naive_bayes import MultinomialNB

smooth_model = MultinomialNB()
smooth_model.fit(X_train_hash, newsgroup_y_train)

train_scores, test_scores = validation_curve(
                estimator=smooth_model, 
                X=X_train_hash, 
                y=newsgroup_y_train, 
                param_name='alpha', 
                param_range=param_range,
                cv=2)

train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)
test_mean = np.mean(test_scores, axis=1)
test_std = np.std(test_scores, axis=1)

#### Visualization

In [None]:
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt

plt.plot(param_range, train_mean, 
         color='blue', marker='o', 
         markersize=5, label='Training accuracy')

plt.fill_between(param_range, train_mean + train_std,
                 train_mean - train_std, alpha=0.15,
                 color='blue')

plt.plot(param_range, test_mean, 
         color='green', linestyle='--', 
         marker='s', markersize=5, 
         label='Validation accuracy')

plt.fill_between(param_range, 
                 test_mean + test_std,
                 test_mean - test_std, 
                 alpha=0.15, color='green')


plt.grid()
plt.xscale('log')
plt.legend(loc='lower right')
plt.xlabel('alpha')
plt.ylabel('Accuracy')
plt.ylim([np.min(train_mean)*0.8, np.max(train_mean)*1.2])
plt.tight_layout()
plt.show()

#### Evaluation
**Deafault model performance**
* Train Accuracy(hash): 0.7721862109605185
* Test Accuracy(hash): 0.68045245669848

In [None]:
from sklearn.naive_bayes import MultinomialNB

proper_model_tfidf = MultinomialNB(alpha=10**-1)
proper_model_tfidf.fit(X_train_hash, newsgroup_y_train)


from sklearn import metrics

# use hash Vector
predict = proper_model_tfidf.predict(X_train_hash)
acc = metrics.accuracy_score(newsgroup_y_train, predict)
print('Train Accuracy(hash): {}'.format(acc))
predict = proper_model_tfidf.predict(X_test_hash)
acc = metrics.accuracy_score(newsgroup_y_test, predict)
print('Test Accuracy(hash): {}'.format(acc))

#### Result
**Best performance**
* Train Accuracy(hash): 0.7959929286977018
* Test Accuracy(hash): 0.6938847649346058