# Application of Logistic Regression Classifier

* $0 \leq h_{Θ}(x)\leq 1 $

## Logistic Regression with Iris dataset

### Setup

In [None]:
# common lib
import sklearn
import numpy as np

### Datasets



#### Iris dataset
* feature 3 (petal width)
* label: (1 if Iris virginica, else 0)

In [None]:
from sklearn import datasets
import pandas as pd

iris = datasets.load_iris()
iris_X = iris["data"][:, 3:]  # petal width
iris_y = (iris["target"] == 2).astype(np.int)  # 1 if Iris virginica, else 0

pd.DataFrame(iris_X).head(5)

### Preprocess

#### train_test_split

In [None]:
from sklearn.model_selection import train_test_split
iris_X_train, iris_X_test, iris_y_train, iris_y_test = train_test_split(iris_X, iris_y, random_state=42)

#### Add out liers

In [None]:
iris_X_train_out=np.append(iris_X_train, np.array([10., 10.,-20, -30]))
iris_y_train_out=np.append(iris_y_train, np.array([1,1,0, 0]))
iris_X_train_out=iris_X_train_out.reshape(-1,1)

### Linear Regression

In [None]:
from sklearn.linear_model import LinearRegression

line_model = LinearRegression()
line_model.fit(iris_X_train_out, iris_y_train_out)

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

plt.plot(iris_X_train_out, iris_y_train_out, 'o')
plt.plot(iris_X_train_out,line_model.predict(iris_X_train_out.reshape(-1,1)), "k-",markersize=1, linewidth=2)

#Decision boundary
y = np.mean(line_model.predict(iris_X_train_out.reshape(-1,1)))
print(y)
plt.plot([y, y], [-1.1, 1.1], "r-")
plt.show()

### Sigmoid

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

plt.plot(iris_X_train_out, iris_y_train_out, 'o')

#sigmoid
t = np.linspace(-30, 30, 100)
b = 1.5
sig = 1 / (1 + np.exp(-t+b))
plt.plot(t, sig, "k-",markersize=1, linewidth=2)

#Decision boundary
plt.plot([b, b], [-0.1, 1.1], "r-")
plt.show()

### Logistic Regression Classifier

**- sklearn.linear_model.[LogisticRegression(penalty='l2', *, dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver='lbfgs', max_iter=100, multi_class='auto', verbose=0, warm_start=False, n_jobs=None, l1_ratio=None)](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) : Returns the instance itself.**


In [None]:
from sklearn.linear_model import LogisticRegression
log_reg_model = LogisticRegression(random_state=42)
log_reg_model.fit(iris_X_train_out, iris_y_train_out)

#### Evaluation

In [None]:
from sklearn import metrics

predict = log_reg_model.predict(iris_X_train_out)
acc = metrics.accuracy_score(iris_y_train_out, predict)
print('Train acc(iris): {}'.format(acc))


predict = log_reg_model.predict(iris_X_test)
acc = metrics.accuracy_score(iris_y_test, predict)
print('Test acc(iris): {}'.format(acc))


## Logistic Regression with Breast cancer dataset

### Setup

In [None]:
# common lib
import sklearn
import numpy as np

### Datasets



#### Breast cancer dataset
* The breast cancer dataset is a classic and very easy binary classification dataset.

In [None]:
from sklearn import datasets
import pandas as pd

breast = datasets.load_breast_cancer()
breast_X = breast["data"]
breast_y = breast["target"]
breast_feature_name = breast.feature_names

pd.DataFrame(breast_X, columns=breast_feature_name).head(5)

### Preprocess

#### Variance based feature selection

In [None]:
from sklearn.feature_selection import VarianceThreshold
from sklearn.preprocessing import MinMaxScaler
%matplotlib inline
import matplotlib.pyplot as plt

selector = VarianceThreshold().fit(MinMaxScaler().fit_transform(breast_X))
variances = selector.variances_
var_sort = np.argsort(variances)[::-1]

plt.figure(figsize=(6, 2))
ypos = np.arange(5)[::-1]
plt.barh(ypos, variances[var_sort][:5], align='center')
plt.yticks(ypos, np.array(breast_feature_name)[var_sort][:5])
plt.xlabel("Variance");

index = np.where(breast_feature_name=='worst concave points')[0][0]
breast_X_texture = breast["data"][:,index].reshape(-1,1)

In [None]:
breast_X_high_var =VarianceThreshold(0.04).fit_transform(MinMaxScaler().fit_transform(breast_X))

#### train_test_split

In [None]:
from sklearn.model_selection import train_test_split
breast_X_train, breast_X_test, breast_y_train, breast_y_test = train_test_split(breast_X_high_var, breast_y, random_state=42)

### Logistic Regression model

**- sklearn.linear_model.[LogisticRegression(penalty='l2', *, dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver='lbfgs', max_iter=100, multi_class='auto', verbose=0, warm_start=False, n_jobs=None, l1_ratio=None)](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) : Returns the instance itself.**


In [None]:
from sklearn.linear_model import LogisticRegression
log_reg_model_breast = LogisticRegression(random_state=42)
log_reg_model_breast.fit(breast_X_train, breast_y_train)

#### Evaluation

In [None]:
from sklearn import metrics

predict = log_reg_model_breast.predict(breast_X_train)
acc = metrics.accuracy_score(breast_y_train, predict)
print('Train acc(breast): {}'.format(acc))


predict = log_reg_model_breast.predict(breast_X_test)
acc = metrics.accuracy_score(breast_y_test, predict)
print('Test acc(breast): {}'.format(acc))


#### Set parameter(C, max_iter, solver)

**C**
  * float, default=1.0
  * Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization.
  * 너무 작으면 강한 정규화로 underfitting 가능성이 높아짐
  * 너무 크면 outlier가 발생하여 overfitting 가능성이 높아짐

**max_iter**
  * int, default=100
  * Maximum number of iterations taken for the solvers to converge.

**solver**

* default='lbfgs'
* solver 종류
  * newton-cg: ['l2', 'none']
  * lbfgs: ['l2', 'none']
  * liblinear: ['l1', 'l2']
  * sag: ['l2', 'none']
  * saga: ['elasticnet', 'l1', 'l2', 'none']
* solver 장점 및 용도
  * For small datasets, 'liblinear' is a good choice, whereas 'sag' and 'saga'are faster for large ones;
  * For multiclass problems, only 'newton-cg', 'sag', 'saga' and 'lbfgs' handle multinomial loss; 'liblinear' is limited to one-versus-rest schemes.

* C 1e-3 VS 1e+3

In [None]:
# Default model
from sklearn.linear_model import LogisticRegression
log_reg_model_c = LogisticRegression(random_state=42)

# C=10**-3
log_reg_model_c.set_params(C=10**-3)
log_reg_model_c.fit(breast_X_train, breast_y_train)

predict = log_reg_model_c.predict(breast_X_train)
acc = metrics.accuracy_score(breast_y_train, predict)
print('train Accuracy(C=10**-3): {}'.format(acc))
predict = log_reg_model_c.predict(breast_X_test)
acc = metrics.accuracy_score(breast_y_test, predict)
print('Test Accuracy(C=10**-3): {}'.format(acc))

# C=10**3
log_reg_model_c.set_params(C=10**3)
log_reg_model_c.fit(breast_X_train, breast_y_train)

predict = log_reg_model_c.predict(breast_X_train)
acc = metrics.accuracy_score(breast_y_train, predict)
print('train Accuracy(C=10**3): {}'.format(acc))
predict = log_reg_model_c.predict(breast_X_test)
acc = metrics.accuracy_score(breast_y_test, predict)
print('Test Accuracy(C=10**3): {}'.format(acc))

* max_iter 2 VS 100

In [None]:
# Default model
from sklearn.linear_model import LogisticRegression
log_reg_model = LogisticRegression(random_state=42)

# max_iter=2
log_reg_model.set_params(max_iter=2)
log_reg_model.fit(breast_X_train, breast_y_train)

predict = log_reg_model.predict(breast_X_train)
acc = metrics.accuracy_score(breast_y_train, predict)
print('train Accuracy(max_iter=2): {}'.format(acc))
predict = log_reg_model.predict(breast_X_test)
acc = metrics.accuracy_score(breast_y_test, predict)
print('Test Accuracy(max_iter=2): {}'.format(acc))

# max_iter=100
log_reg_model.set_params(max_iter=100)
log_reg_model.fit(breast_X_train, breast_y_train)

predict = log_reg_model.predict(breast_X_train)
acc = metrics.accuracy_score(breast_y_train, predict)
print('train Accuracy(max_iter=100): {}'.format(acc))
predict = log_reg_model.predict(breast_X_test)
acc = metrics.accuracy_score(breast_y_test, predict)
print('Test Accuracy(max_iter=100): {}'.format(acc))

* solver 'lbfgs' VS 'sag'

In [None]:
# Default model
from sklearn.linear_model import LogisticRegression
log_reg_model = LogisticRegression(max_iter=2, random_state=42)

# solver="lbfgs"
log_reg_model.set_params(solver="lbfgs")
log_reg_model.fit(breast_X_train, breast_y_train)

predict = log_reg_model.predict(breast_X_train)
acc = metrics.accuracy_score(breast_y_train, predict)
print('train Accuracy(solver="lbfgs"): {}'.format(acc))
predict = log_reg_model.predict(breast_X_test)
acc = metrics.accuracy_score(breast_y_test, predict)
print('Test Accuracy(solver="lbfgs"): {}'.format(acc))

# solver="sag"
log_reg_model.set_params(solver="sag")
log_reg_model.fit(breast_X_train, breast_y_train)

predict = log_reg_model.predict(breast_X_train)
acc = metrics.accuracy_score(breast_y_train, predict)
print('train Accuracy(solver="sag"): {}'.format(acc))
predict = log_reg_model.predict(breast_X_test)
acc = metrics.accuracy_score(breast_y_test, predict)
print('Test Accuracy(solver="sag"): {}'.format(acc))

### Validation_curve


* Visualization 함수

In [None]:
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt

# 수치형 파라미터 시각화 함수
def viz_val_curve(param_range, train_mean, train_std, test_mean, test_std, param_name, xscale_log=False):
  plt.plot(param_range, train_mean, 
          color='blue', marker='o', 
          markersize=5, label='Training accuracy')

  plt.fill_between(param_range, train_mean + train_std,
                  train_mean - train_std, alpha=0.15,
                  color='blue')

  plt.plot(param_range, test_mean, 
          color='green', linestyle='--', 
          marker='s', markersize=5, 
          label='Validation accuracy')

  plt.fill_between(param_range, 
                  test_mean + test_std,
                  test_mean - test_std, 
                  alpha=0.15, color='green')


  plt.grid()
  plt.legend(loc='lower right')
  if xscale_log:
    plt.xscale('log')
  plt.xlabel(param_name)
  plt.ylabel('Accuracy')
  plt.ylim([np.min(test_mean)*0.8, np.max(train_mean)*1.2])
  plt.tight_layout()
  plt.show()

# 범주형 파라미터 시각화 함수
def viz_val_bar(param_range, train_mean, train_std, test_mean, test_std, param_name):
  idx = np.arange(len(param_range))
  plt.bar(idx, test_mean, width=0.3)
  plt.xlabel(param_name)
  plt.ylabel('Accuracy')
  plt.ylim([np.min(test_mean)*0.9, np.max(test_mean)*1.1])
  plt.xticks(idx, param_range, fontsize=15)
  plt.show()

#### Validation_curve(C)

#### **C**
  * float, default=1.0
  * Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization.
  * 너무 작으면 강한 정규화로 underfitting 가능성이 높아짐
  * 너무 크면 outlier가 발생하여 overfitting 가능성이 높아짐

In [None]:
from sklearn.model_selection import validation_curve

param_range= [10**i for i in range(-9,-3)]
param_name='C'

from sklearn.linear_model import LogisticRegression
log_reg_model_c = LogisticRegression(random_state=42, solver='liblinear')

train_scores, test_scores = validation_curve(
                estimator=log_reg_model_c, 
                X=breast_X, 
                y=breast_y, 
                param_name=param_name, 
                param_range=param_range,
                cv=10)

train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)
test_mean = np.mean(test_scores, axis=1)
test_std = np.std(test_scores, axis=1)

viz_val_curve(param_range, train_mean, train_std, test_mean, test_std, param_name, xscale_log=True)

#### Evaluation
**Default model performance**
* Train acc(breast): 0.7723004694835681
* Test acc(breast): 0.8251748251748252

In [None]:
from sklearn.linear_model import LogisticRegression
proper_model_c = LogisticRegression(C=10**2, random_state=42)
proper_model_c.fit(breast_X_train, breast_y_train)

from sklearn import metrics

predict = proper_model_c.predict(breast_X_train)
acc = metrics.accuracy_score(breast_y_train, predict)
print('Train Accuracy(c): {}'.format(acc))

predict = proper_model_c.predict(breast_X_test)
acc = metrics.accuracy_score(breast_y_test, predict)
print('Test Accuracy(c): {}'.format(acc))

# Exercise

## 1번 문제
Logistic Regression을 사용하여 Wine 데이터를 분류 하시오.

  * Validation_curve 함수를 사용하여 아래 Hyperparameters의 변화에 따른 결과를 그래프로 표현하시오.
    * C
    * max_iter
    * solver
  * 가장 높은 accuracy를 기록하는 파리미터 조합을 도출하시오.

  ```python
  from sklearn.datasets import load_wine

  wine = load_wine()
  ```

## 1번 문제 답안

## 2번 문제
Logistic Regression을 사용하여 California Housing 데이터를 분류 하시오.

  * Validation_curve 함수를 사용하여 아래 Hyperparameters의 변화에 따른 결과를 그래프로 표현하시오.
    * C
    * max_iter
    * solver
  * 가장 높은 accuracy를 기록하는 파리미터 조합을 도출하시오.

  

#### California Housing dataset

* The target variable is the median house value for California districts, expressed in hundreds of thousands of dollars ($100,000).
*  하였음

```python
#load data
housing = fetch_california_housing()
housing_X = housing.data
housing_y = np.round(housing.target).astype(int) # make y discrete
print('Number of target: ',len(set(housing_y)))

pd.DataFrame(housing_X, columns=housing.feature_names).head(3)
```



## 2번 문제 답안