## supervised learning

- classification
    - binary classification
    - multiclass classification

### 이진 분류
positive class 와 negative class가 있음.
positive class는 분류하고자 하는 대상이다.

### 회귀
연속적인 숫자, 또는 프로그래밍 용어로 말하면 floating point 를 예측하는 것이다.

출력 값에 연속성이 있는지 찾아보면 회귀와 분류 문제를 쉽게 구분할 수 있다.

In [59]:
from IPython.display import display
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import mglearn
from sklearn.datasets import make_blobs

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

%matplotlib notebook

In [60]:
# make datasets
X, y = mglearn.datasets.make_forge()

# plotting scatter
mglearn.discrete_scatter(X[:, 0], X[:, 1], y)
plt.title("scatter of forge datasets")
plt.legend(["class 0", "class1"], loc = 4)
plt.xlabel("1st feature")
plt.ylabel("2nd feature")
print("X.shape: ", X.shape)



<IPython.core.display.Javascript object>

[<matplotlib.lines.Line2D at 0x17a1a245cd0>,
 <matplotlib.lines.Line2D at 0x17a1a245f70>]

Text(0.5, 1.0, 'scatter of forge datasets')

<matplotlib.legend.Legend at 0x17a1a23bb20>

Text(0.5, 0, '1st feature')

Text(0, 0.5, '2nd feature')

X.shape:  (26, 2)


위의 dataset의 shape은 26, 2 이다.

회귀 알고리즘에는 인위적으로 만든 wave datasets을 사용한다.

wave datasets 은 input feature 1개와 모델링할 target 변수 (또는 응답)을 가진다.

In [61]:
X, y = mglearn.datasets.make_wave(n_samples=40)
plt.plot(X, y, 'o')
plt.ylim(-3, 3)
plt.xlabel("features")
plt.ylabel("targets")

[<matplotlib.lines.Line2D at 0x17a19c531f0>]

(-3.0, 3.0)

Text(0.5, 0, 'features')

Text(0, 0.5, 'targets')

In [62]:
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
print("cancer.keys():\n", cancer.keys())
# sklearn에 있는 dataset은 bunch 객체이다.
# 딕셔너리와 비슷하지만 bunch['key'] 대신 bunch.key 를 사용할 수 있다.

cancer.keys():
 dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename'])


In [63]:
print(cancer.data.shape)

(569, 30)


In [64]:
print("# of samples for class\n",
     {n: v for n, v in zip(cancer.target_names, np.bincount(cancer.target))})

# of samples for class
 {'malignant': 212, 'benign': 357}


In [65]:
print("name of features\n", cancer.feature_names)

name of features
 ['mean radius' 'mean texture' 'mean perimeter' 'mean area'
 'mean smoothness' 'mean compactness' 'mean concavity'
 'mean concave points' 'mean symmetry' 'mean fractal dimension'
 'radius error' 'texture error' 'perimeter error' 'area error'
 'smoothness error' 'compactness error' 'concavity error'
 'concave points error' 'symmetry error' 'fractal dimension error'
 'worst radius' 'worst texture' 'worst perimeter' 'worst area'
 'worst smoothness' 'worst compactness' 'worst concavity'
 'worst concave points' 'worst symmetry' 'worst fractal dimension']


### boston housing price prediction

In [66]:
from sklearn.datasets import load_boston
boston = load_boston()
print(boston.data.shape)

(506, 13)


In [67]:
X, y = mglearn.datasets.load_extended_boston()
print(X.shape)

(506, 104)


In [68]:
mglearn.plots.plot_knn_classification(n_neighbors=1)



In [69]:
from sklearn.model_selection import train_test_split
X, y = mglearn.datasets.make_forge()

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)



In [70]:
from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier(n_neighbors=3)

In [71]:
clf.fit(X_train, y_train)

KNeighborsClassifier(n_neighbors=3)

In [72]:
print(clf.predict(X_test))

[1 0 1 0 1 0 0]


In [73]:
print("performance:", clf.score(X_test, y_test))

performance: 0.8571428571428571


In [75]:
fig, axes = plt.subplots(1, 3, figsize=(10,3))

for n_neighbors, ax in zip([1, 3, 9], axes):
    clf = KNeighborsClassifier(n_neighbors=n_neighbors).fit(X, y)
    mglearn.plots.plot_2d_separator(clf, X, fill=True, eps=0.5, ax=ax, alpha=0.4)
    mglearn.discrete_scatter(X[:, 0], X[:, 1], y, ax=ax)
    ax.set_title("neighbors: {}".format(n_neighbors))
    ax.set_xlabel("feature 1")
    ax.set_ylabel("feature 2")
axes[0].legend(loc=3)                        

<IPython.core.display.Javascript object>

[<matplotlib.lines.Line2D at 0x17a1cd416d0>,
 <matplotlib.lines.Line2D at 0x17a1cd41a30>]

Text(0.5, 1.0, 'neighbors: 1')

Text(0.5, 0, 'feature 1')

Text(0, 0.5, 'feature 2')

[<matplotlib.lines.Line2D at 0x17a1cd495b0>,
 <matplotlib.lines.Line2D at 0x17a1cd49910>]

Text(0.5, 1.0, 'neighbors: 3')

Text(0.5, 0, 'feature 1')

Text(0, 0.5, 'feature 2')

[<matplotlib.lines.Line2D at 0x17a1cd516a0>,
 <matplotlib.lines.Line2D at 0x17a1cd51a00>]

Text(0.5, 1.0, 'neighbors: 9')

Text(0.5, 0, 'feature 1')

Text(0, 0.5, 'feature 2')

<matplotlib.legend.Legend at 0x17a19d366a0>

In [76]:
from sklearn.datasets import load_breast_cancer

cancer = load_breast_cancer()