scikit-learnに内包されている、アメリカウィスコンシン州の乳がんのデータ元に、良性・悪性の二値分類を行って下さい。
評価は、F1スコアで評価した結果が良くなるように、アルゴリズム選定から、説明変数の選定及びパラメータ調整を行って下さい。

最終的な目的は、汎化性能が高いモデルを構築することです。
用いて良いツールは、scikit-learnのみとし、ディープラーニングなどの別の手法は採用しないものとします。

In [1]:
from sklearn.datasets import load_breast_cancer
breast_cancer = load_breast_cancer()
X = breast_cancer.data
y = breast_cancer.target

以下は回答例

In [2]:
X = X[:, :10]
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
clf.fit(X, y)
y_pred = clf.predict(X)

In [3]:
from sklearn.metrics import accuracy_score
accuracy_score(y, y_pred)

0.9086115992970123

In [4]:
from sklearn.metrics import classification_report
print(classification_report(y, y_pred))

             precision    recall  f1-score   support

          0       0.92      0.83      0.87       212
          1       0.90      0.96      0.93       357

avg / total       0.91      0.91      0.91       569



# 回答は以下

以下にセルを追加して、実装を書き加えてください。

In [14]:
from sklearn.model_selection import train_test_split

In [15]:
(X_train, X_test, y_train, y_test) = train_test_split(X, y, test_size=0.3)

In [16]:
from sklearn import svm
cls = svm.SVC()
cls.fit(X_train, y_train)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [20]:
y_pred2 = cls.predict(X_test)
accuracy_score(y_test, y_pred2)

0.695906432748538

サポートベクターマシーンでは、0.7程度のスコア

In [21]:
from sklearn.ensemble import RandomForestClassifier
cls2 = RandomForestClassifier()
cls2.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [22]:
y_pred3 = cls2.predict(X_test)
accuracy_score(y_test, y_pred3)

0.9532163742690059

ランダムフォレストでは、0.95ものスコア

⇒ランダムフォレストによるチューニングを検討する

In [25]:
print(classification_report(y_test, y_pred3))

             precision    recall  f1-score   support

          0       0.89      0.98      0.93        55
          1       0.99      0.94      0.96       116

avg / total       0.96      0.95      0.95       171



In [34]:
breast_cancer.feature_names

array(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean smoothness', 'mean compactness', 'mean concavity',
       'mean concave points', 'mean symmetry', 'mean fractal dimension',
       'radius error', 'texture error', 'perimeter error', 'area error',
       'smoothness error', 'compactness error', 'concavity error',
       'concave points error', 'symmetry error',
       'fractal dimension error', 'worst radius', 'worst texture',
       'worst perimeter', 'worst area', 'worst smoothness',
       'worst compactness', 'worst concavity', 'worst concave points',
       'worst symmetry', 'worst fractal dimension'], dtype='<U23')

In [36]:
len(breast_cancer.data)

569

In [37]:
len(X_train)

398

In [41]:
#チューニングするパラメータの設定
parameters = {
    'n_estimators' :[i for i in range(5, 15)],
    'max_depth' :[i for i in range(1, 10)],
    'min_samples_leaf': [i for i in range(1, 10)],
    'min_samples_split': [i for i in range(2, 20)]
}

In [42]:
#グリッドサーチ
from sklearn.model_selection import GridSearchCV
cls3 = GridSearchCV(estimator=RandomForestClassifier(), param_grid=parameters, cv=2, iid=False)
cls3.fit(X_train, y_train)

GridSearchCV(cv=2, error_score='raise',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False),
       fit_params=None, iid=False, n_jobs=1,
       param_grid={'n_estimators': [5, 6, 7, 8, 9, 10, 11, 12, 13, 14], 'max_depth': [1, 2, 3, 4, 5, 6, 7, 8, 9], 'min_samples_leaf': [1, 2, 3, 4, 5, 6, 7, 8, 9], 'min_samples_split': [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [43]:
y_pred4 = cls3.predict(X_test)
accuracy_score(y_test, y_pred4)

0.9473684210526315

In [44]:
print(classification_report(y_test, y_pred4))

             precision    recall  f1-score   support

          0       0.90      0.95      0.92        55
          1       0.97      0.95      0.96       116

avg / total       0.95      0.95      0.95       171



残念ながら、チューニングをしても精度は向上しなかった