<a href="https://colab.research.google.com/github/fboldt/aulasml/blob/master/aula10a_feature_selection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from sklearn.datasets import fetch_covtype

covtype = fetch_covtype()
description = covtype.DESCR
print(description)

.. _covtype_dataset:

Forest covertypes
-----------------

The samples in this dataset correspond to 30×30m patches of forest in the US,
collected for the task of predicting each patch's cover type,
i.e. the dominant species of tree.
There are seven covertypes, making this a multiclass classification problem.
Each sample has 54 features, described on the
`dataset's homepage <https://archive.ics.uci.edu/ml/datasets/Covertype>`__.
Some of the features are boolean indicators,
while others are discrete or continuous measurements.

**Data Set Characteristics:**

    Classes                        7
    Samples total             581012
    Dimensionality                54
    Features                     int

:func:`sklearn.datasets.fetch_covtype` will load the covertype dataset;
it returns a dictionary-like 'Bunch' object
with the feature matrix in the ``data`` member
and the target values in ``target``. If optional argument 'as_frame' is
set to 'True', it will return ``data`` and ``target`

In [2]:
import numpy as np
np.random.seed(42)

In [3]:
X, y = fetch_covtype(return_X_y=True, shuffle=True)
X.shape, y.shape

((581012, 54), (581012,))

Referências: 
https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html
https://sherbold.github.io/intro-to-data-science/exercises/Solution_Classification.html
https://www.kaggle.com/code/maostack/classifiers-tips/notebook#Results

In [4]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPClassifier

In [5]:
classifiers = [
    KNeighborsClassifier(3),
    KNeighborsClassifier(5),
    KNeighborsClassifier(10),
    DecisionTreeClassifier(max_depth=5),
    DecisionTreeClassifier(max_depth=10),
    DecisionTreeClassifier(max_depth=20),
    RandomForestClassifier(n_estimators=1000, max_depth=3, n_jobs=-1),
    RandomForestClassifier(n_estimators=1000, max_depth=5, n_jobs=-1),
    LogisticRegression(max_iter=10000, solver='lbfgs', n_jobs=-1),
    SGDClassifier(loss="hinge", penalty="l2", max_iter=1000, n_jobs=-1),
    GaussianNB(),
    MLPClassifier(hidden_layer_sizes=(100, 100, 100), max_iter=10000, activation='relu'),
    MLPClassifier(hidden_layer_sizes=(100, 100, 100), max_iter=10000, activation='tanh')
]

clf_names = [
    "Nearest Neighbors (k=3)",
    "Nearest Neighbors (k=5)",
    "Nearest Neighbors (k=10)",
    "Decision Tree (Max Depth=5)",
    "Decision Tree (Max Depth=10)",
    "Decision Tree (Max Depth=20)",
    "Random Forest (Max Depth=3)",
    "Random Forest (Max Depth=5)",
    "Logistic Regression",
    "SGD Classifier",
    "Gaussian Naive Bayes",
    "MLP (RelU)",
    "MLP (tanh)"
]çç


In [6]:

for clf, clf_name in zip(classifiers, clf_names):
    try:
        model = Pipeline([
            ('scaler', StandardScaler()),
            ('classifier', clf)
        ])
        
        scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
        mean_score = scores.mean()
        
        print(f"Classificador: {clf_name}")
        print(f"Scores: {scores}")
        print(f"Média das pontuações de validação cruzada: {mean_score}")
        print('-' * 100)

    except Exception as ex:
        print(f"Erro durante a execução do classificador {clf_name}: {str(ex)}")

Classificador: Nearest Neighbors (k=3)
Scores: [0.93255768 0.93291912 0.93243662 0.93471713 0.93378771]
Média das pontuações de validação cruzada: 0.9332836517843155
----------------------------------------------------------------------------------------------------
Classificador: Nearest Neighbors (k=5)
Scores: [0.92811717 0.92814299 0.92869314 0.92996678 0.929089  ]
Média das pontuações de validação cruzada: 0.928801817076874
----------------------------------------------------------------------------------------------------
Classificador: Nearest Neighbors (k=10)
Scores: [0.91659424 0.91705894 0.91695496 0.91765202 0.91778111]
Média das pontuações de validação cruzada: 0.9172082517457945
----------------------------------------------------------------------------------------------------
Classificador: Decision Tree (Max Depth=5)
Scores: [0.70310577 0.70004217 0.69991911 0.70593449 0.70223404]
Média das pontuações de validação cruzada: 0.7022471159740118
-----------------------------