## Aula 01 - Classificação multirrótulo - Exemplo prático

Neste notebook, veremos um exemplo do uso da biblioteca scikit-multilearn para resolução de uma tarefa de classificação multirrótulo.

Leia as instruções e execute as células abaixo.

In [11]:
""" from google.colab import drive
drive.mount('/content/drive') """

" from google.colab import drive\ndrive.mount('/content/drive') "

In [12]:
""" #Instalando uma versão anterior do scikit-learn para compatibilidade com método MLkNN
!pip install -U scikit-learn==0.24.1 """

' #Instalando uma versão anterior do scikit-learn para compatibilidade com método MLkNN\n!pip install -U scikit-learn==0.24.1 '

Instalando a biblioteca scikit-multilearn no ambiente do Colab

In [13]:
""" !pip install scikit-multilearn """

' !pip install scikit-multilearn '

\
Dataset utilizado: scene (classificação multirrótulo de cenas)

http://mulan.sourceforge.net/datasets-mlc.html

download: http://sourceforge.net/projects/mulan/files/datasets/scene.rar


In [14]:
from scipy.io import arff
import pandas as pd

#Carregando datasets de treino e teste
 
data_train, _ = arff.loadarff('scene-train.arff') 
X_train = pd.DataFrame(data_train)

data_test, _ = arff.loadarff('scene-test.arff')
X_test = pd.DataFrame(data_test)

In [15]:
#Observando dados de treinamento
X_train

Unnamed: 0,Att1,Att2,Att3,Att4,Att5,Att6,Att7,Att8,Att9,Att10,...,Att291,Att292,Att293,Att294,Beach,Sunset,FallFoliage,Field,Mountain,Urban
0,0.646467,0.666435,0.685047,0.699053,0.652746,0.407864,0.150309,0.535193,0.555689,0.580782,...,0.157332,0.247298,0.014025,0.029709,b'1',b'0',b'0',b'0',b'1',b'0'
1,0.770156,0.767255,0.761053,0.745630,0.742231,0.688086,0.708416,0.757351,0.760633,0.740314,...,0.251454,0.137833,0.082672,0.036320,b'1',b'0',b'0',b'0',b'0',b'1'
2,0.793984,0.772096,0.761820,0.762213,0.740569,0.734361,0.722677,0.849128,0.839607,0.812746,...,0.017166,0.051125,0.112506,0.083924,b'1',b'0',b'0',b'0',b'0',b'0'
3,0.938563,0.949260,0.955621,0.966743,0.968649,0.869619,0.696925,0.953460,0.959631,0.966320,...,0.019267,0.031290,0.049780,0.090959,b'1',b'0',b'0',b'0',b'0',b'0'
4,0.512130,0.524684,0.520020,0.504467,0.471209,0.417654,0.364292,0.562266,0.588592,0.584449,...,0.198151,0.238796,0.164270,0.184290,b'1',b'0',b'0',b'0',b'0',b'0'
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1206,0.917676,0.918483,0.919618,0.920433,0.869668,0.873864,0.918827,0.939119,0.938986,0.941038,...,0.052553,0.053978,0.060440,0.043268,b'0',b'0',b'0',b'0',b'0',b'1'
1207,0.740662,0.724838,0.763497,0.811923,0.554132,0.616527,0.632875,0.646750,0.574011,0.612846,...,0.253392,0.294021,0.122148,0.301926,b'0',b'0',b'0',b'0',b'0',b'1'
1208,0.856390,1.000000,1.000000,1.000000,0.800653,0.692760,0.704245,0.750823,1.000000,1.000000,...,0.016016,0.019464,0.022167,0.043738,b'0',b'0',b'0',b'0',b'0',b'1'
1209,0.805592,0.804170,0.811438,0.838218,0.836633,0.840197,0.836530,0.492342,0.508497,0.547168,...,0.098371,0.346736,0.231481,0.332623,b'0',b'0',b'0',b'0',b'0',b'1'


Preparando os dados

In [16]:
#Separando rótulos e valores de atributos

y_train = X_train.iloc[:,-6:] 
X_train.drop(columns=list(y_train.columns), inplace=True)

y_test = X_test.iloc[:,-6:]
X_test.drop(columns=list(y_test.columns), inplace=True)

In [17]:
#Mudando valores de rótulo para 0 ou 1

y_train.replace([b'0', b'1'], [0, 1], inplace=True)
y_test.replace([b'0', b'1'], [0, 1], inplace=True)

In [18]:
y_train

Unnamed: 0,Beach,Sunset,FallFoliage,Field,Mountain,Urban
0,1,0,0,0,1,0
1,1,0,0,0,0,1
2,1,0,0,0,0,0
3,1,0,0,0,0,0
4,1,0,0,0,0,0
...,...,...,...,...,...,...
1206,0,0,0,0,0,1
1207,0,0,0,0,0,1
1208,0,0,0,0,0,1
1209,0,0,0,0,0,1


Agora, podemos aplicar as estratégias de classificação multirrótulo

In [19]:
# Usando Binary Relevance
from skmultilearn.problem_transform import BinaryRelevance
from sklearn.ensemble import RandomForestClassifier
import sklearn.metrics as metrics



br = BinaryRelevance(RandomForestClassifier())

br.fit(X_train.values, y_train.values)

pred = br.predict(X_test.values)


print('Hamming loss: {}'.format(metrics.hamming_loss(y_test, pred)))
print('Acurácia: {}'.format(metrics.accuracy_score(y_test, pred)))


Hamming loss: 0.09113712374581939
Acurácia: 0.5284280936454849


In [20]:
# Usando Classifier Chains
from skmultilearn.problem_transform import ClassifierChain

cc = ClassifierChain(RandomForestClassifier())

cc.fit(X_train, y_train)

pred = cc.predict(X_test)


print('Hamming loss: {}'.format(metrics.hamming_loss(y_test, pred)))
print('Acurácia: {}'.format(metrics.accuracy_score(y_test, pred)))

Hamming loss: 0.08848940914158306
Acurácia: 0.5560200668896321


In [21]:
# Usando Label Powerset
from skmultilearn.problem_transform import LabelPowerset

lp = LabelPowerset(RandomForestClassifier())

lp.fit(X_train, y_train)

pred = lp.predict(X_test)


print('Hamming loss: {}'.format(metrics.hamming_loss(y_test, pred)))
print('Acurácia: {}'.format(metrics.accuracy_score(y_test, pred)))

Hamming loss: 0.08653846153846154
Acurácia: 0.7065217391304348




In [22]:
#Observando a transformação do problema multirrótulo para multiclasse 
import numpy as np 

multiclass_transformation = lp.transform(y_train)
np.unique(multiclass_transformation) #Número de classes únicas: 14.

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13])

In [27]:
# Usando MLkNN
from skmultilearn.adapt import MLkNN


mlknn = MLkNN()

mlknn.fit(X_train.values, y_train.values)

pred = mlknn.predict(X_test)


print('Hamming loss: {}'.format(metrics.hamming_loss(y_test, pred)))
print('Acurácia: {}'.format(metrics.accuracy_score(y_test, pred)))

TypeError: NearestNeighbors.__init__() takes 1 positional argument but 2 were given