We will use the publicly available Biomedical PubMed Multilabel Classification dataset from Kaggle https://www.kaggle.com/datasets/owaiskhan9654/pubmed-multilabel-text-classification. The dataset contain various features, but we would only use the abstractText feature with their MeSH classification (A: Anatomy, B: Organism, C: Diseases, etc.). 

The above dataset shows that each paper can be classified into more than one category, the cases for Multilabel Classification. With this dataset, we can build Multilabel Classifier with Scikit-Learn.

In [2]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer

df = pd.read_csv('PubMed Multi Label Text Classification Dataset Processed.csv')
df = df.drop(['Title', 'meshMajor', 'pmid', 'meshid', 'meshroot'], axis =1)

X = df["abstractText"]
y = np.asarray(df[df.columns[1:]])

vectorizer = TfidfVectorizer(max_features=2500, max_df=0.9)
vectorizer.fit(X)

TfidfVectorizer(max_df=0.9, max_features=2500)

In the code above, we transform the text data into TF-IDF representation so our Scikit-Learn model can accept the training data. For now let us skip the preprocessing data steps, such as stopword removal, to simplify the tutorial.

After data transformation, we split the dataset into training and test datasets.

In [3]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=101)
  
X_train_tfidf = vectorizer.transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

After all the preparation, we would start training our Multilabel Classifier. In Scikit-Learn, we would use the MultiOutputClassifier object to train the Multilabel Classifier model. The strategy behind this model is to train one classifier per label. Basically, each label has its own classifier.

We would use Logistic Regression in this sample, and MultiOutputClassifier would extend them into all labels.

In [None]:
from sklearn.multioutput import MultiOutputClassifier
from sklearn.linear_model import LogisticRegression

clf = MultiOutputClassifier(LogisticRegression()).fit(X_train_tfidf, y_train)

After the training, let’s use the model to predict the test data.

In [5]:
prediction = clf.predict(X_test_tfidf)
prediction

array([[1, 1, 0, ..., 0, 0, 0],
       [0, 1, 0, ..., 1, 1, 1],
       [0, 1, 1, ..., 1, 0, 0],
       ...,
       [1, 1, 0, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0]], dtype=int64)

The prediction result is an array of labels for each MeSH category. Each row represents the sentence, and each column represents the label. 

Lastly, we need to evaluate our Multilabel Classifier. We can use the accuracy metrics to evaluate the model.

In [6]:
from sklearn.metrics import accuracy_score
print('Accuracy Score: ', accuracy_score(y_test, prediction))

Accuracy Score:  0.145


Accuracy Score:  0.145

The accuracy score result is 0.145, which shows that the model only could predict the exact label combination less than 14.5% of the time. However, the accuracy score contains weaknesses for a multilabel prediction evaluation. The accuracy score would need each sentence to have all the label presence in the exact position, or it would be considered wrong.

In [8]:
print(prediction[0])
print(y_test[0])

[1 1 0 1 1 0 1 0 0 0 0 0 0 0]
[1 1 0 1 1 0 1 1 0 0 0 0 0 0]


For example, the first-row prediction only differs by one label between the prediction and test data.

It would be considered a wrong prediction for the accuracy score as the label combination differs. That is why our model has a low metric score.

To mitigate this problem, we must evaluate the label prediction rather than their label combination. In this case, we can rely on Hamming Loss evaluation metric. Hamming Loss is calculated by taking a fraction of the wrong prediction with the total number of labels. Because Hamming Loss is a loss function, the lower the score is, the better (0 indicates no wrong prediction and 1 indicates all the prediction is wrong).

In [9]:
from sklearn.metrics import hamming_loss
print('Hamming Loss: ', round(hamming_loss(y_test, prediction),2))

Hamming Loss:  0.13


Our Multilabel Classifier Hamming Loss model is 0.13, which means that our model would have a wrong prediction 13% of the time independently. This means each label prediction might be wrong 13% of the time.