Skip to content

Multilabel classification task rock news articles based on Python

Notifications You must be signed in to change notification settings

IvoDSBarros/multilabel_classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 

Repository files navigation

Multilabel classification task on rock news articles

Overview

Based upon a previous rule-based text classification model, an hybrid multilabel classifier was developed to assign topic labels to a dataset of rock news headlines, aiming to explore this variant of the classification problem and enhance its accuracy. This repository presents the steps implemented to develop the multilabel classification task. Several classifiers were tested including the ones following the problem transformation approach and the MultiOutputClassifier. In summary, results demonstrated that multioutput algorithms outperformed problem transformation algorithms, achieving significantly higher Micro-average F1-score values, with tree-based models and ensemble methods showing inherent robustness for imbalanced datasets.

Table of Contents
  1. Exploratory data analysis
  2. About the methodology
  3. Results
  4. Rule-based text classification Vs. Machine Learning classification: final thoughts and further research
  5. References

1. Exploratory data analysis

  • The dataset contains 20.000 headlines and the average number of labels per headline stands at 1.45 (see Table 1).
  • 36 predefined labels were derived from the rule-based text classification model (see Table 1).
  • The number of labels for which a headline can be assigned ranges from 1 to 7 (see Figure 1).
  • Two-thirds of the headlines are assigned to a single topic label, while nearly one-fourth are tagged with two topic labels (see Figure 1).
  • The cumulative percentage of headlines assigned to more than three labels is not significant (see Figure 1).
  • The text corpus shows high imbalance (see Figure 2).
  • Nearly one-third of the headlines (6,397 out of 20,000) are tagged with the class 'diverse', indicating topics other than the 35 predefined labels in this classification task (see Figure 2).
  • Core topic labels include: 'announce', 'release', 'album', 'tour', 'song', 'show', 'watch', 'video', 'single', 'death', 'play' and 'cover' (see Figure 2).
  • Most topic labels tend to co-occur with another label rather than being associated with multiple labels or appearing as a single label (see Figure 2).
  • Exceptions to this pattern include 'song', 'death' or 'cover', which tend to appear as single labels, and 'video' and 'single' which are more associated with multiple labels (see Figure 2).
  • Strong correlations are observed among pairs of labels such as ['tour', 'announce'], ['album', 'announce'], ['album', 'release'], ['single', 'release'] and ['video', 'release'] (see Figure 3).

Table 1. Dataset descriptive statistics


Figure 1. Distribution of the number of topic labels



Figure 2. Frequency distribution of topic labels and respective co-occurrence



Figure 3. Co-occurrence of topic labels


2. About the methodology

  • The multilabel classification task was designed based on a rule-based text classification model with the purpose of identifying keywords and assigning both topic labels and publication type categories. Details about the rule-based text classification model can be found here. The keywords generated by the manual rule-based model were the foundation for assigning topic labels to headlines. Instead of directly using the derived topic labels, the multilabel classifier relies on the identified keywords.
  • To ensure a "well-balanced distribution of label relations", an iterative stratification technique was implemented to split the dataset into training and testing sets, as proposed by Szymański and Kajdanowicz (2016). The test size was set at 0.2.
  • In order to address potential class imbalance, no re-sampling or re-weighting methods were adopted, as they tend to "result in oversampling of common labels" (Huang et al., 2021).
  • Various classifiers and estimators were evaluated using both Problem Transformation and MultiOutputClassifier approaches.
  • Emphasis was placed on inherently robust algorithms for imbalanced datasets, particularly tree-based and ensemble methods (Ganganwar, 2012; Mahani, 2022; Mulugeta et al., 2023).
  • Hyperparameter optimization was conducted using Grid Search for base models showing high performance.
  • Aiming to mitigate overfitting during the tuning of Logistic Regression (Jurafsky & Martin, 2024), an initial grid was set for the regularization strength parameter ('C') with values ranging from [0,01,0.001, 0.0001,0.00001]. The adoption of these small values resulted in technical issues. Due to this constraint, a new grid using values of [0.1, 0.25, 0.5] was tested but technical issues still persisted specifically when applied to Classifier Chain in conjunction with Logistic Regression. It was observed that only when grid values were set to >=1 the technical issues subside. Furthermore, no improved outcomes were obtained with this grid in terms of model performance for Binary Relevance combined with Logistic Regression.
  • Several methodologies were experimented to optimize hyperparameters in Gradient Boosting, including Grid search and Randomized search using a balanced subset of around 6000 records, Random over-sampling, Random under-sampling and Class weighting. Despite these efforts, the results were not satisfatory. The inherent characteristics of the dataset, where 48% (435 out of 906) of label combinations consist of only one sample, coupled with the complexity of the Gradient Boosting algorithm, which involves a multitude of parameters (Guan et al., 2023), led to unsucessful optimization. The limited size of the training dataset often presents a significant challenge in machine learning optimization, suggesting that expanding the text corpus by gathering additional news headlines could enrich the model with more diverse examples to learn from.
  • In addition, a cost-sensitive learning experiment was carried out by adjusting the "class_weight" parameter of a tree-based classifier. However, no significant impact on the model's performance was observed.
  • After optimization, the selected base models were fine-tuned with the following hyperparameters: a) Logistic Regression ("C": .5; "penalty": l2; "solver": "sag; max_iter": 1000); b) Decision Tree ("criterion": gini; "max_depth": None; "max_leaf_nodes": None; "min_samples_split": 2); c) Random Forest ("bootstrap": False; "max_depth": None; "max_features": None; "max_leaf_nodes": None; "n_estimators": 50); d) AdaBoost ("algorithm": SAMME.R; "learning_rate": 1.04; "n_estimators": 50); e) Extra Trees ("criterion": gini; "n_estimators": 20).
  • To assess model performance, more informative evaluation metrics for imbalanced datasets such as the Micro-average F1-score were employed. This metric aggregates the contributions of "all the units together, without taking into consideration possible differences between classes" (Grandini et al., 2020).

3. Results

  • Multioutput algorithms showed significantly higher Micro-average F1-score values compared to problem transformation algorithms (see Table 2).
  • In agreement with findings in academic literature (Ganganwar, 2012; Mahani, 2022; Mulugeta et al., 2023), tree-based models (Decision Tree) and ensemble methods (Random Forest, Extra Trees, AdaBoost, Gradient Boosting) demonstrated inherent robustness for imbalanced datasets, outperforming other algorithms as indicated by Micro-average F1-score (see Table 2).
  • AdaBoost stood out as the top performer, showing a Micro-average F1-score of 0.989 (see Table 2).
  • Following hyperparameter tuning, a very slight performance enhancement was observed for Random Forest (see Figure 4).

Table 2. Evaluation metrics by classifier

Figure 4. Tuned models vs. Base models: performance evaluation

References