# Multilabel Text Classification
- Difficulty: Advanced
- Project Purpose: Predict multiple tags (e.g., movie genres) from text. Stretch → compare One-vs-Rest Logistic Regression vs linear SVM.
- Points Examined: Multi-label setting, OvR strategy, evaluation with F1-micro vs F1-macro.
- Doc References: Scikit-learn multi-label docs, OvR/OvO strategies.
- Why Useful: Expands classification intuition from binary → multi-class → multi-label, which is very real-world (tags, recommendations, incident categories).

# Part 0 Data Download

In [1]:
from sklearn.preprocessing import StandardScaler
!curl -L -o ./multilabel-classification-dataset.zip https://www.kaggle.com/api/v1/datasets/download/shivanandmn/multilabel-classification-dataset

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 11.4M  100 11.4M    0     0  4080k      0  0:00:02  0:00:02 --:--:-- 6420k


In [5]:
import pandas as pd
import numpy as np

In [6]:
!unzip ./multilabel-classification-dataset.zip

Archive:  ./multilabel-classification-dataset.zip
  inflating: sample_submission.csv   
  inflating: test.csv                
  inflating: train.csv               


In [8]:
train, test = pd.read_csv('./train.csv'), pd.read_csv('./test.csv')

In [11]:
train.tail()

Unnamed: 0,ID,TITLE,ABSTRACT,Computer Science,Physics,Mathematics,Statistics,Quantitative Biology,Quantitative Finance
20967,20968,Contemporary machine learning: a guide for pra...,Machine learning is finding increasingly bro...,1,1,0,0,0,0
20968,20969,Uniform diamond coatings on WC-Co hard alloy c...,Polycrystalline diamond coatings have been g...,0,1,0,0,0,0
20969,20970,Analysing Soccer Games with Clustering and Con...,We present a new approach for identifying si...,1,0,0,0,0,0
20970,20971,On the Efficient Simulation of the Left-Tail o...,The sum of Log-normal variates is encountere...,0,0,1,1,0,0
20971,20972,Why optional stopping is a problem for Bayesians,"Recently, optional stopping has been a subje...",0,0,1,1,0,0


In [15]:
test.shape

(8989, 3)

In [17]:
train.shape

(20972, 9)

In [18]:
train.head()

Unnamed: 0,ID,TITLE,ABSTRACT,Computer Science,Physics,Mathematics,Statistics,Quantitative Biology,Quantitative Finance
0,1,Reconstructing Subject-Specific Effect Maps,Predictive models allow subject-specific inf...,1,0,0,0,0,0
1,2,Rotation Invariance Neural Network,Rotation invariance and translation invarian...,1,0,0,0,0,0
2,3,Spherical polyharmonics and Poisson kernels fo...,We introduce and develop the notion of spher...,0,0,1,0,0,0
3,4,A finite element approximation for the stochas...,The stochastic Landau--Lifshitz--Gilbert (LL...,0,0,1,0,0,0
4,5,Comparative study of Discrete Wavelet Transfor...,Fourier-transform infra-red (FTIR) spectra o...,1,0,0,1,0,0


In [None]:
# Will only use train for both training and evaluation, as we don't have any way to verify against test set.

# Part 1 Hypothesis and Planning

In [22]:
label_cols = ['Computer Science', 'Physics', 'Mathematics', 'Statistics', 'Quantitative Biology', 'Quantitative Finance']
featu_cols = ['TITLE', 'ABSTRACT']

In [27]:
train[label_cols].sum()

Computer Science        8594
Physics                 6013
Mathematics             5618
Statistics              5206
Quantitative Biology     587
Quantitative Finance     249
dtype: int64

In [28]:
train[label_cols].isnull().sum()

Computer Science        0
Physics                 0
Mathematics             0
Statistics              0
Quantitative Biology    0
Quantitative Finance    0
dtype: int64

In [30]:
# 1. Since here the class is highly imbalance, we will need to also compute Macro F1 to ensure no specific group screw up
# 2. We will compare OvR Classifier and MultiOutputClassifier, since in this case we won't need to have multiclass per column, each column is just a binary, we will use logistic regression to do classification here.
# 3. The hypothesis is that OvR/MultiOutputClassifier will have similar performance against each other since they are doing similar stuff under the hood.
# 4. Due to the imbalance of each class, although they are calculated separately, things like `Quantitative Biology/Finance` only have a very small amount of True samples, this might make the estimator very biased towards predict negative. So would imagine false negative be very serious issue. Will confusion matrix to have a check and if the issue is severe, will try to add a class balancer to the line or increase the weight of positive class.

# Part 3 Original Training (OvR) & MultiOutputClassifier

In [55]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(train[featu_cols], train[label_cols], test_size=0.2, random_state=42)

In [69]:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.multioutput import MultiOutputClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.compose import ColumnTransformer

In [76]:
ovr = make_pipeline(
    ColumnTransformer(
        transformers=[
            ("tfidf_tit", TfidfVectorizer(), 'TITLE'),
            ("tfidf_abs", TfidfVectorizer(), 'ABSTRACT'),
        ]
    ),
    OneVsRestClassifier(LogisticRegression())
)

mul = make_pipeline(
    ColumnTransformer(
        transformers=[
            ("tfidf_tit", TfidfVectorizer(), 'TITLE'),
            ("tfidf_abs", TfidfVectorizer(), 'ABSTRACT'),
        ]
    ),
    MultiOutputClassifier(LogisticRegression())
)

In [78]:
ovr.fit(X_train, y_train)
mul.fit(X_train, y_train)
print(ovr.score(X_test, y_test))
print(mul.score(X_test, y_test)) # score here will output rows that match all labels

0.6390941597139451
0.6390941597139451


In [79]:
# Yea they are having same score because under the hood they are doing the same thing. The accuracy is > 0.5 which is great, means we are not doing random stuff, but the performance is pretty low, and the hypothesis is due to the imbalance class (i.e. too many 0s)

In [93]:
from sklearn.metrics import classification_report
def print_performance_per_class(model, X, y):
    y_pred = model.predict(X)
    for i, col in enumerate(y.columns):
        print("Performance for column {}:".format(col))
        print(classification_report(y[col], y_pred[:,i]))

In [95]:
print_performance_per_class(mul, X_test, y_test)

Performance for column Computer Science:
              precision    recall  f1-score   support

           0       0.88      0.88      0.88      2503
           1       0.82      0.82      0.82      1692

    accuracy                           0.86      4195
   macro avg       0.85      0.85      0.85      4195
weighted avg       0.86      0.86      0.86      4195

Performance for column Physics:
              precision    recall  f1-score   support

           0       0.93      0.97      0.95      2969
           1       0.92      0.82      0.87      1226

    accuracy                           0.93      4195
   macro avg       0.93      0.90      0.91      4195
weighted avg       0.93      0.93      0.93      4195

Performance for column Mathematics:
              precision    recall  f1-score   support

           0       0.91      0.96      0.93      3045
           1       0.88      0.74      0.80      1150

    accuracy                           0.90      4195
   macro avg       

In [87]:
# We see that for precision its always fine for both 0 and 1, but for recall, the results is not that good as number of positive values drop
# We care more about Class 1 recall, because it is the minority. While at sample size 8k+, this seems no difference. However when  number starts to drop, we see that the performance already drop, for example for stats that the number is 5206, the Class 1 Recall already come to 0.68, and for the last two, things get to 0.05 and 0.13 which is basically ignore the class unless super confident.

20257    1
482      0
4189     1
9838     0
16591    1
        ..
14740    0
3755     0
10684    0
16274    0
14452    1
Name: Computer Science, Length: 4195, dtype: int64

# Part 4 Dealing with imbalanced class

In [102]:
y_train[y_train.sum(axis=1) == 0] # Okay so there's no pure 0 line

Unnamed: 0,Computer Science,Physics,Mathematics,Statistics,Quantitative Biology,Quantitative Finance


In [131]:
# first try balanced class weight, also try ngram2
ovr = make_pipeline(
    ColumnTransformer(
        transformers=[
            ("tfidf_tit", TfidfVectorizer(), 'TITLE'),
            ("tfidf_abs", TfidfVectorizer(), 'ABSTRACT'),
        ]
    ),
    OneVsRestClassifier(LogisticRegression(class_weight='balanced'))
)

mul = make_pipeline(
    ColumnTransformer(
        transformers=[
            ("tfidf_tit", TfidfVectorizer(), 'TITLE'),
            ("tfidf_abs", TfidfVectorizer(), 'ABSTRACT'),
        ]
    ),
    MultiOutputClassifier(LogisticRegression(class_weight='balanced'))
)

ovr2 = make_pipeline(
    ColumnTransformer(
        transformers=[
            ("tfidf_tit", TfidfVectorizer(ngram_range=(2, 2)), 'TITLE'),
            ("tfidf_abs", TfidfVectorizer(ngram_range=(2, 2)), 'ABSTRACT'),
        ]
    ),
    OneVsRestClassifier(LogisticRegression(class_weight='balanced'))
)

mul2 = make_pipeline(
    ColumnTransformer(
        transformers=[
            ("tfidf_tit", TfidfVectorizer(ngram_range=(2, 2)), 'TITLE'),
            ("tfidf_abs", TfidfVectorizer(ngram_range=(2, 2)), 'ABSTRACT'),
        ]
    ),
    MultiOutputClassifier(LogisticRegression(class_weight='balanced'))
)

In [132]:
ovr.fit(X_train, y_train)
mul.fit(X_train, y_train)
ovr2.fit(X_train, y_train)
mul2.fit(X_train, y_train)
print(ovr.score(X_test, y_test))
print(mul.score(X_test, y_test)) # well a bit better?
print(mul2.score(X_test, y_test))
print(ovr2.score(X_test, y_test))

0.6407628128724672
0.6407628128724672
0.6166865315852205
0.6166865315852205


In [105]:
print_performance_per_class(mul, X_test, y_test)

Performance for column Computer Science:
              precision    recall  f1-score   support

           0       0.91      0.86      0.88      2503
           1       0.80      0.87      0.84      1692

    accuracy                           0.86      4195
   macro avg       0.86      0.86      0.86      4195
weighted avg       0.87      0.86      0.86      4195

Performance for column Physics:
              precision    recall  f1-score   support

           0       0.95      0.95      0.95      2969
           1       0.87      0.87      0.87      1226

    accuracy                           0.92      4195
   macro avg       0.91      0.91      0.91      4195
weighted avg       0.92      0.92      0.92      4195

Performance for column Mathematics:
              precision    recall  f1-score   support

           0       0.94      0.91      0.93      3045
           1       0.79      0.84      0.81      1150

    accuracy                           0.89      4195
   macro avg       

In [109]:
# Quantitative Biology -> 1      0.67->0.49, 0.05->0.61
# Quantitative Finance -> 1      1.00->0.76, 0.57->0.69
# The trade off here is great, although we are miss classifying some, we do get some improvement especially for finance.

In [111]:
# However, the difference for the mini class here is still huge against the major class. So here we will do separate training for Quantitative-Features and others. The strategy is:
#  - Model 1: Predict features other than Quantitative*
#  - Model 2:
#    - Sub Model 1: Predict if it is Quantitative or not
#    - Sub Model 2: Predict if it is Biology or Finance

In [112]:
# Since Model 1 is just a subset of above, will skip and will only focus on Model 2.

In [114]:
train_cpy = train.copy()


In [116]:
featu_cols = ['TITLE', 'ABSTRACT']
label_cols = ['Quantitative Biology', 'Quantitative Finance']

In [117]:
train_cpy.head()

Unnamed: 0,ID,TITLE,ABSTRACT,Computer Science,Physics,Mathematics,Statistics,Quantitative Biology,Quantitative Finance
0,1,Reconstructing Subject-Specific Effect Maps,"Predictive models allow subject-specific inference when analyzing disease\nrelated alterations in neuroimaging data. Given a subject's data, inference can\nbe made at two levels: global, i.e. identifiying condition presence for the\nsubject, and local, i.e. detecting condition effect on each individual\nmeasurement extracted from the subject's data. While global inference is widely\nused, local inference, which can be used to form subject-specific effect maps,\nis rarely used because existing models often yield noisy detections composed of\ndispersed isolated islands. In this article, we propose a reconstruction\nmethod, named RSM, to improve subject-specific detections of predictive\nmodeling approaches and in particular, binary classifiers. RSM specifically\naims to reduce noise due to sampling error associated with using a finite\nsample of examples to train classifiers. The proposed method is a wrapper-type\nalgorithm that can be used with different binary classifiers in a diagnostic\nmanner, i.e. without information on condition presence. Reconstruction is posed\nas a Maximum-A-Posteriori problem with a prior model whose parameters are\nestimated from training data in a classifier-specific fashion. Experimental\nevaluation is performed on synthetically generated data and data from the\nAlzheimer's Disease Neuroimaging Initiative (ADNI) database. Results on\nsynthetic data demonstrate that using RSM yields higher detection accuracy\ncompared to using models directly or with bootstrap averaging. Analyses on the\nADNI dataset show that RSM can also improve correlation between\nsubject-specific detections in cortical thickness data and non-imaging markers\nof Alzheimer's Disease (AD), such as the Mini Mental State Examination Score\nand Cerebrospinal Fluid amyloid-$\beta$ levels. Further reliability studies on\nthe longitudinal ADNI dataset show improvement on detection reliability when\nRSM is used.\n",1,0,0,0,0,0
1,2,Rotation Invariance Neural Network,"Rotation invariance and translation invariance have great values in image\nrecognition tasks. In this paper, we bring a new architecture in convolutional\nneural network (CNN) named cyclic convolutional layer to achieve rotation\ninvariance in 2-D symbol recognition. We can also get the position and\norientation of the 2-D symbol by the network to achieve detection purpose for\nmultiple non-overlap target. Last but not least, this architecture can achieve\none-shot learning in some cases using those invariance.\n",1,0,0,0,0,0
2,3,Spherical polyharmonics and Poisson kernels for polyharmonic functions,"We introduce and develop the notion of spherical polyharmonics, which are a\nnatural generalisation of spherical harmonics. In particular we study the\ntheory of zonal polyharmonics, which allows us, analogously to zonal harmonics,\nto construct Poisson kernels for polyharmonic functions on the union of rotated\nballs. We find the representation of Poisson kernels and zonal polyharmonics in\nterms of the Gegenbauer polynomials. We show the connection between the\nclassical Poisson kernel for harmonic functions on the ball, Poisson kernels\nfor polyharmonic functions on the union of rotated balls, and the Cauchy-Hua\nkernel for holomorphic functions on the Lie ball.\n",0,0,1,0,0,0
3,4,A finite element approximation for the stochastic Maxwell--Landau--Lifshitz--Gilbert system,"The stochastic Landau--Lifshitz--Gilbert (LLG) equation coupled with the\nMaxwell equations (the so called stochastic MLLG system) describes the creation\nof domain walls and vortices (fundamental objects for the novel nanostructured\nmagnetic memories). We first reformulate the stochastic LLG equation into an\nequation with time-differentiable solutions. We then propose a convergent\n$\theta$-linear scheme to approximate the solutions of the reformulated system.\nAs a consequence, we prove convergence of the approximate solutions, with no or\nminor conditions on time and space steps (depending on the value of $\theta$).\nHence, we prove the existence of weak martingale solutions of the stochastic\nMLLG system. Numerical results are presented to show applicability of the\nmethod.\n",0,0,1,0,0,0
4,5,Comparative study of Discrete Wavelet Transforms and Wavelet Tensor Train decomposition to feature extraction of FTIR data of medicinal plants,"Fourier-transform infra-red (FTIR) spectra of samples from 7 plant species\nwere used to explore the influence of preprocessing and feature extraction on\nefficiency of machine learning algorithms. Wavelet Tensor Train (WTT) and\nDiscrete Wavelet Transforms (DWT) were compared as feature extraction\ntechniques for FTIR data of medicinal plants. Various combinations of signal\nprocessing steps showed different behavior when applied to classification and\nclustering tasks. Best results for WTT and DWT found through grid search were\nsimilar, significantly improving quality of clustering as well as\nclassification accuracy for tuned logistic regression in comparison to original\nspectra. Unlike DWT, WTT has only one parameter to be tuned (rank), making it a\nmore versatile and easier to use as a data processing tool in various signal\nprocessing applications.\n",1,0,0,1,0,0


In [None]:
(((train_cpy['Quantitative Biology'] + train_cpy['Quantitative Finance']) >= 1).astype(int)).sum() # okay not too bad, we still got 832..

In [133]:
train_cpy['Quantitative'] = ((train_cpy['Quantitative Biology'] + train_cpy['Quantitative Finance']) >= 1).astype(int) # features used for is Quantitative/Is not Quantitative
train_cpy['Finance'] = train_cpy['Quantitative Finance'] # Finance/Biology Test
train_cpy['Biology'] = np.zeros(len(train_cpy)) #Finance/Biology Test

In [135]:
X_train, X_test, y_train, y_test = train_test_split(train_cpy[featu_cols], train_cpy['Quantitative'], test_size=0.2, random_state=42, stratify=train_cpy['Quantitative'])

In [136]:
# Lets first train the model of Quantitative or not with balanced class, if things are not that accurate, we may consider using up-sampling
quantitative_check = make_pipeline(
    ColumnTransformer(
        transformers=[
            ("tfidf_ABS", TfidfVectorizer(sublinear_tf=True), 'ABSTRACT'),
            ("tfidf_TITLE", TfidfVectorizer(sublinear_tf=True), 'TITLE'),
        ]
    ),
    LogisticRegression(class_weight='balanced')
)
quantitative_check.fit(X_train, y_train)
quantitative_check.score(X_test, y_test)

0.9594755661501788

In [137]:
print(classification_report(y_test, quantitative_check.predict(X_test))) # :( still bad recall for class 1, lets do up sampling

              precision    recall  f1-score   support

           0       0.98      0.97      0.98      4029
           1       0.49      0.61      0.54       166

    accuracy                           0.96      4195
   macro avg       0.74      0.79      0.76      4195
weighted avg       0.96      0.96      0.96      4195



In [139]:
from sklearn.utils import resample
X_train_maj, X_train_min = X_train[y_train == 0], X_train[y_train == 1]
y_train_maj, y_train_min = y_train[y_train == 0], y_train[y_train == 1]


In [146]:
X_train_min, y_train_min = resample(X_train_min, y_train_min, random_state=42, n_samples=len(X_train_maj), replace=True)

In [147]:
print(X_train_min.shape, y_train_min.shape)

(16111, 2) (16111,)


In [148]:
X_train, y_train = pd.concat([X_train_min, X_train_maj]), pd.concat([y_train_min, y_train_maj])

In [154]:
quantitative_check = make_pipeline(
    ColumnTransformer(
        transformers=[
            ("tfidf_ABS", TfidfVectorizer(sublinear_tf=True), 'ABSTRACT'),
            ("tfidf_TITLE", TfidfVectorizer(sublinear_tf=True), 'TITLE'),
        ]
    ),
    LogisticRegression()
)
quantitative_check.fit(X_train, y_train)
quantitative_check.score(X_test, y_test)
quantitative_check.fit(X_train, y_train)
quantitative_check.score(X_test, y_test)

0.9611442193087009

In [155]:
print(classification_report(y_test, quantitative_check.predict(X_test))) # Seems like even after this, the 1 class still suffer from great issue of insufficient data, although we are not sure why after up-sampling it is still very biased towards 0? But at this stage, it is safe to say that we need more data.
# But interesting to see why precision/recall can stay high for class 0 while 1 is still bit biased? Like if I trained stuff on an even dataset, then precision and recall for 0 should also drop?

              precision    recall  f1-score   support

           0       0.98      0.98      0.98      4029
           1       0.51      0.57      0.54       166

    accuracy                           0.96      4195
   macro avg       0.75      0.77      0.76      4195
weighted avg       0.96      0.96      0.96      4195



In [156]:
y_train.value_counts()

Quantitative
1    16111
0    16111
Name: count, dtype: int64

In [None]:
# GPT's review below

TL;DR feedback

- .score in your multilabel setup = subset accuracy (strict). Keep using per-label metrics (micro/macro F1, Hamming loss, Jaccard).
- Your ColumnTransformer + TfidfVectorizer approach is good. For text-only data, consider concatenating title+abstract or keep both but weight title higher.
- For rare labels (Quant Bio / Quant Finance), fix recall via per-label threshold tuning, class_weight='balanced', and possibly ClassifierChain. (I.E. after the model, check threshold that can give best F1)
- Prefer LogisticRegression(solver='saga', max_iter=2000) for large sparse TF-IDF; try L1 / elastic-net to drop junk features.
- Add PR-AUC per label; optimize thresholds by F1@best-threshold per label.

In [161]:
# https://stackoverflow.com/questions/38640109/logistic-regression-python-solvers-definitions