# Multilabel Text Classification using Logistic Regression

This notebook demonstrates multilabel text classification using traditional machine learning methods.

We use the RCV1 dataset, which contains news articles labeled with multiple topics per sample. The task is to predict all relevant topics for a given article.

This notebook covers:
- Data loading and preprocessing
- Text vectorization with TF-IDF
- Multilabel classification using Logistic Regression wrapped in OneVsRestClassifier
- Model evaluation with multilabel metrics

In [2]:
import warnings
warnings.filterwarnings('ignore')

## 1. Dataset Loading & Exploration

The RCV1 dataset contains over 800,000 manually categorized news stories from Reuters.

Each article can have one or more labels from a set of 103 topics.

We'll explore:
- Dataset size
- Label distribution
- Sample texts and their labels

In [1]:
from sklearn.datasets import fetch_rcv1

# RCV1 is already multilabel and sparse
rcv1 = fetch_rcv1(subset='train')

X = rcv1.data           # TF-IDF vectors (already processed)
y = rcv1.target         # Multilabel binary matrix (shape: [num_samples, num_classes])

print(X.shape, y.shape)

(23149, 47236) (23149, 103)


In [9]:
# Access the first row of the csr_matrix 'y'
first_row = y[0]

# Convert the first row to a dense NumPy array
first_row_dense = first_row.toarray()

# Now you can work with first_row_dense like a regular NumPy array
print(first_row_dense)

[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1]]


## 2. Multilabel Classification Model - Logistic Regression

We use Logistic Regression as a base classifier wrapped in `OneVsRestClassifier` to handle multilabel outputs.

The sigmoid activation allows independent probability predictions for each label.

We train the model and analyze its performance.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier

# Use a linear classifier in OneVsRest setting
model = OneVsRestClassifier(LogisticRegression(solver='liblinear'))

# Fit model
model.fit(X, y)

In [None]:
# Predict probabilities for multilabel
y_pred_prob = model.predict_proba(X)

# Convert probabilities to binary predictions (threshold = 0.5)
y_pred = (y_pred_prob >= 0.5).astype(int)

## 3. Evaluation Metrics

We evaluate the model using multilabel-specific metrics:

- Micro and macro averaged F1-score
- Precision and recall per label

These metrics give insights into overall and per-label performance.

In [4]:
from sklearn.metrics import classification_report

print(classification_report(y, y_pred, target_names=rcv1.target_names))

              precision    recall  f1-score   support

         C11       0.97      0.11      0.20       674
         C12       0.95      0.38      0.55       381
         C13       0.93      0.20      0.32       947
         C14       0.86      0.04      0.07       160
         C15       0.98      0.86      0.92      4179
        C151       0.97      0.85      0.91      2366
       C1511       0.96      0.56      0.70       399
        C152       0.96      0.64      0.77      1930
         C16       0.00      0.00      0.00        49
         C17       0.96      0.60      0.74      1172
        C171       0.96      0.39      0.56       437
        C172       0.95      0.56      0.70       285
        C173       0.83      0.07      0.12        76
        C174       0.99      0.62      0.77       246
         C18       0.96      0.62      0.76      1462
        C181       0.94      0.56      0.70      1205
        C182       0.00      0.00      0.00       142
        C183       0.95    

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [None]:
from sklearn.metrics import f1_score

print("Micro F1:", f1_score(y, y_pred, average='micro'))
print("Macro F1:", f1_score(y, y_pred, average='macro'))

Micro F1: 0.819000959565621
Macro F1: 0.40321105838859317


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


## 4. Conclusion

Logistic Regression with OneVsRest is a simple but effective baseline for multilabel classification.

Future work could include:
- Hyperparameter tuning
- Trying other classifiers like Random Forest or SVM
- Using dimensionality reduction or embeddings