### Baseline Model


This Jupyter Notebook aims to build and evaluate a multilabel text classification model using Scikit-Learn libraries. The focus is on classifying news articles into multiple topics or categories. Given that we are working in a client-facing environment, the primary metric of interest is precision, to minimize false positives and maintain client trust.

In [1]:
from datasets import load_dataset
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import classification_report
from collections import Counter
from itertools import chain
import re

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# Text Preprocessing
def preprocess_text(text: str) -> str:
    """Remove numbers, newlines, and special characters from text."""
    text = re.sub(r'\d+', '', text)
    text = re.sub(r'\n', ' ', text)
    text = re.sub(r'[^\w\s]', '', text)
    return text

# Find Single Appearance Labels
def find_single_appearance_labels(y):
    """Find labels that appear only once in the dataset."""
    all_labels = list(chain.from_iterable(y))
    label_count = Counter(all_labels)
    single_appearance_labels = [label for label, count in label_count.items() if count == 1]
    return single_appearance_labels

# Remove Single Appearance Labels from Dataset
def remove_single_appearance_labels(dataset, single_appearance_labels):
    """Remove samples with single-appearance labels from both train and test sets."""
    for split in ['train', 'test']:
        dataset[split] = dataset[split].filter(lambda x: all(label not in single_appearance_labels for label in x['topics']))
    return dataset

In [3]:
# Load Dataset
dataset = load_dataset("reuters21578", "ModApte")

# Find and Remove Single Appearance Labels
print("Finding single appearance labels...")
y_train = [item['topics'] for item in dataset['train']]
single_appearance_labels = find_single_appearance_labels(y_train)
print(f"Single appearance labels: {single_appearance_labels}")

print("Removing samples with single-appearance labels...")
dataset = remove_single_appearance_labels(dataset, single_appearance_labels)

Finding single appearance labels...
Single appearance labels: ['lin-oil', 'rye', 'red-bean', 'groundnut-oil', 'citruspulp', 'rape-meal', 'corn-oil', 'peseta', 'cotton-oil', 'ringgit', 'castorseed', 'castor-oil', 'lit', 'rupiah', 'skr', 'nkr', 'dkr', 'sun-meal', 'lin-meal', 'cruzado']
Removing samples with single-appearance labels...


In [4]:
# Combine title and text, then preprocess
print("Preprocessing text...")
X_train = [item['title'] + ' ' + item['text'] for item in dataset['train']]
X_train = [preprocess_text(text) for text in X_train]
y_train = [item['topics'] for item in dataset['train']]

# TF-IDF Vectorization
vectorizer = TfidfVectorizer(stop_words='english', max_features=5000, ngram_range=(1,2))
X_train_tfidf = vectorizer.fit_transform(X_train)

# Transform Labels to Binary Matrix
mlb = MultiLabelBinarizer()
y_train_bin = mlb.fit_transform(y_train)

Preprocessing text...


In [5]:
X_test_tfidf = vectorizer.transform([preprocess_text(item['title'] + ' ' + item['text']) for item in dataset['test']])
y_test_bin = mlb.transform([item['topics'] for item in dataset['test']])



Note: Using scikit-learn's MultiLabelBinarizer is convenient because it automatically disregards any labels in the test set that didn't appear during training. However, when employing transformer models, additional preprocessing steps are needed to manage these unseen labels. 

In [6]:
mlb.classes_

array(['acq', 'alum', 'austdlr', 'barley', 'bop', 'can', 'carcass',
       'cocoa', 'coconut', 'coconut-oil', 'coffee', 'copper',
       'copra-cake', 'corn', 'cornglutenfeed', 'cotton', 'cpi', 'cpu',
       'crude', 'dfl', 'dlr', 'dmk', 'earn', 'fishmeal', 'fuel', 'gas',
       'gnp', 'gold', 'grain', 'groundnut', 'heat', 'hog', 'housing',
       'income', 'instal-debt', 'interest', 'inventories', 'ipi',
       'iron-steel', 'jet', 'jobs', 'l-cattle', 'lead', 'lei', 'linseed',
       'livestock', 'lumber', 'meal-feed', 'money-fx', 'money-supply',
       'naphtha', 'nat-gas', 'nickel', 'nzdlr', 'oat', 'oilseed',
       'orange', 'palladium', 'palm-oil', 'palmkernel', 'pet-chem',
       'platinum', 'plywood', 'pork-belly', 'potato', 'propane', 'rand',
       'rape-oil', 'rapeseed', 'reserves', 'retail', 'rice', 'rubber',
       'saudriyal', 'ship', 'silver', 'sorghum', 'soy-meal', 'soy-oil',
       'soybean', 'stg', 'strategic-metal', 'sugar', 'sun-oil', 'sunseed',
       'tapioca', 'te

In [7]:
from sklearn.linear_model import LogisticRegression
# Train Classifier
print("Training classifier...")
#clf = OneVsRestClassifier(MultinomialNB())
clf = OneVsRestClassifier(LogisticRegression())
clf.fit(X_train_tfidf, y_train_bin)

Training classifier...


In [8]:
# Predictions and Evaluation
print("Making predictions and evaluating...")
y_pred = clf.predict(X_test_tfidf)

Making predictions and evaluating...


In [9]:
print("Classification Report:")
print(classification_report(y_test_bin, y_pred, target_names=mlb.classes_, zero_division=1))

Classification Report:
                 precision    recall  f1-score   support

            acq       0.98      0.87      0.92       719
           alum       1.00      0.00      0.00        23
        austdlr       1.00      1.00      1.00         0
         barley       1.00      0.00      0.00        12
            bop       1.00      0.30      0.46        30
            can       1.00      1.00      1.00         0
        carcass       1.00      0.06      0.11        18
          cocoa       1.00      0.61      0.76        18
        coconut       1.00      0.00      0.00         2
    coconut-oil       1.00      0.00      0.00         2
         coffee       0.94      0.59      0.73        27
         copper       1.00      0.22      0.36        18
     copra-cake       1.00      0.00      0.00         1
           corn       0.97      0.51      0.67        55
 cornglutenfeed       1.00      1.00      1.00         0
         cotton       1.00      0.06      0.11        18
       

---

Insight:

- In our client-facing news classification model, precision takes precedence over recall. This is because the repercussions of false positives are more severe and harder to justify to clients compared to false negatives. When the model incorrectly tags a news item with a topic, it's challenging to explain this error. On the other hand, if the model misses a topic, it's easier to defend by stating that the topic wasn't sufficiently emphasized in the news article.

- High Precision, Low Recall: The model seems to be cautious, making predictions only when it is highly certain. This is good for avoiding false positives but at the cost of missing several true positives, leading to low recall.

- Macro vs Micro Averages: The micro avg F1-score is 0.77, which is decent, but the macro avg F1-score is 0.29, which is low. This discrepancy indicates that while the model performs well on commonly occurring labels, it fails to capture the minority classes effectively.

- Labels with Zero Support: There are several labels where the 'support' is zero, meaning they did not appear in the test set. 