# Local Classifier Per Node (LCPN) using Random Forest (per node)

This notebook implements the **LCPN strategy**, where one classifier is trained at each non-leaf node in a hierarchy. This allows models to make level-wise decisions using structural information.

In [3]:
!pip install -q transformers torch datasets accelerate scikit-learn

In [30]:
import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report, accuracy_score
from collections import defaultdict

## 🔹 Dataset: Simulated Hierarchy (20 Newsgroups)

We map 20 categories to a custom 3-level tree (e.g., Root → Major Category → Subcategory). Labels are processed to maintain path information.


- Level 1 classifier: Predicts top-level category (5 classes)

- Level 2 classifiers: For each Level 1 class, predict the mid-level class

- Level 3 classifiers: For each Level 2 class, predict the fine-grained Newsgroup class

In [25]:
level1_map = {
    'comp': ['hardware', 'software'],
    'rec': ['autos', 'sports'],
    'sci': ['med', 'tech'],
    'talk': ['politics', 'religion'],
    'misc': ['miscellaneous']
}

level2_map = {
    'hardware': ['comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware'],
    'software': ['comp.graphics', 'comp.os.ms-windows.misc', 'comp.windows.x'],
    'autos': ['rec.autos', 'rec.motorcycles'],
    'sports': ['rec.sport.baseball', 'rec.sport.hockey'],
    'med': ['sci.med'],
    'tech': ['sci.crypt', 'sci.electronics', 'sci.space'],
    'politics': ['talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc'],
    'religion': ['talk.religion.misc', 'alt.atheism', 'soc.religion.christian'],  # Added here
    'miscellaneous': ['misc.forsale']
}

In [26]:
# Invert maps for quick lookup from fine to Level 2 and Level 1
fine_to_level2 = {}
for lvl2_cat, fine_cats in level2_map.items():
    for cat in fine_cats:
        fine_to_level2[cat] = lvl2_cat

fine_to_level1 = {}
for lvl1_cat, lvl2_cats in level1_map.items():
    for lvl2_cat in lvl2_cats:
        for fine_cat in level2_map[lvl2_cat]:
            fine_to_level1[fine_cat] = lvl1_cat

In [27]:
newsgroups_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))
newsgroups_test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'))

X_train = newsgroups_train.data
y_train_fine = [newsgroups_train.target_names[i] for i in newsgroups_train.target]

X_test = newsgroups_test.data
y_test_fine = [newsgroups_test.target_names[i] for i in newsgroups_test.target]

In [28]:
# Create Level 1 and Level 2 labels ===

y_train_level1 = [fine_to_level1[label] for label in y_train_fine]
y_train_level2 = [fine_to_level2[label] for label in y_train_fine]

y_test_level1 = [fine_to_level1[label] for label in y_test_fine]
y_test_level2 = [fine_to_level2[label] for label in y_test_fine]

In [31]:
# Encode Level 1 labels and train classifier ===
le_lvl1 = LabelEncoder()
y_train_lvl1_enc = le_lvl1.fit_transform(y_train_level1)
y_test_lvl1_enc = le_lvl1.transform(y_test_level1)

In [32]:
# Vectorize text with TF-IDF ===
vectorizer = TfidfVectorizer(max_features=10000, stop_words='english')
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

In [34]:
clf_lvl1 = RandomForestClassifier(n_estimators=100, n_jobs=-1, random_state=42)
clf_lvl1.fit(X_train_tfidf, y_train_lvl1_enc)

In [35]:
# Predict Level 1 on test set
pred_lvl1 = clf_lvl1.predict(X_test_tfidf)
print("Level 1 Accuracy:", accuracy_score(y_test_lvl1_enc, pred_lvl1))
print("Level 1 Classification Report:\n", classification_report(y_test_lvl1_enc, pred_lvl1, target_names=le_lvl1.classes_))

Level 1 Accuracy: 0.7466808284652151
Level 1 Classification Report:
               precision    recall  f1-score   support

        comp       0.72      0.88      0.79      1955
        misc       0.82      0.55      0.66       390
         rec       0.72      0.79      0.75      1590
         sci       0.79      0.48      0.60      1579
        talk       0.77      0.84      0.80      2018

    accuracy                           0.75      7532
   macro avg       0.77      0.71      0.72      7532
weighted avg       0.75      0.75      0.74      7532



## 🔹 LCPN Design

Each non-leaf node gets its own classifier (e.g., `DecisionTreeClassifier`, `RandomForestClassifier`). For prediction, we follow the decision path from root to leaf.


In [37]:
# For each Level 1 class, train Level 2 classifier ===

le_lvl2_dict = {}
clf_lvl2_dict = {}

for lvl1_class in le_lvl1.classes_:
    # Filter training samples for this Level 1 class
    idx_train = [i for i, lbl in enumerate(y_train_level1) if lbl == lvl1_class]
    X_sub_train = X_train_tfidf[idx_train]
    y_sub_train = [y_train_level2[i] for i in idx_train]

    # Encode Level 2 labels for this Level 1 class
    le_lvl2 = LabelEncoder()
    y_sub_train_enc = le_lvl2.fit_transform(y_sub_train)
    le_lvl2_dict[lvl1_class] = le_lvl2

    # Train classifier for Level 2
    clf_lvl2 = RandomForestClassifier(n_estimators=100, n_jobs=-1, random_state=42)
    clf_lvl2.fit(X_sub_train, y_sub_train_enc)
    clf_lvl2_dict[lvl1_class] = clf_lvl2


In [38]:
# For each Level 2 class, train Level 3 (fine class) classifier ===

le_lvl3_dict = {}
clf_lvl3_dict = {}

for lvl2_class in level2_map.keys():
    # Filter training samples for this Level 2 class
    idx_train = [i for i, lbl in enumerate(y_train_level2) if lbl == lvl2_class]
    X_sub_train = X_train_tfidf[idx_train]
    y_sub_train = [y_train_fine[i] for i in idx_train]

    # Encode Level 3 (fine) labels for this Level 2 class
    le_lvl3 = LabelEncoder()
    y_sub_train_enc = le_lvl3.fit_transform(y_sub_train)
    le_lvl3_dict[lvl2_class] = le_lvl3

    # Train classifier for Level 3
    clf_lvl3 = RandomForestClassifier(n_estimators=100, n_jobs=-1, random_state=42)
    clf_lvl3.fit(X_sub_train, y_sub_train_enc)
    clf_lvl3_dict[lvl2_class] = clf_lvl3

In [39]:
# Predict on test set chaining Level 1 -> Level 2 -> Level 3 ===

y_pred_fine = []
for i in range(X_test_tfidf.shape[0]):
    x = X_test_tfidf[i]

    # Level 1 prediction
    pred_l1 = clf_lvl1.predict(x)[0]
    l1_label = le_lvl1.inverse_transform([pred_l1])[0]

    # Level 2 prediction using Level 1 class classifier
    clf_l2 = clf_lvl2_dict[l1_label]
    le_l2 = le_lvl2_dict[l1_label]
    pred_l2_enc = clf_l2.predict(x)
    pred_l2 = le_l2.inverse_transform(pred_l2_enc)[0]

    # Level 3 prediction using Level 2 class classifier
    clf_l3 = clf_lvl3_dict[pred_l2]
    le_l3 = le_lvl3_dict[pred_l2]
    pred_l3_enc = clf_l3.predict(x)
    pred_l3 = le_l3.inverse_transform(pred_l3_enc)[0]

    y_pred_fine.append(pred_l3)

In [40]:
# Evaluate fine level prediction ===

print("Fine-level Accuracy:", accuracy_score(y_test_fine, y_pred_fine))
print("Fine-level Classification Report:\n", classification_report(y_test_fine, y_pred_fine))

Fine-level Accuracy: 0.5596123207647371
Fine-level Classification Report:
                           precision    recall  f1-score   support

             alt.atheism       0.38      0.38      0.38       319
           comp.graphics       0.39      0.63      0.48       389
 comp.os.ms-windows.misc       0.43      0.63      0.51       394
comp.sys.ibm.pc.hardware       0.57      0.56      0.57       392
   comp.sys.mac.hardware       0.59      0.64      0.61       385
          comp.windows.x       0.70      0.67      0.68       395
            misc.forsale       0.82      0.55      0.66       390
               rec.autos       0.37      0.64      0.47       396
         rec.motorcycles       0.70      0.60      0.65       398
      rec.sport.baseball       0.71      0.70      0.70       397
        rec.sport.hockey       0.93      0.74      0.83       399
               sci.crypt       0.87      0.49      0.63       396
         sci.electronics       0.44      0.29      0.35       393
