# Multilabel Text Classification using CNN

This notebook demonstrates multilabel classification using a Convolutional Neural Network (CNN).

Dataset: GoEmotions (simplified), a large dataset of Reddit comments labeled with 28 emotion categories.

The model uses:
- Embedding layer for word representations
- Conv1D layers to capture local features
- Sigmoid activation for multilabel output
- Binary cross-entropy loss

In [2]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
!pip install datasets tensorflow torch --quiet

In [2]:
from datasets import load_dataset
from sklearn.preprocessing import MultiLabelBinarizer
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np

## 1. Dataset Loading & Exploration

We load the GoEmotions simplified dataset.

In [8]:
# Load dataset
dataset = load_dataset("go_emotions", "simplified", split="train")

texts = dataset['text']
raw_labels = dataset['labels']  # Multilabel: list of integers

# Get the emotion class list
class_list = dataset.features['labels'].feature.names

## 2. Text Preprocessing & Tokenization

Text is tokenized using TensorFlow/Keras Tokenizer.

In [9]:
# Convert to multi-hot
mlb = MultiLabelBinarizer(classes=class_list)
y = mlb.fit_transform([[class_list[i] for i in lbl] for lbl in raw_labels])

# Tokenization
tokenizer = Tokenizer(num_words=10000, oov_token="<OOV>")
tokenizer.fit_on_texts(texts)
X_seq = tokenizer.texts_to_sequences(texts)
X_pad = pad_sequences(X_seq, maxlen=150, padding='post')

## 3. CNN Model Architecture

The CNN model consists of:

- Embedding layer
- Multiple Conv1D layers with ReLU activation
- Global Max Pooling
- Dense layers with sigmoid output

The model predicts probabilities for each emotion label independently.

In [10]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Conv1D, GlobalMaxPooling1D, Dense, Dropout
from tensorflow.keras.metrics import AUC, Precision, Recall

model = Sequential([
    Embedding(input_dim=10000, output_dim=64),
    Conv1D(128, 5, activation='relu'),
    GlobalMaxPooling1D(),
    Dropout(0.3),
    Dense(64, activation='relu'),
    Dense(len(class_list), activation='sigmoid')
])

model.compile(
    loss='binary_crossentropy',
    optimizer='adam',
    metrics=['accuracy', AUC(name='auc'), Precision(name='precision'), Recall(name='recall')]
)

model.build(input_shape=(None, X_pad.shape[1]))
model.summary()

## 4. Model Training & Evaluation

We train the CNN model with binary cross-entropy loss and evaluate using multilabel metrics.

In [11]:
model.fit(X_pad, y, epochs=15, batch_size=32, validation_split=0.2)

Epoch 1/15
[1m1086/1086[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m35s[0m 30ms/step - accuracy: 0.2957 - auc: 0.7345 - loss: 0.1888 - precision: 0.2442 - recall: 0.0737 - val_accuracy: 0.4656 - val_auc: 0.8826 - val_loss: 0.1155 - val_precision: 0.6531 - val_recall: 0.3480
Epoch 2/15
[1m1086/1086[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m41s[0m 30ms/step - accuracy: 0.4767 - auc: 0.8890 - loss: 0.1130 - precision: 0.7033 - recall: 0.3199 - val_accuracy: 0.5039 - val_auc: 0.9084 - val_loss: 0.1058 - val_precision: 0.6816 - val_recall: 0.3740
Epoch 3/15
[1m1086/1086[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m40s[0m 37ms/step - accuracy: 0.5283 - auc: 0.9253 - loss: 0.0985 - precision: 0.7377 - recall: 0.3845 - val_accuracy: 0.5189 - val_auc: 0.9129 - val_loss: 0.1040 - val_precision: 0.6751 - val_recall: 0.3990
Epoch 4/15
[1m1086/1086[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m31s[0m 29ms/step - accuracy: 0.5710 - auc: 0.9427 - loss: 0.0882 - precision: 0.7682 - 

<keras.src.callbacks.history.History at 0x7ac3b9530290>

## 5. Inference with sample 

In [12]:
sample_plot = ["A young boy discovers he has magical powers and attends a school for wizards."]
sample_seq = tokenizer.texts_to_sequences(sample_plot)
sample_pad = pad_sequences(sample_seq, maxlen=150, padding='post')

pred_probs = model.predict(sample_pad)
pred_labels = (pred_probs >= 0.5).astype(int)
predicted_genres = mlb.inverse_transform(pred_labels)
print("Predicted genres:", predicted_genres)


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 84ms/step
Predicted genres: [('neutral',)]


## 6. Conclusion & Next Steps

CNNs provide a powerful alternative to traditional ML for multilabel text classification.

Next steps could be:

- Adding pre-trained embeddings (e.g., GloVe)
- Using attention mechanisms
- Experimenting with more complex architectures (e.g., LSTM, transformers)