In [2]:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Looking in indexes: https://download.pytorch.org/whl/cu118
Note: you may need to restart the kernel to use updated packages.


In [3]:
# In your Python script or notebook
import nltk
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\suyas\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\suyas\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

# Problem Approach: Multi-Label Classification using BERT

## 1. **Understanding the Problem**
   The task was to classify items in the dataset into multiple categories (Level 1 Factors). This is a multi-label classification problem where each item could belong to one or more categories. A model needs to be built that can predict these categories based on the provided item descriptions.

## 2. **Data Exploration**
   - **Training Data:** The `bodywash-train.xlsx` file contains the `Core Item` column (text descriptions of items) and the `Level 1 Factors` column (comma-separated categories).
   - **Test Data:** The `bodywash-test.xlsx` file contains only the `Core Item` column and needs to be labeled based on the trained model.

## 3. **Data Preprocessing**
   To prepare the data for the BERT model:
   - **Text Cleaning:** Applied preprocessing techniques to clean the item descriptions:
     - Converted text to lowercase.
     - Removed special characters, numbers, and punctuation.
     - Removed stopwords using the NLTK library.
     - Removed extra whitespaces.
   - **Multi-Label Encoding:** The `Level 1 Factors` column (comma-separated labels) was transformed into a binary format using `MultiLabelBinarizer`, turning it into a format suitable for multi-label classification.

## 4. **Model Selection**
   Initially, I considered using **Logistic Regression**, but due to a low F1-score, I moved to **BERT (Bidirectional Encoder Representations from Transformers)** for the following reasons:
   - **Contextual Understanding**: BERT's attention mechanism allows it to understand context by considering both the left and right sides of a word.
   - **Pre-trained Weights**: The pre-trained `bert-base-uncased` model provides a strong starting point for fine-tuning on specific tasks, avoiding the need for training a model from scratch.
   - **Adaptability for Multi-Label Classification**: BERT can be easily adapted to multi-label tasks by modifying its classification head and loss function.

## 5. **Model Training**
   - **Tokenizer:** Used the pre-trained BERT tokenizer (`bert-base-uncased`) to convert the item descriptions into token IDs and attention masks.
   - **Custom Dataset:** Defined a custom PyTorch `Dataset` to feed tokenized data and labels into the model.
   - **Model Architecture:** Loaded a pre-trained BERT model (`BertForSequenceClassification`) with a classification head. The head was modified to handle multi-label classification by setting the `num_labels` parameter to match the number of unique categories in the dataset.
   - **Loss Function:** Used `BCEWithLogitsLoss` (Binary Cross-Entropy with Logits) since it's suited for multi-label classification tasks.
   - **Optimizer:** AdamW optimizer was chosen with a learning rate of `1e-5` to fine-tune the model.

## 6. **Training Loop**
   - **Batch Training:** Data was fed into the model in batches using PyTorch's `DataLoader`.
   - **Backpropagation:** For each batch, the model's logits (predictions) were compared to the true labels, and the loss was calculated. Gradients were then backpropagated to update model weights.
   - **Validation:** After each epoch, the model was evaluated on a validation set to monitor its performance.

## 7. **Prediction on Test Data**
   - **Test Dataset:** A similar tokenization process was applied to the test dataset.
   - **Prediction:** The trained model was used to predict the categories (Level 1 Factors) for each item in the test set.
   - **Thresholding:** The model's output (logits) was passed through a sigmoid function to convert them into probabilities. A threshold of 0.5 was used to determine which categories to assign to each item.
   - **Inverse Transformation:** The predicted binary labels were transformed back into their corresponding category names using the `inverse_transform` method from `MultiLabelBinarizer`.

## 8. **Saving Results**
   - The predictions were added as a new column in the test dataset.
   - The final results were saved to an Excel file (`bodywash_test_predictions.xlsx`).

## 9. **Conclusion**
   This approach effectively utilizes a pre-trained BERT model for multi-label classification. BERT's attention mechanism and pre-trained weights enable the model to generalize well and predict the appropriate categories for the given test data.
dict the appropriate categories for the given test data.


In [6]:
import torch
from transformers import BertTokenizer, BertForSequenceClassification
from torch.utils.data import DataLoader, Dataset
import torch.nn as nn
import torch.optim as optim
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MultiLabelBinarizer
from tqdm import tqdm
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords

# Download stopwords if needed
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

# Load data (adapt this to your data loading method)
df_train = pd.read_excel('bodywash-train.xlsx')
df_test = pd.read_excel('bodywash-test.xlsx')

# Clean the text data
def clean_text(text):
    # Convert text to lowercase
    text = text.lower()
    
    # Remove special characters, numbers, and punctuation
    text = re.sub(r'[^a-z\s]', '', text)
    
    # Remove stopwords (optional)
    text = ' '.join([word for word in text.split() if word not in stop_words])
    
    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    
    return text

# Apply cleaning to the 'item' column
df_train['Core Item'] = df_train['Core Item'].apply(clean_text)
df_test['Core Item'] = df_test['Core Item'].apply(clean_text)

# Assuming 'item' and 'factors' are the relevant columns in the dataset
items = df_train['Core Item'].tolist()
factors = df_train['Level 1 Factors'].apply(lambda x: x.split(',')).tolist()  # Assuming factors are comma-separated

# MultiLabelBinarizer to encode the factors
mlb = MultiLabelBinarizer()
y_train = mlb.fit_transform(factors)  # Converts factors into one-hot encoded format
labels_list = mlb.classes_  # List of unique labels

# Load pre-trained BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Custom dataset class for loading tokenized data
class CustomDataset(Dataset):
    def __init__(self, items, labels):
        self.items = items
        self.labels = labels
    
    def __len__(self):
        return len(self.items)
    
    def __getitem__(self, idx):
        item = self.items[idx]
        label = self.labels[idx]
        
        # Tokenize the item
        encoding = tokenizer.encode_plus(
            item,
            add_special_tokens=True,
            truncation=True,
            padding='max_length',
            max_length=128,
            return_tensors='pt',
            return_attention_mask=True
        )
        
        input_ids = encoding['input_ids'].squeeze()
        attention_mask = encoding['attention_mask'].squeeze()
        
        return input_ids, attention_mask, torch.tensor(label)

# Split dataset into train and validation
X_train, X_val, y_train, y_val = train_test_split(items, y_train, test_size=0.1, random_state=42)

# Create PyTorch DataLoader
train_dataset = CustomDataset(X_train, y_train)
val_dataset = CustomDataset(X_val, y_val)
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=16, shuffle=False)

# Load pre-trained BERT model with a classification head
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=y_train.shape[1], problem_type="multi_label_classification")

# Move model to GPU if available
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model.to(device)

# Define optimizer
optimizer = optim.AdamW(model.parameters(), lr=1e-5)

# Define BCEWithLogitsLoss for multi-label classification
loss_fn = nn.BCEWithLogitsLoss()

# Training loop
epochs = 3
for epoch in range(epochs):
    model.train()
    loop = tqdm(train_loader, leave=True)
    
    for batch in loop:
        input_ids, attention_mask, labels = [b.to(device) for b in batch]
        
        optimizer.zero_grad()
        outputs = model(input_ids, attention_mask=attention_mask)
        
        # Apply sigmoid to logits for multi-label classification
        logits = outputs.logits
        loss = loss_fn(logits, labels.float())  # Labels should be float for BCEWithLogitsLoss
        
        loss.backward()
        optimizer.step()
        
        loop.set_description(f'Epoch {epoch}')
        loop.set_postfix(loss=loss.item())
    
    # Validation loop (optional)
    model.eval()
    val_loss = 0
    for batch in val_loader:
        input_ids, attention_mask, labels = [b.to(device) for b in batch]
        
        with torch.no_grad():
            outputs = model(input_ids, attention_mask=attention_mask)
            logits = outputs.logits
            loss = loss_fn(logits, labels.float())
            val_loss += loss.item()
    
    print(f"Validation Loss after epoch {epoch}: {val_loss / len(val_loader)}")

# Save the trained model (optional)
model.save_pretrained('./bert-multi-label-classifier')
tokenizer.save_pretrained('./bert-multi-label-classifier')


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\suyas\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Epoch 0: 100%|█████████████████████████████████████████████████████████| 436/436 [5:07:52<00:00, 42.37s/it, loss=0.203]


Validation Loss after epoch 0: 0.2055515193817567


Epoch 1: 100%|█████████████████████████████████████████████████████████| 436/436 [1:59:59<00:00, 16.51s/it, loss=0.182]


Validation Loss after epoch 1: 0.1950129690218945


Epoch 2: 100%|█████████████████████████████████████████████████████████| 436/436 [1:33:41<00:00, 12.89s/it, loss=0.212]


Validation Loss after epoch 2: 0.18517818682047785


('./bert-multi-label-classifier\\tokenizer_config.json',
 './bert-multi-label-classifier\\special_tokens_map.json',
 './bert-multi-label-classifier\\vocab.txt',
 './bert-multi-label-classifier\\added_tokens.json')

## Observation

After training the BERT model for multi-label classification over 3 epochs, the following key observations were made:

### 1. **Training Performance**:
   - The model's training loss gradually decreased from epoch 0 to epoch 2, indicating that the model was effectively learning from the data:
     - **Epoch 0**: Training Loss = 0.203
     - **Epoch 1**: Training Loss = 0.182
     - **Epoch 2**: Training Loss = 0.212
   - Although there was a slight increase in training loss during epoch 2, the model overall demonstrated good training behavior.

### 2. **Validation Performance**:
   - The validation loss consistently improved over the first two epochs, showing that the model was generalizing well to unseen validation data:
     - **Validation Loss after Epoch 0**: 0.2055
     - **Validation Loss after Epoch 1**: 0.1950
     - **Validation Loss after Epoch 2**: 0.1852
   - The steady reduction in validation loss indicates that the model is improving its ability to predict labels accurately g in future epochs.
later epochs.


In [61]:
import numpy as np
import torch.nn.functional as F
# Prepare test data for prediction
test_items = df_test['Core Item'].tolist()

# Create a CustomDataset for the test set (without labels)
class TestDataset(Dataset):
    def __init__(self, items):
        self.items = items
    
    def __len__(self):
        return len(self.items)
    
    def __getitem__(self, idx):
        item = self.items[idx]
        
        # Tokenize the item
        encoding = tokenizer.encode_plus(
            item,
            add_special_tokens=True,
            truncation=True,
            padding='max_length',
            max_length=128,
            return_tensors='pt',
            return_attention_mask=True
        )
        
        input_ids = encoding['input_ids'].squeeze()
        attention_mask = encoding['attention_mask'].squeeze()
        
        return input_ids, attention_mask

# Create DataLoader for test data
test_dataset = TestDataset(test_items)
test_loader = DataLoader(test_dataset, batch_size=16, shuffle=False)

# Make predictions on the test set
model.eval()
predictions = []
with torch.no_grad():
    for batch in test_loader:
        input_ids, attention_mask = [b.to(device) for b in batch]
        
        outputs = model(input_ids, attention_mask=attention_mask)
        logits = outputs.logits
        
        # Apply sigmoid and threshold to get binary predictions
        preds = torch.sigmoid(logits).cpu().numpy()
        predictions.append(preds)

predictions = np.concatenate(predictions, axis=0)

# Apply softmax along axis=1 to normalize logits row-wise
probabilities = F.softmax(torch.tensor(predictions), dim=1).numpy()

# Define a threshold (e.g., 0.5) and create binary predictions
threshold = 0.5
binary_predictions = (probabilities > threshold).astype(int)

# Convert binary predictions back to factors (multi-label)
predicted_factors = mlb.inverse_transform(binary_predictions)

# Add the predicted factors to the test DataFrame
df_test['Predicted Factors'] = [' , '.join(factors) for factors in predicted_factors]

# Save the results to a new Excel file
df_test.to_excel('bodywash_test_predictions.xlsx', index=False)

print("Predictions saved to 'bodywash_test_predictions.xlsx'")

Predictions saved to 'bodywash_test_predictions.xlsx'


In [62]:
predicted_factors

[('Fragrance',),
 ('Fragrance',),
 ('Fragrance',),
 ('Fragrance',),
 ('Fragrance',),
 ('Cleansing',),
 ('Fragrance',),
 ('Fragrance',),
 ('Fragrance',),
 ('Cleansing',),
 ('Cleansing',),
 ('Brand Value',),
 ('Fragrance',),
 ('Fragrance',),
 ('Fragrance',),
 ('Fragrance',),
 ('Fragrance',),
 ('Fragrance',),
 ('Brand Value',),
 ('Brand Value',),
 ('Fragrance',),
 ('Fragrance',),
 ('Fragrance',),
 ('Fragrance',),
 ('Fragrance',),
 ('Fragrance',),
 ('Brand Value',),
 ('Fragrance',),
 ('Fragrance',),
 ('Fragrance',),
 ('Fragrance',),
 ('Brand Value',),
 ('Fragrance',),
 ('Fragrance',),
 ('Fragrance',),
 ('Fragrance',),
 ('Brand Value',),
 ('Fragrance',),
 ('Price',),
 ('Brand Value',),
 ('Brand Value',),
 ('Fragrance',),
 ('Brand Value',),
 ('Brand Value',),
 ('Fragrance',),
 ('Brand Value',),
 ('Fragrance',),
 ('Fragrance',),
 ('Brand Value',),
 ('Fragrance',),
 ('Brand Value',),
 ('Fragrance',),
 ('Fragrance',),
 ('Fragrance',),
 ('Fragrance',),
 ('Fragrance',),
 ('Fragrance',),
 ('Fragra