# NLP Ecommerce Text Classification
### By Shreeyansh
#### shreeyanshparihar@gmail.com | +971 529412388 | +91 9530056916

Prelimnary Step: Install required Python Libary. In case only transformaer is need to be installed

In [6]:
!pip install transformers

Defaulting to user installation because normal site-packages is not writeable


### Importing required libraries.

In [7]:
import numpy as np
import pandas as pd
from sklearn.metrics import f1_score, classification_report
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from transformers import BertTokenizer, BertForSequenceClassification
from transformers import Trainer, TrainingArguments
from transformers import EarlyStoppingCallback
import torch
from tqdm import tqdm
import os
os.environ["WANDB_DISABLED"] = "true"

### Create a batch tokenizer to encode tags in data set.

In [8]:
def get_batch_tokenizer(tokenizer, dataset):
    return tokenizer.batch_encode_plus(dataset,
                                       max_length=256,
                                       padding=True,
                                       truncation=True,
                                       add_special_tokens=True,
                                       return_attention_mask=True,
                                       return_tensors='pt')

### Process tokenized dataset and input generate to final dataset.

In [9]:
class Dataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val
                in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

### Method to compute precision, accuracy etc. after training the model and testing with predictions

In [10]:
def compute_metrics(p):
    prediction, labels = p
    preds_flat = np.argmax(prediction, axis=1).flatten()
    labels_flat = labels.flatten()
    f1 = f1_score(labels_flat, preds_flat, average='macro')
    return {"f1": f1}

### Load data from CSV and convert to a list

In [11]:
df = pd.read_csv("ecommerceDataset.csv", names=["labels", "descriptions"])
descriptions = df["descriptions"].map(str).values.tolist()
labels = df["labels"].values.tolist()

le = LabelEncoder()
labels = le.fit_transform(labels).tolist()

### Load BERT Sequence Classification pretrained model

In [14]:
model = BertForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels=4)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, element

### Load BERT Tokenizer

In [15]:
tokenizer = BertTokenizer.from_pretrained(
        "bert-base-uncased",
        do_lower_case=True)

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

### Split the dataset into training, validation and test Dataset

In [16]:
x_train, x_test, y_train, y_test = train_test_split(descriptions, labels, test_size=0.4, stratify=labels, random_state=42)
x_valid, x_test, y_valid, y_test = train_test_split(x_test, y_test, test_size=0.5, random_state=42)

### Tokenize the tags in batches

In [17]:
x_train_tokens = get_batch_tokenizer(tokenizer, x_train)
x_valid_tokens = get_batch_tokenizer(tokenizer, x_valid)
x_test_tokens = get_batch_tokenizer(tokenizer, x_test)

### Preparing Final Dataset with the tokenised labels

In [18]:
train_dataset = Dataset(x_train_tokens, y_train)
valid_dataset = Dataset(x_valid_tokens, y_valid)
test_dataset = Dataset(x_test_tokens, y_test)

### Preparing Training Arguments

In [19]:
args = TrainingArguments(output_dir="output",
                            evaluation_strategy="epoch",
                            metric_for_best_model="f1",
                            save_strategy="epoch",
                            num_train_epochs=3,
                            load_best_model_at_end=True
                            )

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


### Preparing the Trainer Object & Train model

In [20]:
trainer = Trainer(args=args,
                    model=model,
                    train_dataset=train_dataset,
                    eval_dataset=valid_dataset,
                    compute_metrics=compute_metrics,
                    callbacks=[EarlyStoppingCallback(
                            early_stopping_patience=3)]
                    )

In [21]:
trainer.train()

***** Running training *****
  Num examples = 30255
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 11346
  item = {key: torch.tensor(val[idx]) for key, val


Epoch,Training Loss,Validation Loss,F1
1,0.2093,0.200522,0.965516
2,0.0938,0.146698,0.97505
3,0.055,0.133567,0.977441


***** Running Evaluation *****
  Num examples = 10085
  Batch size = 8
Saving model checkpoint to output\checkpoint-3782
Configuration saved in output\checkpoint-3782\config.json
Model weights saved in output\checkpoint-3782\pytorch_model.bin
  item = {key: torch.tensor(val[idx]) for key, val
***** Running Evaluation *****
  Num examples = 10085
  Batch size = 8
Saving model checkpoint to output\checkpoint-7564
Configuration saved in output\checkpoint-7564\config.json
Model weights saved in output\checkpoint-7564\pytorch_model.bin
  item = {key: torch.tensor(val[idx]) for key, val
***** Running Evaluation *****
  Num examples = 10085
  Batch size = 8
Saving model checkpoint to output\checkpoint-11346
Configuration saved in output\checkpoint-11346\config.json
Model weights saved in output\checkpoint-11346\pytorch_model.bin


Training completed. Do not forget to share your model on huggingface.co/models =)


Loading best model from output\checkpoint-11346 (score: 0.9774410552863159).


TrainOutput(global_step=11346, training_loss=0.13990063367225328, metrics={'train_runtime': 85003.9155, 'train_samples_per_second': 1.068, 'train_steps_per_second': 0.133, 'total_flos': 1.194085189020672e+16, 'train_loss': 0.13990063367225328, 'epoch': 3.0})

### Test the trained model with the test dataset

In [22]:
trainer = Trainer(model=model)
predictions = trainer.predict(test_dataset)

No `TrainingArguments` passed, using `output_dir=tmp_trainer`.
PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
***** Running Prediction *****
  Num examples = 10085
  Batch size = 8
  item = {key: torch.tensor(val[idx]) for key, val


### Get the predicted value and true values, then prepare a classificaton report

In [23]:
preds = np.argmax(predictions.predictions, axis=1).flatten()
true_vals = predictions.label_ids

In [24]:
print(classification_report(true_vals, preds, target_names=list(le.classes_)))

                        precision    recall  f1-score   support

                 Books       0.99      0.97      0.98      2335
Clothing & Accessories       0.98      0.99      0.99      1772
           Electronics       0.98      0.97      0.97      2111
             Household       0.98      0.99      0.98      3867

              accuracy                           0.98     10085
             macro avg       0.98      0.98      0.98     10085
          weighted avg       0.98      0.98      0.98     10085



### Manual Testing

In [32]:
print(le.classes_)
print(true_vals)
print(preds)
print(x_test[0])
print(y_test[0])

['Books' 'Clothing & Accessories' 'Electronics' 'Household']
[3 3 0 ... 0 2 2]
[3 3 0 ... 0 2 2]
Healthgenie Water Bed (Colour May Vary) Healthgenie Water Bed ensure long life and comfort. It is made up of single textured rubberised fabric that is skin friendly and its variable firmness allows it to adjust according to your body which makes it perfect. It not only gives aesthetic pleasure but is also a complete solution to back ache, spinal problem, burns, bedsores, arthritis, general surgery, cardiac rehabilitation, cystic fibrosis, cerebral palsy and multiple sclerosis. It helps you sleep in your natural body position and foster better circulation. It comes in standard size that can be easily placed on your bed. Based on the water therapy it is known to relax both body and mind. Further it comes in various colours to add grace to your living room. It is easy to clean and maintain. And most importantly it is leak proof. Features, single textured rubberised fabric. Completely leak proo

In [39]:
manual_test = ['The iPhone 14 and iPhone 14 Plus will be available for pre-order starting today and will cost $799 and $899, respectively. The iPhone 14 will be more widely available on September 16th, while the iPhone 14 Plus will hit stores on October 7th.This year, the Pro phones have a noticeably different design than previous iPhones. The rumors about a pill-shaped cutout turned out to be true — the screen notch is now gone and has been replaced by a floating space that houses the front-facing cameras as well as Apple\'s privacy dots, which turn on when apps use your camera or microphone. From a software standpoint, that space is dubbed the "Dynamic Island" as it will change and expand to adapt to what you\'re doing on your iPhone, notifications you receive and more.The iPhone 14 Pro has a 6.1-inch display while the Pro Max has a 6.7-inch screen, and they\'re always-on for the first time ever. Apple designed the panel to be as power efficient as possible, dynamically adjusting the refresh rate down to as low as 1Hz when necessary. The new Lock Screen in iOS 16 can show a bunch of things on the display like the time, widgets, live activities and more, and the Pro screens will do things like automatically dim to preserve power while continuing to show you relevant information, Lock Screen photos and backgrounds and more.As expected, the Pro handsets run on Apple\'s new A16 Bionic chip and they have an updated rear camera array along with a new TrueDepth front-facing camera. The rear setup includes a new 12MP telephoto lens, a 12MP ultra wide camera and a 48-megapixel main shooter that has a 65-percent larger sensor than that in the iPhone 13 Pro. The Pro phones will also support all of the new features found on the standard iPhone 14 models, including 5G and eSIM connectivity, crash detection, Emergency SOS with Satellite and more.']

In [40]:
manual_test_token = get_batch_tokenizer(tokenizer,manual_test)
manual_test_dataset = Dataset(manual_test_token,[2])

In [45]:
manual_pred = trainer.predict(manual_test_dataset)
manual_test_prediction = np.argmax(manual_pred.predictions, axis=1).flatten()
print(le.classes_[manual_test_prediction[0]])

***** Running Prediction *****
  Num examples = 1
  Batch size = 8
  item = {key: torch.tensor(val[idx]) for key, val


Electronics


In [46]:
trainer.save_model("trained_model")

Saving model checkpoint to trained_model
Configuration saved in trained_model\config.json
Model weights saved in trained_model\pytorch_model.bin


## Program Ends Here 
### NLP Ecommerce Text Classification
#### By Shreeyansh
##### shreeyanshparihar@gmail.com | +971 529412388 | +91 9530056916