<h1 align="center">
  <a href="https://uptrain.ai">
    <img width="300" src="https://user-images.githubusercontent.com/108270398/214240695-4f958b76-c993-4ddd-8de6-8668f4d0da84.png" alt="uptrain">
  </a>
</h1>

<h1 style="text-align: center;">Fine-tuning a Large-Language Model [WIP]</h1>

### Install Required packages
- [PyTorch](https://pytorch.org/get-started/locally/): Deep learning framework.
- Hugging Face Transformers(https://huggingface.co/docs/transformers/installation): To use pretrained state-of-the-art models.
- [Hugging Face Datasets](https://pypi.org/project/datasets/): Use public Hugging Face datasets
- [IPywidgets](https://ipywidgets.readthedocs.io/en/stable/user_install.html): For interactive notebook widgets

In [1]:
!pip install torch transformers[torch] datasets ipywidgets nltk uptrain



You should consider upgrading via the 'c:\users\kanchan kumar kaity\appdata\local\programs\python\python39\python.exe -m pip install --upgrade pip' command.




In [49]:
from transformers import AutoModelForMaskedLM, AutoTokenizer
import json
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import uptrain
import pandas as pd

from model_constants import *
from model_train import retrain_model
from helper_funcs import *

Define few cases to test our model performance before and after retraining.

In [50]:
testing_texts = [
    "Nike shoes are very [MASK]." , "Website is [MASK]." , 'customer service is [MASK].' , 'store is [MASK].' , 'Price is [MASK].' , 'delivery is [MASK].' , 'design is [MASK].' , 'fitting is [MASK].'
]

In [51]:
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)
original_model_outputs = [test_model(model, x) for x in testing_texts]

loading configuration file config.json from cache at C:\Users\KANCHAN KUMAR KAITY/.cache\huggingface\hub\models--distilbert-base-uncased\snapshots\1c4513b2eedbda136f57676a34eea67aba266e5c\config.json
Model config DistilBertConfig {
  "_name_or_path": "distilbert-base-uncased",
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.26.1",
  "vocab_size": 30522
}

loading file vocab.txt from cache at C:\Users\KANCHAN KUMAR KAITY/.cache\huggingface\hub\models--distilbert-base-uncased\snapshots\1c4513b2eedbda136f57676a34eea67aba266e5c\vocab.txt
loading file tokenizer.json from cache at C:\Users\KANCHAN KUMA

In [52]:
# Create Nike review training dataset
nike_attrs = {
    "version": "0.1.0",
    'source': "nike review dataset",
    'url': 'https://www.kaggle.com/datasets/tinkuzp23/nike-onlinestore-customer-reviews?resource=download',
}
# Download the dataset from the url, zip it and copy the csv file here
nike_reviews_dataset1 = create_dataset_from_csv("web_scraped.csv", "Content", "nike_reviews_data.json")

Let's use Nike onlinestore customer reviews from Kaggle and filter data using UpTrain signals to retrain our model upon. Please download the data from the [link](https://www.kaggle.com/datasets/tinkuzp23/nike-onlinestore-customer-reviews?resource=download) and unzip it here.
  

In [58]:
def nike_positive_sentiment_func(inputs, outputs, gts=None, extra_args={}):
    is_positives = []
    for input in inputs["text"]:
        txt = input.lower()
        sia = SentimentIntensityAnalyzer()
        score = sia.polarity_scores(txt)

        is_negative = score['pos'] < 0.25
        for neg_adj in ['expensive', 'worn', 'cheap', 'inexpensive', 'dirty', 'bad' , 'worst' , 'incomplete' , 'defunct' , 'not satisfactory' , ]:
            if neg_adj in txt:
                is_negative = True

        is_positives.append(bool(1-is_negative))
    return is_positives

cfg = {
    'checks': [{
        'type': uptrain.Anomaly.EDGE_CASE,
        "signal_formulae": uptrain.Signal("Nike Positive Sentiment", nike_positive_sentiment_func)
    }],

    # Define where to save the retraining dataset
    'retraining_folder': "uptrain_smart_data",
    
    # Define when to retrain, define a large number because we are using UpTrain just to create retraining dataset
    'retrain_after': 10000000000
}

framework = uptrain.Framework(cfg)

Deleting the folder:  uptrain_smart_data


In [59]:
with open(nike_reviews_dataset1) as f:
    all_data = json.load(f)

for sample in all_data['data']:
    inputs = {'data': {'text': [sample['text']]}}
    framework.log(inputs = inputs, outputs = None)

50  edge cases identified out of  135  total samples


In [60]:
print("Number of samples filtered for retraining: ", len(pd.read_csv("uptrain_smart_data/1/smart_data.csv")))
retraining_dataset = create_dataset_from_csv("uptrain_smart_data/1/smart_data.csv", "text", "retrain_dataset.json", min_samples=1000)

Number of samples filtered for retraining:  82


In [61]:
retrain_model(model, retraining_dataset)
retrained_model_outputs = [test_model(model, x) for x in testing_texts]

Using custom data configuration default-8e3ab05d96f393bb


Downloading and preparing dataset json/default to C:/Users/KANCHAN KUMAR KAITY/.cache/huggingface/datasets/json/default-8e3ab05d96f393bb/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset json downloaded and prepared to C:/Users/KANCHAN KUMAR KAITY/.cache/huggingface/datasets/json/default-8e3ab05d96f393bb/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
The following columns in the evaluation set don't have a corresponding argument in `DistilBertForMaskedLM.forward` and have been ignored: word_ids. If word_ids are not expected by `DistilBertForMaskedLM.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 23
  Batch size = 64


  0%|          | 0/1 [00:00<?, ?it/s]

The following columns in the training set don't have a corresponding argument in `DistilBertForMaskedLM.forward` and have been ignored: word_ids. If word_ids are not expected by `DistilBertForMaskedLM.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 204
  Num Epochs = 3
  Instantaneous batch size per device = 64
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 1
  Total optimization steps = 12
  Number of trainable parameters = 66985530


>>>Before training, Perplexity: 27.42


  0%|          | 0/12 [00:00<?, ?it/s]

The following columns in the evaluation set don't have a corresponding argument in `DistilBertForMaskedLM.forward` and have been ignored: word_ids. If word_ids are not expected by `DistilBertForMaskedLM.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 23
  Batch size = 64


  0%|          | 0/1 [00:00<?, ?it/s]

{'eval_loss': 3.236717462539673, 'eval_runtime': 3.7296, 'eval_samples_per_second': 6.167, 'eval_steps_per_second': 0.268, 'epoch': 1.0}


The following columns in the evaluation set don't have a corresponding argument in `DistilBertForMaskedLM.forward` and have been ignored: word_ids. If word_ids are not expected by `DistilBertForMaskedLM.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 23
  Batch size = 64


  0%|          | 0/1 [00:00<?, ?it/s]

{'eval_loss': 2.964031934738159, 'eval_runtime': 4.9192, 'eval_samples_per_second': 4.676, 'eval_steps_per_second': 0.203, 'epoch': 2.0}


The following columns in the evaluation set don't have a corresponding argument in `DistilBertForMaskedLM.forward` and have been ignored: word_ids. If word_ids are not expected by `DistilBertForMaskedLM.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 23
  Batch size = 64


  0%|          | 0/1 [00:00<?, ?it/s]



Training completed. Do not forget to share your model on huggingface.co/models =)


The following columns in the evaluation set don't have a corresponding argument in `DistilBertForMaskedLM.forward` and have been ignored: word_ids. If word_ids are not expected by `DistilBertForMaskedLM.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 23
  Batch size = 64


{'eval_loss': 3.0635790824890137, 'eval_runtime': 5.7526, 'eval_samples_per_second': 3.998, 'eval_steps_per_second': 0.174, 'epoch': 3.0}
{'train_runtime': 529.1098, 'train_samples_per_second': 1.157, 'train_steps_per_second': 0.023, 'train_loss': 3.102147420247396, 'epoch': 3.0}


  0%|          | 0/1 [00:00<?, ?it/s]

>>>After training, Perplexity: 25.08


In [68]:
[original_model_outputs, retrained_model_outputs]

[[['popular', 'expensive', 'durable', 'common', 'comfortable'],
  ['closed', 'defunct', 'open', 'incomplete', 'available']],
 [['popular', 'expensive', 'comfortable', 'durable', 'good'],
  ['defunct', 'closed', 'available', 'open', 'incomplete']]]

For second dataset

In [63]:
# Create Nike review training dataset
nike_attrs = {
    "version": "0.1.0",
    'source': "nike_feedback",
    'url': 'kaggle kernels output asjad2024/distilbert-based-uncased-fine-tuning-on-custom -p /path/to/dest',
}
# Download the dataset from the url, zip it and copy the csv file here
nike_reviews_dataset2 = create_dataset_from_csv("nike_2020_04_13.csv", "Description", "nike_reviews_data2.json")

In [64]:
def nike_positive_sentiment_func(inputs, outputs, gts=None, extra_args={}):
    is_positives = []
    for input in inputs["text"]:
        txt = input.lower()
        sia = SentimentIntensityAnalyzer()
        score = sia.polarity_scores(txt)

        is_negative = score['pos'] < 0.25
        for neg_adj in ['expensive', 'worn', 'cheap', 'inexpensive', 'dirty', 'bad' , 'worst']:
            if neg_adj in txt:
                is_negative = True

        is_positives.append(bool(1-is_negative))
    return is_positives

cfg = {
    'checks': [{
        'type': uptrain.Anomaly.EDGE_CASE,
        "signal_formulae": uptrain.Signal("Nike Positive Sentiment", nike_positive_sentiment_func)
    }],

    # Define where to save the retraining dataset
    'retraining_folder': "uptrain_smart_data2",
    
    # Define when to retrain, define a large number because we are using UpTrain just to create retraining dataset
    'retrain_after': 10000000000
}

framework = uptrain.Framework(cfg)

Deleting the folder:  uptrain_smart_data2


In [65]:
with open(nike_reviews_dataset2) as f:
    all_data = json.load(f)

for sample in all_data['data']:
    inputs = {'data': {'text': [sample['text']]}}
    framework.log(inputs = inputs, outputs = None)

50  edge cases identified out of  348  total samples


In [66]:
print("Number of samples filtered for retraining: ", len(pd.read_csv("uptrain_smart_data2/1/smart_data.csv")))
retraining_dataset2 = create_dataset_from_csv("uptrain_smart_data2/1/smart_data.csv", "text", "retrain_dataset2.json", min_samples=1000)

Number of samples filtered for retraining:  82


In [67]:
retrain_model(model, retraining_dataset2)
retrained_model_outputs2 = [test_model(model, x) for x in testing_texts]

Using custom data configuration default-0d9ecc2fe1ff932e


Downloading and preparing dataset json/default to C:/Users/KANCHAN KUMAR KAITY/.cache/huggingface/datasets/json/default-0d9ecc2fe1ff932e/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset json downloaded and prepared to C:/Users/KANCHAN KUMAR KAITY/.cache/huggingface/datasets/json/default-0d9ecc2fe1ff932e/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
The following columns in the evaluation set don't have a corresponding argument in `DistilBertForMaskedLM.forward` and have been ignored: word_ids. If word_ids are not expected by `DistilBertForMaskedLM.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 51
  Batch size = 64


  0%|          | 0/1 [00:00<?, ?it/s]

The following columns in the training set don't have a corresponding argument in `DistilBertForMaskedLM.forward` and have been ignored: word_ids. If word_ids are not expected by `DistilBertForMaskedLM.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 455
  Num Epochs = 3
  Instantaneous batch size per device = 64
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 1
  Total optimization steps = 24
  Number of trainable parameters = 66985530


>>>Before training, Perplexity: 39.71


  0%|          | 0/24 [00:00<?, ?it/s]

The following columns in the evaluation set don't have a corresponding argument in `DistilBertForMaskedLM.forward` and have been ignored: word_ids. If word_ids are not expected by `DistilBertForMaskedLM.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 51
  Batch size = 64


  0%|          | 0/1 [00:00<?, ?it/s]

{'eval_loss': 3.2097129821777344, 'eval_runtime': 10.0262, 'eval_samples_per_second': 5.087, 'eval_steps_per_second': 0.1, 'epoch': 1.0}


The following columns in the evaluation set don't have a corresponding argument in `DistilBertForMaskedLM.forward` and have been ignored: word_ids. If word_ids are not expected by `DistilBertForMaskedLM.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 51
  Batch size = 64


  0%|          | 0/1 [00:00<?, ?it/s]

{'eval_loss': 3.063210964202881, 'eval_runtime': 8.4339, 'eval_samples_per_second': 6.047, 'eval_steps_per_second': 0.119, 'epoch': 2.0}


The following columns in the evaluation set don't have a corresponding argument in `DistilBertForMaskedLM.forward` and have been ignored: word_ids. If word_ids are not expected by `DistilBertForMaskedLM.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 51
  Batch size = 64


  0%|          | 0/1 [00:00<?, ?it/s]



Training completed. Do not forget to share your model on huggingface.co/models =)


The following columns in the evaluation set don't have a corresponding argument in `DistilBertForMaskedLM.forward` and have been ignored: word_ids. If word_ids are not expected by `DistilBertForMaskedLM.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 51
  Batch size = 64


{'eval_loss': 2.7324538230895996, 'eval_runtime': 10.1552, 'eval_samples_per_second': 5.022, 'eval_steps_per_second': 0.098, 'epoch': 3.0}
{'train_runtime': 1297.9011, 'train_samples_per_second': 1.052, 'train_steps_per_second': 0.018, 'train_loss': 3.3099263509114585, 'epoch': 3.0}


  0%|          | 0/1 [00:00<?, ?it/s]

>>>After training, Perplexity: 20.74


In [69]:
[original_model_outputs, retrained_model_outputs2]

[[['popular', 'expensive', 'durable', 'common', 'comfortable'],
  ['closed', 'defunct', 'open', 'incomplete', 'available']],
 [['popular', 'comfortable', 'durable', 'expensive', 'versatile'],
  ['free', 'open', 'available', 'defunct', 'closed']]]

In [None]:
def remove_word_func(inputs, outputs, gts=None, extra_args={}):
    is_positives = []
    for input in inputs["text"]:
        txt = input.lower()
        sia = SentimentIntensityAnalyzer()
        score = sia.polarity_scores(txt)

        is_negative = score['pos'] < 0.25
        for neg_adj in ['Adidas','Burberry','Gucci','Jimmy','Salvatore Ferragamo','Bugatti',
                       'Airwalk','Lacoste','Lee Cooper','Red Tape','Fila','Balenciaga','Puma',
                       'Levis','Tommy Hilfiger','Jordan','Reebok','Woodland',
                       'Sparx','Red Chief','Diesel','Calvin Klein','US Polo']:
            if neg_adj in txt:
                is_negative = True

        is_positives.append(bool(1-is_negative))
    return is_positives

cfg = {
    'checks': [{
        'type': uptrain.Anomaly.EDGE_CASE,
        "signal_formulae": uptrain.Signal("Nike word remove", remove_word_func)
    }],

    # Define where to save the retraining dataset
    'retraining_folder': "uptrain_smart_data3",
    
    # Define when to retrain, define a large number because we are using UpTrain just to create retraining dataset
    'retrain_after': 10000000000
}

framework = uptrain.Framework(cfg)

In [None]:
with open(nike_reviews_dataset2) as f:
    all_data = json.load(f)

for sample in all_data['data']:
    inputs = {'data': {'text': [sample['text']]}}
    framework.log(inputs = inputs, outputs = None)

In [None]:
print("Number of samples filtered for retraining: ", len(pd.read_csv("uptrain_smart_data3/1/smart_data.csv")))
retraining_dataset3 = create_dataset_from_csv("uptrain_smart_data3/1/smart_data.csv", "text", "retrain_dataset3.json", min_samples=1000)

In [None]:
retrain_model(model, retraining_dataset3)
retrained_model_outputs3 = [test_model(model, x) for x in testing_texts]

In [None]:
[original_model_outputs, retrained_model_outputs3]