<h1 align="center">
  <a href="https://uptrain.ai">
    <img width="300" src="https://user-images.githubusercontent.com/108270398/214240695-4f958b76-c993-4ddd-8de6-8668f4d0da84.png" alt="uptrain">
  </a>
</h1>

<h1 style="text-align: center;">Fine-tuning a Large-Language Model</h1>

### Install Required packages
- [PyTorch](https://pytorch.org/get-started/locally/): Deep learning framework.
- Hugging Face Transformers(https://huggingface.co/docs/transformers/installation): To use pretrained state-of-the-art models.
- [Hugging Face Datasets](https://pypi.org/project/datasets/): Use public Hugging Face datasets
- [IPywidgets](https://ipywidgets.readthedocs.io/en/stable/user_install.html): For interactive notebook widgets

In [2]:
from transformers import AutoModelForMaskedLM, AutoTokenizer
from model_constants import *
from model_train import retrain_model
from helper_funcs import *
import json
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
import uptrain

In [3]:
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)
testing_text = "Nike shoes are very [MASK]."
original_model_outputs = test_model(model, testing_text)

In [4]:
pd1 = pd.read_csv('raw_data/training_data.csv')
pd1.drop(columns=['id_','gender','title','url','category','price','description','description_long','size','r_date'],inplace=True)
pd1.drop(pd1[pd1['r_body'].isna() & pd1['r_title'].isna()].index,inplace=True)
pd1.sort_values(by='r_raiting',ascending=True, inplace=True, kind='quicksort')
pd1.reset_index(inplace=True,drop=True)
pd1

Unnamed: 0,n_reviews,score,comfort,durability,r_title,r_raiting,r_body
0,169.0,4.6,90.0,76.0,Not worth the money,1.0,Choose another shoe. Not worth it. Size too sm...
1,62.0,4.3,81.0,50.0,Disappointed,1.0,Nice look but eyelets broke in less than month...
2,36.0,4.0,87.5,93.0,,1.0,First pair of Nike shoes that didn’t come clos...
3,125.0,4.2,79.5,75.5,very uncomfortable wearing and doesn't look good.,1.0,"I bought these pair of nike shoes recently, on..."
4,51.0,3.7,70.5,61.0,ripped on first use,1.0,The fabric at the toe ripped while skating the...
...,...,...,...,...,...,...,...
23273,72.0,4.8,85.5,68.0,Jordan 92's,5.0,Bought the shoes as a birthday gift and they l...
23274,72.0,4.8,85.5,68.0,Awesomely fast shipping.,5.0,Super fast shipping. Make sure you make a Nike...
23275,72.0,4.8,85.5,68.0,,5.0,Nice shoe.
23276,72.0,4.8,85.5,68.0,Highly recommend,5.0,Shoes feel great!


In [5]:
real_world_test_cases = 'raw_data/real_world_testing_data.csv'
training_dataset = create_dataset_from_csv(real_world_test_cases, "r_title", "real_world_testing_data.json")

In [5]:
def is_negative_sentiment(inputs, outputs, gts=None, extra_args={}):
    is_negatives = []
    for txt in inputs["text"]:
        sia = SentimentIntensityAnalyzer()
        score = sia.polarity_scores(txt.lower())
        is_negative = score['pos'] < 0.25
        if score['compound'] <= - 0.05:
            is_negative = True
        is_negatives.append(bool(is_negative))
    return is_negatives

uptrain_save_fold_name = "uptrain_smart_data_bert"
nike_text_present = uptrain.Signal("Nike Text Present", is_negative_sentiment)

cfg = {
    'checks': [{
        'type': uptrain.Anomaly.EDGE_CASE,
        "signal_formulae": nike_text_present
    }],

    # Define where to save the retraining dataset
    'retraining_folder': uptrain_save_fold_name,
    
    # Define when to retrain, define a large number because we
    # are not retraining yet
    'retrain_after': 100
}

framework = uptrain.Framework(cfg)

Deleting the folder:  uptrain_smart_data_bert


In [6]:
# raw_dataset = create_sample_dataset("data.json")
with open('real_world_testing_data.json','r') as f:
    all_data = json.load(f)

for sample in all_data['data']:
    inputs = {'data': {'text': [sample['text']]}}
    framework.log(inputs = inputs, outputs = None)

retraining_dataset = create_dataset_from_csv(uptrain_save_fold_name + "/1/smart_data.csv", "text", "retrain_dataset.json")

50  edge cases identified out of  142  total samples
100  edge cases identified out of  263  total samples

Kicking off re-training


AttributeError: 'DatasetHandler' object has no attribute 'annotation_method'

In [6]:
retrain_model(model, retraining_dataset)
retrained_model_outputs = test_model(model, testing_text)

Using custom data configuration default-2370e3cf0f5387dd


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Downloading and preparing dataset json/default to /Users/vipul/.cache/huggingface/datasets/json/default-2370e3cf0f5387dd/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset json downloaded and prepared to /Users/vipul/.cache/huggingface/datasets/json/default-2370e3cf0f5387dd/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

The following columns in the evaluation set don't have a corresponding argument in `DistilBertForMaskedLM.forward` and have been ignored: word_ids. If word_ids are not expected by `DistilBertForMaskedLM.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 11
  Batch size = 64
You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


The following columns in the training set don't have a corresponding argument in `DistilBertForMaskedLM.forward` and have been ignored: word_ids. If word_ids are not expected by `DistilBertForMaskedLM.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 92
  Num Epochs = 3
  Instantaneous batch size per device = 64
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 1
  Total optimization steps = 6
  Number of trainable parameters = 66985530


>>>Before training, Perplexity: 10.48


Epoch,Training Loss,Validation Loss
1,1.6319,0.936632
2,1.2835,1.153624
3,1.1519,0.998965


The following columns in the evaluation set don't have a corresponding argument in `DistilBertForMaskedLM.forward` and have been ignored: word_ids. If word_ids are not expected by `DistilBertForMaskedLM.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 11
  Batch size = 64
The following columns in the evaluation set don't have a corresponding argument in `DistilBertForMaskedLM.forward` and have been ignored: word_ids. If word_ids are not expected by `DistilBertForMaskedLM.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 11
  Batch size = 64
The following columns in the evaluation set don't have a corresponding argument in `DistilBertForMaskedLM.forward` and have been ignored: word_ids. If word_ids are not expected by `DistilBertForMaskedLM.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 11
  Batch size = 64


Training completed. Do not forget to shar

>>>After trainign, Perplexity: 2.63


In [7]:
# print([original_model_outputs, retrained_model_outputs])

# # Create Nike review training dataset
# nike_attrs = {
#     "version": "0.1.0",
#     'source': "nike review dataset",
#     'url': 'https://www.kaggle.com/datasets/tinkuzp23/nike-onlinestore-customer-reviews?resource=download',
# }
# # Download the dataset from the url, zip it and copy the csv file here
# raw_nike_reviews_dataset = create_dataset_from_csv("web_scrapped.csv", "Content", "raw_nike_reviews_data.json")