## Installing Necessary Libraries

In [2]:
!pip install -q -U trl transformers accelerate git+https://github.com/huggingface/peft.git

In [3]:
!pip install -q datasets bitsandbytes einops wandb

## Loading the Data Set

In [4]:
import pandas as pd
df = pd.read_csv('dataforLLM.csv')

In [5]:
df.head()

Unnamed: 0.1,Unnamed: 0,merged_review
0,0,why does it look like someone spit on my food?...
1,1,it'd mcdonalds. it is what it is as far as the...
2,2,made a mobile order got to the speaker and che...
3,3,my mc. crispy chicken sandwich was customer se...
4,4,"i repeat my order 3 times in the drive thru, a..."


In [6]:
df.drop(['Unnamed: 0'], axis = 1, inplace = True)

In [7]:
df.head()

Unnamed: 0,merged_review
0,why does it look like someone spit on my food?...
1,it'd mcdonalds. it is what it is as far as the...
2,made a mobile order got to the speaker and che...
3,my mc. crispy chicken sandwich was customer se...
4,"i repeat my order 3 times in the drive thru, a..."


Here the dataset is built in a way to facilitate to fit the instruct model which we will have:
the structure is question->:answer

We are spliting the dataset here

In [12]:
from sklearn.model_selection import train_test_split
train_df, test_df = train_test_split(df, test_size=0.05, random_state=42)

In [13]:
train_df.head(10)

Unnamed: 0,merged_review
4877,good->:positive
4032,excellent->:positive
22739,"curbside is a joke waited for 15 mins, should ..."
9090,nice accurate and quick service. thanks!->:pos...
9279,"average service and food, long wait times at n..."
18809,poor->:negative
17950,neutral->:negative
19386,very modern mcdonald's. to go order pickup sti...
13875,if there is any mcdonalds world wide that dese...
14359,"fast, got my 2 grape jellies!->:positive"


In [14]:
test_df.head(10)

Unnamed: 0,merged_review
29153,this review is about the staff at this locatio...
30570,worst experience ever. customer service is the...
23381,disappointing. mcdonalds is my standby for fas...
23825,we went here for breakfast. the place could ha...
1398,good->:positive
8639,they forget everything and charge you for it->...
20984,good->:positive
6293,soooo slow!!!! came here to try and get a quic...
9845,month->:negative
22365,not from florida..but every mcdonald's i visit...


In [15]:
len(test_df)

1630

In [16]:
test_df.to_csv('test.csv')

We are creating a dataset iterator here

In [17]:
from datasets import Dataset,DatasetDict
train_dataset_dict = DatasetDict({
    "train": Dataset.from_pandas(train_df),
})

  from .autonotebook import tqdm as notebook_tqdm


In [18]:
train_dataset_dict

DatasetDict({
    train: Dataset({
        features: ['merged_review', '__index_level_0__'],
        num_rows: 30970
    })
})

## Loading the model

Loading [Falcon 7B model](https://huggingface.co/tiiuae/falcon-7b), quantizing it in 4bit and attaching LoRA adapters on it. 

In [19]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, AutoTokenizer

model_name = "ybelkada/falcon-7b-sharded-bf16"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    trust_remote_code=True
)
model.config.use_cache = False

Loading checkpoint shards: 100%|██████████| 8/8 [00:10<00:00,  1.29s/it]


Loding the tokenizer for the model

In [20]:
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

## Base model predicting before finetuning.

In [16]:
import transformers
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto",
)


sequences = pipeline(
   ["“no restrooms, no seats to eat, no stars” ->:","good ->:","the nastiest mcdonald's i have ever been in!! ->:"],
    max_length=200,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
)
for seq in sequences:
    print(f"Result: {seq[0]['generated_text']}")

Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
The current implementation of Falcon calls `torch.scaled_dot_product_attention` directly, this will be deprecated in the future in favor of the `BetterTransformer` API. Please install the latest optimum library with `pip install -U optimum` and call `model.to_bettertransformer()` to benefit from `torch.scaled_dot_product_attention` and future performance optimizations.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.


Result: “no restrooms, no seats to eat, no stars” ->:
“no restrooms, no seats to eat, no stars, no service”
“It's a very good place. The staff is very friendly and they make it worth the money”
Result: good ->: [url][/url] and here is the "problem": when we try to run the game from the emulator (we have tried to run the game in several emulators like "nes emulator 0.85", "nes emulator 0.86.2 beta4", "nes emulator 0.85 beta 2", etc, we get the following errors from the emulator: - we tried to run the game with all the options we can configure in the emulator (like in "nes emulator 0.85 beta", we set "screen ratio" to "full screen", we have set "screen refresh rate" to 60hz (we can set up to 70hz), etc. the screen is not full and it is "jumping" a lot (it seems like a screen refresh rate is not set to 60hz and "it jumps" to 30
Result: the nastiest mcdonald's i have ever been in!! ->:
(Source: f-l-a-t-t-a-l-e-s)
I have a new love for the beach! I love swimming in the ocean and just sittin

As we can see from above code, the pre-trained model is giving some random results which it learnt from the internet. We have to fine-tune the model

In the below code block we are utilizing Parameter efficient fine tuning library to aid us wih fine tuning

In [21]:
from peft import LoraConfig

peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=[
        "query_key_value",
        "dense",
        "dense_h_to_4h",
        "dense_4h_to_h",
    ]
)

In [22]:
from transformers import TrainingArguments

outputDir = "./results"
eval_accumulation_steps = 1

training_arguments = TrainingArguments(
    output_dir=outputDir,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=1,
    optim="paged_adamw_32bit",
    save_steps=10,
    logging_steps=1,
    learning_rate=2e-4,
    fp16=True,
    max_grad_norm=0.3,
    max_steps=50,
    warmup_ratio=0.03,
    group_by_length=True,
    lr_scheduler_type="constant",
)

Then finally pass everthing to the trainer

In [23]:
from trl import SFTTrainer

trainer = SFTTrainer(
    model=model,
    train_dataset=train_dataset_dict['train'],
    peft_config=peft_config,
    dataset_text_field="merged_review",
    max_seq_length=512,
    tokenizer=tokenizer,
    args=training_arguments,
)

Map: 100%|██████████| 30970/30970 [00:00<00:00, 61914.46 examples/s]


We will also pre-process the model by upcasting the layer norms in float 32 for more stable training

In [24]:
for name, module in trainer.model.named_modules():
    if "norm" in name:
        module = module.to(torch.float32)

## Train the model

Now let's train the model.CALLING `trainer.train()`

In [25]:
trainer.train()

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mmanojcsathreya[0m. Use [1m`wandb login --relogin`[0m to force relogin


  0%|          | 0/50 [00:00<?, ?it/s]You're using a PreTrainedTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
The current implementation of Falcon calls `torch.scaled_dot_product_attention` directly, this will be deprecated in the future in favor of the `BetterTransformer` API. Please install the latest optimum library with `pip install -U optimum` and call `model.to_bettertransformer()` to benefit from `torch.scaled_dot_product_attention` and future performance optimizations.
  2%|▏         | 1/50 [00:01<01:34,  1.94s/it]

{'loss': 3.0028, 'learning_rate': 0.0002, 'epoch': 0.0}


  4%|▍         | 2/50 [00:02<00:51,  1.08s/it]

{'loss': 3.0889, 'learning_rate': 0.0002, 'epoch': 0.0}


  6%|▌         | 3/50 [00:02<00:37,  1.25it/s]

{'loss': 3.539, 'learning_rate': 0.0002, 'epoch': 0.0}


  8%|▊         | 4/50 [00:03<00:30,  1.50it/s]

{'loss': 3.0936, 'learning_rate': 0.0002, 'epoch': 0.0}


 10%|█         | 5/50 [00:03<00:26,  1.68it/s]

{'loss': 3.0216, 'learning_rate': 0.0002, 'epoch': 0.0}


 12%|█▏        | 6/50 [00:04<00:24,  1.82it/s]

{'loss': 3.1341, 'learning_rate': 0.0002, 'epoch': 0.0}


 14%|█▍        | 7/50 [00:04<00:22,  1.92it/s]

{'loss': 3.3309, 'learning_rate': 0.0002, 'epoch': 0.0}


 16%|█▌        | 8/50 [00:05<00:20,  2.02it/s]

{'loss': 3.2059, 'learning_rate': 0.0002, 'epoch': 0.0}


 18%|█▊        | 9/50 [00:05<00:19,  2.08it/s]

{'loss': 2.9408, 'learning_rate': 0.0002, 'epoch': 0.0}


 20%|██        | 10/50 [00:06<00:18,  2.13it/s]

{'loss': 2.8298, 'learning_rate': 0.0002, 'epoch': 0.0}


 22%|██▏       | 11/50 [00:09<00:49,  1.26s/it]

{'loss': 2.9967, 'learning_rate': 0.0002, 'epoch': 0.0}


 24%|██▍       | 12/50 [00:09<00:38,  1.01s/it]

{'loss': 2.629, 'learning_rate': 0.0002, 'epoch': 0.0}


 26%|██▌       | 13/50 [00:10<00:31,  1.19it/s]

{'loss': 3.4977, 'learning_rate': 0.0002, 'epoch': 0.0}


 28%|██▊       | 14/50 [00:10<00:25,  1.39it/s]

{'loss': 4.2979, 'learning_rate': 0.0002, 'epoch': 0.0}


 30%|███       | 15/50 [00:10<00:22,  1.58it/s]

{'loss': 3.3466, 'learning_rate': 0.0002, 'epoch': 0.0}


 32%|███▏      | 16/50 [00:11<00:19,  1.74it/s]

{'loss': 3.9415, 'learning_rate': 0.0002, 'epoch': 0.0}


 34%|███▍      | 17/50 [00:11<00:17,  1.88it/s]

{'loss': 2.8541, 'learning_rate': 0.0002, 'epoch': 0.0}


 36%|███▌      | 18/50 [00:12<00:16,  1.99it/s]

{'loss': 2.3422, 'learning_rate': 0.0002, 'epoch': 0.0}


 38%|███▊      | 19/50 [00:12<00:14,  2.07it/s]

{'loss': 3.6921, 'learning_rate': 0.0002, 'epoch': 0.0}


 40%|████      | 20/50 [00:13<00:14,  2.14it/s]

{'loss': 3.1078, 'learning_rate': 0.0002, 'epoch': 0.0}


 42%|████▏     | 21/50 [00:15<00:30,  1.06s/it]

{'loss': 3.065, 'learning_rate': 0.0002, 'epoch': 0.0}


 44%|████▍     | 22/50 [00:15<00:24,  1.15it/s]

{'loss': 2.871, 'learning_rate': 0.0002, 'epoch': 0.0}


 46%|████▌     | 23/50 [00:16<00:19,  1.35it/s]

{'loss': 3.8499, 'learning_rate': 0.0002, 'epoch': 0.0}


 48%|████▊     | 24/50 [00:16<00:16,  1.54it/s]

{'loss': 1.6428, 'learning_rate': 0.0002, 'epoch': 0.0}


 50%|█████     | 25/50 [00:17<00:14,  1.71it/s]

{'loss': 3.4813, 'learning_rate': 0.0002, 'epoch': 0.0}


 52%|█████▏    | 26/50 [00:17<00:12,  1.86it/s]

{'loss': 2.7179, 'learning_rate': 0.0002, 'epoch': 0.0}


 54%|█████▍    | 27/50 [00:18<00:11,  1.97it/s]

{'loss': 3.5369, 'learning_rate': 0.0002, 'epoch': 0.0}


 56%|█████▌    | 28/50 [00:18<00:10,  2.06it/s]

{'loss': 2.5671, 'learning_rate': 0.0002, 'epoch': 0.0}


 58%|█████▊    | 29/50 [00:18<00:09,  2.13it/s]

{'loss': 2.604, 'learning_rate': 0.0002, 'epoch': 0.0}


 60%|██████    | 30/50 [00:19<00:09,  2.18it/s]

{'loss': 2.5468, 'learning_rate': 0.0002, 'epoch': 0.0}


 62%|██████▏   | 31/50 [00:21<00:19,  1.02s/it]

{'loss': 3.0788, 'learning_rate': 0.0002, 'epoch': 0.0}


 64%|██████▍   | 32/50 [00:22<00:15,  1.18it/s]

{'loss': 2.8037, 'learning_rate': 0.0002, 'epoch': 0.0}


 66%|██████▌   | 33/50 [00:22<00:12,  1.39it/s]

{'loss': 3.0116, 'learning_rate': 0.0002, 'epoch': 0.0}


 68%|██████▊   | 34/50 [00:23<00:10,  1.58it/s]

{'loss': 2.5272, 'learning_rate': 0.0002, 'epoch': 0.0}


 70%|███████   | 35/50 [00:23<00:08,  1.74it/s]

{'loss': 2.2237, 'learning_rate': 0.0002, 'epoch': 0.0}


 72%|███████▏  | 36/50 [00:23<00:07,  1.88it/s]

{'loss': 2.0364, 'learning_rate': 0.0002, 'epoch': 0.0}


 74%|███████▍  | 37/50 [00:24<00:06,  1.99it/s]

{'loss': 1.9987, 'learning_rate': 0.0002, 'epoch': 0.0}


 76%|███████▌  | 38/50 [00:24<00:05,  2.10it/s]

{'loss': 3.7327, 'learning_rate': 0.0002, 'epoch': 0.0}


 78%|███████▊  | 39/50 [00:25<00:05,  2.15it/s]

{'loss': 0.3635, 'learning_rate': 0.0002, 'epoch': 0.0}


 80%|████████  | 40/50 [00:25<00:04,  2.20it/s]

{'loss': 1.672, 'learning_rate': 0.0002, 'epoch': 0.0}


 82%|████████▏ | 41/50 [00:27<00:09,  1.02s/it]

{'loss': 1.799, 'learning_rate': 0.0002, 'epoch': 0.0}


 84%|████████▍ | 42/50 [00:28<00:06,  1.18it/s]

{'loss': 3.1661, 'learning_rate': 0.0002, 'epoch': 0.0}


 86%|████████▌ | 43/50 [00:28<00:05,  1.39it/s]

{'loss': 4.2681, 'learning_rate': 0.0002, 'epoch': 0.0}


 88%|████████▊ | 44/50 [00:29<00:03,  1.59it/s]

{'loss': 3.4958, 'learning_rate': 0.0002, 'epoch': 0.0}


 90%|█████████ | 45/50 [00:29<00:02,  1.75it/s]

{'loss': 3.5488, 'learning_rate': 0.0002, 'epoch': 0.0}


 92%|█████████▏| 46/50 [00:30<00:02,  1.88it/s]

{'loss': 1.8142, 'learning_rate': 0.0002, 'epoch': 0.0}


 94%|█████████▍| 47/50 [00:30<00:01,  1.99it/s]

{'loss': 0.052, 'learning_rate': 0.0002, 'epoch': 0.0}


 96%|█████████▌| 48/50 [00:30<00:00,  2.08it/s]

{'loss': 0.0022, 'learning_rate': 0.0002, 'epoch': 0.0}


 98%|█████████▊| 49/50 [00:31<00:00,  2.14it/s]

{'loss': 0.0066, 'learning_rate': 0.0002, 'epoch': 0.0}


100%|██████████| 50/50 [00:31<00:00,  2.18it/s]

{'loss': 0.0005, 'learning_rate': 0.0002, 'epoch': 0.0}


100%|██████████| 50/50 [00:34<00:00,  1.47it/s]

{'train_runtime': 36.165, 'train_samples_per_second': 1.383, 'train_steps_per_second': 1.383, 'train_loss': 2.687549252529861, 'epoch': 0.0}





TrainOutput(global_step=50, training_loss=2.687549252529861, metrics={'train_runtime': 36.165, 'train_samples_per_second': 1.383, 'train_steps_per_second': 1.383, 'train_loss': 2.687549252529861, 'epoch': 0.0})

Sticked on to the best described parametes and fine tuned the model for 50 epochs. We can see the loss is gradually decreasing.

In [26]:
lst_test_data = list(test_df['merged_review'])

In [27]:
len(lst_test_data)

1630

In [28]:
lst_test_data_short = lst_test_data[:1000]

Inferencing takes a lot of time. So we are limiting to first 1000 reviews. We are also persisting the inference results to save the results as we are prone to single point of failure 

In [32]:
import transformers
res = []

pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.float16,
    trust_remote_code=True,
    device_map="auto",
)

sequences = pipeline(
    lst_test_data_short[:250],
    max_length=100, 
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
)

for ix,seq in enumerate(sequences):
    res.append([ix, seq[0]['generated_text'].split('->:')[0] +  "->" +seq[0]['generated_text'].split('->:')[1]])

Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_tok

In [37]:
res = []
sequences = pipeline(
    lst_test_data_short[500:751],
    max_length=100,  #200,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
)

for ix,seq in enumerate(sequences):
    res.append([ix, seq[0]['generated_text'].split('->:')[0] +  "->" +seq[0]['generated_text'].split('->:')[1]])

with open('500to750.csv', 'w') as f:
    for i in res:
        f.write(str(i))

Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_tok

In [38]:
res = []
sequences = pipeline(
    lst_test_data_short[250:500],
    max_length=100,  #200,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
)

for ix,seq in enumerate(sequences):
    res.append([ix, seq[0]['generated_text'].split('->:')[0] +  "->" +seq[0]['generated_text'].split('->:')[1]])

with open('250to500.csv', 'w') as f:
    for i in res:
        f.write(str(i))

Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_tok

In [39]:
res = []
sequences = pipeline(
    lst_test_data_short[750:1000],
    max_length=100,  #200,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
)

for ix,seq in enumerate(sequences):
    res.append([ix, seq[0]['generated_text'].split('->:')[0] +  "->" +seq[0]['generated_text'].split('->:')[1]])

with open('750to1000.csv', 'w') as f:
    for i in res:
        f.write(str(i))

Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_tok