<a href="https://colab.research.google.com/github/Swagat-modder/Fine-tunings-and-AI/blob/main/Text_Completion_(Using_Fine_tuning_of_LLMs).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Step 1: Installation and Initial Setup**

In [None]:
pip install transformers datasets



**Step 2: Loading and Sampling the Dataset**

In [None]:
from datasets import load_dataset
#loading IMDB dataset and taking a small sample
data=load_dataset('imdb',split='train[:1%]')
print(data[0])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far be

**Step 3: Data Preprocessing**

In [None]:
def preprocess(batch):
    batch['text']=[text.replace('\n',' ') for text in batch['text']]
    return batch

In [None]:
data=data.map(preprocess,batched=True)

Map:   0%|          | 0/250 [00:00<?, ? examples/s]

**S4:Initializing the model and Tokenizer**

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "distilgpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)  # Fixed class name
tokenizer.pad_token = tokenizer.eos_token

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Here, we loaded distilgpt2, a lightweight version of GPT-2, which is suitable for causal language modelling tasks. AutoTokenizer and AutoModelForCausalLM automatically download and set up the tokenizer and model architecture for the specified model. Setting the pad_token to eos_token ensures consistent padding in sequences, which is necessary for batch processing.

**S5: Tokenizing the Data**

In [None]:
def tok_func(example):
    tokenized=tokenizer(example['text'],padding="max_length",truncation=True,max_length=128)
    tokenized['labels']=tokenized['input_ids'].copy() #setting labels to be the same as input_ids
    return tokenized
tokenized_data=data.map(tok_func,batched=True)

Map:   0%|          | 0/250 [00:00<?, ? examples/s]

This function tokenizes each text input by converting it into integer IDs that the model can process. Using padding= “max_length” and truncation=True; ensures each tokenized sequence has a fixed length of 128, which avoids model memory overflow. Setting labels as a copy of input_ids prepares the dataset for language modelling by ensuring the model learns to predict the next word in a sequence.

**S6: Configuring training Parameters**

In [None]:
from transformers import TrainingArguments as ta
training_args= ta(
    output_dir="./results",
    evaluation_strategy="epoch",
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=1,
    logging_dir="./logs",
    logging_steps=10,
    save_total_limit=1
)



The TrainingArguments class is used to define the hyperparameters and settings for training. Key parameters include:

output_dir: Directory to save model checkpoints.
evaluation_strategy= “epoch”: Evaluate the model at the end of each epoch.
per_device_train_batch_size and per_device_eval_batch_size: Number of samples processed per device in each batch during training and evaluation, respectively.
num_train_epochs=1: Train the model for a single epoch.
logging_steps: How often to log training information.
save_total_limit=1: Limits the saved checkpoints to avoid storage overload.

**S7: Splitting the Dataset**

In [None]:
train_data=tokenized_data.shuffle().select(range(int(0.8 * len(tokenized_data))))
eval_data=tokenized_data.shuffle().select(range(int(0.8 * len(tokenized_data)),len(tokenized_data)))

Here, we randomly shuffle the dataset and then split it into 80% training data and 20% evaluation data. This ensures that the model has enough data to learn from and also allows for a validation set to assess the model’s performance.

**S8: Setting Up the Trainer & Fine-Tuning the Model**

In [None]:
from transformers import Trainer
trainer=Trainer(
    model=model,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=eval_data
)

The Trainer class in transformers simplifies the training process by automating tasks like gradient updates and model evaluation. It uses training_args for hyperparameters and takes the train_data and eval_data datasets to structure the training and validation process.

In [None]:
trainer.train()

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mswagatgaradia[0m ([33mswagatgaradia-self-employed[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Epoch,Training Loss,Validation Loss
1,3.8506,3.651736


TrainOutput(global_step=50, training_loss=4.012131576538086, metrics={'train_runtime': 26.9368, 'train_samples_per_second': 7.425, 'train_steps_per_second': 1.856, 'total_flos': 6532418764800.0, 'train_loss': 4.012131576538086, 'epoch': 1.0})

**S9: Saving and Testing the model**

In [None]:
model.save_pretrained("./fine_tuned_model")
tokenizer.save_pretrained("./fine_tuned_model")

('./fine_tuned_model/tokenizer_config.json',
 './fine_tuned_model/special_tokens_map.json',
 './fine_tuned_model/vocab.json',
 './fine_tuned_model/merges.txt',
 './fine_tuned_model/added_tokens.json',
 './fine_tuned_model/tokenizer.json')

In [None]:
prompt="This is a prompt"
inputs=tokenizer(prompt, return_tensors="pt",padding=True).to('cuda') # Moving inputs to the GPU
output=model.generate(inputs['input_ids'],attention_mask=inputs['attention_mask'],max_length=20)
print(tokenizer.decode(output[0],skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


This is a prompt to ask for help. I have no idea what the answer is. I have


In this final section, we provide a sample prompt ("This is a prompt") to test the model's generative capabilities. The generate() function creates a new text sequence by sampling from the model's learned distribution. By decoding and printing the output, we can observe how well the fine-tuned model generates text that aligns with the IMDb dataset.

*Fine-tuning large language models (LLMs) means adapting a pre-trained model to perform well on a specific task or to reflect a specialized domain of language. Fine-tuning is essential when the model's general knowledge needs refinement to meet the precision required in a specific field or task.*